[Biojava-l] [biojavax] EMBL parser : features parsing
Richard Holland
richard.holland at ebi.ac.uk
Tue Apr 18 09:21:49 UTC 2006
I have committed an UNTESTED patch based on Jolyon's suggestion, and
also attempted to fix the split-on-equals problem Morgane observed.
Please let me know if there are any problems with it.
As this problem affected the UniProt parser in a similar manner (much of
the code is identical), the same fixes were applied there too.
cheers,
Richard
On Thu, 2006-04-13 at 17:42 +0100, Jolyon Holdstock wrote:
> Hi Morgane,
>
> I have amended the EmblFormat readSection method as below and the
> parsing seems to work; please test it.
>
> I think that the last bit of annotation is carried over into the next
> feature so before adding the new feature I dump the annotation and reset
> currentTag and currentVal.
>
> if (!line.startsWith(" ")) {
> //--------- new code starts ---------------------------
> if (currentTag!=null) {
> section.add(new String[]{currentTag,currentVal.toString()});
> currentTag = null;
> currentVal = null;
> }
> //--------- new code ends -----------------------------
> // case 1 : word value - splits into key-value on its own
> section.add(line.split("\\s+"));
> }
>
> Cheers,
>
> Jolyon
>
>
>
> -----Original Message-----
> From: biojava-l-bounces at lists.open-bio.org
> [mailto:biojava-l-bounces at lists.open-bio.org] On Behalf Of Morgane
> THOMAS-CHOLLIER
> Sent: 12 April 2006 09:35
> To: biojava-l at open-bio.org
> Subject: [Biojava-l] [biojavax] EMBL parser : features parsing[Scanned]
>
> Hello again,
>
> I am currently using biojavax to parse EMBL files exported from Ensembl
> website.
>
> Compared to the EBI files I have, they show a difference in the Features
>
> lines :
>
> sometimes, only one "/word" is present. ie:
>
> EBI file :
>
> FT gene <1..>118
> FT /gene="Hoxb9"
> FT /note="Hoxb-9"
>
> Ensembl file;
>
> FT gene complement(1..3218)
> FT /gene="ENSMUSG00000038227"
>
> The problem I encounter is that the parser correctly convert the "/word"
>
> into a Note, but the Note is then in relation with the immediate
> following feature (ie: mRNA).
> The current gene feature thus has no annotation.
>
> This behavior is reproducible when removing one "/word" of an EBI file.
>
> Apart from this issue, I noted that Ensembl EMBL files uses "=" inside a
>
> feature (ie: /note="transcript_id=ENSMUST00000048680") which ends up
> with an incomplete Note, as the parser seems to split on "=" to separate
>
> the Key and the Value.
>
> Thanks for your help,
>
> Morgane.
>
--
Richard Holland
European Bioinformatics Institute
Wellcome Trust Genome Campus, Hinxton
Cambridge CB10 1SD, UK
Tel: +44-(0)1223-494416
---------------
More information about the Biojava-l
mailing list