[Biojava-l] [biojavax] EMBL parser : features parsing

Richard Holland richard.holland at ebi.ac.uk
Tue Apr 18 09:21:49 UTC 2006


I have committed an UNTESTED patch based on Jolyon's suggestion, and
also attempted to fix the split-on-equals problem Morgane observed. 

Please let me know if there are any problems with it.

As this problem affected the UniProt parser in a similar manner (much of
the code is identical), the same fixes were applied there too.

cheers,
Richard

On Thu, 2006-04-13 at 17:42 +0100, Jolyon Holdstock wrote:
> Hi Morgane,
> 
> I have amended the EmblFormat readSection method as below and the
> parsing seems to work; please test it.
> 
> I think that the last bit of annotation is carried over into the next
> feature so before adding the new feature I dump the annotation and reset
> currentTag and currentVal.
> 
> if (!line.startsWith(" ")) {
> //--------- new code starts ---------------------------
>   if (currentTag!=null) {
>     section.add(new String[]{currentTag,currentVal.toString()});
>     currentTag = null;
>     currentVal = null;
>   }
> //--------- new code ends -----------------------------
> // case 1 : word value - splits into key-value on its own
>   section.add(line.split("\\s+"));
> }
> 
> Cheers,
> 
> Jolyon
> 
> 
> 
> -----Original Message-----
> From: biojava-l-bounces at lists.open-bio.org
> [mailto:biojava-l-bounces at lists.open-bio.org] On Behalf Of Morgane
> THOMAS-CHOLLIER
> Sent: 12 April 2006 09:35
> To: biojava-l at open-bio.org
> Subject: [Biojava-l] [biojavax] EMBL parser : features parsing[Scanned]
> 
> Hello again,
> 
> I am currently using biojavax to parse EMBL files exported from Ensembl 
> website.
> 
> Compared to the EBI files I have, they show a difference in the Features
> 
> lines :
> 
> sometimes, only one "/word" is present. ie:
> 
> EBI file :
> 
> FT   gene            <1..>118
> FT                   /gene="Hoxb9"
> FT                   /note="Hoxb-9"
> 
> Ensembl file;
> 
> FT   gene         complement(1..3218)
> FT                   /gene="ENSMUSG00000038227"
> 
> The problem I encounter is that the parser correctly convert the "/word"
> 
> into a Note, but the Note is then in relation with the immediate 
> following feature (ie: mRNA).
> The current gene feature thus has no annotation.
> 
> This behavior is reproducible when removing one "/word" of an EBI file.
> 
> Apart from this issue, I noted that Ensembl EMBL files uses "=" inside a
> 
> feature (ie: /note="transcript_id=ENSMUST00000048680") which ends up 
> with an incomplete Note, as the parser seems to split on "=" to separate
> 
> the Key and the Value.
> 
> Thanks for your help,
> 
> Morgane.
> 
-- 
Richard Holland
European Bioinformatics Institute
Wellcome Trust Genome Campus, Hinxton
Cambridge CB10 1SD, UK
Tel: +44-(0)1223-494416
---------------




More information about the Biojava-l mailing list