[Biojava-l] [biojavax] EMBL parser : features parsing[Scanned]

Thu Apr 20 12:16:00 UTC 2006

Did you use the latest CVS version? (I committed a change that I think
should have fixed that about 1 minute before my previous email).

On Thu, 2006-04-20 at 13:08 +0100, Jolyon Holdstock wrote:
> I've run the sequence through the parser and it seems to work OK. I
> iterate through the features and then iterate through the annotations of
> that feature
> 
> Based on the input....
> 
> FT   source          1..118
> FT                   /organism="Triturus helveticus"
> FT                   /mol_type="genomic DNA"
> FT                   /clone="Thel.b9"
> FT                   /db_xref="taxon:256425"
> FT   gene            <1..>118
> FT                   /gene="Hoxb9"
> FT                   /note="Hoxb-9"
> FT   mRNA            <1..>118
> FT                   /gene="Hoxb9"
> FT                   /product="HOXB9"
> FT   CDS             <1..>118
> FT                   /codon_start=2
> FT                   /gene="Hoxb9"
> FT                   /product="HOXB9"
> FT                   /db_xref="UniProtKB/TrEMBL:Q2LK47"
> FT                   /protein_id="ABA39736.1"
> FT
> /translation="KYQTLELEKEFLFNMYLTRDRRHEVARLLNLSERQVKIW"
> 
> The output is....
> 
> ========================================
> Feature: (#0) lcl:DQ158013/DQ158013.1:source,EMBL(1..118)
> Note: (#0) biojavax:mol_type: genomic DNA
> Note: (#1) biojavax:clone: Thel.b9
> ========================================
> Feature: (#1) lcl:DQ158013/DQ158013.1:gene,EMBL(<1..118>)
> Note: (#2) biojavax:gene: Hoxb9
> Note: (#3) biojavax:note: Hoxb-9
> ========================================
> Feature: (#2) lcl:DQ158013/DQ158013.1:mRNA,EMBL(<1..118>)
> Note: (#4) biojavax:gene: Hoxb9
> Note: (#5) biojavax:product: HOXB9
> ========================================
> Feature: (#3) lcl:DQ158013/DQ158013.1:CDS,EMBL(<1..118>)
> Note: (#6) biojavax:codon_start: 2
> Note: (#7) biojavax:gene: Hoxb9
> Note: (#8) biojavax:product: HOXB9
> Note: (#9) biojavax:protein_id: ABA39736.1
> Note: (#10) biojavax:translation:
> KYQTLELEKEFLFNMYLTRDRRHEVARLLNLSERQVKIW
> Note: (#11) biojavax:translation:
> KYQTLELEKEFLFNMYLTRDRRHEVARLLNLSERQVKIW
> =============================================
> 
> This looks OK, the one thing I've just noticed is that the last piece of
> annotation of the last feature is assigned twice.
> 
> Jolyon
> 
> 
> -----Original Message-----
> From: Richard Holland [mailto:richard.holland at ebi.ac.uk] 
> Sent: 20 April 2006 13:05
> To: mthomas at dbm.ulb.ac.be
> Cc: Jolyon Holdstock; biojava-l at open-bio.org
> Subject: Re: [Biojava-l] [biojavax] EMBL parser : features
> parsing[Scanned]
> 
> Hi.
> 
> I made some small changes to the code, although nothing that would fix
> this kind of problem, committed it back to CVS, checked it out again,
> compiled, and ran a test program that read in an EMBL file with the
> feature table you describe below, and output it in EMBL format to
> another file. I then compared the two files... and found no differences!
> The split-on-equals problem didn't occur, and all notes appeared
> alongside their correct features.
> 
> Could there be a problem maybe with the script you are using?
> 
> I've really no idea what the problem is as I can't reproduce it based on
> the current CVS contents!
> 
> cheers,
> Richard
> 
> On Thu, 2006-04-20 at 11:35 +0200, Morgane THOMAS-CHOLLIER wrote:
> > Hi,
> > 
> > I have tested today's version from CVS.
> > 
> > Both EBI and Ensembl files now react the same way.
> > The last annotation of a feature is nevertheless related to its 
> > immediate following feature.
> > e.g. :
> > 
> > FT   gene            <1..>118
> > FT                   /gene="Hoxb9"
> > FT                   /note="Hoxb-9"
> > FT   mRNA            <1..>118
> > FT                   /gene="Hoxb9"
> > FT                   /product="HOXB9"
> > FT   CDS             <1..>118
> > 
> > /note="Hoxb-9" is related to mRNA
> > /product="HOXB9" is related to CDS
> > 
> > Concerning the split-on-equals problem, I still observe the problem :
> > 
> >  [(#2) biojavax:note: transcript_i]
> > 
> > for this annotation :  /note="transcript_id=ENSMUST00000048680"
> > 
> > Thanks for helping,
> > 
> > Cheers,
> > 
> > Morgane.
> > 
> > Richard Holland wrote:
> > > I have committed an UNTESTED patch based on Jolyon's suggestion, and
> > > also attempted to fix the split-on-equals problem Morgane observed. 
> > >
> > > Please let me know if there are any problems with it.
> > >
> > > As this problem affected the UniProt parser in a similar manner
> (much of
> > > the code is identical), the same fixes were applied there too.
> > >
> > > cheers,
> > > Richard
> > >
> > > On Thu, 2006-04-13 at 17:42 +0100, Jolyon Holdstock wrote:
> > >   
> > >> Hi Morgane,
> > >>
> > >> I have amended the EmblFormat readSection method as below and the
> > >> parsing seems to work; please test it.
> > >>
> > >> I think that the last bit of annotation is carried over into the
> next
> > >> feature so before adding the new feature I dump the annotation and
> reset
> > >> currentTag and currentVal.
> > >>
> > >> if (!line.startsWith(" ")) {
> > >> //--------- new code starts ---------------------------
> > >>   if (currentTag!=null) {
> > >>     section.add(new String[]{currentTag,currentVal.toString()});
> > >>     currentTag = null;
> > >>     currentVal = null;
> > >>   }
> > >> //--------- new code ends -----------------------------
> > >> // case 1 : word value - splits into key-value on its own
> > >>   section.add(line.split("\\s+"));
> > >> }
> > >>
> > >> Cheers,
> > >>
> > >> Jolyon
> > >>
> > >>
> > >>
> > >> -----Original Message-----
> > >> From: biojava-l-bounces at lists.open-bio.org
> > >> [mailto:biojava-l-bounces at lists.open-bio.org] On Behalf Of Morgane
> > >> THOMAS-CHOLLIER
> > >> Sent: 12 April 2006 09:35
> > >> To: biojava-l at open-bio.org
> > >> Subject: [Biojava-l] [biojavax] EMBL parser : features
> parsing[Scanned]
> > >>
> > >> Hello again,
> > >>
> > >> I am currently using biojavax to parse EMBL files exported from
> Ensembl 
> > >> website.
> > >>
> > >> Compared to the EBI files I have, they show a difference in the
> Features
> > >>
> > >> lines :
> > >>
> > >> sometimes, only one "/word" is present. ie:
> > >>
> > >> EBI file :
> > >>
> > >> FT   gene            <1..>118
> > >> FT                   /gene="Hoxb9"
> > >> FT                   /note="Hoxb-9"
> > >>
> > >> Ensembl file;
> > >>
> > >> FT   gene         complement(1..3218)
> > >> FT                   /gene="ENSMUSG00000038227"
> > >>
> > >> The problem I encounter is that the parser correctly convert the
> "/word"
> > >>
> > >> into a Note, but the Note is then in relation with the immediate 
> > >> following feature (ie: mRNA).
> > >> The current gene feature thus has no annotation.
> > >>
> > >> This behavior is reproducible when removing one "/word" of an EBI
> file.
> > >>
> > >> Apart from this issue, I noted that Ensembl EMBL files uses "="
> inside a
> > >>
> > >> feature (ie: /note="transcript_id=ENSMUST00000048680") which ends
> up 
> > >> with an incomplete Note, as the parser seems to split on "=" to
> separate
> > >>
> > >> the Key and the Value.
> > >>
> > >> Thanks for your help,
> > >>
> > >> Morgane.
> > >>
> > >>     
> > 
-- 
Richard Holland (BioMart Team)
EMBL-EBI
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
UNITED KINGDOM
Tel: +44-(0)1223-494416