[Biojava-l] [biojavax] EMBL parser : features parsing[Scanned]

Thu Apr 20 12:08:40 UTC 2006

I've run the sequence through the parser and it seems to work OK. I
iterate through the features and then iterate through the annotations of
that feature

Based on the input....

FT   source          1..118
FT                   /organism="Triturus helveticus"
FT                   /mol_type="genomic DNA"
FT                   /clone="Thel.b9"
FT                   /db_xref="taxon:256425"
FT   gene            <1..>118
FT                   /gene="Hoxb9"
FT                   /note="Hoxb-9"
FT   mRNA            <1..>118
FT                   /gene="Hoxb9"
FT                   /product="HOXB9"
FT   CDS             <1..>118
FT                   /codon_start=2
FT                   /gene="Hoxb9"
FT                   /product="HOXB9"
FT                   /db_xref="UniProtKB/TrEMBL:Q2LK47"
FT                   /protein_id="ABA39736.1"
FT
/translation="KYQTLELEKEFLFNMYLTRDRRHEVARLLNLSERQVKIW"

The output is....

========================================
Feature: (#0) lcl:DQ158013/DQ158013.1:source,EMBL(1..118)
Note: (#0) biojavax:mol_type: genomic DNA
Note: (#1) biojavax:clone: Thel.b9
========================================
Feature: (#1) lcl:DQ158013/DQ158013.1:gene,EMBL(<1..118>)
Note: (#2) biojavax:gene: Hoxb9
Note: (#3) biojavax:note: Hoxb-9
========================================
Feature: (#2) lcl:DQ158013/DQ158013.1:mRNA,EMBL(<1..118>)
Note: (#4) biojavax:gene: Hoxb9
Note: (#5) biojavax:product: HOXB9
========================================
Feature: (#3) lcl:DQ158013/DQ158013.1:CDS,EMBL(<1..118>)
Note: (#6) biojavax:codon_start: 2
Note: (#7) biojavax:gene: Hoxb9
Note: (#8) biojavax:product: HOXB9
Note: (#9) biojavax:protein_id: ABA39736.1
Note: (#10) biojavax:translation:
KYQTLELEKEFLFNMYLTRDRRHEVARLLNLSERQVKIW
Note: (#11) biojavax:translation:
KYQTLELEKEFLFNMYLTRDRRHEVARLLNLSERQVKIW
=============================================

This looks OK, the one thing I've just noticed is that the last piece of
annotation of the last feature is assigned twice.

Jolyon

-----Original Message-----
From: Richard Holland [mailto:richard.holland at ebi.ac.uk] 
Sent: 20 April 2006 13:05
To: mthomas at dbm.ulb.ac.be
Cc: Jolyon Holdstock; biojava-l at open-bio.org
Subject: Re: [Biojava-l] [biojavax] EMBL parser : features
parsing[Scanned]

Hi.

I made some small changes to the code, although nothing that would fix
this kind of problem, committed it back to CVS, checked it out again,
compiled, and ran a test program that read in an EMBL file with the
feature table you describe below, and output it in EMBL format to
another file. I then compared the two files... and found no differences!
The split-on-equals problem didn't occur, and all notes appeared
alongside their correct features.

Could there be a problem maybe with the script you are using?

I've really no idea what the problem is as I can't reproduce it based on
the current CVS contents!

cheers,
Richard

On Thu, 2006-04-20 at 11:35 +0200, Morgane THOMAS-CHOLLIER wrote:
> Hi,
> 
> I have tested today's version from CVS.
> 
> Both EBI and Ensembl files now react the same way.
> The last annotation of a feature is nevertheless related to its 
> immediate following feature.
> e.g. :
> 
> FT   gene            <1..>118
> FT                   /gene="Hoxb9"
> FT                   /note="Hoxb-9"
> FT   mRNA            <1..>118
> FT                   /gene="Hoxb9"
> FT                   /product="HOXB9"
> FT   CDS             <1..>118
> 
> /note="Hoxb-9" is related to mRNA
> /product="HOXB9" is related to CDS
> 
> Concerning the split-on-equals problem, I still observe the problem :
> 
>  [(#2) biojavax:note: transcript_i]
> 
> for this annotation :  /note="transcript_id=ENSMUST00000048680"
> 
> Thanks for helping,
> 
> Cheers,
> 
> Morgane.
> 
> Richard Holland wrote:
> > I have committed an UNTESTED patch based on Jolyon's suggestion, and
> > also attempted to fix the split-on-equals problem Morgane observed. 
> >
> > Please let me know if there are any problems with it.
> >
> > As this problem affected the UniProt parser in a similar manner
(much of
> > the code is identical), the same fixes were applied there too.
> >
> > cheers,
> > Richard
> >
> > On Thu, 2006-04-13 at 17:42 +0100, Jolyon Holdstock wrote:
> >   
> >> Hi Morgane,
> >>
> >> I have amended the EmblFormat readSection method as below and the
> >> parsing seems to work; please test it.
> >>
> >> I think that the last bit of annotation is carried over into the
next
> >> feature so before adding the new feature I dump the annotation and
reset
> >> currentTag and currentVal.
> >>
> >> if (!line.startsWith(" ")) {
> >> //--------- new code starts ---------------------------
> >>   if (currentTag!=null) {
> >>     section.add(new String[]{currentTag,currentVal.toString()});
> >>     currentTag = null;
> >>     currentVal = null;
> >>   }
> >> //--------- new code ends -----------------------------
> >> // case 1 : word value - splits into key-value on its own
> >>   section.add(line.split("\\s+"));
> >> }
> >>
> >> Cheers,
> >>
> >> Jolyon
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: biojava-l-bounces at lists.open-bio.org
> >> [mailto:biojava-l-bounces at lists.open-bio.org] On Behalf Of Morgane
> >> THOMAS-CHOLLIER
> >> Sent: 12 April 2006 09:35
> >> To: biojava-l at open-bio.org
> >> Subject: [Biojava-l] [biojavax] EMBL parser : features
parsing[Scanned]
> >>
> >> Hello again,
> >>
> >> I am currently using biojavax to parse EMBL files exported from
Ensembl 
> >> website.
> >>
> >> Compared to the EBI files I have, they show a difference in the
Features
> >>
> >> lines :
> >>
> >> sometimes, only one "/word" is present. ie:
> >>
> >> EBI file :
> >>
> >> FT   gene            <1..>118
> >> FT                   /gene="Hoxb9"
> >> FT                   /note="Hoxb-9"
> >>
> >> Ensembl file;
> >>
> >> FT   gene         complement(1..3218)
> >> FT                   /gene="ENSMUSG00000038227"
> >>
> >> The problem I encounter is that the parser correctly convert the
"/word"
> >>
> >> into a Note, but the Note is then in relation with the immediate 
> >> following feature (ie: mRNA).
> >> The current gene feature thus has no annotation.
> >>
> >> This behavior is reproducible when removing one "/word" of an EBI
file.
> >>
> >> Apart from this issue, I noted that Ensembl EMBL files uses "="
inside a
> >>
> >> feature (ie: /note="transcript_id=ENSMUST00000048680") which ends
up 
> >> with an incomplete Note, as the parser seems to split on "=" to
separate
> >>
> >> the Key and the Value.
> >>
> >> Thanks for your help,
> >>
> >> Morgane.
> >>
> >>     
> 
-- 
Richard Holland (BioMart Team)
EMBL-EBI
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
UNITED KINGDOM
Tel: +44-(0)1223-494416

This email has been scanned by Oxford Gene Technology Group of Companies
Security Systems.