[Biojava-l] [biojavax] EMBL parser : features parsing
Richard Holland
richard.holland at ebi.ac.uk
Thu Apr 20 12:05:00 UTC 2006
Hi.
I made some small changes to the code, although nothing that would fix
this kind of problem, committed it back to CVS, checked it out again,
compiled, and ran a test program that read in an EMBL file with the
feature table you describe below, and output it in EMBL format to
another file. I then compared the two files... and found no differences!
The split-on-equals problem didn't occur, and all notes appeared
alongside their correct features.
Could there be a problem maybe with the script you are using?
I've really no idea what the problem is as I can't reproduce it based on
the current CVS contents!
cheers,
Richard
On Thu, 2006-04-20 at 11:35 +0200, Morgane THOMAS-CHOLLIER wrote:
> Hi,
>
> I have tested today's version from CVS.
>
> Both EBI and Ensembl files now react the same way.
> The last annotation of a feature is nevertheless related to its
> immediate following feature.
> e.g. :
>
> FT gene <1..>118
> FT /gene="Hoxb9"
> FT /note="Hoxb-9"
> FT mRNA <1..>118
> FT /gene="Hoxb9"
> FT /product="HOXB9"
> FT CDS <1..>118
>
> /note="Hoxb-9" is related to mRNA
> /product="HOXB9" is related to CDS
>
> Concerning the split-on-equals problem, I still observe the problem :
>
> [(#2) biojavax:note: transcript_i]
>
> for this annotation : /note="transcript_id=ENSMUST00000048680"
>
> Thanks for helping,
>
> Cheers,
>
> Morgane.
>
> Richard Holland wrote:
> > I have committed an UNTESTED patch based on Jolyon's suggestion, and
> > also attempted to fix the split-on-equals problem Morgane observed.
> >
> > Please let me know if there are any problems with it.
> >
> > As this problem affected the UniProt parser in a similar manner (much of
> > the code is identical), the same fixes were applied there too.
> >
> > cheers,
> > Richard
> >
> > On Thu, 2006-04-13 at 17:42 +0100, Jolyon Holdstock wrote:
> >
> >> Hi Morgane,
> >>
> >> I have amended the EmblFormat readSection method as below and the
> >> parsing seems to work; please test it.
> >>
> >> I think that the last bit of annotation is carried over into the next
> >> feature so before adding the new feature I dump the annotation and reset
> >> currentTag and currentVal.
> >>
> >> if (!line.startsWith(" ")) {
> >> //--------- new code starts ---------------------------
> >> if (currentTag!=null) {
> >> section.add(new String[]{currentTag,currentVal.toString()});
> >> currentTag = null;
> >> currentVal = null;
> >> }
> >> //--------- new code ends -----------------------------
> >> // case 1 : word value - splits into key-value on its own
> >> section.add(line.split("\\s+"));
> >> }
> >>
> >> Cheers,
> >>
> >> Jolyon
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: biojava-l-bounces at lists.open-bio.org
> >> [mailto:biojava-l-bounces at lists.open-bio.org] On Behalf Of Morgane
> >> THOMAS-CHOLLIER
> >> Sent: 12 April 2006 09:35
> >> To: biojava-l at open-bio.org
> >> Subject: [Biojava-l] [biojavax] EMBL parser : features parsing[Scanned]
> >>
> >> Hello again,
> >>
> >> I am currently using biojavax to parse EMBL files exported from Ensembl
> >> website.
> >>
> >> Compared to the EBI files I have, they show a difference in the Features
> >>
> >> lines :
> >>
> >> sometimes, only one "/word" is present. ie:
> >>
> >> EBI file :
> >>
> >> FT gene <1..>118
> >> FT /gene="Hoxb9"
> >> FT /note="Hoxb-9"
> >>
> >> Ensembl file;
> >>
> >> FT gene complement(1..3218)
> >> FT /gene="ENSMUSG00000038227"
> >>
> >> The problem I encounter is that the parser correctly convert the "/word"
> >>
> >> into a Note, but the Note is then in relation with the immediate
> >> following feature (ie: mRNA).
> >> The current gene feature thus has no annotation.
> >>
> >> This behavior is reproducible when removing one "/word" of an EBI file.
> >>
> >> Apart from this issue, I noted that Ensembl EMBL files uses "=" inside a
> >>
> >> feature (ie: /note="transcript_id=ENSMUST00000048680") which ends up
> >> with an incomplete Note, as the parser seems to split on "=" to separate
> >>
> >> the Key and the Value.
> >>
> >> Thanks for your help,
> >>
> >> Morgane.
> >>
> >>
>
--
Richard Holland (BioMart Team)
EMBL-EBI
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
UNITED KINGDOM
Tel: +44-(0)1223-494416
More information about the Biojava-l
mailing list