[Biojava-l] [biojavax] EMBL parser : features parsing
Morgane THOMAS-CHOLLIER
mthomasc at vub.ac.be
Thu Apr 20 09:35:54 UTC 2006
Hi,
I have tested today's version from CVS.
Both EBI and Ensembl files now react the same way.
The last annotation of a feature is nevertheless related to its
immediate following feature.
e.g. :
FT gene <1..>118
FT /gene="Hoxb9"
FT /note="Hoxb-9"
FT mRNA <1..>118
FT /gene="Hoxb9"
FT /product="HOXB9"
FT CDS <1..>118
/note="Hoxb-9" is related to mRNA
/product="HOXB9" is related to CDS
Concerning the split-on-equals problem, I still observe the problem :
[(#2) biojavax:note: transcript_i]
for this annotation : /note="transcript_id=ENSMUST00000048680"
Thanks for helping,
Cheers,
Morgane.
Richard Holland wrote:
> I have committed an UNTESTED patch based on Jolyon's suggestion, and
> also attempted to fix the split-on-equals problem Morgane observed.
>
> Please let me know if there are any problems with it.
>
> As this problem affected the UniProt parser in a similar manner (much of
> the code is identical), the same fixes were applied there too.
>
> cheers,
> Richard
>
> On Thu, 2006-04-13 at 17:42 +0100, Jolyon Holdstock wrote:
>
>> Hi Morgane,
>>
>> I have amended the EmblFormat readSection method as below and the
>> parsing seems to work; please test it.
>>
>> I think that the last bit of annotation is carried over into the next
>> feature so before adding the new feature I dump the annotation and reset
>> currentTag and currentVal.
>>
>> if (!line.startsWith(" ")) {
>> //--------- new code starts ---------------------------
>> if (currentTag!=null) {
>> section.add(new String[]{currentTag,currentVal.toString()});
>> currentTag = null;
>> currentVal = null;
>> }
>> //--------- new code ends -----------------------------
>> // case 1 : word value - splits into key-value on its own
>> section.add(line.split("\\s+"));
>> }
>>
>> Cheers,
>>
>> Jolyon
>>
>>
>>
>> -----Original Message-----
>> From: biojava-l-bounces at lists.open-bio.org
>> [mailto:biojava-l-bounces at lists.open-bio.org] On Behalf Of Morgane
>> THOMAS-CHOLLIER
>> Sent: 12 April 2006 09:35
>> To: biojava-l at open-bio.org
>> Subject: [Biojava-l] [biojavax] EMBL parser : features parsing[Scanned]
>>
>> Hello again,
>>
>> I am currently using biojavax to parse EMBL files exported from Ensembl
>> website.
>>
>> Compared to the EBI files I have, they show a difference in the Features
>>
>> lines :
>>
>> sometimes, only one "/word" is present. ie:
>>
>> EBI file :
>>
>> FT gene <1..>118
>> FT /gene="Hoxb9"
>> FT /note="Hoxb-9"
>>
>> Ensembl file;
>>
>> FT gene complement(1..3218)
>> FT /gene="ENSMUSG00000038227"
>>
>> The problem I encounter is that the parser correctly convert the "/word"
>>
>> into a Note, but the Note is then in relation with the immediate
>> following feature (ie: mRNA).
>> The current gene feature thus has no annotation.
>>
>> This behavior is reproducible when removing one "/word" of an EBI file.
>>
>> Apart from this issue, I noted that Ensembl EMBL files uses "=" inside a
>>
>> feature (ie: /note="transcript_id=ENSMUST00000048680") which ends up
>> with an incomplete Note, as the parser seems to split on "=" to separate
>>
>> the Key and the Value.
>>
>> Thanks for your help,
>>
>> Morgane.
>>
>>
--
**********************************************************
Morgane THOMAS-CHOLLIER, PHD Student (mthomasc at vub.ac.be)
Vrije Universiteit Brussels (VUB)
Laboratory of Cell Genetics
Pleinlaan 2
1050 Brussels
Belgium
Tel : +32 2 629 15 22
**********************************************************
Stop Using Internet Explorer, choose FIREFOX !
More information about the Biojava-l
mailing list