[Biojava-l] [biojavax] EMBL parser : features parsing

Thu Apr 20 09:35:54 UTC 2006

Hi,

I have tested today's version from CVS.

Both EBI and Ensembl files now react the same way.
The last annotation of a feature is nevertheless related to its 
immediate following feature.
e.g. :

FT   gene            <1..>118
FT                   /gene="Hoxb9"
FT                   /note="Hoxb-9"
FT   mRNA            <1..>118
FT                   /gene="Hoxb9"
FT                   /product="HOXB9"
FT   CDS             <1..>118

/note="Hoxb-9" is related to mRNA
/product="HOXB9" is related to CDS

Concerning the split-on-equals problem, I still observe the problem :

 [(#2) biojavax:note: transcript_i]

for this annotation :  /note="transcript_id=ENSMUST00000048680"

Thanks for helping,

Cheers,

Morgane.

Richard Holland wrote:
> I have committed an UNTESTED patch based on Jolyon's suggestion, and
> also attempted to fix the split-on-equals problem Morgane observed. 
>
> Please let me know if there are any problems with it.
>
> As this problem affected the UniProt parser in a similar manner (much of
> the code is identical), the same fixes were applied there too.
>
> cheers,
> Richard
>
> On Thu, 2006-04-13 at 17:42 +0100, Jolyon Holdstock wrote:
>   
>> Hi Morgane,
>>
>> I have amended the EmblFormat readSection method as below and the
>> parsing seems to work; please test it.
>>
>> I think that the last bit of annotation is carried over into the next
>> feature so before adding the new feature I dump the annotation and reset
>> currentTag and currentVal.
>>
>> if (!line.startsWith(" ")) {
>> //--------- new code starts ---------------------------
>>   if (currentTag!=null) {
>>     section.add(new String[]{currentTag,currentVal.toString()});
>>     currentTag = null;
>>     currentVal = null;
>>   }
>> //--------- new code ends -----------------------------
>> // case 1 : word value - splits into key-value on its own
>>   section.add(line.split("\\s+"));
>> }
>>
>> Cheers,
>>
>> Jolyon
>>
>>
>>
>> -----Original Message-----
>> From: biojava-l-bounces at lists.open-bio.org
>> [mailto:biojava-l-bounces at lists.open-bio.org] On Behalf Of Morgane
>> THOMAS-CHOLLIER
>> Sent: 12 April 2006 09:35
>> To: biojava-l at open-bio.org
>> Subject: [Biojava-l] [biojavax] EMBL parser : features parsing[Scanned]
>>
>> Hello again,
>>
>> I am currently using biojavax to parse EMBL files exported from Ensembl 
>> website.
>>
>> Compared to the EBI files I have, they show a difference in the Features
>>
>> lines :
>>
>> sometimes, only one "/word" is present. ie:
>>
>> EBI file :
>>
>> FT   gene            <1..>118
>> FT                   /gene="Hoxb9"
>> FT                   /note="Hoxb-9"
>>
>> Ensembl file;
>>
>> FT   gene         complement(1..3218)
>> FT                   /gene="ENSMUSG00000038227"
>>
>> The problem I encounter is that the parser correctly convert the "/word"
>>
>> into a Note, but the Note is then in relation with the immediate 
>> following feature (ie: mRNA).
>> The current gene feature thus has no annotation.
>>
>> This behavior is reproducible when removing one "/word" of an EBI file.
>>
>> Apart from this issue, I noted that Ensembl EMBL files uses "=" inside a
>>
>> feature (ie: /note="transcript_id=ENSMUST00000048680") which ends up 
>> with an incomplete Note, as the parser seems to split on "=" to separate
>>
>> the Key and the Value.
>>
>> Thanks for your help,
>>
>> Morgane.
>>
>>     

-- 
**********************************************************
Morgane THOMAS-CHOLLIER, PHD Student (mthomasc at vub.ac.be)

Vrije Universiteit Brussels (VUB)
Laboratory of Cell Genetics
Pleinlaan 2
1050 Brussels
Belgium

Tel : +32 2 629 15 22
**********************************************************
Stop Using Internet Explorer, choose FIREFOX !