[Biojava-l] [biojavax] EMBL parser : features parsing[Resolved]

Morgane THOMAS-CHOLLIER mthomasc at vub.ac.be
Thu Apr 20 12:30:10 UTC 2006


I've just updated my sources few minutes ago and everything works fine 
now (both annotations and split-on-equals problem).

I've tested both the EBI file and Ensembl file.

Thanks for fixing the problems !!

Cheers,

Morgane

Jolyon Holdstock wrote:
> No, I'll update my source.
>
> Thanks,
>
> Jolyon
>
>
> -----Original Message-----
> From: Richard Holland [mailto:richard.holland at ebi.ac.uk] 
> Sent: 20 April 2006 13:16
> To: Jolyon Holdstock
> Cc: mthomas at dbm.ulb.ac.be; biojava-l at open-bio.org
> Subject: RE: [Biojava-l] [biojavax] EMBL parser : features
> parsing[Scanned]
>
> Did you use the latest CVS version? (I committed a change that I think
> should have fixed that about 1 minute before my previous email).
>
>
> On Thu, 2006-04-20 at 13:08 +0100, Jolyon Holdstock wrote:
>   
>> I've run the sequence through the parser and it seems to work OK. I
>> iterate through the features and then iterate through the annotations
>>     
> of
>   
>> that feature
>>
>> Based on the input....
>>
>> FT   source          1..118
>> FT                   /organism="Triturus helveticus"
>> FT                   /mol_type="genomic DNA"
>> FT                   /clone="Thel.b9"
>> FT                   /db_xref="taxon:256425"
>> FT   gene            <1..>118
>> FT                   /gene="Hoxb9"
>> FT                   /note="Hoxb-9"
>> FT   mRNA            <1..>118
>> FT                   /gene="Hoxb9"
>> FT                   /product="HOXB9"
>> FT   CDS             <1..>118
>> FT                   /codon_start=2
>> FT                   /gene="Hoxb9"
>> FT                   /product="HOXB9"
>> FT                   /db_xref="UniProtKB/TrEMBL:Q2LK47"
>> FT                   /protein_id="ABA39736.1"
>> FT
>> /translation="KYQTLELEKEFLFNMYLTRDRRHEVARLLNLSERQVKIW"
>>
>> The output is....
>>
>> ========================================
>> Feature: (#0) lcl:DQ158013/DQ158013.1:source,EMBL(1..118)
>> Note: (#0) biojavax:mol_type: genomic DNA
>> Note: (#1) biojavax:clone: Thel.b9
>> ========================================
>> Feature: (#1) lcl:DQ158013/DQ158013.1:gene,EMBL(<1..118>)
>> Note: (#2) biojavax:gene: Hoxb9
>> Note: (#3) biojavax:note: Hoxb-9
>> ========================================
>> Feature: (#2) lcl:DQ158013/DQ158013.1:mRNA,EMBL(<1..118>)
>> Note: (#4) biojavax:gene: Hoxb9
>> Note: (#5) biojavax:product: HOXB9
>> ========================================
>> Feature: (#3) lcl:DQ158013/DQ158013.1:CDS,EMBL(<1..118>)
>> Note: (#6) biojavax:codon_start: 2
>> Note: (#7) biojavax:gene: Hoxb9
>> Note: (#8) biojavax:product: HOXB9
>> Note: (#9) biojavax:protein_id: ABA39736.1
>> Note: (#10) biojavax:translation:
>> KYQTLELEKEFLFNMYLTRDRRHEVARLLNLSERQVKIW
>> Note: (#11) biojavax:translation:
>> KYQTLELEKEFLFNMYLTRDRRHEVARLLNLSERQVKIW
>> =============================================
>>
>> This looks OK, the one thing I've just noticed is that the last piece
>>     
> of
>   
>> annotation of the last feature is assigned twice.
>>
>> Jolyon
>>
>>
>> -----Original Message-----
>> From: Richard Holland [mailto:richard.holland at ebi.ac.uk] 
>> Sent: 20 April 2006 13:05
>> To: mthomas at dbm.ulb.ac.be
>> Cc: Jolyon Holdstock; biojava-l at open-bio.org
>> Subject: Re: [Biojava-l] [biojavax] EMBL parser : features
>> parsing[Scanned]
>>
>> Hi.
>>
>> I made some small changes to the code, although nothing that would fix
>> this kind of problem, committed it back to CVS, checked it out again,
>> compiled, and ran a test program that read in an EMBL file with the
>> feature table you describe below, and output it in EMBL format to
>> another file. I then compared the two files... and found no
>>     
> differences!
>   
>> The split-on-equals problem didn't occur, and all notes appeared
>> alongside their correct features.
>>
>> Could there be a problem maybe with the script you are using?
>>
>> I've really no idea what the problem is as I can't reproduce it based
>>     
> on
>   
>> the current CVS contents!
>>
>> cheers,
>> Richard
>>
>> On Thu, 2006-04-20 at 11:35 +0200, Morgane THOMAS-CHOLLIER wrote:
>>     
>>> Hi,
>>>
>>> I have tested today's version from CVS.
>>>
>>> Both EBI and Ensembl files now react the same way.
>>> The last annotation of a feature is nevertheless related to its 
>>> immediate following feature.
>>> e.g. :
>>>
>>> FT   gene            <1..>118
>>> FT                   /gene="Hoxb9"
>>> FT                   /note="Hoxb-9"
>>> FT   mRNA            <1..>118
>>> FT                   /gene="Hoxb9"
>>> FT                   /product="HOXB9"
>>> FT   CDS             <1..>118
>>>
>>> /note="Hoxb-9" is related to mRNA
>>> /product="HOXB9" is related to CDS
>>>
>>> Concerning the split-on-equals problem, I still observe the problem
>>>       
> :
>   
>>>  [(#2) biojavax:note: transcript_i]
>>>
>>> for this annotation :  /note="transcript_id=ENSMUST00000048680"
>>>
>>> Thanks for helping,
>>>
>>> Cheers,
>>>
>>> Morgane.
>>>
>>> Richard Holland wrote:
>>>       
>>>> I have committed an UNTESTED patch based on Jolyon's suggestion,
>>>>         
> and
>   
>>>> also attempted to fix the split-on-equals problem Morgane
>>>>         
> observed. 
>   
>>>> Please let me know if there are any problems with it.
>>>>
>>>> As this problem affected the UniProt parser in a similar manner
>>>>         
>> (much of
>>     
>>>> the code is identical), the same fixes were applied there too.
>>>>
>>>> cheers,
>>>> Richard
>>>>
>>>> On Thu, 2006-04-13 at 17:42 +0100, Jolyon Holdstock wrote:
>>>>   
>>>>         
>>>>> Hi Morgane,
>>>>>
>>>>> I have amended the EmblFormat readSection method as below and the
>>>>> parsing seems to work; please test it.
>>>>>
>>>>> I think that the last bit of annotation is carried over into the
>>>>>           
>> next
>>     
>>>>> feature so before adding the new feature I dump the annotation
>>>>>           
> and
>   
>> reset
>>     
>>>>> currentTag and currentVal.
>>>>>
>>>>> if (!line.startsWith(" ")) {
>>>>> //--------- new code starts ---------------------------
>>>>>   if (currentTag!=null) {
>>>>>     section.add(new String[]{currentTag,currentVal.toString()});
>>>>>     currentTag = null;
>>>>>     currentVal = null;
>>>>>   }
>>>>> //--------- new code ends -----------------------------
>>>>> // case 1 : word value - splits into key-value on its own
>>>>>   section.add(line.split("\\s+"));
>>>>> }
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Jolyon
>>>>>
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: biojava-l-bounces at lists.open-bio.org
>>>>> [mailto:biojava-l-bounces at lists.open-bio.org] On Behalf Of
>>>>>           
> Morgane
>   
>>>>> THOMAS-CHOLLIER
>>>>> Sent: 12 April 2006 09:35
>>>>> To: biojava-l at open-bio.org
>>>>> Subject: [Biojava-l] [biojavax] EMBL parser : features
>>>>>           
>> parsing[Scanned]
>>     
>>>>> Hello again,
>>>>>
>>>>> I am currently using biojavax to parse EMBL files exported from
>>>>>           
>> Ensembl 
>>     
>>>>> website.
>>>>>
>>>>> Compared to the EBI files I have, they show a difference in the
>>>>>           
>> Features
>>     
>>>>> lines :
>>>>>
>>>>> sometimes, only one "/word" is present. ie:
>>>>>
>>>>> EBI file :
>>>>>
>>>>> FT   gene            <1..>118
>>>>> FT                   /gene="Hoxb9"
>>>>> FT                   /note="Hoxb-9"
>>>>>
>>>>> Ensembl file;
>>>>>
>>>>> FT   gene         complement(1..3218)
>>>>> FT                   /gene="ENSMUSG00000038227"
>>>>>
>>>>> The problem I encounter is that the parser correctly convert the
>>>>>           
>> "/word"
>>     
>>>>> into a Note, but the Note is then in relation with the immediate 
>>>>> following feature (ie: mRNA).
>>>>> The current gene feature thus has no annotation.
>>>>>
>>>>> This behavior is reproducible when removing one "/word" of an EBI
>>>>>           
>> file.
>>     
>>>>> Apart from this issue, I noted that Ensembl EMBL files uses "="
>>>>>           
>> inside a
>>     
>>>>> feature (ie: /note="transcript_id=ENSMUST00000048680") which ends
>>>>>           
>> up 
>>     
>>>>> with an incomplete Note, as the parser seems to split on "=" to
>>>>>           
>> separate
>>     
>>>>> the Key and the Value.
>>>>>
>>>>> Thanks for your help,
>>>>>
>>>>> Morgane.
>>>>>
>>>>>     
>>>>>           


-- 
**********************************************************
Morgane THOMAS-CHOLLIER, PHD Student (mthomasc at vub.ac.be)

Vrije Universiteit Brussels (VUB)
Laboratory of Cell Genetics
Pleinlaan 2
1050 Brussels
Belgium





More information about the Biojava-l mailing list