[Biojava-l] Ensembl gene parsing

Stein Aerts stein.aerts at esat.kuleuven.ac.be
Wed Jan 29 09:57:10 EST 2003


Hi,

When currently parsing an exported sequence of an Ensembl mouse gene 
(using the Export Data function at www.ensembl.org) there appear to be 3 
problems:
I tried to attach an example of an exported sequence of the Igf1 gene 
but then the message was bounced because of a suspicious header...

1. Some of the exon locations start with .0:
I think this is a bug of the EMBL formatting at Ensembl?

FT   exon            .0:44020..44364
FT                   /exon_id="ENSMUSE00000233709"
FT                   /start_phase=0
FT                   /end_phase=0



2. The first annotation of a CDS feature is written on the next line 
after CDS. This is not found by the EMBL parser.
I think that is is also a bug at Ensembl?

FT   CDS             
FT                   /gene="ENSMUSG00000020053"



3. Some of the lines cannot be parsed, for example the parser writes to 
System.out: "This line could not be parsed: exon            2001..2159"
This one I don't understand, I cannot see a problem for these features?

FT   exon            2001..2159
FT                   /exon_id="ENSMUSE00000248454"
FT                   /start_phase=0
FT                   /end_phase=0



Thank you in advance!

Stein.

-- 
Stein Aerts BioI at SISTA
K.U.Leuven ESAT-SCD Belgium
http://www.esat.kuleuven.ac.be/~dna/BioI




More information about the Biojava-l mailing list