[Biojava-l] Ensembl gene parsing

saerts saerts at mailserv.esat.kuleuven.ac.be
Wed Jan 29 10:35:11 EST 2003


Hi Ewan,
I know of Mart (and I like it) but it is not suited for automated sequence
retrieval using gene_stable_id's (a SOAP web service for the export data
function would be nice). Anyway, the Mart output would have currently the same
faults I guess. Do you reckon that the fixing of the Ensembl bugs is a short
term matter?
No ideas on the cause of the 3rd problem? I would probably have to print the
stack trace in the source code instead of the message "could not be parsed" when
parsing errors occur.

Thx,
Stein.

PS it is very annoying that my mails are always bounced because of a 'suspicious
header'; am I doing something wrong?


Ewan Birney wrote:

On Wed, 29 Jan 2003, Stein Aerts wrote:

  

Hi,

When currently parsing an exported sequence of an Ensembl mouse gene 
(using the Export Data function at www.ensembl.org) there appear to be 3 
problems:
I tried to attach an example of an exported sequence of the Igf1 gene 
but then the message was bounced because of a suspicious header...

1. Some of the exon locations start with .0:
I think this is a bug of the EMBL formatting at Ensembl?
    


Yes, this is pretty certainly a fault our end, and I think I know where 
this is.

  

FT   exon            .0:44020..44364
FT                   /exon_id="ENSMUSE00000233709"
FT                   /start_phase=0
FT                   /end_phase=0



2. The first annotation of a CDS feature is written on the next line 
after CDS. This is not found by the EMBL parser.
I think that is is also a bug at Ensembl?

    


This is probably a line-length issue. I wonder what the right thing to do 
here is... Hmmm

  

FT   CDS             
FT                   /gene="ENSMUSG00000020053"



3. Some of the lines cannot be parsed, for example the parser writes to 
System.out: "This line could not be parsed: exon            2001..2159"
This one I don't understand, I cannot see a problem for these features?

FT   exon            2001..2159
FT                   /exon_id="ENSMUSE00000248454"
FT                   /start_phase=0
FT                   /end_phase=0



Thank you in advance!

    


Stein - have you tried Mart inside Ensembl? For most people, this is far 
easier way to get bulk downloads of stuff in very-easy-to-parse-format.


http://www.ensembl.org/Homo_sapiens/martview


choose feature list and/or gene structure when you get to output.



The Ensembl bugs should be fixed of course... ;)



  

Stein.

-- 
Stein Aerts BioI at SISTA
K.U.Leuven ESAT-SCD Belgium
http://www.esat.kuleuven.ac.be/~dna/BioI


_______________________________________________
Biojava-l mailing list  -  Biojava-l at biojava.org
http://biojava.org/mailman/listinfo/biojava-l

    


-----------------------------------------------------------------
Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420
<birney at ebi.ac.uk>. 
-----------------------------------------------------------------

_______________________________________________
Biojava-l mailing list  -  Biojava-l at biojava.org
http://biojava.org/mailman/listinfo/biojava-l

  


-- 
Stein Aerts BioI at SISTA
K.U.Leuven ESAT-SCD Belgium
http://www.esat.kuleuven.ac.be/~dna/BioI



-- 
Stein Aerts BioI at SISTA
K.U.Leuven ESAT-SCD Belgium
http://www.esat.kuleuven.ac.be/~dna/BioI




More information about the Biojava-l mailing list