[Biojava-l] Parsing EMBL files from Ensembl

Thomas Down td2@sanger.ac.uk
Thu, 21 Nov 2002 09:55:06 +0000


On Thu, Nov 21, 2002 at 09:23:58AM +0100, Stein Aerts wrote:
> 
> Since today, apparantly something changed on the "export data" function 
> of Ensembl. When retrieving a gene based on its ensembl id, e.g. 
> ENSG00000110092 with 2000 bp on either side, and requesting only gene 
> features, then until yesterday, the resulting EMBL formatted file had 
> ID= ENSG00000110092 but now it has ID :
> 
> ID   Chromosome 11 71948701 to 71966070  ENSEMBL; DNA; HUM; 17370 BP.

There have been very substantial changes in the Ensembl codebase
between versions 8 and 9.  More-or-less a complete rewrite, I think.

> This line could not be parsed: CDS             
> join(-1151..-840,1654..1777,1995..2434)

I don't think EMBL entries are meant to have coordinates
outside 1..length.  Anything that doesn't fall within
the attached sequence is supposed to be represented in the
accession:x..y format.

We could probably add some kind of `tolerant' flag on the BioJava
parser which either truncates such locations, or turns the relevent
Features into RemoteFeatures.  It's probably also worth talking to
the Ensembl people to see if this change was actually intentional,
though.

     Thomas.