[Bioperl-l] Problem with parsing ENSEMBL genbank flat file with genbank2gff3. pls

Tue Jan 18 04:16:29 EST 2005

On Mon, 17 Jan 2005, Chris Mungall wrote:

> 
> It is a genbank formatted file - you can download it from the url Vladmir
> provides below.
> 
> There seem to be a few oddities to do with the ensembl-flavour genbank
> format which may be causing problems for the unflattener:
> 
> * There doesn't appear to be any 'gene' features - a gene model is just
> mRNAs and CDSs. This means the files don't even contain essential stuff
> like the gene symbol!

The symbols are on the mRNA and CDS (in fact most identifiers map to the
mRNA and CDS). Each mRNA and CDS has the ENSG identifier in there. We
could of course put in a Gene line as well, and I can flag this up to the
guys. We should do this as it is easy enough to do.

However Chris, as you imply, we don't consider our EMBL or GenBank flat
files somehow definitive - the Mart tool allows highly flexible
downloading of gene structure (GTF) and other things and if we do
implement a GFF3 dumper it is likely to be via the Mart tool again.

Underneath this the database and Perl and Java API allows nearly any sort 
of information to be yanked out, and the database is internet accessible 
directly at ensembldb.ensembl.org.

   --> I'll ask the guys here to put in a gene line - Chris - what 
precisely do you need in the format to tickle your unflattener right?

   --> GFF3 direct dumping is in 2005 todo list, but not at the top at the 
moment. 

> 
> * In the feature entry, for the reverse strand, ensembl nests the
> complement function inside the join function, listing sublocations in a
> 3'->5' direction. This is unusual, but not problemmatic in itself.
> However, I'm not 100% convinced that the bioperl genbank parser handles
> these cases correctly - I will expand on this in another email. It's not
> a problem for the vast majority of cases, but it will be problemmatic for
> certain rare situations where the sublocations are of mixed strand (eg
> trans-spliced genes).
> 
> I can implement a hack in the unflattener for the first problem. However,
> the question is - is it worth it? Without the gene feature the
> ensembl-flavoured genbank files seem not particularly useful (granted it
> is possible to get the gene data by integrating with LocusLink/EntrezGene
> but is it worth it?). I know for a fact that the data structures
> underlying ensembl are sound, so it seems counterproductive to use nothing
> but genbank/embl as a flat file distribution format (and to drop the gene
> features on top of that!). I know ensembl use GTF a lot internally, it
> would be great to see use made of this format (or even better, GFF3) for
> data distribution. Perhaps there's something I'm missing here.. I'll wait
> for comment from someone from ensembl before progressing here, to avoid
> any pointless work...
> 
> Cheers
> Chris
> 
> On Mon, 17 Jan 2005, Scott Cain wrote:
> 
> > Hi Vladimir,
> >
> > Not to ask a question on the level of "is it plugged in", but are you sure
> > it is a genbank formatted file?  I think you would get a different error
> > if it weren't, but I just wanted to make sure.
> >
> > Scott
> >
> > ----------------------------------------------------------------------
> > Scott Cain, Ph. D.				 	 cain at cshl.org
> > GMOD Coordinator, http://www.gmod.org/			 (216)392-3087
> > ----------------------------------------------------------------------
> >
> >
> > On Mon, 17 Jan 2005, Chris Mungall wrote:
> >
> > >
> > > Hi Vladimir
> > >
> > > The genbank2gff3 script, in scripts/Bio-DB-GFF is attempting to recover
> > > information often which the genbank flat file format loses; this is the
> > > information about which mRNA relates to which CDS. You may or may not need
> > > this information, it depends why you are doing the conversion. If you
> > > don't need this, you may want just a straightforward genbank->gff
> > > conversion. Let me know if this is what you want to do and I can help with
> > > that.
> > >
> > > If you _do_ wish to preserve the mRNA to CDS mappings, be aware that it
> > > isn't always possible to recover these with 100% fidelity from the genbank
> > > flat files. You may wish to pursue alternate approaches, such as
> > > downloading ensembl as a mysql dump (any ensembl folks around.. any plans
> > > to offer downloads in alternate formats such as gff3? This would be
> > > fantastic)
> > >
> > > If you'd prefer to carry on via the genbank flat file route, here's what
> > > you should do:
> > >
> > > * get the latest version of genbank2gff3.PLS I have just checked into cvs
> > > (I can send you a copy if you are using a bioperl release and not cvs)
> > >
> > > * run the script with the "--ethresh 3" option. This will raise the error
> > > severity threshold at which problems with genbank file become
> > > showstoppers.
> > >
> > > In addition, I will take a look at this particular file and see what it is
> > > that is causing problems and get back to you.
> > >
> > > Cheers
> > > Chris
> > >
> > > On Mon, 17 Jan 2005, Babenko, Vladimir (NIH/NLM/NCBI) wrote:
> > >
> > > >     Greetings,
> > > > While parsing a genbank file taken from:
> > > > ftp://ftp.ensembl.org/pub/current_human/data/flatfiles/genbank/Homo_sapiens.
> > > > 0.dat as of Jan 2005,
> > > > I'm getting the following unflattening error:
> > > > --------------------------------------------------------
> > > > Processing file /ENSEMBL/Homo_sapiens.0.dat...
> > > > working on contig
> > > > chromosome:NCBI35:1:1:994676:1...chromosome:NCBI35:1:1:994676:1 Unflattening
> > > > error:
> > > > Details:
> > > > ------------- EXCEPTION  -------------
> > > > MSG: PROBLEM, SEVERITY==2
> > > > no containers possible for SeqFeature of type: CDS; this SF is being placed
> > > > at root level
> > > > SF [Bio::SeqFeature::Generic=HASH(0x86485d8)]: CDS; ENSG00000146556
> > > >
> > > > STACK Bio::SeqFeature::Tools::Unflattener::problem
> > > > /Bio/SeqFeature/Tools/Unflattener.pm:940
> > > > STACK Bio::SeqFeature::Tools::Unflattener::unflatten_group
> > > > /Bio/SeqFeature/Tools/Unflattener.pm:1983
> > > > STACK Bio::SeqFeature::Tools::Unflattener::unflatten_groups
> > > > /Bio/SeqFeature/Tools/Unflattener.pm:1744
> > > > STACK Bio::SeqFeature::Tools::Unflattener::unflatten_seq
> > > > /Bio/SeqFeature/Tools/Unflattener.pm:1449
> > > > STACK (eval) genbank2gff3.PLS:345
> > > > STACK main::unflatten_seq genbank2gff3.PLS:344
> > > > STACK toplevel genbank2gff3.PLS:209
> > > >
> > > > --------------------------------------
> > > >
> > > > Possible gene unflattening error withchromosome:NCBI35:1:1:994676:1: consult
> > > > STDERR
> > > >
> > > > Using bioperl-1.5.0.RC2 under Linux.
> > > >
> > > >     Would be grateful for the hint,
> > > >       Vladimir
> > > > _______________________________________________
> > > > Bioperl-l mailing list
> > > > Bioperl-l at portal.open-bio.org
> > > > http://portal.open-bio.org/mailman/listinfo/bioperl-l
> > > >
> > >
> >
> >
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
> 

-----------------------------------------------------------------
Ewan Birney.  Work:  +44 1223 494420
             Email:  birney "at" ebi.ac.uk 
Clerical Assistant:  shelley "at" ebi.ac.uk
Please cc shelley for urgent or diary-dependent requests
-----------------------------------------------------------------