[Bioperl-l] Problem with parsing ENSEMBL genbank flat file with genbank2gff3. pls

Chris Mungall cjm at fruitfly.org
Mon Jan 17 21:33:31 EST 2005


It is a genbank formatted file - you can download it from the url Vladmir
provides below.

There seem to be a few oddities to do with the ensembl-flavour genbank
format which may be causing problems for the unflattener:

* There doesn't appear to be any 'gene' features - a gene model is just
mRNAs and CDSs. This means the files don't even contain essential stuff
like the gene symbol!

* In the feature entry, for the reverse strand, ensembl nests the
complement function inside the join function, listing sublocations in a
3'->5' direction. This is unusual, but not problemmatic in itself.
However, I'm not 100% convinced that the bioperl genbank parser handles
these cases correctly - I will expand on this in another email. It's not
a problem for the vast majority of cases, but it will be problemmatic for
certain rare situations where the sublocations are of mixed strand (eg
trans-spliced genes).

I can implement a hack in the unflattener for the first problem. However,
the question is - is it worth it? Without the gene feature the
ensembl-flavoured genbank files seem not particularly useful (granted it
is possible to get the gene data by integrating with LocusLink/EntrezGene
but is it worth it?). I know for a fact that the data structures
underlying ensembl are sound, so it seems counterproductive to use nothing
but genbank/embl as a flat file distribution format (and to drop the gene
features on top of that!). I know ensembl use GTF a lot internally, it
would be great to see use made of this format (or even better, GFF3) for
data distribution. Perhaps there's something I'm missing here.. I'll wait
for comment from someone from ensembl before progressing here, to avoid
any pointless work...

Cheers
Chris

On Mon, 17 Jan 2005, Scott Cain wrote:

> Hi Vladimir,
>
> Not to ask a question on the level of "is it plugged in", but are you sure
> it is a genbank formatted file?  I think you would get a different error
> if it weren't, but I just wanted to make sure.
>
> Scott
>
> ----------------------------------------------------------------------
> Scott Cain, Ph. D.				 	 cain at cshl.org
> GMOD Coordinator, http://www.gmod.org/			 (216)392-3087
> ----------------------------------------------------------------------
>
>
> On Mon, 17 Jan 2005, Chris Mungall wrote:
>
> >
> > Hi Vladimir
> >
> > The genbank2gff3 script, in scripts/Bio-DB-GFF is attempting to recover
> > information often which the genbank flat file format loses; this is the
> > information about which mRNA relates to which CDS. You may or may not need
> > this information, it depends why you are doing the conversion. If you
> > don't need this, you may want just a straightforward genbank->gff
> > conversion. Let me know if this is what you want to do and I can help with
> > that.
> >
> > If you _do_ wish to preserve the mRNA to CDS mappings, be aware that it
> > isn't always possible to recover these with 100% fidelity from the genbank
> > flat files. You may wish to pursue alternate approaches, such as
> > downloading ensembl as a mysql dump (any ensembl folks around.. any plans
> > to offer downloads in alternate formats such as gff3? This would be
> > fantastic)
> >
> > If you'd prefer to carry on via the genbank flat file route, here's what
> > you should do:
> >
> > * get the latest version of genbank2gff3.PLS I have just checked into cvs
> > (I can send you a copy if you are using a bioperl release and not cvs)
> >
> > * run the script with the "--ethresh 3" option. This will raise the error
> > severity threshold at which problems with genbank file become
> > showstoppers.
> >
> > In addition, I will take a look at this particular file and see what it is
> > that is causing problems and get back to you.
> >
> > Cheers
> > Chris
> >
> > On Mon, 17 Jan 2005, Babenko, Vladimir (NIH/NLM/NCBI) wrote:
> >
> > >     Greetings,
> > > While parsing a genbank file taken from:
> > > ftp://ftp.ensembl.org/pub/current_human/data/flatfiles/genbank/Homo_sapiens.
> > > 0.dat as of Jan 2005,
> > > I'm getting the following unflattening error:
> > > --------------------------------------------------------
> > > Processing file /ENSEMBL/Homo_sapiens.0.dat...
> > > working on contig
> > > chromosome:NCBI35:1:1:994676:1...chromosome:NCBI35:1:1:994676:1 Unflattening
> > > error:
> > > Details:
> > > ------------- EXCEPTION  -------------
> > > MSG: PROBLEM, SEVERITY==2
> > > no containers possible for SeqFeature of type: CDS; this SF is being placed
> > > at root level
> > > SF [Bio::SeqFeature::Generic=HASH(0x86485d8)]: CDS; ENSG00000146556
> > >
> > > STACK Bio::SeqFeature::Tools::Unflattener::problem
> > > /Bio/SeqFeature/Tools/Unflattener.pm:940
> > > STACK Bio::SeqFeature::Tools::Unflattener::unflatten_group
> > > /Bio/SeqFeature/Tools/Unflattener.pm:1983
> > > STACK Bio::SeqFeature::Tools::Unflattener::unflatten_groups
> > > /Bio/SeqFeature/Tools/Unflattener.pm:1744
> > > STACK Bio::SeqFeature::Tools::Unflattener::unflatten_seq
> > > /Bio/SeqFeature/Tools/Unflattener.pm:1449
> > > STACK (eval) genbank2gff3.PLS:345
> > > STACK main::unflatten_seq genbank2gff3.PLS:344
> > > STACK toplevel genbank2gff3.PLS:209
> > >
> > > --------------------------------------
> > >
> > > Possible gene unflattening error withchromosome:NCBI35:1:1:994676:1: consult
> > > STDERR
> > >
> > > Using bioperl-1.5.0.RC2 under Linux.
> > >
> > >     Would be grateful for the hint,
> > >       Vladimir
> > > _______________________________________________
> > > Bioperl-l mailing list
> > > Bioperl-l at portal.open-bio.org
> > > http://portal.open-bio.org/mailman/listinfo/bioperl-l
> > >
> >
>
>


More information about the Bioperl-l mailing list