[Bioperl-l] Problem with parsing ENSEMBL genbank flat file with
genbank2gff3. pls
Chris Mungall
cjm at fruitfly.org
Tue Jan 18 11:20:15 EST 2005
OK, so it looks like EnsMart may solve Vladimir's problem by bypassing the
genbank-format files altogether
Ewan - it'd be nice to see the GFF/GTFs appear in the main ftp download
area too at some point, as well as via dynamic EnsMart download. As far as
tweaking the ensembl genbank output, I think the addition of a feature of
type 'gene', with a single location covering the maximal extent of all
mRNAs, as is fairly-standard with genbank-format files - that should do
it.
Cheers
Chris
On Tue, 18 Jan 2005, Ewan Birney wrote:
> On Mon, 17 Jan 2005, Chris Mungall wrote:
>
> >
> > It is a genbank formatted file - you can download it from the url Vladmir
> > provides below.
> >
> > There seem to be a few oddities to do with the ensembl-flavour genbank
> > format which may be causing problems for the unflattener:
> >
> > * There doesn't appear to be any 'gene' features - a gene model is just
> > mRNAs and CDSs. This means the files don't even contain essential stuff
> > like the gene symbol!
>
> The symbols are on the mRNA and CDS (in fact most identifiers map to the
> mRNA and CDS). Each mRNA and CDS has the ENSG identifier in there. We
> could of course put in a Gene line as well, and I can flag this up to the
> guys. We should do this as it is easy enough to do.
>
>
> However Chris, as you imply, we don't consider our EMBL or GenBank flat
> files somehow definitive - the Mart tool allows highly flexible
> downloading of gene structure (GTF) and other things and if we do
> implement a GFF3 dumper it is likely to be via the Mart tool again.
>
>
> Underneath this the database and Perl and Java API allows nearly any sort
> of information to be yanked out, and the database is internet accessible
> directly at ensembldb.ensembl.org.
>
>
> --> I'll ask the guys here to put in a gene line - Chris - what
> precisely do you need in the format to tickle your unflattener right?
>
> --> GFF3 direct dumping is in 2005 todo list, but not at the top at the
> moment.
>
>
>
>
> >
> > * In the feature entry, for the reverse strand, ensembl nests the
> > complement function inside the join function, listing sublocations in a
> > 3'->5' direction. This is unusual, but not problemmatic in itself.
> > However, I'm not 100% convinced that the bioperl genbank parser handles
> > these cases correctly - I will expand on this in another email. It's not
> > a problem for the vast majority of cases, but it will be problemmatic for
> > certain rare situations where the sublocations are of mixed strand (eg
> > trans-spliced genes).
> >
> > I can implement a hack in the unflattener for the first problem. However,
> > the question is - is it worth it? Without the gene feature the
> > ensembl-flavoured genbank files seem not particularly useful (granted it
> > is possible to get the gene data by integrating with LocusLink/EntrezGene
> > but is it worth it?). I know for a fact that the data structures
> > underlying ensembl are sound, so it seems counterproductive to use nothing
> > but genbank/embl as a flat file distribution format (and to drop the gene
> > features on top of that!). I know ensembl use GTF a lot internally, it
> > would be great to see use made of this format (or even better, GFF3) for
> > data distribution. Perhaps there's something I'm missing here.. I'll wait
> > for comment from someone from ensembl before progressing here, to avoid
> > any pointless work...
> >
> > Cheers
> > Chris
> >
> > On Mon, 17 Jan 2005, Scott Cain wrote:
> >
> > > Hi Vladimir,
> > >
> > > Not to ask a question on the level of "is it plugged in", but are you sure
> > > it is a genbank formatted file? I think you would get a different error
> > > if it weren't, but I just wanted to make sure.
> > >
> > > Scott
> > >
> > > ----------------------------------------------------------------------
> > > Scott Cain, Ph. D. cain at cshl.org
> > > GMOD Coordinator, http://www.gmod.org/ (216)392-3087
> > > ----------------------------------------------------------------------
> > >
> > >
> > > On Mon, 17 Jan 2005, Chris Mungall wrote:
> > >
> > > >
> > > > Hi Vladimir
> > > >
> > > > The genbank2gff3 script, in scripts/Bio-DB-GFF is attempting to recover
> > > > information often which the genbank flat file format loses; this is the
> > > > information about which mRNA relates to which CDS. You may or may not need
> > > > this information, it depends why you are doing the conversion. If you
> > > > don't need this, you may want just a straightforward genbank->gff
> > > > conversion. Let me know if this is what you want to do and I can help with
> > > > that.
> > > >
> > > > If you _do_ wish to preserve the mRNA to CDS mappings, be aware that it
> > > > isn't always possible to recover these with 100% fidelity from the genbank
> > > > flat files. You may wish to pursue alternate approaches, such as
> > > > downloading ensembl as a mysql dump (any ensembl folks around.. any plans
> > > > to offer downloads in alternate formats such as gff3? This would be
> > > > fantastic)
> > > >
> > > > If you'd prefer to carry on via the genbank flat file route, here's what
> > > > you should do:
> > > >
> > > > * get the latest version of genbank2gff3.PLS I have just checked into cvs
> > > > (I can send you a copy if you are using a bioperl release and not cvs)
> > > >
> > > > * run the script with the "--ethresh 3" option. This will raise the error
> > > > severity threshold at which problems with genbank file become
> > > > showstoppers.
> > > >
> > > > In addition, I will take a look at this particular file and see what it is
> > > > that is causing problems and get back to you.
> > > >
> > > > Cheers
> > > > Chris
> > > >
> > > > On Mon, 17 Jan 2005, Babenko, Vladimir (NIH/NLM/NCBI) wrote:
> > > >
> > > > > Greetings,
> > > > > While parsing a genbank file taken from:
> > > > > ftp://ftp.ensembl.org/pub/current_human/data/flatfiles/genbank/Homo_sapiens.
> > > > > 0.dat as of Jan 2005,
> > > > > I'm getting the following unflattening error:
> > > > > --------------------------------------------------------
> > > > > Processing file /ENSEMBL/Homo_sapiens.0.dat...
> > > > > working on contig
> > > > > chromosome:NCBI35:1:1:994676:1...chromosome:NCBI35:1:1:994676:1 Unflattening
> > > > > error:
> > > > > Details:
> > > > > ------------- EXCEPTION -------------
> > > > > MSG: PROBLEM, SEVERITY==2
> > > > > no containers possible for SeqFeature of type: CDS; this SF is being placed
> > > > > at root level
> > > > > SF [Bio::SeqFeature::Generic=HASH(0x86485d8)]: CDS; ENSG00000146556
> > > > >
> > > > > STACK Bio::SeqFeature::Tools::Unflattener::problem
> > > > > /Bio/SeqFeature/Tools/Unflattener.pm:940
> > > > > STACK Bio::SeqFeature::Tools::Unflattener::unflatten_group
> > > > > /Bio/SeqFeature/Tools/Unflattener.pm:1983
> > > > > STACK Bio::SeqFeature::Tools::Unflattener::unflatten_groups
> > > > > /Bio/SeqFeature/Tools/Unflattener.pm:1744
> > > > > STACK Bio::SeqFeature::Tools::Unflattener::unflatten_seq
> > > > > /Bio/SeqFeature/Tools/Unflattener.pm:1449
> > > > > STACK (eval) genbank2gff3.PLS:345
> > > > > STACK main::unflatten_seq genbank2gff3.PLS:344
> > > > > STACK toplevel genbank2gff3.PLS:209
> > > > >
> > > > > --------------------------------------
> > > > >
> > > > > Possible gene unflattening error withchromosome:NCBI35:1:1:994676:1: consult
> > > > > STDERR
> > > > >
> > > > > Using bioperl-1.5.0.RC2 under Linux.
> > > > >
> > > > > Would be grateful for the hint,
> > > > > Vladimir
> > > > > _______________________________________________
> > > > > Bioperl-l mailing list
> > > > > Bioperl-l at portal.open-bio.org
> > > > > http://portal.open-bio.org/mailman/listinfo/bioperl-l
> > > > >
> > > >
> > >
> > >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at portal.open-bio.org
> > http://portal.open-bio.org/mailman/listinfo/bioperl-l
> >
>
> -----------------------------------------------------------------
> Ewan Birney. Work: +44 1223 494420
> Email: birney "at" ebi.ac.uk
> Clerical Assistant: shelley "at" ebi.ac.uk
> Please cc shelley for urgent or diary-dependent requests
> -----------------------------------------------------------------
>
>
More information about the Bioperl-l
mailing list