[Bioperl-l] Problem with parsing ENSEMBL genbank flat file with genbank2gff3. pls

Jason Stajich jason.stajich at duke.edu
Mon Jan 17 21:21:57 EST 2005


I have been using EnsMart to  grab GFF2/GTF or (GFF-like output and  
reformatting it for GFF3) with reasonable success.  You probably want  
just the output columns so you can reformat things to have CDS  
start/end and the Gen, Exon->Transcript->Peptide identifiers all in the  
same report

This is a lot easier than parsing genbank flatfiles and the whole point  
of ensmart.

-jason
On Jan 17, 2005, at 2:51 PM, Chris Mungall wrote:

>
> Hi Vladimir
>
> The genbank2gff3 script, in scripts/Bio-DB-GFF is attempting to recover
> information often which the genbank flat file format loses; this is the
> information about which mRNA relates to which CDS. You may or may not  
> need
> this information, it depends why you are doing the conversion. If you
> don't need this, you may want just a straightforward genbank->gff
> conversion. Let me know if this is what you want to do and I can help  
> with
> that.
>
> If you _do_ wish to preserve the mRNA to CDS mappings, be aware that it
> isn't always possible to recover these with 100% fidelity from the  
> genbank
> flat files. You may wish to pursue alternate approaches, such as
> downloading ensembl as a mysql dump (any ensembl folks around.. any  
> plans
> to offer downloads in alternate formats such as gff3? This would be
> fantastic)
>
> If you'd prefer to carry on via the genbank flat file route, here's  
> what
> you should do:
>
> * get the latest version of genbank2gff3.PLS I have just checked into  
> cvs
> (I can send you a copy if you are using a bioperl release and not cvs)
>
> * run the script with the "--ethresh 3" option. This will raise the  
> error
> severity threshold at which problems with genbank file become
> showstoppers.
>
> In addition, I will take a look at this particular file and see what  
> it is
> that is causing problems and get back to you.
>
> Cheers
> Chris
>
> On Mon, 17 Jan 2005, Babenko, Vladimir (NIH/NLM/NCBI) wrote:
>
>>     Greetings,
>> While parsing a genbank file taken from:
>> ftp://ftp.ensembl.org/pub/current_human/data/flatfiles/genbank/ 
>> Homo_sapiens.
>> 0.dat as of Jan 2005,
>> I'm getting the following unflattening error:
>> --------------------------------------------------------
>> Processing file /ENSEMBL/Homo_sapiens.0.dat...
>> working on contig
>> chromosome:NCBI35:1:1:994676:1...chromosome:NCBI35:1:1:994676:1  
>> Unflattening
>> error:
>> Details:
>> ------------- EXCEPTION  -------------
>> MSG: PROBLEM, SEVERITY==2
>> no containers possible for SeqFeature of type: CDS; this SF is being  
>> placed
>> at root level
>> SF [Bio::SeqFeature::Generic=HASH(0x86485d8)]: CDS; ENSG00000146556
>>
>> STACK Bio::SeqFeature::Tools::Unflattener::problem
>> /Bio/SeqFeature/Tools/Unflattener.pm:940
>> STACK Bio::SeqFeature::Tools::Unflattener::unflatten_group
>> /Bio/SeqFeature/Tools/Unflattener.pm:1983
>> STACK Bio::SeqFeature::Tools::Unflattener::unflatten_groups
>> /Bio/SeqFeature/Tools/Unflattener.pm:1744
>> STACK Bio::SeqFeature::Tools::Unflattener::unflatten_seq
>> /Bio/SeqFeature/Tools/Unflattener.pm:1449
>> STACK (eval) genbank2gff3.PLS:345
>> STACK main::unflatten_seq genbank2gff3.PLS:344
>> STACK toplevel genbank2gff3.PLS:209
>>
>> --------------------------------------
>>
>> Possible gene unflattening error withchromosome:NCBI35:1:1:994676:1:  
>> consult
>> STDERR
>>
>> Using bioperl-1.5.0.RC2 under Linux.
>>
>>     Would be grateful for the hint,
>>       Vladimir
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at portal.open-bio.org
>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
>
--
Jason Stajich
jason.stajich at duke.edu
http://www.duke.edu/~jes12/



More information about the Bioperl-l mailing list