[Bioperl-l] Re: Problem with Unflattener

Chris Mungall cjm at fruitfly.org
Tue Dec 9 21:52:17 EST 2003


Oh, not quite. Sorry, not an expert on the various pre-gff3 flavours

But I imagine you could take the ensembl GFF (which is of 'GTF' flavour?)
OR an actual ensembl mysql database, both of which sensibly preserve
gene/transcript/protein/exon nesting structure, and turn that into GFF3
without too much difficulty. Ok, you start losing yourself in a maze of
converters, but the advantage is you never have to go through that lossy
step to genbank format where you lose the nesting structure. There are
some situations where it could be simply impossible for the Unflattener to
recover some of this lost structure.

On Tue, 9 Dec 2003, Scott Cain wrote:

> Thanks.
>
> Re: GFF from ensembl: You can get it as GFF3?  Could you send me a link
> if so.  (You can tell I'm a little incredulous.)
>
> Scott
>
> On Tue, 2003-12-09 at 21:19, Chris Mungall wrote:
> > Hi Scott
> >
> > Bug squashed, do a cvs update and it should work
> >
> > The problem was that this record uses /locus_tag instead of /gene - the
> > unflattener should be able to detect this in magic mode, but there was one
> > place where "/gene" was hardcoded.
> >
> > By the way, for this particular record you can get the exact same data
> > from ensembl, already unflattened (or rather, never flattened into genbank
> > format in the first place). Nevertheless, this sort of thing is extremely
> > useful for testing Unflattener.pm, so carry on testing! Really I should do
> > a full QC by comparing ensembl sourced GFF and the results of
> > ensembl->genbank->unflattener->gff, but I haven't got round to this yet.
> >
> > Cheers
> > Chris
> >
> > On Tue, 9 Dec 2003, Scott Cain wrote:
> >
> > > Hello Chris,
> > >
> > > I am using Unflattener to create a genbank2gff script that is more
> > > robust than what we have now.  As one of my example Genbank files, I am
> > > using an A. gambiae chromosome:
> > >
> > > http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=nucleotide&list_uids=31249389&dopt=GenBank&term=NW_045730&qty=1
> > >
> > > When I try to run the simplified script below, I get the following
> > > error:
> > >
> > > ------------- EXCEPTION  -------------
> > > MSG: structure_type 2 is currently unknown
> > > STACK Bio::SeqFeature::Tools::Unflattener::unflatten_seq /usr/local/lib/perl5/site_perl/5.8.1/Bio/SeqFeature/Tools/Unflattener.pm:1345
> > > STACK toplevel ./simple.pl:19
> > >
> > > --------------------------------------
> > >
> > > As I read Unflattener, structure_type should only be set if I set it
> > > explicitly, right?  So how is it getting set here, and how do I make it
> > > stop?
> > >
> > > Here's the script:
> > > #!/usr/bin/perl -w
> > > use strict;
> > > use Bio::SeqIO;
> > > use Bio::SeqFeature::Tools::Unflattener;
> > >
> > > my $unflattener = Bio::SeqFeature::Tools::Unflattener->new;
> > >
> > > my $seqio = Bio::SeqIO->new(
> > >     -file   => 'NW_045730.1.gbk',
> > >     -format => 'GenBank'
> > > );
> > >
> > > open OUT, '>out.gff';
> > >
> > > while ( my $seq = $seqio->next_seq() ) {
> > >     my $acc = $seq->accession;
> > >
> > >     # get top level unflattended SeqFeatureI objects
> > >     my @sfs = $unflattener->unflatten_seq(
> > >         -seq       => $seq,
> > >         -use_magic => 1
> > >     );
> > >
> > >     foreach my $sf (@sfs) {
> > >         my $gffio =
> > >           $sf->gff_format( Bio::Tools::GFF->new( -gff_version => 3 ) );
> > >
> > >         $sf->seq_id($acc);
> > >
> > >         if ( $sf->primary_tag() eq 'source' ) {
> > >             $sf->add_tag_value( 'ID', $acc );
> > >             $sf->primary_tag('region');
> > >         }
> > >         print OUT $sf->gff_string . "\n";
> > >     }
> > > }
> > > close OUT;
> > > ---------------------------
> > >
> > > Thanks,
> > > Scott
> > >
> > >
> >
>



More information about the Bioperl-l mailing list