[Bioperl-l] Re: Changes to GFF 2.5 "unflattening" code

Fri Dec 12 18:58:23 EST 2003

Nice one Scott!

I imagine this script would be v useful to plenty of non GMOD/chado folks.
Is there anything chado or GMOD specific about this? can we add it to
bioperl instead of GMOD? (IMHO there are far too few scripts in bioperl,
which is fine for the hardcore object-heads who'll roll up their own in a
few minutes, but not so great for new users)

What do you think of rolling some of the logic up from the script into
bioperl modules? For example, the typemapping stuff could go into
Bio::SeqFeature::Tools::TypeMapper, which already has a method for mapping
to the Sequence Ontology

Mapping of the SeqFeature nesting hierarchy to GFF ID/Parent tags could
also take place in FeatureHolderI, as discussed on this list the other
week.

By the way, what are you doing for parent features that don't have a
natural ID? Are you creating artificial surrogate IDs?

That way we could easily roll out genbank2chadoxml, genbank2ensembl,
genbank2game, genbank2das, genbank2biosql and fastafile generators like
genbank2intron_fasta, genbank2spliced_utr_fasta, genbank2exon_fasta,
genbank2intergenic_fasta, genbank2my_favourite_SO_type_fasta and so on - I
think this is the sort of thing people are really often after when they
start downloading and wrestling with the bioperl object model.

By the way, we often use genbank, when what we really mean is
genbank/eml(/ddbj?). is there a handy short catchy name for this
collective, or shall we carry on just using the term genbank to denote the
collection of genbank-like formats?

This is all incredibly useful stuff in my opinion - for ages we've been
able to say "we have a parser for format X" in bioperl, but really it's
still been a  semantic quagmire, the parsing is just the first step.

Cheers
Chris

On Fri, 12 Dec 2003, Scott Cain wrote:

> Lincoln and Sheldon,
>
> For your information, I wrote a new genbank2gff3.pl script for use with
> the pending GMOD release.  I anticipate that it will form the foundation
> for rewriting the biofetch adaptor.  It uses Unflattener.pm and seems to
> work for the organisms I tested (human, worm, fly, mosquito, and
> Ecoli).  It is in the GMOD cvs in the schema repository at
> schema/chado/load/bin/genbank2gff.PLS.
>
> Scott
>
> On Fri, 2003-12-12 at 10:56, bioperl-l-request at portal.open-bio.org
> wrote:
> > Hi Mark, Sheldon,
> >
> > I saw your change to the _parse_gff2_group code in Bio::DB::GFF, which
> > prioritizes "gene", "locus_tag" and "transcript" as group fields in
> > the column 9 attributes.  I like it, but unfortunately it breaks some
> > other code that I have, including the GMOD tutorial.
> >
> > I think you'll like what I've done instead.  I've added a
> > preferred_groups() method to which you pass a list of group names.
> > Then, this list will be used as the priority list to pluck out groups
> > from the GFF2 attribute list.  To get your previous behavior, you need
> > to do this:
> >
> >  $db = Bio::DB::GFF->new(-preferred_groups=>['gene','locus_tag','transcript'],
> > 	                 @other_args);
> >  $db->load_gff(...);
> >
> > or this
> >
> >  $db = Bio::DB::GFF->new(@other_args);
> >  $db->preferred_groups('gene','locus_tag','transcript');
> >  $db->load_gff(...);
> >
> > You'll have to change your existing scripts accordingly.  Sure, this
> > should be merged with Chris's unflattener, but then again let's just
> > get to GFF3 as quickly as we possibly can and leave this nightmare
> > behind us!
> >
> > Lincoln
>