[Biopython-dev] Bio.GFF and Brad's code

Wed Apr 8 12:49:08 UTC 2009

Hi Michiel;

> Thanks for your work on the GFF parser; I'm dealing with GFF files
> quite a lot. Could you maybe give a simple example of how to use your
> GFF parser, once it's included into Biopython?

Awesome; I'm glad it will be useful. I'd definitely welcome any
feedback you have on the API or implementation. At this stage we can
be flexible and hopefully get it finalized before it hits Biopython.
I will get some user documentation together soon, but here is some
basic usage.

To parse an entire GFF file, getting all features at once:

from BCBio.GFF.GFFParser import GFFAddingIterator

gff_iterator = GFFAddingIterator()
rec_dict = gff_iterator.get_all_features(gff_file)

The returned dictionary is like a dictionary from SeqIO.to_dict;
keys are ids and values are SeqRecords.

You can also seed the parser with an initial dictionary containing
sequences or other features, and the features from the GFF file will
be added to those records:

with open(seq_file) as seq_handle:  
    seq_dict = SeqIO.to_dict(SeqIO.parse(seq_handle, "fasta")) 
gff_iterator = GFFAddingIterator(seq_dict)

If a file is very large, you have two ways of limiting the size of
items parsed. The first is to specify which items you are interested
and return only those. This code will parse out coding transcripts
on chromosome I:

cds_limit_info = dict(  
        gff_source_type = [('Coding_transcript', 'gene'),  
                           ('Coding_transcript', 'mRNA'),  
                           ('Coding_transcript', 'CDS')],  
        gff_id = ['I']  
        )  
rec_dict = gff_iterator.get_all_features(gff_file, limit_info=cds_limit_info)

The second is to use an iterator over a section of the file:

for rec_dict in gff_iterator.get_features(gff_file, target_lines=1000000):
   # handle partial rec dictionary of first 1000000 lines

Finally, there is an interface to examine a GFF file and figure out
useful ways to limit it. This will give you a dictionary of all
possible ways to limit a file along with the counts in each:

gff_examiner = GFFExaminer()
possible_limits = gff_examiner.available_limits(gff_file)

and this will give a dictionary of the parent-child relationships in
the file:

gff_examiner = GFFExaminer()
pc_map = gff_examiner.parent_child_map(gff_file)

Since GFF providers tend to differ in how they structure their
information, this helps get a quick overview of the file to
determine how to manage it.

Happy to hear about thoughts you might have. Thanks,
Brad

>
> --Michiel.
> 
> 
> --- On Mon, 4/6/09, Brad Chapman <chapmanb at 50mail.com> wrote:
> 
> > From: Brad Chapman <chapmanb at 50mail.com>
> > Subject: Re: [Biopython-dev] Bio.GFF and Brad's code
> > To: biopython-dev at lists.open-bio.org
> > Date: Monday, April 6, 2009, 6:08 PM
> > Peter;
> > Thanks for the plug. GFF parsing is moving along; the main
> > feature
> > two things I would like to finish before proposing it for
> > inclusion
> > are writing of GFF files and putting GFF into BioSQL with
> > the nested
> > features. The code does work for parsing, and I've been
> > using it for
> > some real projects; anyone who would like to test it is
> > more than
> > welcome.
> > 
> > As far as the current Bio.GFF, that is a bit of a
> > conundrum. The
> > current code does work and for some cases it would be nice
> > of having
> > the utility of working with GFF from a database. Eventually
> > BioSQL
> > from GFF may supplant that, but that should be finished and
> > tested
> > first. I would argue for keeping it in.
> > 
> > However, it is a bit confusing if someone is looking for a
> > parser. It
> > would make more sense if it lived under a namespace like
> > Bio.GFF.DB.
> > What do you think about adding a warning that it is going
> > to move to
> > a new namespace and then moving it there, if we don't
> > hear any
> > complaints, for 1.51? This is less cumbersome than a
> > removal for
> > users since it's just an import change.
> > 
> > Brad
> > 
> > 
> > 
> > > Brad has been working on his GFF parsing code - see
> > progress reports
> > > on his blog http://bcbio.wordpress.com/ and his code
> > on github,
> > > http://github.com/chapmanb/bcbb/tree/master/gff
> > > 
> > > Potentially this could make it into Biopython 1.51,
> > and I was just
> > > thinking about where the code would go.  Brad is
> > supporting both GFF3
> > > and the loosely defined GFF2 variants, so Bio.GFF
> > seems a good place.
> > > There would also be a wrapper under Bio.SeqIO for
> > loading GFF files as
> > > SeqRecord objects (I haven't played with
> > Brad's code, but it can do
> > > this already).
> > > 
> > > However, we already have a Bio.GFF module from Michael
> > Hoffman created
> > > back in 2002 which accesses MySQL General Feature
> > Format (GFF)
> > > databases created with BioPerl.  Perhaps we should
> > poll the main
> > > discussion list now, and if there are no responses
> > from people using
> > > it, we could deprecate Bio.GFF for Biopython 1.50? 
> > Under our current
> > > deprecation policy we shouldn't then remove
> > Bio.GFF until Biopython
> > > 1.52 at the earliest,
> > http://biopython.org/wiki/Deprecation_policy
> > > 
> > > What do you think Brad?  How about using Bio.GFF3
> > instead?
> > > 
> > > Peter
> > > _______________________________________________
> > > Biopython-dev mailing list
> > > Biopython-dev at lists.open-bio.org
> > >
> > http://lists.open-bio.org/mailman/listinfo/biopython-dev
> > _______________________________________________
> > Biopython-dev mailing list
> > Biopython-dev at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython-dev
> 
> 
>