[Biopython-dev] Bio.GFF and Brad's code

Tue Apr 14 11:04:39 UTC 2009

On Tue, Apr 14, 2009 at 11:36 AM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>
> Usually, when I use a GFF file I either don't have an associated Fasta file,
> or I am not particularly interested in the original sequences. So while this
> approach is useful for some people, in its current form it's not exactly
> generally usable.
>
> First, let's discuss how to represent the information contained in a GFF
> file.  SeqRecords are good if the GFF file is associated with a Fasta file
> (or contains the sequence itself), but if not it seems to be a bit awkward.

I think parsing a GFF file with Bio.SeqIO into SeqRecord object(s) can
still be useful even without the sequence.  The list of SeqFeature
objects belonging to each SeqRecord can be used for example with
GenomeDiagram to draw a picture of the organism.  Because you lack the
sequence, you won't be able to include GC% or GC skew, but it is nice
to visualize the annotation all the same.  You could also do things
like looking for the ratio of genic and inter-genic usage, or hunt for
overlapping genes - although for these it may be easier to work with a
more low level representation.

> How about the following (and I think Peter was hinting at the same idea):
>
> The actual parser lives in Bio.GFF, and produces Bio.GFF.Record objects
> that closely resemble the GFF file structure. For example, we use the
> GFF specified fields (<seqname> <source> <feature> <start> <end>
> <score> <strand> <frame> [attributes] [comments]) as attributes to
> Bio.GFF.Record objects.

That sounds possible to me - although I haven't given the basic
Bio.GFF.Record structure any thought, nor indeed have I examined what
data objects Brad is returning at the moment.

> Bio.SeqIO then uses the parser in Bio.GFF, and puts its information in the
> appropriate fields of a SeqRecord.

Yes - much like how Bio.SeqIO calls other modules like Bio.GenBank and
Bio.SwissProt now.  However, regarding the implementation, I wouldn't
automatically insist the Bio.SeqIO GFF wrapper *has* to use a
Bio.GFF.Record internally (assuming we have such a thing) as that
could be a performance bottleneck.  I guess it depends on how simple
the Bio.GFF.Record objects are.

> Here, we have to think about two cases:
> Simply creating a SeqRecord based on the GFF file, and adding the
> information in the GFF file as annotations to a pre-existing set of SeqRecords.
> (I am not sure if we need a separate function for that, or, as Peter suggested,
> let the user do that himself, guided by some examples in the documentation).

Simply creating SeqRecord objects from a GFF file is the standard
Bio.SeqIO approach.   For combining data from a GFF file and a FASTA
file, this is rather like the FASTA+QUAL situation.  Here we do
document (in the docstrings, not yet in the tutorial) how to use
Bio.SeqIO to read in two sets of SeqRecord objects and combine them,
but also provide a "paired file iterator" to do this for you.  Right
now this function is in Bio.SeqIO.QualityIO, but I am open to moving
this and the low level bits to somewhere like Bio.Sequencing.Quality
instead (as long as we do this before Biopython 1.50 is released).

I have pondered a "paired file iterator" function for Bio.SeqIO for
dealing with FASTA+QUAL, FASTA+GFF, FASTA+PPT, etc, which would take
TWO file handles and return SeqRecord objects.  Interestingly all the
examples thus far are FASTA+other.  Anyway, this could be added later
if need be.

> Users then have a choice to use Bio.SeqIO to get SeqRecords, or Bio.GFF to see the "raw" GFF data, depending on their needs.
> How does that sound?

Pretty much what I had in mind - although as I said, I've not given
much thought to how to present the "raw" GFF data.

Peter