[Biopython-dev] Bio.GFF and Brad's code

Fri Dec 4 13:40:10 UTC 2009

Hi all;
Peter, thanks for the feedback. Thoughts below.

> Looking at your code, BCBio.GFF.parse(...) would return
> SeqRecord objects (with SeqFeatures). That seems
> redundant to me as one expect people to just use
> Bio.SeqIO.parse(handle, "gff3") instead. I would instead
> have expected BCBio.GFF.parse(...) to iterate over the
> features in a GFF file.

This would work for simple cases, but for most real life cases you
will likely want to limit the file to a subset of things you are
interested in. It helps reduce memory problems, and is equivalent to
a track system view in UCSC or Ensembl. I find it very useful for
all of the work I've done with it.

We could use SeqIO here, but then there is the issue of passing
along the additional arguments. The simplicity of SeqIO is really
nice, so not sure if we'd want to clutter SeqIO with it.

So we could support basic parsing in SeqIO, but it would be useful to
have this GFF specific parsing as the additional arguments will be a
regular use case.

> Also, and we'd touched on this before - I'd much prefer to
> have the GFF module quite "low level" using either new
> GFF-specific classes or simple Python objects (e.g. for
> each feature use a tuple of ints and strings for the first
> feature columns plus a dict for the final extendible
> column of annotation).

Yes, it is implemented this way. The parse_simple function returns
a line by line parse of the file as a dictionary, which is then used
to build up the SeqFeature objects:

http://github.com/chapmanb/bcbb/blob/master/gff/BCBio/GFF/GFFParser.py

We can document and flesh that out, although I'm not really sure how
useful it will be. It's pretty easy to build your own simple
line-by-line GFF parser; the only advantage of this code over a
home-brew is that it handles tricky annotation cases.

For all of my uses, the real win was being able to build up the
multiple transcript exon/intron structures from the file. This is
not trivial to do on your own, and the real win of the code is in
handling this, especially for older GFF2 and GTF formatted files.

> From a technical point of view, a justification for this
> separation is the GFF details are not a perfect fit to the
> SeqRecord and SeqFeature objects and forcing their
> use adds unnecessary overheads for people wanting
> to work directly with the features themselves.

Why are SeqRecord and SeqFeature not appropriate for GFF? We could 
improve them to make things more lightweight, as we discussed
previously, but conceptually the values fit into the framework fine.

> Also, by splitting the code into basic parsing and a
> SeqRecord/SeqFeature conversion layer (which I
> would put in Bio/SeqIO/GffIO.py) we can add the
> code in two steps (first GFF parsing, then SeqIO
> support).

We can do this as is. I'm not suggesting SeqIO support right now,
and want to target getting the GFF parser as is into Biopython.

> I think this split is useful as this is a very big job to do
> properly: Once we have GFF to SeqRecord parsing,
> we need to try and ensure that it is compatible with the
> GenBank to SeqRecord parsing. This is important as
> we would in effect be extending Biopython to allow
> GFF3 to GenBank conversions. For testing all this,
> we can grab the same data in the two file formats
> (e.g. from the NCBI) and perhaps also use EMBOSS.

Do you think GFF to GenBank is a common use case? Agreed that it is
very hard, but this really had less to do with the object
structure in Biopython and more to do with how things 
are represented and named in the original source files. GenBank has
some "consistency" since it is produced mostly by NCBI, but GFF
files are all over the place.

This can be tackled later if someone wants, but right now my goals
are simply:

- Produce Biopython objects from GFF3/GTF/GFF2 files
- Represent nested features
- Allow GFF2/GTF to GFF3 conversion

This should be done with the current code. We can formalize the raw
parse_simple output for the line-by-line if people find it useful,
but otherwise we should leave these bigger projects for down the
line.

Brad