[Biopython-dev] Bio.GFF and Brad's code

Fri Dec 4 14:25:40 UTC 2009

On Fri, Dec 4, 2009 at 1:40 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> Hi all;
> Peter, thanks for the feedback. Thoughts below.
>
>> Looking at your code, BCBio.GFF.parse(...) would return
>> SeqRecord objects (with SeqFeatures). That seems
>> redundant to me as one expect people to just use
>> Bio.SeqIO.parse(handle, "gff3") instead. I would instead
>> have expected BCBio.GFF.parse(...) to iterate over the
>> features in a GFF file.
>
> This would work for simple cases, but for most real life cases you
> will likely want to limit the file to a subset of things you are
> interested in. It helps reduce memory problems, and is equivalent to
> a track system view in UCSC or Ensembl. I find it very useful for
> all of the work I've done with it.

Understood - a feature returning Bio.GFF.parse() function could
take various arguments, or for full flexibility, the user can use the
parser object directly.

> We could use SeqIO here, but then there is the issue of passing
> along the additional arguments. The simplicity of SeqIO is really
> nice, so not sure if we'd want to clutter SeqIO with it.
>
> So we could support basic parsing in SeqIO, but it would be useful to
> have this GFF specific parsing as the additional arguments will be a
> regular use case.

This is already catered for in that Bio.SeqIO.parse() and read()
don't take arbitrary arguments (currently), but the underlying
Bio.SeqIO.XxxxIO.XxxIterator() they invoke may do so. i.e. You
could have Bio.SeqIO.GffIO.GffIterator() and perhaps variants
(e.g. GFF2 vs GFF3) which take filtering arguments.

>> Also, and we'd touched on this before - I'd much prefer to
>> have the GFF module quite "low level" using either new
>> GFF-specific classes or simple Python objects (e.g. for
>> each feature use a tuple of ints and strings for the first
>> feature columns plus a dict for the final extendible
>> column of annotation).
>
> Yes, it is implemented this way. The parse_simple function returns
> a line by line parse of the file as a dictionary, which is then used
> to build up the SeqFeature objects:
>
> http://github.com/chapmanb/bcbb/blob/master/gff/BCBio/GFF/GFFParser.py
>
> We can document and flesh that out, although I'm not really sure how
> useful it will be. It's pretty easy to build your own simple
> line-by-line GFF parser; the only advantage of this code over a
> home-brew is that it handles tricky annotation cases.

I still think it would be useful to have Bio/GFF/Parser.py (or
similar) as the low level parser, and Bio/SeqIO/GffIO.py (or
similar) to turn this into SeqRecord and SeqFeature objects.

> For all of my uses, the real win was being able to build up the
> multiple transcript exon/intron structures from the file. This is
> not trivial to do on your own, and the real win of the code is in
> handling this, especially for older GFF2 and GTF formatted files.
>
>> From a technical point of view, a justification for this
>> separation is the GFF details are not a perfect fit to the
>> SeqRecord and SeqFeature objects and forcing their
>> use adds unnecessary overheads for people wanting
>> to work directly with the features themselves.
>
> Why are SeqRecord and SeqFeature not appropriate for GFF? We could
> improve them to make things more lightweight, as we discussed
> previously, but conceptually the values fit into the framework fine.

The nested features that worry me. Perhaps the existing
location operator (e.g. "join") could be set to something
like "parent/child" if the subfeatures is used to hold child
features rather than the elements of a join? We need
the GenBank output code etc to be able to tell these
apart reliably.

>> Also, by splitting the code into basic parsing and a
>> SeqRecord/SeqFeature conversion layer (which I
>> would put in Bio/SeqIO/GffIO.py) we can add the
>> code in two steps (first GFF parsing, then SeqIO
>> support).
>
> We can do this as is. I'm not suggesting SeqIO support right now,
> and want to target getting the GFF parser as is into Biopython.

My point is the moment you include GFF -> SeqRecord
code (even if not explicitly via the Bio.SeqIO namespace)
it opens us up to people giving these SeqRecord objects
to SeqIO for output (e.g. as GenBank).

>> I think this split is useful as this is a very big job to do
>> properly: Once we have GFF to SeqRecord parsing,
>> we need to try and ensure that it is compatible with the
>> GenBank to SeqRecord parsing. This is important as
>> we would in effect be extending Biopython to allow
>> GFF3 to GenBank conversions. For testing all this,
>> we can grab the same data in the two file formats
>> (e.g. from the NCBI) and perhaps also use EMBOSS.
>
> Do you think GFF to GenBank is a common use case?

I suspect its something I'd want to do it when working with
new genome annotations. GeneMark produces GFF, while
Prodigal produces (simple) GenBank. The SOLiD pipeline
corona produces GFF. Sometimes you can get both, the
tool RAST outputs GenBank, GFF, GTF and EMBL files.

> Agreed that it is very hard, but this really had less to do
> with the object structure in Biopython and more to do
> with how things are represented and named in the
> original source files. GenBank has some "consistency"
> since it is produced mostly by NCBI, but GFF files are
> all over the place.
>
> This can be tackled later if someone wants, but right
> now my goals are simply:
>
> - Produce Biopython objects from GFF3/GTF/GFF2 files
> - Represent nested features
> - Allow GFF2/GTF to GFF3 conversion
>
> This should be done with the current code. We can
> formalize the raw parse_simple output for the line-by-line
> if people find it useful, but otherwise we should leave
> these bigger projects for down the line.

Worth goals, but if by "Produce Biopython objects from
GFF3/GTF/GFF2 files" you mean SeqRecords with
SeqFeatures, (as I said above) we are opening up the
GFF to GenBank can of worms. There is no "later" :(

Peter