[Biopython-dev] Merging the GFF3 and VCF branches

Wed Jun 24 21:54:44 UTC 2015

On Wed, Jun 10, 2015 at 12:15 PM, Ryan Dale <dalerr at niddk.nih.gov> wrote:

>
> On 06/10/2015 01:44 PM, Brad Chapman wrote:
>
>> Eric and Peter;
>> Thanks again for moving this forward. cc'ing in Ryan as well, in case he
>> hasn't seen this discussion.
>>
>>  So I suppose the remaining tasks are, in no particular order:
>>>
>>> - Add/port Brad's GFF-GenBank converters and tests to Biopython. Ensure
>>> all
>>> the tests pass.
>>>
>> I'd suggest moving those scripts to use gffutils, rather than rely on
>> bcbb/gff. Ryan's implementation is better and I'd prefer to deprecate
>> mine and move forward with his work.
>>
>>  - Enable GFF3 support by merging or porting from Brad's branch, bcbb/gff,
>>> or gffutils?
>>>
>> My vote is for gffutils.
>>
>>  What to add for parent/child relationships between features is
>>>
>>>> yet to be decided.
>>>>
>>> I wonder if we can follow the lead of one of the GFF implementations
>>> mentioned above.
>>>
>>> Has this been discussed in a more recent thread that I didn't link
>>> here?
>>>
>> I lost this as well so am not sure the best starting place. I don't have
>> a strong opinion and open to doing whatever y'all think is best.
>>
>> Thanks again,
>> Brad
>>
>
> Hi all -
>
> Brad, thanks for the CC.  I'd be happy to help out getting any/all of
> gffutils into BioPython. Let me give a high-level overview so you can
> decide what makes sense to bring into BioPython . . .
>

Awesome. (Sorry for the lag.) I've looked through the gffutils code to see
how this might work.

Starting with the most mundane, I see that gffutils has these dependencies
(Biopython aims for a functioning dependency-free installation):

six -- Could port using Bio._py3k, straightforward but monotonous work.

argh, argcomplete -- Only for argument parsing in the "gffutils-cli"
script, maybe not needed in Biopython.

simplejson -- I think this is roughly the same code as the "json" module in
the standard library of Python 2.6+. Since Biopython doesn't support Python
2.5 anymore, we can probably just import "json" instead of "simplejson" in
feature.py and helpers.py.

pyfaidx -- This takes some consideration. Since GSoC 2014, Biopython can
index a genome-scale FASTA file with sqlite3 using its own index format,
not the samtools faidx format. I don't see a ton of pygr-style indexing in
gffutils beyond just extracting the specified subsequences from a FASTA
file, so Biopython's internal solution may suffice. This is not yet merged;
the pull request is here:
https://github.com/biopython/biopython/pull/356

If reading the .fai file is mandatory but writing it is not, then I can
contribute a minimal ~100-line implementation of that (which could
alternatively go into Biopython if we prefer):
https://github.com/etal/cnvkit/blob/master/cnvlib/ngfrills/faidx.py

There are two main tricky parts to working with GFF/GTF: parsing the
> attributes and inferring the hierarchy of parent/child relationships.
>
> The parsing is mostly self-contained in gffutils.parser. It borrows the
> idea of a "dialect" from the built-in Python csv module, and the kinds of
> trickiness we see in Brad's pathological cases are encoded in the fields of
> the dialect (see comments in the gfftutils.constants.dialect dictionary).
>

This looks valuable to have in Biopython even without inferring
parent-child relationships. Would it be possible to start by extracting and
merging the GFF3 parser, and work on the parent-child relationships
separately?

> The relationships are by far the hardest. I could write a lot about the
> difficulties of GFF vs GTF, but let's just say a sqlite3 db is the most
> portable and performant way I've found to use both GFF and GTF and
> interconvert between them. The bulk of gffutils' code and complexity is for
> working on this task.
>
> Converting GFF to BioPython objects while reliably keeping track of
> parent/child relations requires parsing the entire file, creating a
> database, and then querying the db for the relations. gffutils does this,
> and currently creates SeqFeatures objects. Any additional CompoundLocation
> stuff can easily be added, as long as there's a gffutils database to get
> relationship info from. Likewise, assuming presence of a db, Brad's scripts
> can easily be ported. I can certainly work on this.
>
> So I guess the big question is if you want to introduce all the sqlite3
> machinery to BioPython in order to access relationship info, or just use
> the parser.
>

I think we're happy to use sqlite3 wherever it's a sensible engineering
choice, since it's part of the standard library. Biopython users may want
the option to skip the database if parent-child relationships are not
needed, or keep it in RAM to avoid hitting the disk.

-Eric
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython-dev/attachments/20150624/fb00ffc7/attachment-0001.html>