<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Wed, Jun 10, 2015 at 12:15 PM, Ryan Dale <span dir="ltr"><<a href="mailto:dalerr@niddk.nih.gov" target="_blank">dalerr@niddk.nih.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class=""><div class="h5"><br>
On 06/10/2015 01:44 PM, Brad Chapman wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
Eric and Peter;<br>
Thanks again for moving this forward. cc'ing in Ryan as well, in case he<br>
hasn't seen this discussion.<br>
<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
So I suppose the remaining tasks are, in no particular order:<br>
<br>
- Add/port Brad's GFF-GenBank converters and tests to Biopython. Ensure all<br>
the tests pass.<br>
</blockquote>
I'd suggest moving those scripts to use gffutils, rather than rely on<br>
bcbb/gff. Ryan's implementation is better and I'd prefer to deprecate<br>
mine and move forward with his work.<br>
<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
- Enable GFF3 support by merging or porting from Brad's branch, bcbb/gff,<br>
or gffutils?<br>
</blockquote>
My vote is for gffutils.<br>
<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
What to add for parent/child relationships between features is<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
yet to be decided.<br>
</blockquote>
I wonder if we can follow the lead of one of the GFF implementations<br>
mentioned above.<br>
<br>
Has this been discussed in a more recent thread that I didn't link<br>
here?<br>
</blockquote>
I lost this as well so am not sure the best starting place. I don't have<br>
a strong opinion and open to doing whatever y'all think is best.<br>
<br>
Thanks again,<br>
Brad<br>
</blockquote>
<br></div></div>
Hi all -<br>
<br>
Brad, thanks for the CC. I'd be happy to help out getting any/all of gffutils into BioPython. Let me give a high-level overview so you can decide what makes sense to bring into BioPython . . .<br></blockquote><div><br></div><div>Awesome. (Sorry for the lag.) I've looked through the gffutils code to see how this might work.<br><br>Starting with the most mundane, I see that gffutils has these dependencies (Biopython aims for a functioning dependency-free installation):<br><br><span class=""><span class=""></span>six<span class=""> -- Could port using Bio._py3k, straightforward but monotonous work.<br></span></span><span class=""><span class=""></span><br>argh<span class="">, argcomplete -- Only for argument parsing in the "gffutils-cli" script, maybe not needed in Biopython.<br></span></span><span class=""><span class=""></span><span class=""><br></span></span><span class=""><span class=""></span>simplejson<span class=""> -- I think this is roughly the same code as the "json" module in the standard library of Python 2.6+. Since Biopython doesn't support Python 2.5 anymore, we can probably just import "json" instead of "simplejson" in feature.py and helpers.py.<br><br></span></span><span class=""><span class=""></span>pyfaidx<span class=""> -- This takes some consideration. <span id="goog_685299984"></span>Since GSoC 2014<span id="goog_685299985"></span>, Biopython can index a genome-scale FASTA file with sqlite3 using its own index format, not the samtools faidx format. I don't see a ton of pygr-style indexing in gffutils beyond just extracting the specified subsequences from a FASTA file, so Biopython's internal solution may suffice. This is not yet merged; the pull request is here:<br><a href="https://github.com/biopython/biopython/pull/356">https://github.com/biopython/biopython/pull/356</a><br><br>If reading the .fai file is mandatory but writing it is not, then I can contribute a minimal ~100-line implementation of that (which could alternatively go into Biopython if we prefer):<br><a href="https://github.com/etal/cnvkit/blob/master/cnvlib/ngfrills/faidx.py">https://github.com/etal/cnvkit/blob/master/cnvlib/ngfrills/faidx.py</a><br><br></span></span></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
There are two main tricky parts to working with GFF/GTF: parsing the attributes and inferring the hierarchy of parent/child relationships.<br>
<br>
The parsing is mostly self-contained in gffutils.parser. It borrows the idea of a "dialect" from the built-in Python csv module, and the kinds of trickiness we see in Brad's pathological cases are encoded in the fields of the dialect (see comments in the gfftutils.constants.dialect dictionary).<br></blockquote><div><br></div><div>This looks valuable to have in Biopython even without inferring parent-child relationships. Would it be possible to start by extracting and merging the GFF3 parser, and work on the parent-child relationships separately?<br><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
The relationships are by far the hardest. I could write a lot about the difficulties of GFF vs GTF, but let's just say a sqlite3 db is the most portable and performant way I've found to use both GFF and GTF and interconvert between them. The bulk of gffutils' code and complexity is for working on this task.<br>
<br>
Converting GFF to BioPython objects while reliably keeping track of parent/child relations requires parsing the entire file, creating a database, and then querying the db for the relations. gffutils does this, and currently creates SeqFeatures objects. Any additional CompoundLocation stuff can easily be added, as long as there's a gffutils database to get relationship info from. Likewise, assuming presence of a db, Brad's scripts can easily be ported. I can certainly work on this.<br>
<br>
So I guess the big question is if you want to introduce all the sqlite3 machinery to BioPython in order to access relationship info, or just use the parser.<br></blockquote></div><br></div><div class="gmail_extra">I think we're happy to use sqlite3 wherever it's a sensible engineering choice, since it's part of the standard library. Biopython users may want the option to skip the database if parent-child relationships are not needed, or keep it in RAM to avoid hitting the disk.<br><br><br></div><div class="gmail_extra">-Eric<br></div></div>