<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Wed, Jun 10, 2015 at 12:15 PM, Ryan Dale <span dir="ltr">&lt;<a href="mailto:dalerr@niddk.nih.gov" target="_blank">dalerr@niddk.nih.gov</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class=""><div class="h5"><br>

On 06/10/2015 01:44 PM, Brad Chapman wrote:<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

Eric and Peter;<br>

Thanks again for moving this forward. cc&#39;ing in Ryan as well, in case he<br>

hasn&#39;t seen this discussion.<br>

<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

So I suppose the remaining tasks are, in no particular order:<br>

<br>

- Add/port Brad&#39;s GFF-GenBank converters and tests to Biopython. Ensure all<br>

the tests pass.<br>

</blockquote>

I&#39;d suggest moving those scripts to use gffutils, rather than rely on<br>

bcbb/gff. Ryan&#39;s implementation is better and I&#39;d prefer to deprecate<br>

mine and move forward with his work.<br>

<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

- Enable GFF3 support by merging or porting from Brad&#39;s branch, bcbb/gff,<br>

or gffutils?<br>

</blockquote>

My vote is for gffutils.<br>

<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

What to add for parent/child relationships between features is<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

yet to be decided.<br>

</blockquote>

I wonder if we can follow the lead of one of the GFF implementations<br>

mentioned above.<br>

<br>

Has this been discussed in a more recent thread that I didn&#39;t link<br>

here?<br>

</blockquote>

I lost this as well so am not sure the best starting place. I don&#39;t have<br>

a strong opinion and open to doing whatever y&#39;all think is best.<br>

<br>

Thanks again,<br>

Brad<br>

</blockquote>

<br></div></div>

Hi all -<br>

<br>

Brad, thanks for the CC.  I&#39;d be happy to help out getting any/all of gffutils into BioPython. Let me give a high-level overview so you can decide what makes sense to bring into BioPython . . .<br></blockquote><div><br></div><div>Awesome. (Sorry for the lag.) I&#39;ve looked through the gffutils code to see how this might work.<br><br>Starting with the most mundane, I see that gffutils has these dependencies (Biopython aims for a functioning dependency-free installation):<br><br><span class=""><span class=""></span>six<span class=""> -- Could port using Bio._py3k, straightforward but monotonous work.<br></span></span><span class=""><span class=""></span><br>argh<span class="">, argcomplete -- Only for argument parsing in the &quot;gffutils-cli&quot; script, maybe not needed in Biopython.<br></span></span><span class=""><span class=""></span><span class=""><br></span></span><span class=""><span class=""></span>simplejson<span class=""> -- I think this is roughly the same code as the &quot;json&quot; module in the standard library of Python 2.6+. Since Biopython doesn&#39;t support Python 2.5 anymore, we can probably just import &quot;json&quot; instead of &quot;simplejson&quot; in feature.py and helpers.py.<br><br></span></span><span class=""><span class=""></span>pyfaidx<span class=""> -- This takes some consideration. <span id="goog_685299984"></span>Since GSoC 2014<span id="goog_685299985"></span>, Biopython can index a genome-scale FASTA file with sqlite3 using its own index format, not the samtools faidx format. I don&#39;t see a ton of pygr-style indexing in gffutils beyond just extracting the specified subsequences from a FASTA file, so Biopython&#39;s internal solution may suffice. This is not yet merged; the pull request is here:<br><a href="https://github.com/biopython/biopython/pull/356">https://github.com/biopython/biopython/pull/356</a><br><br>If reading the .fai file is mandatory but writing it is not, then I can contribute a minimal ~100-line implementation of that (which could alternatively go into Biopython if we prefer):<br><a href="https://github.com/etal/cnvkit/blob/master/cnvlib/ngfrills/faidx.py">https://github.com/etal/cnvkit/blob/master/cnvlib/ngfrills/faidx.py</a><br><br></span></span></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">


There are two main tricky parts to working with GFF/GTF: parsing the attributes and inferring the hierarchy of parent/child relationships.<br>

<br>

The parsing is mostly self-contained in gffutils.parser. It borrows the idea of a &quot;dialect&quot; from the built-in Python csv module, and the kinds of trickiness we see in Brad&#39;s pathological cases are encoded in the fields of the dialect (see comments in the gfftutils.constants.dialect dictionary).<br></blockquote><div><br></div><div>This looks valuable to have in Biopython even without inferring parent-child relationships. Would it be possible to start by extracting and merging the GFF3 parser, and work on the parent-child relationships separately?<br><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

The relationships are by far the hardest. I could write a lot about the difficulties of GFF vs GTF, but let&#39;s just say a sqlite3 db is the most portable and performant way I&#39;ve found to use both GFF and GTF and interconvert between them. The bulk of gffutils&#39; code and complexity is for working on this task.<br>

<br>

Converting GFF to BioPython objects while reliably keeping track of parent/child relations requires parsing the entire file, creating a database, and then querying the db for the relations. gffutils does this, and currently creates SeqFeatures objects. Any additional CompoundLocation stuff can easily be added, as long as there&#39;s a gffutils database to get relationship info from. Likewise, assuming presence of a db, Brad&#39;s scripts can easily be ported. I can certainly work on this.<br>

<br>

So I guess the big question is if you want to introduce all the sqlite3 machinery to BioPython in order to access relationship info, or just use the parser.<br></blockquote></div><br></div><div class="gmail_extra">I think we&#39;re happy to use sqlite3 wherever it&#39;s a sensible engineering choice, since it&#39;s part of the standard library. Biopython users may want the option to skip the database if parent-child relationships are not needed, or keep it in RAM to avoid hitting the disk.<br><br><br></div><div class="gmail_extra">-Eric<br></div></div>