<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<br>
<div class="moz-cite-prefix">On 06/24/2015 05:54 PM, Eric Talevich
wrote:<br>
</div>
<blockquote
cite="mid:CAMC681=6NOf=r7Tjw94qYwvWwjBdLvPTex9LDJ944pow=s1JPA@mail.gmail.com"
type="cite">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">On Wed, Jun 10, 2015 at 12:15 PM,
Ryan Dale <span dir="ltr"><<a moz-do-not-send="true"
href="mailto:dalerr@niddk.nih.gov" target="_blank">dalerr@niddk.nih.gov</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div class="">
<div class="h5"><br>
On 06/10/2015 01:44 PM, Brad Chapman wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px
0px 0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex"> Eric and Peter;<br>
Thanks again for moving this forward. cc'ing in Ryan
as well, in case he<br>
hasn't seen this discussion.<br>
<br>
<blockquote class="gmail_quote" style="margin:0px
0px 0px 0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex"> So I suppose
the remaining tasks are, in no particular order:<br>
<br>
- Add/port Brad's GFF-GenBank converters and tests
to Biopython. Ensure all<br>
the tests pass.<br>
</blockquote>
I'd suggest moving those scripts to use gffutils,
rather than rely on<br>
bcbb/gff. Ryan's implementation is better and I'd
prefer to deprecate<br>
mine and move forward with his work.<br>
<br>
<blockquote class="gmail_quote" style="margin:0px
0px 0px 0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex"> - Enable GFF3
support by merging or porting from Brad's branch,
bcbb/gff,<br>
or gffutils?<br>
</blockquote>
My vote is for gffutils.<br>
<br>
<blockquote class="gmail_quote" style="margin:0px
0px 0px 0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex"> What to add
for parent/child relationships between features is<br>
<blockquote class="gmail_quote" style="margin:0px
0px 0px 0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex"> yet to be
decided.<br>
</blockquote>
I wonder if we can follow the lead of one of the
GFF implementations<br>
mentioned above.<br>
<br>
Has this been discussed in a more recent thread
that I didn't link<br>
here?<br>
</blockquote>
I lost this as well so am not sure the best starting
place. I don't have<br>
a strong opinion and open to doing whatever y'all
think is best.<br>
<br>
Thanks again,<br>
Brad<br>
</blockquote>
<br>
</div>
</div>
Hi all -<br>
<br>
Brad, thanks for the CC. I'd be happy to help out getting
any/all of gffutils into BioPython. Let me give a
high-level overview so you can decide what makes sense to
bring into BioPython . . .<br>
</blockquote>
<div><br>
</div>
<div>Awesome. (Sorry for the lag.) I've looked through the
gffutils code to see how this might work.<br>
<br>
Starting with the most mundane, I see that gffutils has
these dependencies (Biopython aims for a functioning
dependency-free installation):<br>
<br>
<span class=""><span class=""></span>six<span class=""> --
Could port using Bio._py3k, straightforward but
monotonous work.<br>
</span></span><span class=""><span class=""></span><br>
argh<span class="">, argcomplete -- Only for argument
parsing in the "gffutils-cli" script, maybe not needed
in Biopython.<br>
</span></span><span class=""><span class=""></span><span
class=""><br>
</span></span><span class=""><span class=""></span>simplejson<span
class=""> -- I think this is roughly the same code as
the "json" module in the standard library of Python
2.6+. Since Biopython doesn't support Python 2.5
anymore, we can probably just import "json" instead of
"simplejson" in feature.py and helpers.py.<br>
<br>
</span></span><span class=""><span class=""></span>pyfaidx<span
class=""> -- This takes some consideration. <span
id="goog_685299984"></span>Since GSoC 2014<span
id="goog_685299985"></span>, Biopython can index a
genome-scale FASTA file with sqlite3 using its own
index format, not the samtools faidx format. I don't
see a ton of pygr-style indexing in gffutils beyond
just extracting the specified subsequences from a
FASTA file, so Biopython's internal solution may
suffice. This is not yet merged; the pull request is
here:<br>
<a moz-do-not-send="true"
href="https://github.com/biopython/biopython/pull/356">https://github.com/biopython/biopython/pull/356</a><br>
<br>
If reading the .fai file is mandatory but writing it
is not, then I can contribute a minimal ~100-line
implementation of that (which could alternatively go
into Biopython if we prefer):<br>
<a moz-do-not-send="true"
href="https://github.com/etal/cnvkit/blob/master/cnvlib/ngfrills/faidx.py">https://github.com/etal/cnvkit/blob/master/cnvlib/ngfrills/faidx.py</a><br>
<br>
</span></span></div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex"> There are two main
tricky parts to working with GFF/GTF: parsing the
attributes and inferring the hierarchy of parent/child
relationships.<br>
<br>
The parsing is mostly self-contained in gffutils.parser.
It borrows the idea of a "dialect" from the built-in
Python csv module, and the kinds of trickiness we see in
Brad's pathological cases are encoded in the fields of the
dialect (see comments in the gfftutils.constants.dialect
dictionary).<br>
</blockquote>
<div><br>
</div>
<div>This looks valuable to have in Biopython even without
inferring parent-child relationships. Would it be possible
to start by extracting and merging the GFF3 parser, and
work on the parent-child relationships separately?<br>
<br>
</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex"> The relationships are
by far the hardest. I could write a lot about the
difficulties of GFF vs GTF, but let's just say a sqlite3
db is the most portable and performant way I've found to
use both GFF and GTF and interconvert between them. The
bulk of gffutils' code and complexity is for working on
this task.<br>
<br>
Converting GFF to BioPython objects while reliably keeping
track of parent/child relations requires parsing the
entire file, creating a database, and then querying the db
for the relations. gffutils does this, and currently
creates SeqFeatures objects. Any additional
CompoundLocation stuff can easily be added, as long as
there's a gffutils database to get relationship info from.
Likewise, assuming presence of a db, Brad's scripts can
easily be ported. I can certainly work on this.<br>
<br>
So I guess the big question is if you want to introduce
all the sqlite3 machinery to BioPython in order to access
relationship info, or just use the parser.<br>
</blockquote>
</div>
<br>
</div>
<div class="gmail_extra">I think we're happy to use sqlite3
wherever it's a sensible engineering choice, since it's part
of the standard library. Biopython users may want the option
to skip the database if parent-child relationships are not
needed, or keep it in RAM to avoid hitting the disk.<br>
<br>
<br>
</div>
<div class="gmail_extra">-Eric<br>
</div>
</div>
</blockquote>
<br>
Hi Eric - <br>
<br>
Regarding the dependencies, I'm pretty sure we can drop them all
(I'll address them individually below) for integration with
Biopython. I'm imagining gffutils will remain as an independent
project, with a subset of useful parts shared with Biopython. In
this case, it would be important to minimize the "merge barrier" to
facilitate bugfixes/improvements between code bases in both
directions.<br>
<br>
With that in mind:<br>
<br>
<blockquote type="cite"><span class=""><span class=""></span>six<span
class=""> -- Could port using Bio._py3k, straightforward but
monotonous work.</span></span></blockquote>
<br>
I think the way to address this is to make a copy of Bio._py3k in
gffutils and use that to replace six in the main gffutils repo. Then
picking-and-choosing pieces of gffutils to put in Biopython will be
straightforward: just edit the import from "gffutils._py3k" to
"Bio._py3k".<br>
<br>
<blockquote type="cite"><span class="">argh<span class="">,
argcomplete -- Only for argument parsing in the "gffutils-cli"
script, maybe not needed in Biopython.</span></span></blockquote>
<br>
Agreed, nothing targeted for Biopython integration needs these.<br>
<br>
<blockquote type="cite"><span class="">simplejson<span class=""> --
I think this is roughly the same code as the "json" module in
the standard library of Python 2.6+. Since Biopython doesn't
support Python 2.5 anymore, we can probably just import "json"
instead of "simplejson" in feature.py and helpers.py.</span></span></blockquote>
<br>
simplejson won in some performance benchmarks I had done, but not by
a huge amount. It's a drop-in replacement for json, so this should
be a straightforward fix.<br>
<br>
<blockquote type="cite"><span class="">pyfaidx<span class=""> --
This takes some consideration. <span id="goog_685299984"></span>Since
GSoC 2014<span id="goog_685299985"></span>, Biopython can
index a genome-scale FASTA file with sqlite3 using its own
index format, not the samtools faidx format. I don't see a ton
of pygr-style indexing in gffutils beyond just extracting the
specified subsequences from a FASTA file, so Biopython's
internal solution may suffice. This is not yet merged; the
pull request is here:<br>
<a href="https://github.com/biopython/biopython/pull/356">https://github.com/biopython/biopython/pull/356</a><br>
<br>
If reading the .fai file is mandatory but writing it is not,
then I can contribute a minimal ~100-line implementation of
that (which could alternatively go into Biopython if we
prefer):<br>
<a class="moz-txt-link-freetext"
href="https://github.com/etal/cnvkit/blob/master/cnvlib/ngfrills/faidx.py">https://github.com/etal/cnvkit/blob/master/cnvlib/ngfrills/faidx.py</a></span></span></blockquote>
<br>
Right, it's only used for extracting subsequences from a FASTA.
Could either drop this functionality altogether, or use the options
you suggest. I would prefer the Biopython internal solution, since a
read-only approach limits the user to pre-constructed indexes -- or
requires them to install additional dependencies to create their
own.<br>
<br>
<blockquote type="cite">Would it be possible to start by extracting
and merging the GFF3 parser, and work on the parent-child
relationships separately?</blockquote>
<br>
Certainly. I think this is a good strategy. I think it would be good
to hold off on the sequence extraction for now as well.<br>
<br>
<blockquote type="cite">I think we're happy to use sqlite3 wherever
it's a sensible engineering choice, since it's part of the
standard library. Biopython users may want the option to skip the
database if parent-child relationships are not needed, or keep it
in RAM to avoid hitting the disk.<br>
</blockquote>
<br>
If parent-child relationships are not needed, then maybe all you
need to do is parse:<br>
<br>
from gffutils.iterators import DataIterator<br>
for feature in DataIterator('annotations.gff'):<br>
# do something with feature<br>
<br>
The attributes-field-parsing machinery ported from Brad's code
handles the pathological attributes fields, but other than that it's
pretty trivial and like any line-by-line parser uses very little
RAM.<br>
<br>
Creating parent-child relationships is a different beast. This takes
a lot longer and is a lot more complex:<br>
<br>
from gffutils import create_db<br>
db = create_db('annotations.gff', 'annotations.db')<br>
<br>
I should point out that using ":memory:" for the database name puts
it in RAM. The downside is no persistence, so the time cost of
parsing/constructing the db (~10 mins for 1.2M-feature human GENCODE
GTF) has to be spent every time.<br>
<br>
Anyway, I think the next step is to get a draft PR going to iron out
the details of parser integration. Where do you want this this live?
Bio.GFF? Bio.GTF? <br>
<br>
-ryan<br>
</body>
</html>