<html>
  <head>
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <br>
    <div class="moz-cite-prefix">On 06/24/2015 05:54 PM, Eric Talevich
      wrote:<br>
    </div>
    <blockquote
cite="mid:CAMC681=6NOf=r7Tjw94qYwvWwjBdLvPTex9LDJ944pow=s1JPA@mail.gmail.com"
      type="cite">
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
      <div dir="ltr">
        <div class="gmail_extra">
          <div class="gmail_quote">On Wed, Jun 10, 2015 at 12:15 PM,
            Ryan Dale <span dir="ltr">&lt;<a moz-do-not-send="true"
                href="mailto:dalerr@niddk.nih.gov" target="_blank">dalerr@niddk.nih.gov</a>&gt;</span>
            wrote:<br>
            <blockquote class="gmail_quote" style="margin:0px 0px 0px
              0.8ex;border-left:1px solid
              rgb(204,204,204);padding-left:1ex">
              <div class="">
                <div class="h5"><br>
                  On 06/10/2015 01:44 PM, Brad Chapman wrote:<br>
                  <blockquote class="gmail_quote" style="margin:0px 0px
                    0px 0.8ex;border-left:1px solid
                    rgb(204,204,204);padding-left:1ex"> Eric and Peter;<br>
                    Thanks again for moving this forward. cc'ing in Ryan
                    as well, in case he<br>
                    hasn't seen this discussion.<br>
                    <br>
                    <blockquote class="gmail_quote" style="margin:0px
                      0px 0px 0.8ex;border-left:1px solid
                      rgb(204,204,204);padding-left:1ex"> So I suppose
                      the remaining tasks are, in no particular order:<br>
                      <br>
                      - Add/port Brad's GFF-GenBank converters and tests
                      to Biopython. Ensure all<br>
                      the tests pass.<br>
                    </blockquote>
                    I'd suggest moving those scripts to use gffutils,
                    rather than rely on<br>
                    bcbb/gff. Ryan's implementation is better and I'd
                    prefer to deprecate<br>
                    mine and move forward with his work.<br>
                    <br>
                    <blockquote class="gmail_quote" style="margin:0px
                      0px 0px 0.8ex;border-left:1px solid
                      rgb(204,204,204);padding-left:1ex"> - Enable GFF3
                      support by merging or porting from Brad's branch,
                      bcbb/gff,<br>
                      or gffutils?<br>
                    </blockquote>
                    My vote is for gffutils.<br>
                    <br>
                    <blockquote class="gmail_quote" style="margin:0px
                      0px 0px 0.8ex;border-left:1px solid
                      rgb(204,204,204);padding-left:1ex"> What to add
                      for parent/child relationships between features is<br>
                      <blockquote class="gmail_quote" style="margin:0px
                        0px 0px 0.8ex;border-left:1px solid
                        rgb(204,204,204);padding-left:1ex"> yet to be
                        decided.<br>
                      </blockquote>
                      I wonder if we can follow the lead of one of the
                      GFF implementations<br>
                      mentioned above.<br>
                      <br>
                      Has this been discussed in a more recent thread
                      that I didn't link<br>
                      here?<br>
                    </blockquote>
                    I lost this as well so am not sure the best starting
                    place. I don't have<br>
                    a strong opinion and open to doing whatever y'all
                    think is best.<br>
                    <br>
                    Thanks again,<br>
                    Brad<br>
                  </blockquote>
                  <br>
                </div>
              </div>
              Hi all -<br>
              <br>
              Brad, thanks for the CC.  I'd be happy to help out getting
              any/all of gffutils into BioPython. Let me give a
              high-level overview so you can decide what makes sense to
              bring into BioPython . . .<br>
            </blockquote>
            <div><br>
            </div>
            <div>Awesome. (Sorry for the lag.) I've looked through the
              gffutils code to see how this might work.<br>
              <br>
              Starting with the most mundane, I see that gffutils has
              these dependencies (Biopython aims for a functioning
              dependency-free installation):<br>
              <br>
              <span class=""><span class=""></span>six<span class=""> --
                  Could port using Bio._py3k, straightforward but
                  monotonous work.<br>
                </span></span><span class=""><span class=""></span><br>
                argh<span class="">, argcomplete -- Only for argument
                  parsing in the "gffutils-cli" script, maybe not needed
                  in Biopython.<br>
                </span></span><span class=""><span class=""></span><span
                  class=""><br>
                </span></span><span class=""><span class=""></span>simplejson<span
                  class=""> -- I think this is roughly the same code as
                  the "json" module in the standard library of Python
                  2.6+. Since Biopython doesn't support Python 2.5
                  anymore, we can probably just import "json" instead of
                  "simplejson" in feature.py and helpers.py.<br>
                  <br>
                </span></span><span class=""><span class=""></span>pyfaidx<span
                  class=""> -- This takes some consideration. <span
                    id="goog_685299984"></span>Since GSoC 2014<span
                    id="goog_685299985"></span>, Biopython can index a
                  genome-scale FASTA file with sqlite3 using its own
                  index format, not the samtools faidx format. I don't
                  see a ton of pygr-style indexing in gffutils beyond
                  just extracting the specified subsequences from a
                  FASTA file, so Biopython's internal solution may
                  suffice. This is not yet merged; the pull request is
                  here:<br>
                  <a moz-do-not-send="true"
                    href="https://github.com/biopython/biopython/pull/356">https://github.com/biopython/biopython/pull/356</a><br>
                  <br>
                  If reading the .fai file is mandatory but writing it
                  is not, then I can contribute a minimal ~100-line
                  implementation of that (which could alternatively go
                  into Biopython if we prefer):<br>
                  <a moz-do-not-send="true"
href="https://github.com/etal/cnvkit/blob/master/cnvlib/ngfrills/faidx.py">https://github.com/etal/cnvkit/blob/master/cnvlib/ngfrills/faidx.py</a><br>
                  <br>
                </span></span></div>
            <blockquote class="gmail_quote" style="margin:0px 0px 0px
              0.8ex;border-left:1px solid
              rgb(204,204,204);padding-left:1ex"> There are two main
              tricky parts to working with GFF/GTF: parsing the
              attributes and inferring the hierarchy of parent/child
              relationships.<br>
              <br>
              The parsing is mostly self-contained in gffutils.parser.
              It borrows the idea of a "dialect" from the built-in
              Python csv module, and the kinds of trickiness we see in
              Brad's pathological cases are encoded in the fields of the
              dialect (see comments in the gfftutils.constants.dialect
              dictionary).<br>
            </blockquote>
            <div><br>
            </div>
            <div>This looks valuable to have in Biopython even without
              inferring parent-child relationships. Would it be possible
              to start by extracting and merging the GFF3 parser, and
              work on the parent-child relationships separately?<br>
              <br>
            </div>
            <div> </div>
            <blockquote class="gmail_quote" style="margin:0px 0px 0px
              0.8ex;border-left:1px solid
              rgb(204,204,204);padding-left:1ex"> The relationships are
              by far the hardest. I could write a lot about the
              difficulties of GFF vs GTF, but let's just say a sqlite3
              db is the most portable and performant way I've found to
              use both GFF and GTF and interconvert between them. The
              bulk of gffutils' code and complexity is for working on
              this task.<br>
              <br>
              Converting GFF to BioPython objects while reliably keeping
              track of parent/child relations requires parsing the
              entire file, creating a database, and then querying the db
              for the relations. gffutils does this, and currently
              creates SeqFeatures objects. Any additional
              CompoundLocation stuff can easily be added, as long as
              there's a gffutils database to get relationship info from.
              Likewise, assuming presence of a db, Brad's scripts can
              easily be ported. I can certainly work on this.<br>
              <br>
              So I guess the big question is if you want to introduce
              all the sqlite3 machinery to BioPython in order to access
              relationship info, or just use the parser.<br>
            </blockquote>
          </div>
          <br>
        </div>
        <div class="gmail_extra">I think we're happy to use sqlite3
          wherever it's a sensible engineering choice, since it's part
          of the standard library. Biopython users may want the option
          to skip the database if parent-child relationships are not
          needed, or keep it in RAM to avoid hitting the disk.<br>
          <br>
          <br>
        </div>
        <div class="gmail_extra">-Eric<br>
        </div>
      </div>
    </blockquote>
    <br>
    Hi Eric - <br>
    <br>
    Regarding the dependencies, I'm pretty sure we can drop them all
    (I'll address them individually below) for integration with
    Biopython. I'm imagining gffutils will remain as an independent
    project, with a subset of useful parts shared with Biopython. In
    this case, it would be important to minimize the "merge barrier" to
    facilitate bugfixes/improvements between code bases in both
    directions.<br>
    <br>
    With that in mind:<br>
    <br>
    <blockquote type="cite"><span class=""><span class=""></span>six<span
          class=""> -- Could port using Bio._py3k, straightforward but
          monotonous work.</span></span></blockquote>
    <br>
    I think the way to address this is to make a copy of Bio._py3k in
    gffutils and use that to replace six in the main gffutils repo. Then
    picking-and-choosing pieces of gffutils to put in Biopython will be
    straightforward: just edit the import from "gffutils._py3k" to
    "Bio._py3k".<br>
    <br>
    <blockquote type="cite"><span class="">argh<span class="">,
          argcomplete -- Only for argument parsing in the "gffutils-cli"
          script, maybe not needed in Biopython.</span></span></blockquote>
    <br>
    Agreed, nothing targeted for Biopython integration needs these.<br>
    <br>
    <blockquote type="cite"><span class="">simplejson<span class=""> --
          I think this is roughly the same code as the "json" module in
          the standard library of Python 2.6+. Since Biopython doesn't
          support Python 2.5 anymore, we can probably just import "json"
          instead of "simplejson" in feature.py and helpers.py.</span></span></blockquote>
    <br>
    simplejson won in some performance benchmarks I had done, but not by
    a huge amount. It's a drop-in replacement for json, so this should
    be a straightforward fix.<br>
    <br>
    <blockquote type="cite"><span class="">pyfaidx<span class=""> --
          This takes some consideration. <span id="goog_685299984"></span>Since

          GSoC 2014<span id="goog_685299985"></span>, Biopython can
          index a genome-scale FASTA file with sqlite3 using its own
          index format, not the samtools faidx format. I don't see a ton
          of pygr-style indexing in gffutils beyond just extracting the
          specified subsequences from a FASTA file, so Biopython's
          internal solution may suffice. This is not yet merged; the
          pull request is here:<br>
          <a href="https://github.com/biopython/biopython/pull/356">https://github.com/biopython/biopython/pull/356</a><br>
          <br>
          If reading the .fai file is mandatory but writing it is not,
          then I can contribute a minimal ~100-line implementation of
          that (which could alternatively go into Biopython if we
          prefer):<br>
          <a class="moz-txt-link-freetext"
href="https://github.com/etal/cnvkit/blob/master/cnvlib/ngfrills/faidx.py">https://github.com/etal/cnvkit/blob/master/cnvlib/ngfrills/faidx.py</a></span></span></blockquote>
    <br>
    Right, it's only used for extracting subsequences from a FASTA.
    Could either drop this functionality altogether, or use the options
    you suggest. I would prefer the Biopython internal solution, since a
    read-only approach limits the user to pre-constructed indexes -- or
    requires them to install additional dependencies to create their
    own.<br>
    <br>
    <blockquote type="cite">Would it be possible to start by extracting
      and merging the GFF3 parser, and work on the parent-child
      relationships separately?</blockquote>
    <br>
    Certainly. I think this is a good strategy. I think it would be good
    to hold off on the sequence extraction for now as well.<br>
    <br>
    <blockquote type="cite">I think we're happy to use sqlite3 wherever
      it's a sensible engineering choice, since it's part of the
      standard library. Biopython users may want the option to skip the
      database if parent-child relationships are not needed, or keep it
      in RAM to avoid hitting the disk.<br>
    </blockquote>
    <br>
    If parent-child relationships are not needed, then maybe all you
    need to do is parse:<br>
    <br>
    from gffutils.iterators import DataIterator<br>
    for feature in DataIterator('annotations.gff'):<br>
        # do something with feature<br>
    <br>
    The attributes-field-parsing machinery ported from Brad's code
    handles the pathological attributes fields, but other than that it's
    pretty trivial and like any line-by-line parser uses very little
    RAM.<br>
    <br>
    Creating parent-child relationships is a different beast. This takes
    a lot longer and is a lot more complex:<br>
    <br>
    from gffutils import create_db<br>
    db = create_db('annotations.gff', 'annotations.db')<br>
    <br>
    I should point out that using ":memory:" for the database name puts
    it in RAM. The downside is no persistence, so the time cost of
    parsing/constructing the db (~10 mins for 1.2M-feature human GENCODE
    GTF) has to be spent every time.<br>
    <br>
    Anyway, I think the next step is to get a draft PR going to iron out
    the details of parser integration. Where do you want this this live?
    Bio.GFF? Bio.GTF? <br>
    <br>
    -ryan<br>
  </body>
</html>