<html>

  <head>

    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <br>

    <div class="moz-cite-prefix">On 06/24/2015 05:54 PM, Eric Talevich

      wrote:<br>

    </div>

    <blockquote

cite="mid:CAMC681=6NOf=r7Tjw94qYwvWwjBdLvPTex9LDJ944pow=s1JPA@mail.gmail.com"

      type="cite">

      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

      <div dir="ltr">

        <div class="gmail_extra">

          <div class="gmail_quote">On Wed, Jun 10, 2015 at 12:15 PM,

            Ryan Dale <span dir="ltr">&lt;<a moz-do-not-send="true"

                href="mailto:dalerr@niddk.nih.gov" target="_blank">dalerr@niddk.nih.gov</a>&gt;</span>

            wrote:<br>

            <blockquote class="gmail_quote" style="margin:0px 0px 0px

              0.8ex;border-left:1px solid

              rgb(204,204,204);padding-left:1ex">

              <div class="">

                <div class="h5"><br>

                  On 06/10/2015 01:44 PM, Brad Chapman wrote:<br>

                  <blockquote class="gmail_quote" style="margin:0px 0px

                    0px 0.8ex;border-left:1px solid

                    rgb(204,204,204);padding-left:1ex"> Eric and Peter;<br>

                    Thanks again for moving this forward. cc'ing in Ryan

                    as well, in case he<br>

                    hasn't seen this discussion.<br>

                    <br>

                    <blockquote class="gmail_quote" style="margin:0px

                      0px 0px 0.8ex;border-left:1px solid

                      rgb(204,204,204);padding-left:1ex"> So I suppose

                      the remaining tasks are, in no particular order:<br>

                      <br>

                      - Add/port Brad's GFF-GenBank converters and tests

                      to Biopython. Ensure all<br>

                      the tests pass.<br>

                    </blockquote>

                    I'd suggest moving those scripts to use gffutils,

                    rather than rely on<br>

                    bcbb/gff. Ryan's implementation is better and I'd

                    prefer to deprecate<br>

                    mine and move forward with his work.<br>

                    <br>

                    <blockquote class="gmail_quote" style="margin:0px

                      0px 0px 0.8ex;border-left:1px solid

                      rgb(204,204,204);padding-left:1ex"> - Enable GFF3

                      support by merging or porting from Brad's branch,

                      bcbb/gff,<br>

                      or gffutils?<br>

                    </blockquote>

                    My vote is for gffutils.<br>

                    <br>

                    <blockquote class="gmail_quote" style="margin:0px

                      0px 0px 0.8ex;border-left:1px solid

                      rgb(204,204,204);padding-left:1ex"> What to add

                      for parent/child relationships between features is<br>

                      <blockquote class="gmail_quote" style="margin:0px

                        0px 0px 0.8ex;border-left:1px solid

                        rgb(204,204,204);padding-left:1ex"> yet to be

                        decided.<br>

                      </blockquote>

                      I wonder if we can follow the lead of one of the

                      GFF implementations<br>

                      mentioned above.<br>

                      <br>

                      Has this been discussed in a more recent thread

                      that I didn't link<br>

                      here?<br>

                    </blockquote>

                    I lost this as well so am not sure the best starting

                    place. I don't have<br>

                    a strong opinion and open to doing whatever y'all

                    think is best.<br>

                    <br>

                    Thanks again,<br>

                    Brad<br>

                  </blockquote>

                  <br>

                </div>

              </div>

              Hi all -<br>

              <br>

              Brad, thanks for the CC.  I'd be happy to help out getting

              any/all of gffutils into BioPython. Let me give a

              high-level overview so you can decide what makes sense to

              bring into BioPython . . .<br>

            </blockquote>

            <div><br>

            </div>

            <div>Awesome. (Sorry for the lag.) I've looked through the

              gffutils code to see how this might work.<br>

              <br>

              Starting with the most mundane, I see that gffutils has

              these dependencies (Biopython aims for a functioning

              dependency-free installation):<br>

              <br>

              <span class=""><span class=""></span>six<span class=""> --

                  Could port using Bio._py3k, straightforward but

                  monotonous work.<br>

                </span></span><span class=""><span class=""></span><br>

                argh<span class="">, argcomplete -- Only for argument

                  parsing in the "gffutils-cli" script, maybe not needed

                  in Biopython.<br>

                </span></span><span class=""><span class=""></span><span

                  class=""><br>

                </span></span><span class=""><span class=""></span>simplejson<span

                  class=""> -- I think this is roughly the same code as

                  the "json" module in the standard library of Python

                  2.6+. Since Biopython doesn't support Python 2.5

                  anymore, we can probably just import "json" instead of

                  "simplejson" in feature.py and helpers.py.<br>

                  <br>

                </span></span><span class=""><span class=""></span>pyfaidx<span

                  class=""> -- This takes some consideration. <span

                    id="goog_685299984"></span>Since GSoC 2014<span

                    id="goog_685299985"></span>, Biopython can index a

                  genome-scale FASTA file with sqlite3 using its own

                  index format, not the samtools faidx format. I don't

                  see a ton of pygr-style indexing in gffutils beyond

                  just extracting the specified subsequences from a

                  FASTA file, so Biopython's internal solution may

                  suffice. This is not yet merged; the pull request is

                  here:<br>

                  <a moz-do-not-send="true"

                    href="https://github.com/biopython/biopython/pull/356">https://github.com/biopython/biopython/pull/356</a><br>

                  <br>

                  If reading the .fai file is mandatory but writing it

                  is not, then I can contribute a minimal ~100-line

                  implementation of that (which could alternatively go

                  into Biopython if we prefer):<br>

                  <a moz-do-not-send="true"

href="https://github.com/etal/cnvkit/blob/master/cnvlib/ngfrills/faidx.py">https://github.com/etal/cnvkit/blob/master/cnvlib/ngfrills/faidx.py</a><br>

                  <br>

                </span></span></div>

            <blockquote class="gmail_quote" style="margin:0px 0px 0px

              0.8ex;border-left:1px solid

              rgb(204,204,204);padding-left:1ex"> There are two main

              tricky parts to working with GFF/GTF: parsing the

              attributes and inferring the hierarchy of parent/child

              relationships.<br>

              <br>

              The parsing is mostly self-contained in gffutils.parser.

              It borrows the idea of a "dialect" from the built-in

              Python csv module, and the kinds of trickiness we see in

              Brad's pathological cases are encoded in the fields of the

              dialect (see comments in the gfftutils.constants.dialect

              dictionary).<br>

            </blockquote>

            <div><br>

            </div>

            <div>This looks valuable to have in Biopython even without

              inferring parent-child relationships. Would it be possible

              to start by extracting and merging the GFF3 parser, and

              work on the parent-child relationships separately?<br>

              <br>

            </div>

            <div> </div>

            <blockquote class="gmail_quote" style="margin:0px 0px 0px

              0.8ex;border-left:1px solid

              rgb(204,204,204);padding-left:1ex"> The relationships are

              by far the hardest. I could write a lot about the

              difficulties of GFF vs GTF, but let's just say a sqlite3

              db is the most portable and performant way I've found to

              use both GFF and GTF and interconvert between them. The

              bulk of gffutils' code and complexity is for working on

              this task.<br>

              <br>

              Converting GFF to BioPython objects while reliably keeping

              track of parent/child relations requires parsing the

              entire file, creating a database, and then querying the db

              for the relations. gffutils does this, and currently

              creates SeqFeatures objects. Any additional

              CompoundLocation stuff can easily be added, as long as

              there's a gffutils database to get relationship info from.

              Likewise, assuming presence of a db, Brad's scripts can

              easily be ported. I can certainly work on this.<br>

              <br>

              So I guess the big question is if you want to introduce

              all the sqlite3 machinery to BioPython in order to access

              relationship info, or just use the parser.<br>

            </blockquote>

          </div>

          <br>

        </div>

        <div class="gmail_extra">I think we're happy to use sqlite3

          wherever it's a sensible engineering choice, since it's part

          of the standard library. Biopython users may want the option

          to skip the database if parent-child relationships are not

          needed, or keep it in RAM to avoid hitting the disk.<br>

          <br>

          <br>

        </div>

        <div class="gmail_extra">-Eric<br>

        </div>

      </div>

    </blockquote>

    <br>

    Hi Eric - <br>

    <br>

    Regarding the dependencies, I'm pretty sure we can drop them all

    (I'll address them individually below) for integration with

    Biopython. I'm imagining gffutils will remain as an independent

    project, with a subset of useful parts shared with Biopython. In

    this case, it would be important to minimize the "merge barrier" to

    facilitate bugfixes/improvements between code bases in both

    directions.<br>

    <br>

    With that in mind:<br>

    <br>

    <blockquote type="cite"><span class=""><span class=""></span>six<span

          class=""> -- Could port using Bio._py3k, straightforward but

          monotonous work.</span></span></blockquote>

    <br>

    I think the way to address this is to make a copy of Bio._py3k in

    gffutils and use that to replace six in the main gffutils repo. Then

    picking-and-choosing pieces of gffutils to put in Biopython will be

    straightforward: just edit the import from "gffutils._py3k" to

    "Bio._py3k".<br>

    <br>

    <blockquote type="cite"><span class="">argh<span class="">,

          argcomplete -- Only for argument parsing in the "gffutils-cli"

          script, maybe not needed in Biopython.</span></span></blockquote>

    <br>

    Agreed, nothing targeted for Biopython integration needs these.<br>

    <br>

    <blockquote type="cite"><span class="">simplejson<span class=""> --

          I think this is roughly the same code as the "json" module in

          the standard library of Python 2.6+. Since Biopython doesn't

          support Python 2.5 anymore, we can probably just import "json"

          instead of "simplejson" in feature.py and helpers.py.</span></span></blockquote>

    <br>

    simplejson won in some performance benchmarks I had done, but not by

    a huge amount. It's a drop-in replacement for json, so this should

    be a straightforward fix.<br>

    <br>

    <blockquote type="cite"><span class="">pyfaidx<span class=""> --

          This takes some consideration. <span id="goog_685299984"></span>Since

          GSoC 2014<span id="goog_685299985"></span>, Biopython can

          index a genome-scale FASTA file with sqlite3 using its own

          index format, not the samtools faidx format. I don't see a ton

          of pygr-style indexing in gffutils beyond just extracting the

          specified subsequences from a FASTA file, so Biopython's

          internal solution may suffice. This is not yet merged; the

          pull request is here:<br>

          <a href="https://github.com/biopython/biopython/pull/356">https://github.com/biopython/biopython/pull/356</a><br>

          <br>

          If reading the .fai file is mandatory but writing it is not,

          then I can contribute a minimal ~100-line implementation of

          that (which could alternatively go into Biopython if we

          prefer):<br>

          <a class="moz-txt-link-freetext"

href="https://github.com/etal/cnvkit/blob/master/cnvlib/ngfrills/faidx.py">https://github.com/etal/cnvkit/blob/master/cnvlib/ngfrills/faidx.py</a></span></span></blockquote>

    <br>

    Right, it's only used for extracting subsequences from a FASTA.

    Could either drop this functionality altogether, or use the options

    you suggest. I would prefer the Biopython internal solution, since a

    read-only approach limits the user to pre-constructed indexes -- or

    requires them to install additional dependencies to create their

    own.<br>

    <br>

    <blockquote type="cite">Would it be possible to start by extracting

      and merging the GFF3 parser, and work on the parent-child

      relationships separately?</blockquote>

    <br>

    Certainly. I think this is a good strategy. I think it would be good

    to hold off on the sequence extraction for now as well.<br>

    <br>

    <blockquote type="cite">I think we're happy to use sqlite3 wherever

      it's a sensible engineering choice, since it's part of the

      standard library. Biopython users may want the option to skip the

      database if parent-child relationships are not needed, or keep it

      in RAM to avoid hitting the disk.<br>

    </blockquote>

    <br>

    If parent-child relationships are not needed, then maybe all you

    need to do is parse:<br>

    <br>

    from gffutils.iterators import DataIterator<br>

    for feature in DataIterator('annotations.gff'):<br>

        # do something with feature<br>

    <br>

    The attributes-field-parsing machinery ported from Brad's code

    handles the pathological attributes fields, but other than that it's

    pretty trivial and like any line-by-line parser uses very little

    RAM.<br>

    <br>

    Creating parent-child relationships is a different beast. This takes

    a lot longer and is a lot more complex:<br>

    <br>

    from gffutils import create_db<br>

    db = create_db('annotations.gff', 'annotations.db')<br>

    <br>

    I should point out that using ":memory:" for the database name puts

    it in RAM. The downside is no persistence, so the time cost of

    parsing/constructing the db (~10 mins for 1.2M-feature human GENCODE

    GTF) has to be spent every time.<br>

    <br>

    Anyway, I think the next step is to get a draft PR going to iron out

    the details of parser integration. Where do you want this this live?

    Bio.GFF? Bio.GTF? <br>

    <br>

    -ryan<br>

  </body>

</html>