[Biopython-dev] [Biopython] Update: call for Google Summer of Code project ideas

Eric Talevich eric.talevich at gmail.com
Thu Mar 1 17:49:19 UTC 2012


On Thu, Mar 1, 2012 at 7:02 AM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Mon, Feb 27, 2012 at 4:24 PM, Robert Buels <rbuels at gmail.com> wrote:
> > Hi all,
> >
> > As kindly pointed out by Reece Hart, the previous email I sent out
> calling
> > for Google Summer of Code project ideas, had the wrong due date for
> project
> > ideas in it.
> >
> > I actually want them to all be in place by Friday, March 2, which is this
> > coming Friday.
> >
>
> See
> http://lists.open-bio.org/pipermail/biopython/2012-February/007726.html
> for the original complete email.
>
> That deadline is upon us (tomorrow), so where are we with GSoC 2012 ideas?
> http://biopython.org/wiki/Google_Summer_of_Code
>
> Are any of the areas touched on in the "Biopython 1.60 plans and beyond"
> thread suitable?
>

Perhaps:

Bio.Struct
----------

We have a lot of ideas and incomplete pieces of code from
previous GSoCs that could be sorted out in one summer.
However, taking on another GSoC student might just add to
the heap; this might need to be Eric and João's Summer of
Code instead.

Here's one semi-coherent project idea that could fly:

Overhaul Biopython's parsing infrastructure for protein
primary, secondary and tertiary structures

- Refactor PDBParser and parse_pdb_header to allow parsing
  amino-acid sequences from SEQRES lines (header) and ATOM
  records (body) without building the PDB structure object,
  i.e. without using numpy
- Write a pure-Python replacement for parsing mmCIF files.
  (The module MMCIF2Dict already does almost all the work;
  lex+yacc just manages a fairly simple state machine for
  recognizing comments, special sub-sections, etc.)
- Wrap the parsers for PDB, PDBML and mmCIF under a common
  I/O interface under the Bio.Struct namespace
- Add parsing support for protein secondary structures,
  based on the relevant PDB records or (perhaps) DSSP
  output. (Note that João did some work on this already.)



Variants
--------

So, from the Biopython 1.60 thread:

- James Casbon has offered to merge PyVCF into Biopython, right?
- BCF, the binary form of VCF (via blocked gzip), may also
  be worthwhile to support
- GVF, the Genome Variation Format, appears to be intended
  to be competitive with VCF. It's probably at least as well
  thought-out as VCF, sight unseen. It's based on GFF.

Synthesizing the above, we have a GSoC project that looks like:

- Help merge PyVCF into Python (w/ James's support -- I
  don't mean to volunteer him for this in absentia)?
- Write a GVF parser that emits the same object type as
  PyVCF, potentially also using existing GFF code
- Time permitting, look into blocked gzip support for VCF
  (BCF), also looking at SAM/BAM for inspiration and
  reusable code.



> SearchIO?
> ---------
>
> I'm wondering if a Biopython SearchIO would make a good project,
> that I might supervise. This name is obviously based on BioPerl. I
> would be aiming for iterator based parser/writer framework (like SeqIO
> and AlignIO) for pairwise 'sequence' searches initially, but have also
> been thinking about indexing - at least by query, ideally also by match,
> to allow random access akin to what Bio.SeqIO.index offers.
>
> In some cases the results would also be pairwise sequence alignments,
> in which case some code can be shared/linked with AlignIO. In other
> cases all you get is co-ordinates of the query and match plus some
> kind of score. Therefore this could include a hierarchical SearchIO
> result object structure for minimal matches up to full pairwise alignments.
>
> I'd hope to cover BLAST XML, BLAST tabular, HMMER tabular (not
> really sequence vs sequence, but HMM vs sequence), RPS-BLAST
> (again not really sequence vs sequence). Perhaps this could also tie
> into the Bio.Motif code as well (if we consider things like PSSM vs
> sequence in the same framework).
>
> You can already do some of this in Biopython (e.g. BLAST XML
> parsing, and there is some HMMER work on branches), but I'm
> hoping for a unified API here.
>
>
Interesting. It would be very nice if the objects emitted by SearchIO could
be easily fed into GenomeDiagram.

-Eric




More information about the Biopython-dev mailing list