[Biopython-dev] Project ideas for GSoC (or other student projects)

Thu Mar 21 17:42:19 UTC 2013

On Thu, Mar 21, 2013 at 12:55 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Wed, Mar 13, 2013 at 6:32 PM, Eric Talevich <eric.talevich at gmail.com>
> wrote:
> > I like Michiel's idea, and I'll suggest two more:
> >
> > 1. Codon alignment & analysis:
> > - PAL2NAL-style conversion of unaligned nucleic acid sequences and a
> protein
> > sequence alignment to a codon alignment. (Previously discussed)
>
> e.g.
> https://github.com/peterjc/picobio/blob/master/align/align_back_trans.py

Well, check you out. Would you be interested in mentoring this project?

> > - dN/dS and the related functions needed to calculate it.
> > - Possible AlignIO or MultipleSeqAlignment tweaks to take full advantage
> of
> > codon alignments, including validation (testing for frame shifts etc.)
>
>
> http://biopython.org/wiki/Google_Summer_of_Code#Codon_alignment_and_analysis
>
> I see you've started fleshing this idea out on the wiki, which is great.
> Right now it seems a little on the light weight side - or is that
> deliberate
> (to see if a student can take this idea and come up with a solid
> project proposal in this area)? Things like model selection might
> be a fun extension - I can think of a local expert who would be
> great to get involved on the science side if he's interested.
>

I put up a quick sketch to avoid locking the wiki page for too long, but
also deliberately left it vague to see where the applicants take it. Model
selection would be cool, I added it. Local expert, also great.

> Alternatively this could include doing some more general work
> on the alignment object - for instance per-column-annotation
> for things like a consensus sequence - or an array-of-char
> implementation as an alternative to the list-of-SeqRecords
> we have now (with its poor column access speed).
>
> Peter
>

I wonder if that's something we could just do incrementally -- change the
MultipleSeqAlignment class to store a list-of-lists-of chars (or
list-of-strings), a list of SeqRecord-like husks (all the annotations, but
without the Seq itself) for each row, a list of column annotations, and a
single alphabet for the whole alignment.

How do you suppose the speed of that would compare to the current
list-of-SeqRecords, and also to that of a wrapped NumPy matrix? Would it be
a significant enough speed improvement to justify both replacing the
current implementation, and to make the NumPy approach less tempting (given
PyPy's progress toward including a compliant implementation)?
Alternatively, we could post a GSoC project for creating a separate
TurboAlignment class/module based on NumPy which would be mostly
interchangeable and interconvertible with the pure-Python version in the
Biopython core.

Speaking of which, should we also post the idea of storing sequences as an
efficient byte array, BioJava-style?

-Eric