[Biopython-dev] Project ideas for GSoC (or other student projects)

Thu Mar 21 17:59:10 UTC 2013

On Thu, Mar 21, 2013 at 5:42 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
> On Thu, Mar 21, 2013 at 12:55 PM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
>>
>> On Wed, Mar 13, 2013 at 6:32 PM, Eric Talevich <eric.talevich at gmail.com>
>> wrote:
>> > I like Michiel's idea, and I'll suggest two more:
>> >
>> > 1. Codon alignment & analysis:
>> > - PAL2NAL-style conversion of unaligned nucleic acid sequences and a
>> > protein
>> > sequence alignment to a codon alignment. (Previously discussed)
>>
>> e.g.
>> https://github.com/peterjc/picobio/blob/master/align/align_back_trans.py
>
> Well, check you out. Would you be interested in mentoring this project?
>

If I'm not primary mentor on another project, I'd be open to co-mentoring
something on the alignment side.

>> > - dN/dS and the related functions needed to calculate it.
>> > - Possible AlignIO or MultipleSeqAlignment tweaks to take full advantage
>> > of
>> > codon alignments, including validation (testing for frame shifts etc.)
>>
>>
>> http://biopython.org/wiki/Google_Summer_of_Code#Codon_alignment_and_analysis
>>
>> I see you've started fleshing this idea out on the wiki, which is great.
>> Right now it seems a little on the light weight side - or is that
>> deliberate
>> (to see if a student can take this idea and come up with a solid
>> project proposal in this area)? Things like model selection might
>> be a fun extension - I can think of a local expert who would be
>> great to get involved on the science side if he's interested.
>
>
> I put up a quick sketch to avoid locking the wiki page for too long, but
> also deliberately left it vague to see where the applicants take it. Model
> selection would be cool, I added it. Local expert, also great.

If he's available and willing, yes. I've not mentioned this to him
yet so no promises - the idea only occurred to me while writing
that email ;)

>>
>> Alternatively this could include doing some more general work
>> on the alignment object - for instance per-column-annotation
>> for things like a consensus sequence - or an array-of-char
>> implementation as an alternative to the list-of-SeqRecords
>> we have now (with its poor column access speed).
>>
>> Peter
>
>
> I wonder if that's something we could just do incrementally -- change the
> MultipleSeqAlignment class to store a list-of-lists-of chars (or
> list-of-strings), a list of SeqRecord-like husks (all the annotations, but
> without the Seq itself) for each row, a list of column annotations, and a
> single alphabet for the whole alignment.
>
> How do you suppose the speed of that would compare to the current
> list-of-SeqRecords, and also to that of a wrapped NumPy matrix? Would it be
> a significant enough speed improvement to justify both replacing the current
> implementation, and to make the NumPy approach less tempting (given PyPy's
> progress toward including a compliant implementation)? Alternatively, we
> could post a GSoC project for creating a separate TurboAlignment
> class/module based on NumPy which would be mostly interchangeable and
> interconvertible with the pure-Python version in the Biopython core.

When I said array-of-char I did have NumPy in mind, and PyPy does now
cope with two or more dimensional arrays in NumPyPy. Note that NumPy
handles both row and column orientated arrays with a simple class init
option, so this can easily be setup to favour row or column access.

Last time I did anything with the alignment object where column access
was a bottleneck (calculating mutual information between columns), I
just loaded all the columns into memory as a list of strings, and computed
on that. It worked very nicely.

> Speaking of which, should we also post the idea of storing sequences as an
> efficient byte array, BioJava-style?

I'd wondered about that (in combination with the discussion about strict
alphabet checking), but is there enough for a whole GSoC project?
Related to this one could look at something with k-mer hashes...

(Its good to see lots of possible project ideas bouncing around)

Peter