[GSoC] GSoC 2013 is ON

Ketil Malde ketil at malde.org
Tue Apr 2 08:03:58 UTC 2013


[CC everybody including the biohaskell list. Let me know if any of you
want off. :-) ]

Pjotr Prins <pjotr2010 at thebird.nl> writes:

>   http://www.open-bio.org/wiki/Google_Summer_of_Code

> For Biopython (3x), BioRuby (5x) and BioJava (4x) I found project ideas.

> The others are missing.

> There is still a (rather small) window of opportunity for adding
> ideas.

I have one thing that might work well as a SOC project, if the right
student could be found.

Basically, I and a colleague recently developed and published a method
and implementation for more sensitive pairwise alignments.  The paper is
here, I think (PLoS ONE seems to be down atm):
  http://dx.plos.org/10.1371/journal.pone.0054422

I'm really happy about the results, if nothing else, check the SCOP
benchmark.  Although it's difficult to construct a good test case using
more complex methods (training sets for HMMs and whatnot) I don't know
anything that is as good as this.  We're using it for annotation of
genes.

The current implementation is in Haskell, and although it works
correctly, it is a bit slow, and more problematic, it consumes too much
memory (so going multi-threaded, although pretty easy, won't be of any
help).

I would like to make this into a less resource intensive (and thus more
practical) tool, and there are two ways I can think of to go about this:

1) Optimize the Haskell program

2) Reimplement the algorithm (or parts of it) in a different language

Advantages of 1:

* Already have a working program, and the type system makes it easy to
refactor without introducing errors.
* Haskell supports lots of good multi-threading programming models (like
STM)
* I know Haskell pretty well, and will be hopefully be able to mentor.

Disadvantages:

* Haskell has some good debugging tools, but they tend to work really
  poorly for large memory (i.e. it takes a long time to generate
  profiles)
* Needs somebody with a bit (or a lot) of experience optimizing Haskell,
  and good knowledge of high-perf libraries (like vector)

Advantages of 2:

* Easier to get a student with adequate skills.
* More predictable performance models in other languages.
* Easier to compile and install for many users.

Disadvantages:

* Ideally, should know enough Haskell to read and understand the code.
* Likely needs a co-mentor with knowledge of the language in question.

Is this something I could or should submit as a task?

-k
-- 
If I haven't seen further, it is by standing in the footprints of giants



More information about the GSoC mailing list