[Biojava-l] Questions about Summer of Code Project

Singer Ma sma.hmc at gmail.com
Thu Apr 8 10:38:41 UTC 2010


So, my questions were generated from looking past just the Summer of
Code proposal and into what BioJava 3 is supposed to do. BioJava 3, as
part of its proposal, lists:

Make methods parallel-aware and take advantage of this when possible,
and provide a global variable to specify how much parallelisation can
take place.

on http://www.biojava.org/wiki/BioJava3_Proposal

How important it this to incorporate into the Summer of Code project?
Obviously anything that is already concurrent can remain that way, but
for the new code in multiple sequence alignment, does this need to be
parallel-aware? Clearly, in a multiple sequence alignment, certain
things can be made parallel such as the initial distance matrix
calculation, parts of the neighbor joining algorithm, etc. If I were
to contribute, I would want to uphold the agreed upon standards as
much as possible. I am just unsure of my capability to make multiple
sequence alignment parallel-aware.

Singer

On Thu, Apr 8, 2010 at 3:23 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
> Hi Singer,
>
> To add a bit more information to Andreas' comments. Java has a very mature concurrent execution library (java.util.concurrent) which was introduced in version 1.5. BioJava is a 1.6 project and so I would expect any multi-concurrent library to be using this. Extensions are available for this most notably the Google guava project, the Actor model found in Scala (with more pure Java implementations available) and the Map/Reduce paradigm first white-papered by Google. The big rules about concurrency are:
>
> * Mutable objects are the work of the devil & should be avoided
> * Tasks & Futures are quite lightweight things to produce; threads are not
> * Multiple tasks can be given to a queue to be processed by a number of threads in a pool
> * Assume a non-linear execution pipeline and attempt to pass messages/jobs into queues when data is processed
> * Assume that things will fail
> * Write your program with a view to be concurrent; do not force concurrency on an already written program
>
> Concurrent programs are very hard things to write and normally fail because what they attempt to do is too complex or too simple. Getting the balance right is hard but do-able. I can also recommend Brian Goetz's Java Concurrency in Practice (http://www.javaconcurrencyinpractice.com/).
>
> Andy
>
> On 7 Apr 2010, at 20:30, Andreas Prlic wrote:
>
>> Hi Singer,
>>
>>> I had previously sent this, but was not part of the mailing list, so I
>>> can only assume it got lost in a spam loop.
>>
>> You need to be subscribed in order to be able to post...
>>
>>> I was interested in applying for the All-Java Multiple Sequence
>>> Alignment Google Summer of Code project.
>>
>> Several students have expressed their interest  in this project.
>> Depending on how the funding situation will be, at maximum one will be
>> able to work on this... There is also a 2nd BioJava related project or
>> you could propose your own ideas...
>> http://biojava.org/wiki/Google_Summer_of_Code
>>
>>
>> I wanted to create a project
>>> plan but had some questions about the package as it stands now.
>>>
>>> 1. What exactly has changed with the transition to BioJava 3? From
>>> what I've read on the BioJava 3 proposal page, it seems like that the
>>> changes are to the organization of the code. Additionally there are
>>> some new standards to follow. Java 6 usage is desired, but I am unsure
>>> of what of the new features could be used in modifying pairwise
>>> sequence alignments.
>>
>> BioJava is more modular in version 3. There is a new module for
>> working with sequences. The current alignment module is still based on
>> the old version of BioJava though.
>>
>>>
>>> 2. Is the Neighbor Joining Algorithm really the best for this? Are
>>> other multiple alignments implementations desired? I have implemented
>>> the neighbor joining algorithm very inefficiently in python, it was
>>> not particularly difficult.
>>
>> NJ is a clustering technique, but there are also others.
>> http://en.wikipedia.org/wiki/Neighbor-joining
>> Another online lecture that might be useful is:
>> http://www.mbio.ncsu.edu/MB451/lecture/trees/lecture.html
>>
>> This step seems like it will not take very
>>> long. Additionally, parallelism, I have no experience with parallelism
>>> in Java and will only have some experience with it in C, will that be
>>> an issue?
>>
>> I have never written multi threaded code in C, but I would guess it is
>> much much easier in Java...
>>
>>> 3. Is there a specific paper with the exact algorithm that should be
>>> implemented here?
>>
>> We have only 3 months for this project so having a modular core
>> algorithm that can be extended would be a priority. I recommend
>> reading the Clustalw, T-Coffee and Muscle papers.
>>
>>> General: Will use cases be provided? Will test data be provided? These
>>> would both be useful in coding the test cases which seem to be coded
>>> first.
>>
>> I can provide plenty of data for that.
>>
>>
>>> Additionally, I have access to my current windows machine as well as
>>> as Linux machine for testing, but no Mac. While in theory with java,
>>> if it works on one, then it works on another, and especially with if
>>> it works on Linux, it should be fine on Mac, should I be worried about
>>> strange peculiarities?
>>
>>> From my experience Java works pretty fine on any platform. There might
>> be issues with user interfaces that require testing, but we are not
>> going to do  user interfaces here...
>>
>> Andreas
>>
>>
>>>
>>> Thanks,
>>> Singer Ma
>>> Harvey Mudd College 2011
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>
>>
>>
>>
>> --
>> -----------------------------------------------------------------------
>> Dr. Andreas Prlic
>> Senior Scientist, RCSB PDB Protein Data Bank
>> University of California, San Diego
>> (+1) 858.246.0526
>> -----------------------------------------------------------------------
>>
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>
> --
> Andrew Yates                   Ensembl Genomes Engineer
> EMBL-EBI                       Tel: +44-(0)1223-492538
> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>
>
>
>
>




More information about the Biojava-l mailing list