[Biojava-l] Questions about Summer of Code Project

Andy Yates ayates at ebi.ac.uk
Thu Apr 8 10:46:15 UTC 2010


Ahhh okay. So when we wrote this section it was with a view towards being able to do things in a concurrent manner as & when that framework appears. BioJava3 is still in an incubation phase; a lot of code is in place but we are all having to do this along with work commitments (which in my case is working on a Perl project so my work/BJ contributions are very limited). 

Anyway to go back to the question about being "framework" standard. The MSA algorithm would be the first case we would have to make concurrent (as far as I am  aware but Scooter is a better person to confirm this) and so the framework of building a concurrent application would come from this project. If the code is written using the standard concurrent library interfaces then it should be possible to transplant it into any concurrent Java framework and that's really the important thing here.

Andy

On 8 Apr 2010, at 11:38, Singer Ma wrote:

> So, my questions were generated from looking past just the Summer of
> Code proposal and into what BioJava 3 is supposed to do. BioJava 3, as
> part of its proposal, lists:
> 
> Make methods parallel-aware and take advantage of this when possible,
> and provide a global variable to specify how much parallelisation can
> take place.
> 
> on http://www.biojava.org/wiki/BioJava3_Proposal
> 
> How important it this to incorporate into the Summer of Code project?
> Obviously anything that is already concurrent can remain that way, but
> for the new code in multiple sequence alignment, does this need to be
> parallel-aware? Clearly, in a multiple sequence alignment, certain
> things can be made parallel such as the initial distance matrix
> calculation, parts of the neighbor joining algorithm, etc. If I were
> to contribute, I would want to uphold the agreed upon standards as
> much as possible. I am just unsure of my capability to make multiple
> sequence alignment parallel-aware.
> 
> Singer
> 
> On Thu, Apr 8, 2010 at 3:23 AM, Andy Yates <ayates at ebi.ac.uk> wrote:
>> Hi Singer,
>> 
>> To add a bit more information to Andreas' comments. Java has a very mature concurrent execution library (java.util.concurrent) which was introduced in version 1.5. BioJava is a 1.6 project and so I would expect any multi-concurrent library to be using this. Extensions are available for this most notably the Google guava project, the Actor model found in Scala (with more pure Java implementations available) and the Map/Reduce paradigm first white-papered by Google. The big rules about concurrency are:
>> 
>> * Mutable objects are the work of the devil & should be avoided
>> * Tasks & Futures are quite lightweight things to produce; threads are not
>> * Multiple tasks can be given to a queue to be processed by a number of threads in a pool
>> * Assume a non-linear execution pipeline and attempt to pass messages/jobs into queues when data is processed
>> * Assume that things will fail
>> * Write your program with a view to be concurrent; do not force concurrency on an already written program
>> 
>> Concurrent programs are very hard things to write and normally fail because what they attempt to do is too complex or too simple. Getting the balance right is hard but do-able. I can also recommend Brian Goetz's Java Concurrency in Practice (http://www.javaconcurrencyinpractice.com/).
>> 
>> Andy
>> 
>> On 7 Apr 2010, at 20:30, Andreas Prlic wrote:
>> 
>>> Hi Singer,
>>> 
>>>> I had previously sent this, but was not part of the mailing list, so I
>>>> can only assume it got lost in a spam loop.
>>> 
>>> You need to be subscribed in order to be able to post...
>>> 
>>>> I was interested in applying for the All-Java Multiple Sequence
>>>> Alignment Google Summer of Code project.
>>> 
>>> Several students have expressed their interest  in this project.
>>> Depending on how the funding situation will be, at maximum one will be
>>> able to work on this... There is also a 2nd BioJava related project or
>>> you could propose your own ideas...
>>> http://biojava.org/wiki/Google_Summer_of_Code
>>> 
>>> 
>>> I wanted to create a project
>>>> plan but had some questions about the package as it stands now.
>>>> 
>>>> 1. What exactly has changed with the transition to BioJava 3? From
>>>> what I've read on the BioJava 3 proposal page, it seems like that the
>>>> changes are to the organization of the code. Additionally there are
>>>> some new standards to follow. Java 6 usage is desired, but I am unsure
>>>> of what of the new features could be used in modifying pairwise
>>>> sequence alignments.
>>> 
>>> BioJava is more modular in version 3. There is a new module for
>>> working with sequences. The current alignment module is still based on
>>> the old version of BioJava though.
>>> 
>>>> 
>>>> 2. Is the Neighbor Joining Algorithm really the best for this? Are
>>>> other multiple alignments implementations desired? I have implemented
>>>> the neighbor joining algorithm very inefficiently in python, it was
>>>> not particularly difficult.
>>> 
>>> NJ is a clustering technique, but there are also others.
>>> http://en.wikipedia.org/wiki/Neighbor-joining
>>> Another online lecture that might be useful is:
>>> http://www.mbio.ncsu.edu/MB451/lecture/trees/lecture.html
>>> 
>>> This step seems like it will not take very
>>>> long. Additionally, parallelism, I have no experience with parallelism
>>>> in Java and will only have some experience with it in C, will that be
>>>> an issue?
>>> 
>>> I have never written multi threaded code in C, but I would guess it is
>>> much much easier in Java...
>>> 
>>>> 3. Is there a specific paper with the exact algorithm that should be
>>>> implemented here?
>>> 
>>> We have only 3 months for this project so having a modular core
>>> algorithm that can be extended would be a priority. I recommend
>>> reading the Clustalw, T-Coffee and Muscle papers.
>>> 
>>>> General: Will use cases be provided? Will test data be provided? These
>>>> would both be useful in coding the test cases which seem to be coded
>>>> first.
>>> 
>>> I can provide plenty of data for that.
>>> 
>>> 
>>>> Additionally, I have access to my current windows machine as well as
>>>> as Linux machine for testing, but no Mac. While in theory with java,
>>>> if it works on one, then it works on another, and especially with if
>>>> it works on Linux, it should be fine on Mac, should I be worried about
>>>> strange peculiarities?
>>> 
>>>> From my experience Java works pretty fine on any platform. There might
>>> be issues with user interfaces that require testing, but we are not
>>> going to do  user interfaces here...
>>> 
>>> Andreas
>>> 
>>> 
>>>> 
>>>> Thanks,
>>>> Singer Ma
>>>> Harvey Mudd College 2011
>>>> _______________________________________________
>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> -----------------------------------------------------------------------
>>> Dr. Andreas Prlic
>>> Senior Scientist, RCSB PDB Protein Data Bank
>>> University of California, San Diego
>>> (+1) 858.246.0526
>>> -----------------------------------------------------------------------
>>> 
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>> 
>> --
>> Andrew Yates                   Ensembl Genomes Engineer
>> EMBL-EBI                       Tel: +44-(0)1223-492538
>> Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
>> Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/
>> 
>> 
>> 
>> 
>> 

-- 
Andrew Yates                   Ensembl Genomes Engineer
EMBL-EBI                       Tel: +44-(0)1223-492538
Wellcome Trust Genome Campus   Fax: +44-(0)1223-494468
Cambridge CB10 1SD, UK         http://www.ensemblgenomes.org/








More information about the Biojava-l mailing list