[Biojava-dev] Plans for next biojava release - modularization

Tue May 12 13:34:51 UTC 2009

Mark

It is a challenge on knowing where to draw the line. Allowing both
options is a reasonable approach. The implementation of the algorithm is
key to allow it to be multi-threaded or being able to run in parallel.
One approach is to provide a standard interface such as process() would
wait for the result/return value and run in the parent thread. To run
the algorithm in a thread you can have a startProcess() where you can
add yourself as a progress listener and when complete() method is called
you can call getResults(). You can then also have the corresponding
stopProcess() which would set an internal value to cause all threads to
quit.  Lots of ways to tackle the problem the key is to start talking
about it and at minimum take advantage of multiple-cores where the
external code can set the number of cores to use. You can get a dual
quad core machine these days for < $1000 but most software
implementations are not designed to take advantage of it. 

The real question is what exists today in the BioJava API that is
considered long running in normal use case and thus is a candidate to be
run in parallel. It may not be an issue in existing BioJava code. When I
first started using BioJava I went looking for BLAST code only to find a
BLAST parser. I wanted to do a Multiple Sequence Alignment and turns out
that Biojava code calls CLUSTALW as an external processor under the
covers.  I also needed code to construct trees from an MSA and found the
summer of code project that was only focused on representing the tree. 

It would be nice to have a BLAST implementation in Java optimized to run
on a cluster but who has time to rewrite BLAST in Java when you can do
BLAST search via the web and focus on parsing the results. BioJava needs
a BLAST API that makes a web services call to an external service and
gets returns structured results in core BioJava structures. Probably not
difficult to do a Java version of CLUSTALW but again we can push the
work out to http://www.ebi.ac.uk/Tools/webservices/services/clustalw and
get the results back returned in BioJava structures. 

I can signup for doing a BLAST web service -> BioJava and a CLUSTALW web
service -> BioJava code. I haven't done the research but it seems that
http://www.ebi.ac.uk/Tools/webservices/ has done a fair amount of work
to expose common biology  computational services. If multiple external
services are offering BLAST via web services where each picked a
different implementation then BioJava could provide abstraction to
different services.

Thanks

Scooter

From: mark.schreiber at novartis.com [mailto:mark.schreiber at novartis.com] 
Sent: Tuesday, May 12, 2009 1:27 AM
To: Scooter Willis
Cc: Andreas Prlic; biojava-dev
Subject: Re: [Biojava-dev] Plans for next biojava release -
modularization

Hi - 

This was one thing we discussed previously with respect to biojava 3.
Generally I support the idea because almost all computers are now
multi-core and as you say cloud or utility computing is already a
reality. 

However, I tend to think that biojava should not control threading or
concurrency. This should be done by the developer. This is because
sometimes mutithreading can be fast on a slow computer but slow on a
fast computer (due to the overhead in spawning threads) so programs need
to be tunable. Also Java app servers and things like Sun Grid Engine,
EC2 etc don't like people attempting to control their own threads.  What
BioJava should do is expose granular and thread-safe operations that can
be threaded or form discrete tasks on a utility grid or complete in
SessionBeans on an App server.  For example it would be better if
BioJava had a single threaded method to calculate the GC of a single
sequence rather than a multi-threaded method that calculates the GC of
multiple sequences.  This would let the developer make a multithreaded
version if desired or distribute multiple tasks based on the single
threaded version to a compute cloud (and let the cloud manage all the
tasks). 

Possibly the best situation would be to have the single threaded fine
grain operations that let developers or grid engines control threading
and then higher level APIs that do it for you (or good cookbook examples
that show you how to do it).  Another idea that was discussed was the
use of properties files to allow people to set how many CPUs they wanted
to make available to the JVM or name packages that can or cannot use
threading. 

Finally, there are lots of times when it is highly desirable to use Java
beans because they play well with dozens of Java api's however beans
don't work well with threads because they have public setter methods.  I
would like to see a lot more bean use in a future BioJava because it
would make life so much easier but a lot of care would need to be taken
to make sure thread safety is preserved.  There are many patterns that
can be used such as synchronization locks etc to make things thread safe
so I think this can be achieved as long as we are disciplined and
consider that all methods may be used in a multi-threaded application
(even if we write the method as a single thread).  If there are code
checkers that make suggestions on thread safety it would be great to
have these as part of the standard build process.  Good documentation
would go a long way as well.  Are there unit test patterns that can
catch these problems as well?  Suggestions would be great. 

Progress Listener patterns are good but it depends on the situation and
might be better handled in high level APIs or left to the developer.
For example in your NJ code a progress listener would be good if someone
fed 1000 sequences into the method but not if they only put in 10. Also
code running on an old machine might need a progress listener but the
same problem on a new machine may complete almost instantly.  Probably a
pluggable listener would be the way to go.  Also it might be possible to
do this using the new JDK APIs that let you take a peek at the stack
trace. Even if your NJ method didn't allow for a progress listener a
developer could still make one by looking at the method calls in the
stack. As long as your NJ method called other methods internally for
each sequence (quite likely) it would be possible to observe the cycle
of method calls from the stack.  This might make it possible to have a
very general BioJava progress listener that can be told to count the
number of times a method is called in the stack. The name of the method
would be the argument.  If the application runs in a Java App server you
can also do this very easily with a method Interceptor. 

- Mark 

biojava-dev-bounces at lists.open-bio.org wrote on 05/11/2009 09:50:58 PM:

> Andreas
> 
> Another theme that should be considered is providing a multi-thread
> version of any module with long run time. This would have a couple
> elements. A progress listener interface should be standard where core
> code would update progress messages to listeners that can be used by
> external code to display feedback to the user. I did this with the
> Neighbor Joining code for tree construction and it provides needed
> feedback in a GUI. If not the user gets frustrated because they don't
> know the code they are about to execute may take 10 minutes or 8 hours
> to complete and they think the software is not working. The reverse is
> also true for canceling an operation where you want to have core code
> stop processing a long running loop. Once the code has completed then
> the listener interface for process complete is called allowing the
next
> step in the external code to continue. The developer would have the
> choice to call the "process" method or run it in a thread and wait for
> the callback complete method to be called. 
> 
> This is the first step in the ability to have the core/long running
> processes take advantage of multiple threads to complete the
> computational task faster. Not all code can be parallelized easily but
> if the algorithm can take advantage of running in parallel then it
> should. This then opens up a couple of cloud computing frameworks that
> extend the multi-threaded concepts in Java across a cluster
> http://www.terracotta.org/. If we put an emphasis on having code that
> runs well in a thread we are one step closer to an architecture that
can
> run in a cloud. The computational problems are only going to get
bigger
> and with Amazon EC2 and http://www.eucalyptus.com/ approaches
> computational IO cycles are going to be cheap as long as the
> software/libraries can easily take advantage of it.
> 
> Thanks
> 
> Scooter
> 
> -----Original Message-----
> From: biojava-dev-bounces at lists.open-bio.org
> [mailto:biojava-dev-bounces at lists.open-bio.org] On Behalf Of Andreas
> Prlic
> Sent: Monday, May 11, 2009 12:27 AM
> To: biojava-dev
> Subject: [Biojava-dev] Plans for next biojava release - modularization
> 
> Hi biojava-devs,
> 
> It is time to start working on the next biojava release.  I  would
> like to modularize the current code base and apply some of the ideas
> that have emerged around Richard's "biojava 3" code. In principle the
> idea is that all changes should be backwards compatible with the
> interfaces provided by the current biojava 1.7 release.  Backwards
> compatibility shall only be broken if the functionality is being
> replaced with something that works better, and gets documented
> accordingly. For the build functionality I would suggest to stick with
> what Richard's biojava 3 code base already is providing. Since we will
> try to be backwards compatible all code development should be part of
> the biojava-trunk and the first step will be to move the ant-build
> scripts to a maven build process. Following this procedure will allow
> to use e.g. the code refactoring tools provided by Eclipse, which
> should come in handy.
> 
> The modules I would like to see should provide self-contained
> functionality and cross dependencies should be restricted to a
> minimum. I would suggest to have the following modules:
> 
> biojava-core: Contains everything that can not easily be modularized
> or nobody volunteers to become a module maintainer.
> biojava-phylogeny: Scooter expressed some interested to provide such a
> module and become package maintainer for it.
> biojava-structure: Everything protein structure related. I would be
> package maintainer.
> biojava-blast: Blast parsing is a frequently requested functionality
> and it would be good to have this code self-contained. A package
> maintainer for this still will need to be nominated at a later stage.
> Any suggestions for other modules?
> 
> Let me know what you think about this.
> 
> Andreas
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev
> 
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev

_________________________

CONFIDENTIALITY NOTICE

The information contained in this e-mail message is intended only for
the exclusive use of the individual or entity named above and may
contain information that is privileged, confidential or exempt from
disclosure under applicable law. If the reader of this message is not
the intended recipient, or the employee or agent responsible for
delivery of the message to the intended recipient, you are hereby
notified that any dissemination, distribution or copying of this
communication is strictly prohibited. If you have received this
communication in error, please notify the sender immediately by e-mail
and delete the material from any computer.  Thank you.