[Biojava-dev] Plans for next biojava release - modularization

Wed May 13 00:23:30 UTC 2009

Andreas

A follow up point related to Mark's comment could be that parsing blast output would not be required or less important if we provide a clean BioJava API to make the web service call with BioJava data structure inputs and give back BioJava data structure outputs. This saves the step of the user doing the web query, file save, parse etc. It would be interesting to know how many users run their own BLAST server for privacy reasons.

Scooter

-----Original Message-----
From: Scooter Willis
Sent: Tue 5/12/2009 8:13 PM
To: Andreas Prlic
Cc: biojava-dev
Subject: RE: [Biojava-dev] Plans for next biojava release - modularization

Andreas

The goal for BioJava could be to provide a wrapper for the http://www.ebi.ac.uk/Tools/webservices/clients/blastpgp java code so that inputs/outputs are BioJava. I think they are using Axis for the client web services code. If BioJava 3 is going to be Java 6 minimum then it is easier to use the Java 6 SOAP processing capabilities by pointing to the WSDL code and generating the Java code for the client side. This cuts down on the additional external 3rd parties that are required.

I try to stay out of the legacy file parsing business whenever possible. 

Scooter 

-----Original Message-----
From: andreas.prlic at gmail.com on behalf of Andreas Prlic
Sent: Tue 5/12/2009 7:59 PM
To: Scooter Willis
Cc: biojava-dev
Subject: Re: [Biojava-dev] Plans for next biojava release - modularization

Hi Scooter,

about your suggestion for the blast webservice client code: In
principle I like the idea and we have had questions on the mailing
list regarding this in the past. Only thing is I think there is
already some client code in java available:
http://www.ebi.ac.uk/Tools/webservices/clients/blastpgp
but I am not sure how good that Java client library is....

Besides this, there is the need for work on our blast parser library
and if you are interested in working on that you are welcome. As I
mentioned, I think this should become its own module, due to the
popularity of that code.

Andreas

On Tue, May 12, 2009 at 6:34 AM, Scooter Willis <HWillis at scripps.edu> wrote:
> Mark
>
>
>
> It is a challenge on knowing where to draw the line. Allowing both options
> is a reasonable approach. The implementation of the algorithm is key to
> allow it to be multi-threaded or being able to run in parallel. One approach
> is to provide a standard interface such as process() would wait for the
> result/return value and run in the parent thread. To run the algorithm in a
> thread you can have a startProcess() where you can add yourself as a
> progress listener and when complete() method is called you can call
> getResults(). You can then also have the corresponding stopProcess() which
> would set an internal value to cause all threads to quit.  Lots of ways to
> tackle the problem the key is to start talking about it and at minimum take
> advantage of multiple-cores where the external code can set the number of
> cores to use. You can get a dual quad core machine these days for < $1000
> but most software implementations are not designed to take advantage of it.
>
>
>
> The real question is what exists today in the BioJava API that is considered
> long running in normal use case and thus is a candidate to be run in
> parallel. It may not be an issue in existing BioJava code. When I first
> started using BioJava I went looking for BLAST code only to find a BLAST
> parser. I wanted to do a Multiple Sequence Alignment and turns out that
> Biojava code calls CLUSTALW as an external processor under the covers.  I
> also needed code to construct trees from an MSA and found the summer of code
> project that was only focused on representing the tree.
>
>
>
> It would be nice to have a BLAST implementation in Java optimized to run on
> a cluster but who has time to rewrite BLAST in Java when you can do BLAST
> search via the web and focus on parsing the results. BioJava needs a BLAST
> API that makes a web services call to an external service and gets returns
> structured results in core BioJava structures. Probably not difficult to do
> a Java version of CLUSTALW but again we can push the work out to
> http://www.ebi.ac.uk/Tools/webservices/services/clustalw and get the results
> back returned in BioJava structures.
>
>
>
> I can signup for doing a BLAST web service -> BioJava and a CLUSTALW web
> service -> BioJava code. I haven't done the research but it seems that
> http://www.ebi.ac.uk/Tools/webservices/ has done a fair amount of work to
> expose common biology  computational services. If multiple external services
> are offering BLAST via web services where each picked a different
> implementation then BioJava could provide abstraction to different services.
>
>
>
> Thanks
>
> Scooter
>
>
>
> From: mark.schreiber at novartis.com [mailto:mark.schreiber at novartis.com]
> Sent: Tuesday, May 12, 2009 1:27 AM
> To: Scooter Willis
> Cc: Andreas Prlic; biojava-dev
> Subject: Re: [Biojava-dev] Plans for next biojava release - modularization
>
>
>
> Hi -
>
> This was one thing we discussed previously with respect to biojava 3.
>  Generally I support the idea because almost all computers are now
> multi-core and as you say cloud or utility computing is already a reality.
>
> However, I tend to think that biojava should not control threading or
> concurrency. This should be done by the developer. This is because sometimes
> mutithreading can be fast on a slow computer but slow on a fast computer
> (due to the overhead in spawning threads) so programs need to be tunable.
> Also Java app servers and things like Sun Grid Engine, EC2 etc don't like
> people attempting to control their own threads.  What BioJava should do is
> expose granular and thread-safe operations that can be threaded or form
> discrete tasks on a utility grid or complete in SessionBeans on an App
> server.  For example it would be better if BioJava had a single threaded
> method to calculate the GC of a single sequence rather than a multi-threaded
> method that calculates the GC of multiple sequences.  This would let the
> developer make a multithreaded version if desired or distribute multiple
> tasks based on the single threaded version to a compute cloud (and let the
> cloud manage all the tasks).
>
> Possibly the best situation would be to have the single threaded fine grain
> operations that let developers or grid engines control threading and then
> higher level APIs that do it for you (or good cookbook examples that show
> you how to do it).  Another idea that was discussed was the use of
> properties files to allow people to set how many CPUs they wanted to make
> available to the JVM or name packages that can or cannot use threading.
>
> Finally, there are lots of times when it is highly desirable to use Java
> beans because they play well with dozens of Java api's however beans don't
> work well with threads because they have public setter methods.  I would
> like to see a lot more bean use in a future BioJava because it would make
> life so much easier but a lot of care would need to be taken to make sure
> thread safety is preserved.  There are many patterns that can be used such
> as synchronization locks etc to make things thread safe so I think this can
> be achieved as long as we are disciplined and consider that all methods may
> be used in a multi-threaded application (even if we write the method as a
> single thread).  If there are code checkers that make suggestions on thread
> safety it would be great to have these as part of the standard build
> process.  Good documentation would go a long way as well.  Are there unit
> test patterns that can catch these problems as well?  Suggestions would be
> great.
>
> Progress Listener patterns are good but it depends on the situation and
> might be better handled in high level APIs or left to the developer.  For
> example in your NJ code a progress listener would be good if someone fed
> 1000 sequences into the method but not if they only put in 10. Also code
> running on an old machine might need a progress listener but the same
> problem on a new machine may complete almost instantly.  Probably a
> pluggable listener would be the way to go.  Also it might be possible to do
> this using the new JDK APIs that let you take a peek at the stack trace.
> Even if your NJ method didn't allow for a progress listener a developer
> could still make one by looking at the method calls in the stack. As long as
> your NJ method called other methods internally for each sequence (quite
> likely) it would be possible to observe the cycle of method calls from the
> stack.  This might make it possible to have a very general BioJava progress
> listener that can be told to count the number of times a method is called in
> the stack. The name of the method would be the argument.  If the application
> runs in a Java App server you can also do this very easily with a method
> Interceptor.
>
> - Mark
>
> biojava-dev-bounces at lists.open-bio.org wrote on 05/11/2009 09:50:58 PM:
>
>> Andreas
>>
>> Another theme that should be considered is providing a multi-thread
>> version of any module with long run time. This would have a couple
>> elements. A progress listener interface should be standard where core
>> code would update progress messages to listeners that can be used by
>> external code to display feedback to the user. I did this with the
>> Neighbor Joining code for tree construction and it provides needed
>> feedback in a GUI. If not the user gets frustrated because they don't
>> know the code they are about to execute may take 10 minutes or 8 hours
>> to complete and they think the software is not working. The reverse is
>> also true for canceling an operation where you want to have core code
>> stop processing a long running loop. Once the code has completed then
>> the listener interface for process complete is called allowing the next
>> step in the external code to continue. The developer would have the
>> choice to call the "process" method or run it in a thread and wait for
>> the callback complete method to be called.
>>
>> This is the first step in the ability to have the core/long running
>> processes take advantage of multiple threads to complete the
>> computational task faster. Not all code can be parallelized easily but
>> if the algorithm can take advantage of running in parallel then it
>> should. This then opens up a couple of cloud computing frameworks that
>> extend the multi-threaded concepts in Java across a cluster
>> http://www.terracotta.org/. If we put an emphasis on having code that
>> runs well in a thread we are one step closer to an architecture that can
>> run in a cloud. The computational problems are only going to get bigger
>> and with Amazon EC2 and http://www.eucalyptus.com/ approaches
>> computational IO cycles are going to be cheap as long as the
>> software/libraries can easily take advantage of it.
>>
>> Thanks
>>
>> Scooter
>>
>> -----Original Message-----
>> From: biojava-dev-bounces at lists.open-bio.org
>> [mailto:biojava-dev-bounces at lists.open-bio.org] On Behalf Of Andreas
>> Prlic
>> Sent: Monday, May 11, 2009 12:27 AM
>> To: biojava-dev
>> Subject: [Biojava-dev] Plans for next biojava release - modularization
>>
>> Hi biojava-devs,
>>
>> It is time to start working on the next biojava release.  I  would
>> like to modularize the current code base and apply some of the ideas
>> that have emerged around Richard's "biojava 3" code. In principle the
>> idea is that all changes should be backwards compatible with the
>> interfaces provided by the current biojava 1.7 release.  Backwards
>> compatibility shall only be broken if the functionality is being
>> replaced with something that works better, and gets documented
>> accordingly. For the build functionality I would suggest to stick with
>> what Richard's biojava 3 code base already is providing. Since we will
>> try to be backwards compatible all code development should be part of
>> the biojava-trunk and the first step will be to move the ant-build
>> scripts to a maven build process. Following this procedure will allow
>> to use e.g. the code refactoring tools provided by Eclipse, which
>> should come in handy.
>>
>> The modules I would like to see should provide self-contained
>> functionality and cross dependencies should be restricted to a
>> minimum. I would suggest to have the following modules:
>>
>> biojava-core: Contains everything that can not easily be modularized
>> or nobody volunteers to become a module maintainer.
>> biojava-phylogeny: Scooter expressed some interested to provide such a
>> module and become package maintainer for it.
>> biojava-structure: Everything protein structure related. I would be
>> package maintainer.
>> biojava-blast: Blast parsing is a frequently requested functionality
>> and it would be good to have this code self-contained. A package
>> maintainer for this still will need to be nominated at a later stage.
>> Any suggestions for other modules?
>>
>> Let me know what you think about this.
>>
>> Andreas
>> _______________________________________________
>> biojava-dev mailing list
>> biojava-dev at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>
>> _______________________________________________
>> biojava-dev mailing list
>> biojava-dev at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>
> _________________________
>
> CONFIDENTIALITY NOTICE
>
> The information contained in this e-mail message is intended only for the
> exclusive use of the individual or entity named above and may contain
> information that is privileged, confidential or exempt from disclosure under
> applicable law. If the reader of this message is not the intended recipient,
> or the employee or agent responsible for delivery of the message to the
> intended recipient, you are hereby notified that any dissemination,
> distribution or copying of this communication is strictly prohibited. If you
> have received this communication in error, please notify the sender
> immediately by e-mail and delete the material from any computer.  Thank you.