[Biojava-dev] Plans for next biojava release - modularization

Andreas Prlic andreas at sdsc.edu
Wed May 13 00:45:54 UTC 2009


The point with the auto-generated code raises actually another
question to me: How shall we deal with auto-generated code?

I also have some code that is  currently not part on BioJava, but it
might be useful for other people: It allows to parse uniprot XML files
and serialize / de-serialize the objects to a database using EJBs,
hibernate and the uniprot XML files.

How far should biojava go in supporting such auto generated or
semi-auto generated code?
A


On Tue, May 12, 2009 at 5:09 PM,  <mark.schreiber at novartis.com> wrote:
>
> A while back I gave Richard some code that uses JAXB to objectify (and
> deobjectify) BLAST XML output. This might be useful for parsing BLAST
> results from the webservices which normally use BLAST XML. I could probably
> dig it up again if needed (it was autogenerated anyway).
>
> It would probably be a good object model for BLAST output if people want to
> parse other types of BLAST output (such as flatfile, but who would want to
> do that!).  The BLAST XML seems to accommodate strange flavours of BLAST
> such as PSI-BLAST etc and also has been much more stable than the default
> flat file output.
>
> - Mark
>
>
>
> Andreas Prlic <andreas at sdsc.edu>
> Sent by: biojava-dev-bounces at lists.open-bio.org
>
> 05/13/2009 08:02 AM
>
> To
> Scooter Willis <HWillis at scripps.edu>
> cc
> biojava-dev <biojava-dev at lists.open-bio.org>
> Subject
> Re: [Biojava-dev] Plans for next biojava release - modularization
>
>
>
>
> Hi Scooter,
>
> about your suggestion for the blast webservice client code: In
> principle I like the idea and we have had questions on the mailing
> list regarding this in the past. Only thing is I think there is
> already some client code in java available:
> http://www.ebi.ac.uk/Tools/webservices/clients/blastpgp
> but I am not sure how good that Java client library is....
>
> Besides this, there is the need for work on our blast parser library
> and if you are interested in working on that you are welcome. As I
> mentioned, I think this should become its own module, due to the
> popularity of that code.
>
> Andreas
>
>
>
>
> On Tue, May 12, 2009 at 6:34 AM, Scooter Willis <HWillis at scripps.edu> wrote:
>> Mark
>>
>>
>>
>> It is a challenge on knowing where to draw the line. Allowing both options
>> is a reasonable approach. The implementation of the algorithm is key to
>> allow it to be multi-threaded or being able to run in parallel. One
>> approach
>> is to provide a standard interface such as process() would wait for the
>> result/return value and run in the parent thread. To run the algorithm in
>> a
>> thread you can have a startProcess() where you can add yourself as a
>> progress listener and when complete() method is called you can call
>> getResults(). You can then also have the corresponding stopProcess() which
>> would set an internal value to cause all threads to quit.  Lots of ways to
>> tackle the problem the key is to start talking about it and at minimum
>> take
>> advantage of multiple-cores where the external code can set the number of
>> cores to use. You can get a dual quad core machine these days for < $1000
>> but most software implementations are not designed to take advantage of
>> it.
>>
>>
>>
>> The real question is what exists today in the BioJava API that is
>> considered
>> long running in normal use case and thus is a candidate to be run in
>> parallel. It may not be an issue in existing BioJava code. When I first
>> started using BioJava I went looking for BLAST code only to find a BLAST
>> parser. I wanted to do a Multiple Sequence Alignment and turns out that
>> Biojava code calls CLUSTALW as an external processor under the covers.  I
>> also needed code to construct trees from an MSA and found the summer of
>> code
>> project that was only focused on representing the tree.
>>
>>
>>
>> It would be nice to have a BLAST implementation in Java optimized to run
>> on
>> a cluster but who has time to rewrite BLAST in Java when you can do BLAST
>> search via the web and focus on parsing the results. BioJava needs a BLAST
>> API that makes a web services call to an external service and gets returns
>> structured results in core BioJava structures. Probably not difficult to
>> do
>> a Java version of CLUSTALW but again we can push the work out to
>> http://www.ebi.ac.uk/Tools/webservices/services/clustalw and get the
>> results
>> back returned in BioJava structures.
>>
>>
>>
>> I can signup for doing a BLAST web service -> BioJava and a CLUSTALW web
>> service -> BioJava code. I haven’t done the research but it seems that
>> http://www.ebi.ac.uk/Tools/webservices/ has done a fair amount of work to
>> expose common biology  computational services. If multiple external
>> services
>> are offering BLAST via web services where each picked a different
>> implementation then BioJava could provide abstraction to different
>> services.
>>
>>
>>
>> Thanks
>>
>> Scooter
>>
>>
>>
>> From: mark.schreiber at novartis.com [mailto:mark.schreiber at novartis.com]
>> Sent: Tuesday, May 12, 2009 1:27 AM
>> To: Scooter Willis
>> Cc: Andreas Prlic; biojava-dev
>> Subject: Re: [Biojava-dev] Plans for next biojava release - modularization
>>
>>
>>
>> Hi -
>>
>> This was one thing we discussed previously with respect to biojava 3.
>>  Generally I support the idea because almost all computers are now
>> multi-core and as you say cloud or utility computing is already a reality.
>>
>> However, I tend to think that biojava should not control threading or
>> concurrency. This should be done by the developer. This is because
>> sometimes
>> mutithreading can be fast on a slow computer but slow on a fast computer
>> (due to the overhead in spawning threads) so programs need to be tunable.
>> Also Java app servers and things like Sun Grid Engine, EC2 etc don't like
>> people attempting to control their own threads.  What BioJava should do is
>> expose granular and thread-safe operations that can be threaded or form
>> discrete tasks on a utility grid or complete in SessionBeans on an App
>> server.  For example it would be better if BioJava had a single threaded
>> method to calculate the GC of a single sequence rather than a
>> multi-threaded
>> method that calculates the GC of multiple sequences.  This would let the
>> developer make a multithreaded version if desired or distribute multiple
>> tasks based on the single threaded version to a compute cloud (and let the
>> cloud manage all the tasks).
>>
>> Possibly the best situation would be to have the single threaded fine
>> grain
>> operations that let developers or grid engines control threading and then
>> higher level APIs that do it for you (or good cookbook examples that show
>> you how to do it).  Another idea that was discussed was the use of
>> properties files to allow people to set how many CPUs they wanted to make
>> available to the JVM or name packages that can or cannot use threading.
>>
>> Finally, there are lots of times when it is highly desirable to use Java
>> beans because they play well with dozens of Java api's however beans don't
>> work well with threads because they have public setter methods.  I would
>> like to see a lot more bean use in a future BioJava because it would make
>> life so much easier but a lot of care would need to be taken to make sure
>> thread safety is preserved.  There are many patterns that can be used such
>> as synchronization locks etc to make things thread safe so I think this
>> can
>> be achieved as long as we are disciplined and consider that all methods
>> may
>> be used in a multi-threaded application (even if we write the method as a
>> single thread).  If there are code checkers that make suggestions on
>> thread
>> safety it would be great to have these as part of the standard build
>> process.  Good documentation would go a long way as well.  Are there unit
>> test patterns that can catch these problems as well?  Suggestions would be
>> great.
>>
>> Progress Listener patterns are good but it depends on the situation and
>> might be better handled in high level APIs or left to the developer.  For
>> example in your NJ code a progress listener would be good if someone fed
>> 1000 sequences into the method but not if they only put in 10. Also code
>> running on an old machine might need a progress listener but the same
>> problem on a new machine may complete almost instantly.  Probably a
>> pluggable listener would be the way to go.  Also it might be possible to
>> do
>> this using the new JDK APIs that let you take a peek at the stack trace.
>> Even if your NJ method didn't allow for a progress listener a developer
>> could still make one by looking at the method calls in the stack. As long
>> as
>> your NJ method called other methods internally for each sequence (quite
>> likely) it would be possible to observe the cycle of method calls from the
>> stack.  This might make it possible to have a very general BioJava
>> progress
>> listener that can be told to count the number of times a method is called
>> in
>> the stack. The name of the method would be the argument.  If the
>> application
>> runs in a Java App server you can also do this very easily with a method
>> Interceptor.
>>
>> - Mark
>>
>> biojava-dev-bounces at lists.open-bio.org wrote on 05/11/2009 09:50:58 PM:
>>
>>> Andreas
>>>
>>> Another theme that should be considered is providing a multi-thread
>>> version of any module with long run time. This would have a couple
>>> elements. A progress listener interface should be standard where core
>>> code would update progress messages to listeners that can be used by
>>> external code to display feedback to the user. I did this with the
>>> Neighbor Joining code for tree construction and it provides needed
>>> feedback in a GUI. If not the user gets frustrated because they don't
>>> know the code they are about to execute may take 10 minutes or 8 hours
>>> to complete and they think the software is not working. The reverse is
>>> also true for canceling an operation where you want to have core code
>>> stop processing a long running loop. Once the code has completed then
>>> the listener interface for process complete is called allowing the next
>>> step in the external code to continue. The developer would have the
>>> choice to call the "process" method or run it in a thread and wait for
>>> the callback complete method to be called.
>>>
>>> This is the first step in the ability to have the core/long running
>>> processes take advantage of multiple threads to complete the
>>> computational task faster. Not all code can be parallelized easily but
>>> if the algorithm can take advantage of running in parallel then it
>>> should. This then opens up a couple of cloud computing frameworks that
>>> extend the multi-threaded concepts in Java across a cluster
>>> http://www.terracotta.org/. If we put an emphasis on having code that
>>> runs well in a thread we are one step closer to an architecture that can
>>> run in a cloud. The computational problems are only going to get bigger
>>> and with Amazon EC2 and http://www.eucalyptus.com/ approaches
>>> computational IO cycles are going to be cheap as long as the
>>> software/libraries can easily take advantage of it.
>>>
>>> Thanks
>>>
>>> Scooter
>>>
>>> -----Original Message-----
>>> From: biojava-dev-bounces at lists.open-bio.org
>>> [mailto:biojava-dev-bounces at lists.open-bio.org] On Behalf Of Andreas
>>> Prlic
>>> Sent: Monday, May 11, 2009 12:27 AM
>>> To: biojava-dev
>>> Subject: [Biojava-dev] Plans for next biojava release - modularization
>>>
>>> Hi biojava-devs,
>>>
>>> It is time to start working on the next biojava release.  I  would
>>> like to modularize the current code base and apply some of the ideas
>>> that have emerged around Richard's "biojava 3" code. In principle the
>>> idea is that all changes should be backwards compatible with the
>>> interfaces provided by the current biojava 1.7 release.  Backwards
>>> compatibility shall only be broken if the functionality is being
>>> replaced with something that works better, and gets documented
>>> accordingly. For the build functionality I would suggest to stick with
>>> what Richard's biojava 3 code base already is providing. Since we will
>>> try to be backwards compatible all code development should be part of
>>> the biojava-trunk and the first step will be to move the ant-build
>>> scripts to a maven build process. Following this procedure will allow
>>> to use e.g. the code refactoring tools provided by Eclipse, which
>>> should come in handy.
>>>
>>> The modules I would like to see should provide self-contained
>>> functionality and cross dependencies should be restricted to a
>>> minimum. I would suggest to have the following modules:
>>>
>>> biojava-core: Contains everything that can not easily be modularized
>>> or nobody volunteers to become a module maintainer.
>>> biojava-phylogeny: Scooter expressed some interested to provide such a
>>> module and become package maintainer for it.
>>> biojava-structure: Everything protein structure related. I would be
>>> package maintainer.
>>> biojava-blast: Blast parsing is a frequently requested functionality
>>> and it would be good to have this code self-contained. A package
>>> maintainer for this still will need to be nominated at a later stage.
>>> Any suggestions for other modules?
>>>
>>> Let me know what you think about this.
>>>
>>> Andreas
>>> _______________________________________________
>>> biojava-dev mailing list
>>> biojava-dev at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>
>>> _______________________________________________
>>> biojava-dev mailing list
>>> biojava-dev at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>
>> _________________________
>>
>> CONFIDENTIALITY NOTICE
>>
>> The information contained in this e-mail message is intended only for the
>> exclusive use of the individual or entity named above and may contain
>> information that is privileged, confidential or exempt from disclosure
>> under
>> applicable law. If the reader of this message is not the intended
>> recipient,
>> or the employee or agent responsible for delivery of the message to the
>> intended recipient, you are hereby notified that any dissemination,
>> distribution or copying of this communication is strictly prohibited. If
>> you
>> have received this communication in error, please notify the sender
>> immediately by e-mail and delete the material from any computer.  Thank
>> you.
>
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>
>




More information about the biojava-dev mailing list