[Biojava-dev] Plans for next biojava release - modularization

mark.schreiber at novartis.com mark.schreiber at novartis.com
Wed May 13 02:15:27 UTC 2009


Hi -

I think it depends if the code is going to be auto-generated at each build 
or only once.  I have autogenerated Entity classes for BioSQL tables. My 
recommendation would be that these be used for JPA mapping to BioSQL from 
BioJava.  I think these only need be generated once (unless the BioSQL 
schema changes), especially as the autogeneration didn't quite catch some 
of the subtleties of the schema.  They can also be in their own module, 
not the core.

Classes that map to XML or webservice clients can be autogenerated from 
XML schema, DTD or WSDL once or at every build (automatically from ANT and 
probably Maven).  In these cases it may pay to do it with every build 
because these classes are completely boiler plate code and should never 
need to be manually modified.  Also it means the code for these utility 
classes will never be in the code base and at will not be possible for 
someone to change it accidentally (and the code base will be smaller). 
Only the XSD or WSDL will be in subversion (and any higher level code that 
makes use of the boilerplate client code).  Improvements in the 
boilerplate code or changes that come with updates to JAXB and similar 
will automatically appear at the next build (when we change JAXB 
versions).

Conceptually the BLAST XML parsing module may consist of only the BLAST 
XSD (or DTD) and a high-level biojava class like the following:

public interface BlastParser {
        public Serializable[] parseBlast(URL url){
                Calls bioler plate code...
        } 

        public Serializable[] parseBlast(String blastXMLOutput){
                Calls bioler plate code...
        }
}

The code for the bit that does the JAXB marshalling etc could be generated 
at build time.  The Serializable array would be the objects that JAXB 
generates. Probably they would be a more specific stub that implements 
serializable (eg BlastResult or similar depending on the XSD).

I think it really comes down to a question of how much the generated code 
is boilerplate code that will never be changed. If it is not 'modifiable' 
then it can be generated at build. If the autogenerated code is an outline 
of a class where method bodies need to be filled in or customized then 
they should not be autogenerated at build time.  A good example would be 
JUnit classes that can be autogenerated to give you a template that will 
compile and run but probably will not perform a sensible test.  The 
developer of the test could autogenerate the template but would then need 
to make the test sensible. At that point the test should be in the code 
base and should not be regenerated at build time.

- Mark

biojava-dev-bounces at lists.open-bio.org wrote on 05/13/2009 08:45:54 AM:

> The point with the auto-generated code raises actually another
> question to me: How shall we deal with auto-generated code?
> 
> I also have some code that is  currently not part on BioJava, but it
> might be useful for other people: It allows to parse uniprot XML files
> and serialize / de-serialize the objects to a database using EJBs,
> hibernate and the uniprot XML files.
> 
> How far should biojava go in supporting such auto generated or
> semi-auto generated code?
> A
> 
> 
> On Tue, May 12, 2009 at 5:09 PM,  <mark.schreiber at novartis.com> wrote:
> >
> > A while back I gave Richard some code that uses JAXB to objectify (and
> > deobjectify) BLAST XML output. This might be useful for parsing BLAST
> > results from the webservices which normally use BLAST XML. I could 
probably
> > dig it up again if needed (it was autogenerated anyway).
> >
> > It would probably be a good object model for BLAST output if people 
want to
> > parse other types of BLAST output (such as flatfile, but who would 
want to
> > do that!).  The BLAST XML seems to accommodate strange flavours of 
BLAST
> > such as PSI-BLAST etc and also has been much more stable than the 
default
> > flat file output.
> >
> > - Mark
> >
> >
> >
> > Andreas Prlic <andreas at sdsc.edu>
> > Sent by: biojava-dev-bounces at lists.open-bio.org
> >
> > 05/13/2009 08:02 AM
> >
> > To
> > Scooter Willis <HWillis at scripps.edu>
> > cc
> > biojava-dev <biojava-dev at lists.open-bio.org>
> > Subject
> > Re: [Biojava-dev] Plans for next biojava release - modularization
> >
> >
> >
> >
> > Hi Scooter,
> >
> > about your suggestion for the blast webservice client code: In
> > principle I like the idea and we have had questions on the mailing
> > list regarding this in the past. Only thing is I think there is
> > already some client code in java available:
> > http://www.ebi.ac.uk/Tools/webservices/clients/blastpgp
> > but I am not sure how good that Java client library is....
> >
> > Besides this, there is the need for work on our blast parser library
> > and if you are interested in working on that you are welcome. As I
> > mentioned, I think this should become its own module, due to the
> > popularity of that code.
> >
> > Andreas
> >
> >
> >
> >
> > On Tue, May 12, 2009 at 6:34 AM, Scooter Willis <HWillis at scripps.edu> 
wrote:
> >> Mark
> >>
> >>
> >>
> >> It is a challenge on knowing where to draw the line. Allowing both 
options
> >> is a reasonable approach. The implementation of the algorithm is key 
to
> >> allow it to be multi-threaded or being able to run in parallel. One
> >> approach
> >> is to provide a standard interface such as process() would wait for 
the
> >> result/return value and run in the parent thread. To run the 
algorithm in
> >> a
> >> thread you can have a startProcess() where you can add yourself as a
> >> progress listener and when complete() method is called you can call
> >> getResults(). You can then also have the corresponding stopProcess() 
which
> >> would set an internal value to cause all threads to quit.  Lots of 
ways to
> >> tackle the problem the key is to start talking about it and at 
minimum
> >> take
> >> advantage of multiple-cores where the external code can set the 
number of
> >> cores to use. You can get a dual quad core machine these days for < 
$1000
> >> but most software implementations are not designed to take advantage 
of
> >> it.
> >>
> >>
> >>
> >> The real question is what exists today in the BioJava API that is
> >> considered
> >> long running in normal use case and thus is a candidate to be run in
> >> parallel. It may not be an issue in existing BioJava code. When I 
first
> >> started using BioJava I went looking for BLAST code only to find a 
BLAST
> >> parser. I wanted to do a Multiple Sequence Alignment and turns out 
that
> >> Biojava code calls CLUSTALW as an external processor under the 
covers.  I
> >> also needed code to construct trees from an MSA and found the summer 
of
> >> code
> >> project that was only focused on representing the tree.
> >>
> >>
> >>
> >> It would be nice to have a BLAST implementation in Java optimized to 
run
> >> on
> >> a cluster but who has time to rewrite BLAST in Java when you can do 
BLAST
> >> search via the web and focus on parsing the results. BioJava needs a 
BLAST
> >> API that makes a web services call to an external service and gets 
returns
> >> structured results in core BioJava structures. Probably not difficult 
to
> >> do
> >> a Java version of CLUSTALW but again we can push the work out to
> >> http://www.ebi.ac.uk/Tools/webservices/services/clustalw and get the
> >> results
> >> back returned in BioJava structures.
> >>
> >>
> >>
> >> I can signup for doing a BLAST web service -> BioJava and a CLUSTALW 
web
> >> service -> BioJava code. I haven?t done the research but it seems 
that
> >> http://www.ebi.ac.uk/Tools/webservices/ has done a fair amount of 
work to
> >> expose common biology  computational services. If multiple external
> >> services
> >> are offering BLAST via web services where each picked a different
> >> implementation then BioJava could provide abstraction to different
> >> services.
> >>
> >>
> >>
> >> Thanks
> >>
> >> Scooter
> >>
> >>
> >>
> >> From: mark.schreiber at novartis.com 
[mailto:mark.schreiber at novartis.com]
> >> Sent: Tuesday, May 12, 2009 1:27 AM
> >> To: Scooter Willis
> >> Cc: Andreas Prlic; biojava-dev
> >> Subject: Re: [Biojava-dev] Plans for next biojava release - 
modularization
> >>
> >>
> >>
> >> Hi -
> >>
> >> This was one thing we discussed previously with respect to biojava 3.
> >>  Generally I support the idea because almost all computers are now
> >> multi-core and as you say cloud or utility computing is already a 
reality.
> >>
> >> However, I tend to think that biojava should not control threading or
> >> concurrency. This should be done by the developer. This is because
> >> sometimes
> >> mutithreading can be fast on a slow computer but slow on a fast 
computer
> >> (due to the overhead in spawning threads) so programs need to be 
tunable.
> >> Also Java app servers and things like Sun Grid Engine, EC2 etc don't 
like
> >> people attempting to control their own threads.  What BioJava should 
do is
> >> expose granular and thread-safe operations that can be threaded or 
form
> >> discrete tasks on a utility grid or complete in SessionBeans on an 
App
> >> server.  For example it would be better if BioJava had a single 
threaded
> >> method to calculate the GC of a single sequence rather than a
> >> multi-threaded
> >> method that calculates the GC of multiple sequences.  This would let 
the
> >> developer make a multithreaded version if desired or distribute 
multiple
> >> tasks based on the single threaded version to a compute cloud (and 
let the
> >> cloud manage all the tasks).
> >>
> >> Possibly the best situation would be to have the single threaded fine
> >> grain
> >> operations that let developers or grid engines control threading and 
then
> >> higher level APIs that do it for you (or good cookbook examples that 
show
> >> you how to do it).  Another idea that was discussed was the use of
> >> properties files to allow people to set how many CPUs they wanted to 
make
> >> available to the JVM or name packages that can or cannot use 
threading.
> >>
> >> Finally, there are lots of times when it is highly desirable to use 
Java
> >> beans because they play well with dozens of Java api's however beans 
don't
> >> work well with threads because they have public setter methods.  I 
would
> >> like to see a lot more bean use in a future BioJava because it would 
make
> >> life so much easier but a lot of care would need to be taken to make 
sure
> >> thread safety is preserved.  There are many patterns that can be used 
such
> >> as synchronization locks etc to make things thread safe so I think 
this
> >> can
> >> be achieved as long as we are disciplined and consider that all 
methods
> >> may
> >> be used in a multi-threaded application (even if we write the method 
as a
> >> single thread).  If there are code checkers that make suggestions on
> >> thread
> >> safety it would be great to have these as part of the standard build
> >> process.  Good documentation would go a long way as well.  Are there 
unit
> >> test patterns that can catch these problems as well?  Suggestions 
would be
> >> great.
> >>
> >> Progress Listener patterns are good but it depends on the situation 
and
> >> might be better handled in high level APIs or left to the developer. 
 For
> >> example in your NJ code a progress listener would be good if someone 
fed
> >> 1000 sequences into the method but not if they only put in 10. Also 
code
> >> running on an old machine might need a progress listener but the same
> >> problem on a new machine may complete almost instantly.  Probably a
> >> pluggable listener would be the way to go.  Also it might be possible 
to
> >> do
> >> this using the new JDK APIs that let you take a peek at the stack 
trace.
> >> Even if your NJ method didn't allow for a progress listener a 
developer
> >> could still make one by looking at the method calls in the stack. As 
long
> >> as
> >> your NJ method called other methods internally for each sequence 
(quite
> >> likely) it would be possible to observe the cycle of method calls 
from the
> >> stack.  This might make it possible to have a very general BioJava
> >> progress
> >> listener that can be told to count the number of times a method is 
called
> >> in
> >> the stack. The name of the method would be the argument.  If the
> >> application
> >> runs in a Java App server you can also do this very easily with a 
method
> >> Interceptor.
> >>
> >> - Mark
> >>
> >> biojava-dev-bounces at lists.open-bio.org wrote on 05/11/2009 09:50:58 
PM:
> >>
> >>> Andreas
> >>>
> >>> Another theme that should be considered is providing a multi-thread
> >>> version of any module with long run time. This would have a couple
> >>> elements. A progress listener interface should be standard where 
core
> >>> code would update progress messages to listeners that can be used by
> >>> external code to display feedback to the user. I did this with the
> >>> Neighbor Joining code for tree construction and it provides needed
> >>> feedback in a GUI. If not the user gets frustrated because they 
don't
> >>> know the code they are about to execute may take 10 minutes or 8 
hours
> >>> to complete and they think the software is not working. The reverse 
is
> >>> also true for canceling an operation where you want to have core 
code
> >>> stop processing a long running loop. Once the code has completed 
then
> >>> the listener interface for process complete is called allowing the 
next
> >>> step in the external code to continue. The developer would have the
> >>> choice to call the "process" method or run it in a thread and wait 
for
> >>> the callback complete method to be called.
> >>>
> >>> This is the first step in the ability to have the core/long running
> >>> processes take advantage of multiple threads to complete the
> >>> computational task faster. Not all code can be parallelized easily 
but
> >>> if the algorithm can take advantage of running in parallel then it
> >>> should. This then opens up a couple of cloud computing frameworks 
that
> >>> extend the multi-threaded concepts in Java across a cluster
> >>> http://www.terracotta.org/. If we put an emphasis on having code 
that
> >>> runs well in a thread we are one step closer to an architecture that 
can
> >>> run in a cloud. The computational problems are only going to get 
bigger
> >>> and with Amazon EC2 and http://www.eucalyptus.com/ approaches
> >>> computational IO cycles are going to be cheap as long as the
> >>> software/libraries can easily take advantage of it.
> >>>
> >>> Thanks
> >>>
> >>> Scooter
> >>>
> >>> -----Original Message-----
> >>> From: biojava-dev-bounces at lists.open-bio.org
> >>> [mailto:biojava-dev-bounces at lists.open-bio.org] On Behalf Of Andreas
> >>> Prlic
> >>> Sent: Monday, May 11, 2009 12:27 AM
> >>> To: biojava-dev
> >>> Subject: [Biojava-dev] Plans for next biojava release - 
modularization
> >>>
> >>> Hi biojava-devs,
> >>>
> >>> It is time to start working on the next biojava release.  I  would
> >>> like to modularize the current code base and apply some of the ideas
> >>> that have emerged around Richard's "biojava 3" code. In principle 
the
> >>> idea is that all changes should be backwards compatible with the
> >>> interfaces provided by the current biojava 1.7 release.  Backwards
> >>> compatibility shall only be broken if the functionality is being
> >>> replaced with something that works better, and gets documented
> >>> accordingly. For the build functionality I would suggest to stick 
with
> >>> what Richard's biojava 3 code base already is providing. Since we 
will
> >>> try to be backwards compatible all code development should be part 
of
> >>> the biojava-trunk and the first step will be to move the ant-build
> >>> scripts to a maven build process. Following this procedure will 
allow
> >>> to use e.g. the code refactoring tools provided by Eclipse, which
> >>> should come in handy.
> >>>
> >>> The modules I would like to see should provide self-contained
> >>> functionality and cross dependencies should be restricted to a
> >>> minimum. I would suggest to have the following modules:
> >>>
> >>> biojava-core: Contains everything that can not easily be modularized
> >>> or nobody volunteers to become a module maintainer.
> >>> biojava-phylogeny: Scooter expressed some interested to provide such 
a
> >>> module and become package maintainer for it.
> >>> biojava-structure: Everything protein structure related. I would be
> >>> package maintainer.
> >>> biojava-blast: Blast parsing is a frequently requested functionality
> >>> and it would be good to have this code self-contained. A package
> >>> maintainer for this still will need to be nominated at a later 
stage.
> >>> Any suggestions for other modules?
> >>>
> >>> Let me know what you think about this.
> >>>
> >>> Andreas
> >>> _______________________________________________
> >>> biojava-dev mailing list
> >>> biojava-dev at lists.open-bio.org
> >>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
> >>>
> >>> _______________________________________________
> >>> biojava-dev mailing list
> >>> biojava-dev at lists.open-bio.org
> >>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
> >>
> >> _________________________
> >>
> >> CONFIDENTIALITY NOTICE
> >>
> >> The information contained in this e-mail message is intended only for 
the
> >> exclusive use of the individual or entity named above and may contain
> >> information that is privileged, confidential or exempt from 
disclosure
> >> under
> >> applicable law. If the reader of this message is not the intended
> >> recipient,
> >> or the employee or agent responsible for delivery of the message to 
the
> >> intended recipient, you are hereby notified that any dissemination,
> >> distribution or copying of this communication is strictly prohibited. 
If
> >> you
> >> have received this communication in error, please notify the sender
> >> immediately by e-mail and delete the material from any computer. 
 Thank
> >> you.
> >
> > _______________________________________________
> > biojava-dev mailing list
> > biojava-dev at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-dev
> >
> >
> 
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev




More information about the biojava-dev mailing list