[Biojava-l] BioInformatics toolbox.

Simon Brocklehurst simon.brocklehurst@CambridgeAntibody.com
Mon, 15 Apr 2002 19:50:12 +0100

Patrick McConnell wrote:
> We (Duke Bioinformatics Shared Resource) are going to make a concerted
> effort to create a software package that allows the piping of web services.
> We are going to focus on bioinformatics web services, though the project
> should be generally applicable to all web services.  We are now in the
> process of applying for a grant.
> A few questions:
> 1. We plan for these tools to be open source.  Is there a place for this
> under BioJava?  Of some concern: we will be defining some bioinformatics
> data types with java interfaces and XML schemas for common bioinformatics
> programs.  We will also be providing interfaces for web services, as well
> as implementations (implementations will most likely have abstract
> functions for returning installation dependent information e.g. path to
> blast databases).  Will this sort of thing fit under BioJava?  I believe
> the GUI and mechanics of piping would fall under BioJava with no problem
> (let me know if I am wrong in this assumption), but I was unsure whether
> application-specific code would fit.


For information, in case you don't know, there exists in biojava:

o A software framework that makes it easy to build SAX2 drivers for
"legacy" bioinformatics file formats i.e. non-XML file formats.  It's
far from perfect, rather it is a pragmatic implementation that allows
people to knock out new SAX drivers rapidly.

o XML formats that map the output of common bioinformatics programs into
common XML formats by using the SAX2 framework mentioned above.  For
example, the framework maps NCBI Blast, Wu-Blast, and HMMER all into a
single, common XML format; and ClustalW and Needle into a single, common
XML format.  It also defines a format for 3-D structure (currently only
PDB format is suppported).  The benefits of sharing common XML formats
across bioinformatics programs are that it facilitates re-use of Java
objects within/across (but particularly *within*) given bespoke Java
applications that use multiple bioinformatics programs, and that it
reduces the learning curve.

The underlying idea behind the SAX2 framework is that, for many
real-world use cases, re-use of Java objects/intefaces is difficult to
achieve *across* distinct use cases. The need to parse file outputs from
a variety of programs, however, doesn't go away simply because people
have different requirements for objects in different use cases.  Thus,
event-based parsing is a good idea for reasons of improved productivity
- writing an XML Content Handler is orders of magnitude faster than
writing a parser.  And more than just event-based parsing, the biojava
framework is based on an existing standard (SAX2) which reduces the
learning curve for people. That is, serious developers either will or
should be familiar with SAX2 already.

Finally, if you're interested in developing standards based on Java and
XML, you may be interested in getting involved in I3C (www.i3c.org)?  
We (CAT) are not involved in I3C ourselves, so I'm afraid I can't
comment on whether "web services" is their bag or not.

Good luck with your project by the way, sounds like you're set to have
some fun!

Simon M. Brocklehurst, Ph.D.
Head of Bioinformatics & Advanced IS
Cambridge Antibody Technology
The Science Park, Melbourn, Cambridgeshire, UK