[Biojava-l] BioJavaX ready for testing

Fri Nov 4 05:29:00 EST 2005

Richard has done a really excellent job of making some pretty 
comprehensive docs here with lots of examples. You should be able to use 
it to take biojavax out for a spin!

- Mark

"Richard HOLLAND" <hollandr at gis.a-star.edu.sg>
Sent by: biojava-l-bounces at portal.open-bio.org
10/31/2005 05:28 PM

        To:     <biojava-l at biojava.org>
        cc:     Biosql <biosql-l at open-bio.org>, (bcc: Mark Schreiber/GP/Novartis)
        Subject:        [Biojava-l] BioJavaX ready for testing

Hello people!

Mark is away so I'm taking the liberty of sneaking this one out... :)

I've cross-posted this to both BioJava and BioSQL as much of what is new 
in BioJavaX will probably be of interest to BioSQL users too.

We've been doing a lot of work recently on creating some extensions to 
BioJava called BioJavaX. Primarily the purpose of these extensions is to 
provide better interaction with BioSQL databases, which has been achieved 
using Hibernate (www.hibernate.org). You can now fully interact with every 
column of every table in BioSQL, using Hibernate's own HQL language to 
construct queries that result in sets of BioJavaX objects. Selects, 
inserts, updates, primary key assignment, foreign key relations, and 
deletes are all handled transparently by Hibernate, removing the need for 
any SQL at all to be included in BioJavaX.

As a side effect of constructing a Hibernate-compatible extension to the 
BioJava object model, we were required to define objects that hold much 
more detailed information about themselves. For instance, a Sequence 
object cannot tell you what namespace it lives in in the BioSQL database, 
but our extension to it, RichSequence, can. As RichSequence extends 
Sequence and doesn't replace it, this means you can use the new objects 
with your existing code without any hassle casting them.

To be able to load information from files into these new RichSequence 
objects in a meaningful way, we had to create a more detailed 
SeqIOListener, called RichSeqIOListener. Then, we had to create new file 
parsers for the common file formats which were able to extract more 
detailed information than before in order to satisfy the 
RichSeqIOListener. 

It's pretty safe to say that the file parsers in BioJavaX are leagues 
ahead of the existing ones in BioJava, even if I do say so myself. :P The 
downside of this extra detail though is that the parsers are much more 
sensitive and will not play well at all with incomplete or incorrectly 
formed files. If someone can edit them to be less sensitive whilst still 
retaining the level of detail required, that'd be great.

We've included parsers for FASTA, GenBank, EMBL, UniProt, INSDseq, 
EMBLxml, UniProtXML, and an extra one for parsing NCBI Taxonomy data.

Do note that BioJavaX cannot fully convert sequences created using the old 
BioJava model into the new BioJavaX model. It'll do its best, but the 
RichSequence object you'll end up with will have lots of properties set to 
null and a tonne of annotations instead, pretty much the same as the 
original Sequence object I suppose. So its best to try to avoid 
conversions and deal with RichSequence objects from the ground up. This is 
particularly important to consider when converting a BioSQL database 
previously used with BioJava into one for use with BioJavaX. You'll also 
find that if you pass a converted old-style Sequence object to one of the 
new file parsers for writing it may fail or produce output with lots of 
missing fields, as it will not find the information it is looking for in 
the places it expects. 

The whole lot is specifically designed to mimic and be compatible with 
BioSQL, but you don't need to have a BioSQL database to use it. Everything 
is standalone and will work just fine without a backing data source. Also 
there is no reason why you couldn't create a new set of Hibernate mappings 
that map the BioJavaX object model to some other relational database 
schema of your choice.

The upshot of it all is the org.biojavax package, which you can find in 
biojava-live branch on CVS. Development is pretty much complete, and it 
now needs some serious testing.

We need volunteers to:

                 a) test the BioSQL interaction via Hibernate with the 
various database flavours supported (HSQL, Oracle, MySQL, PostGreSQL)
                 b) test the various file formats, particularly looking 
for special-case exceptions which the parsers may not be aware of yet
                 c) do some load-testing and help us find ways to improve 
it if it turns out to be too slow when under pressure

Documentation of the new features can be found in DocBook XML format in 
docs/docbook/BioJavaX.xml in the biojava-live branch of CVS. It's as 
detailed as I could make it without getting bored to death writing it. 
I've never been the world's best documentation writer, so if anyone would 
like to help improve it you're more than welcome.

Our plan is to make all this an official part of BioJava come the 1.5 
release, whenever that may be. For now though it is very very much a 
testing-stage thing, not even an alpha release.

Questions on a postcard to either Mark or myself. Feedback most welcome.

cheers,
Richard

Richard Holland
Bioinformatics Specialist
Genome Institute of Singapore
60 Biopolis Street, #02-01 Genome, Singapore 138672
Tel: (65) 6478 8000   DID: (65) 6478 8199
Email: hollandr at gis.a-star.edu.sg
---------------------------------------------
This email is confidential and may be privileged. If you are not the 
intended recipient, please delete it and notify us immediately. Please do 
not copy or use it for any purpose, or disclose its content to any other 
person. Thank you.
---------------------------------------------

_______________________________________________
Biojava-l mailing list  -  Biojava-l at biojava.org
http://biojava.org/mailman/listinfo/biojava-l