[Biojava-l] BioJavaX ready for testing

Mon Oct 31 04:28:09 EST 2005

Hello people!

Mark is away so I'm taking the liberty of sneaking this one out... :)

I've cross-posted this to both BioJava and BioSQL as much of what is new in BioJavaX will probably be of interest to BioSQL users too.

We've been doing a lot of work recently on creating some extensions to BioJava called BioJavaX. Primarily the purpose of these extensions is to provide better interaction with BioSQL databases, which has been achieved using Hibernate (www.hibernate.org). You can now fully interact with every column of every table in BioSQL, using Hibernate's own HQL language to construct queries that result in sets of BioJavaX objects. Selects, inserts, updates, primary key assignment, foreign key relations, and deletes are all handled transparently by Hibernate, removing the need for any SQL at all to be included in BioJavaX.

As a side effect of constructing a Hibernate-compatible extension to the BioJava object model, we were required to define objects that hold much more detailed information about themselves. For instance, a Sequence object cannot tell you what namespace it lives in in the BioSQL database, but our extension to it, RichSequence, can. As RichSequence extends Sequence and doesn't replace it, this means you can use the new objects with your existing code without any hassle casting them.

To be able to load information from files into these new RichSequence objects in a meaningful way, we had to create a more detailed SeqIOListener, called RichSeqIOListener. Then, we had to create new file parsers for the common file formats which were able to extract more detailed information than before in order to satisfy the RichSeqIOListener. 

It's pretty safe to say that the file parsers in BioJavaX are leagues ahead of the existing ones in BioJava, even if I do say so myself. :P The downside of this extra detail though is that the parsers are much more sensitive and will not play well at all with incomplete or incorrectly formed files. If someone can edit them to be less sensitive whilst still retaining the level of detail required, that'd be great.

We've included parsers for FASTA, GenBank, EMBL, UniProt, INSDseq, EMBLxml, UniProtXML, and an extra one for parsing NCBI Taxonomy data.

Do note that BioJavaX cannot fully convert sequences created using the old BioJava model into the new BioJavaX model. It'll do its best, but the RichSequence object you'll end up with will have lots of properties set to null and a tonne of annotations instead, pretty much the same as the original Sequence object I suppose. So its best to try to avoid conversions and deal with RichSequence objects from the ground up. This is particularly important to consider when converting a BioSQL database previously used with BioJava into one for use with BioJavaX. You'll also find that if you pass a converted old-style Sequence object to one of the new file parsers for writing it may fail or produce output with lots of missing fields, as it will not find the information it is looking for in the places it expects. 

The whole lot is specifically designed to mimic and be compatible with BioSQL, but you don't need to have a BioSQL database to use it. Everything is standalone and will work just fine without a backing data source. Also there is no reason why you couldn't create a new set of Hibernate mappings that map the BioJavaX object model to some other relational database schema of your choice.

The upshot of it all is the org.biojavax package, which you can find in biojava-live branch on CVS. Development is pretty much complete, and it now needs some serious testing.

We need volunteers to:

	a) test the BioSQL interaction via Hibernate with the various database flavours supported (HSQL, Oracle, MySQL, PostGreSQL)
	b) test the various file formats, particularly looking for special-case exceptions which the parsers may not be aware of yet
	c) do some load-testing and help us find ways to improve it if it turns out to be too slow when under pressure

Documentation of the new features can be found in DocBook XML format in docs/docbook/BioJavaX.xml in the biojava-live branch of CVS. It's as detailed as I could make it without getting bored to death writing it. I've never been the world's best documentation writer, so if anyone would like to help improve it you're more than welcome.

Our plan is to make all this an official part of BioJava come the 1.5 release, whenever that may be. For now though it is very very much a testing-stage thing, not even an alpha release.

Questions on a postcard to either Mark or myself. Feedback most welcome.

cheers,
Richard

Richard Holland
Bioinformatics Specialist
Genome Institute of Singapore
60 Biopolis Street, #02-01 Genome, Singapore 138672
Tel: (65) 6478 8000   DID: (65) 6478 8199
Email: hollandr at gis.a-star.edu.sg
---------------------------------------------
This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its content to any other person. Thank you.
---------------------------------------------