[Biojava-l] Anybody working on a Java binding to CHADO?

Fri Apr 2 07:20:13 EST 2004

On Friday 02 Apr 2004 12:42 pm, Richard Bruskiewich wrote:
> Hi folks (especially Matt and Thomas...),
>
> I've finally decided to subscribe to this list since we're getting more
> heavily into Java here at IRRI.
>
> First question: who is working on a Java binding to CHADO?
>
Hi Richard,
I worked on one for some time before suspending my efforts.

The main problem is that the fit between the Chado data model and the 
BioJava one is particularly poor.  BioJava has a hierarachical feature 
model in which features have subfeatures etc.  All features are associated 
with a location on a sequence.  This made sense at the time BJ was being 
developed and is also inherent in a BioSQL type of world.

Chado, however, unlinks features from sequences and from each other.  Now, 
features can be associated with multiple locations on different sequences.  
Also, features have relationships to each other that are potentially 
non-hierarchical (DAG for example).  The Chado data model is probably more 
powerful and expressive.

I made one attempt to shoehorn Chado's predecessor into the Biojava by 
forcing particular relationships into particular positions in the BJ 
hierarchy but the results are ugly and fragile.  With Chado being more 
generalised that its predecessor, the results have been even more messy.  
For example, BJ uses getParent() to move up the hierarchy but an attemtp to 
use this in Chado would be ambiguous - by what relationship would the other 
object be a parent?  SImilarly, all our feature selection through filter() 
also fails.  How do we traverse the feature relationships if they are not a 
hierarchy?  Are we filtering by applying the FF on each individual feature 
or by traversing the hierarchy of a specific relationship?

The inadequacies of our data model will need to be addressed in BioJava 2 ( 
and may that day come soon!) but for now, I would not suggest trying to 
bridge BJ1 to Chado.  When BJ2 becomes available, i will certainly revisit 
Chado for Biojava.

One aspect of Chado is that because it is so heavily denormalised, 
performance can be a real issue if you are attempting to use it as a 
datasource rather than as a reference data repository.  Almost everything 
you could possibly want to do involves a large number of joins (e.g. 
feature->feature type - 1 join.  -> location 1 join , thru relationship - 1 
join, etc).  As such I think you will need to explicitly denormalise it if 
you are intending to use it for analysis.  You may want to investigate 
Flymine (flymine.org) - they have to Java-ize Chado for their own use.  
They have a transparent (to user) query optimisation that may be of help to 
you.  You could try asking their liaison officer (Rachel Lyne) (rachel [at] 
flymine.org) for more details.

Regards,
David Huen

-- 

David Huen, Ph.D.              Email: smh1008 at cus.cam.ac.uk
Dept. of Genetics              Fax  : +44 1223 333992
University of Cambridge        Phone: +44 1223 333982/766748
Cambridge, CB2 3UH
U.K.