[Biojava-l] Anybody working on a Java binding to CHADO?
smh1008 at cus.cam.ac.uk
Fri Apr 2 07:20:13 EST 2004
On Friday 02 Apr 2004 12:42 pm, Richard Bruskiewich wrote:
> Hi folks (especially Matt and Thomas...),
> I've finally decided to subscribe to this list since we're getting more
> heavily into Java here at IRRI.
> First question: who is working on a Java binding to CHADO?
I worked on one for some time before suspending my efforts.
The main problem is that the fit between the Chado data model and the
BioJava one is particularly poor. BioJava has a hierarachical feature
model in which features have subfeatures etc. All features are associated
with a location on a sequence. This made sense at the time BJ was being
developed and is also inherent in a BioSQL type of world.
Chado, however, unlinks features from sequences and from each other. Now,
features can be associated with multiple locations on different sequences.
Also, features have relationships to each other that are potentially
non-hierarchical (DAG for example). The Chado data model is probably more
powerful and expressive.
I made one attempt to shoehorn Chado's predecessor into the Biojava by
forcing particular relationships into particular positions in the BJ
hierarchy but the results are ugly and fragile. With Chado being more
generalised that its predecessor, the results have been even more messy.
For example, BJ uses getParent() to move up the hierarchy but an attemtp to
use this in Chado would be ambiguous - by what relationship would the other
object be a parent? SImilarly, all our feature selection through filter()
also fails. How do we traverse the feature relationships if they are not a
hierarchy? Are we filtering by applying the FF on each individual feature
or by traversing the hierarchy of a specific relationship?
The inadequacies of our data model will need to be addressed in BioJava 2 (
and may that day come soon!) but for now, I would not suggest trying to
bridge BJ1 to Chado. When BJ2 becomes available, i will certainly revisit
Chado for Biojava.
One aspect of Chado is that because it is so heavily denormalised,
performance can be a real issue if you are attempting to use it as a
datasource rather than as a reference data repository. Almost everything
you could possibly want to do involves a large number of joins (e.g.
feature->feature type - 1 join. -> location 1 join , thru relationship - 1
join, etc). As such I think you will need to explicitly denormalise it if
you are intending to use it for analysis. You may want to investigate
Flymine (flymine.org) - they have to Java-ize Chado for their own use.
They have a transparent (to user) query optimisation that may be of help to
you. You could try asking their liaison officer (Rachel Lyne) (rachel [at]
flymine.org) for more details.
David Huen, Ph.D. Email: smh1008 at cus.cam.ac.uk
Dept. of Genetics Fax : +44 1223 333992
University of Cambridge Phone: +44 1223 333982/766748
Cambridge, CB2 3UH
More information about the Biojava-l