[Bioperl-l] Re: BioSQL or chado

Tue Jul 29 21:31:30 EDT 2003

[x-posting to GMOD-schema]

On Tue, 29 Jul 2003, Nathan (Nat) Goodman wrote:

> I'm thinking about converting our homegrown relational schema to one of the
> emerging BioPerl-friendly "standard" schemas.  I'm looking for something
> that (1) works now, and (2) is likely to be popular in the BioPerl world for
> some time to come.
>
> I think the choices are BioSQL and chado.  Are there others?  Is one of
> these the obvious right choice?

ensembl and GUS are the other main choices

I originally viewed BioSQL as a way of doing relational queries over data
slurped from EMBL/GenBank and SwissProt. It has since evolved into
something more generic and resembles chado in many ways. Hilmar and Dave
Block use it at GNF all the time for a lot more than slurping genbank.

Chado encompasses more than BioSQL - genetics, expression, publications.
But I'm assuming it's the sequence part you are interested in here? There
is nothing to stop BioSQL moving into this area.

BioSQL certainly has the tightest integration with bioperl (and the other
bio* projects). This is through an O/R layer.

chado has no direct integration with bioperl. I don't think there is any
O/R layer or OO API planned (although some biojava folks have expressed an
interest in this), Scott Cain has written a chado adapter for gbrowse
(which uses bioperl objects) which could be extracted to form an API in
its own right (although it is currently limited to the kind of API
calls you need to make a genome viewer).

Many of the chado developers favour XML over objects. Chado-XML DTD is
derived directly from the relational schema. The chado developers at
Harvard have written a generic XML<->DB tool, which can be used in place
of an API or O/R mapping. Of course, we still want to be able to use
bioperl objects, so there are Bio::SeqIO::chadoxml classes being
developed. The most likely route will be DB<->ChadoXML<->bioperl.

BioSQL is semantically almost identical to the bioperl object model,
whereas there are some differences with chado, specifically with respect
to locations.

chado does not allow discontinuous/split feature locations
chado does not support the full fuzzy genbank model
chado allows multiple redundant locations
  (eg a SNP on a protein vs genomic; features on clone and chromosome)
chado uses interbase
chado uses a different mechanism for 'remote' locations
  the source feature (ie the one which start/end is relative to)
  is part of the location in chado, unlike bioSQL
chado abandons the artifical distinction between 'sequence' and 'feature',
  there is only one entity 'feature' in the _logical_ model
chado has no equivalent of biosql.bioentry (other than 'feature')

These aspects of chado are more fully documented in the sql ddl, and in a
document which is.... Stan/Dave.... where abouts is that doc?

other than that, there are more similarities than differences. eg Both
allow arbitrary feature graphs (preferably conforming to SO partonomies),
features are typed by an ontology etc. I'm sure migrating data one way or
the other wouldn't be too much of a problem.

ensembl is a different kettle of fish altogether. The main difference is
that typing is enforced at the relational layer in ensembl. this has many
advantages and disadvantages which have been discussed to death, it
depends on your project really.

ensembl is the most mature, and chado is the new kid on the block.
however, chado 1_01 has just been frozen, and that's what most apps will
be targetting.

chado has a lot riding on it right now; flybase will become completely
chado-dependent for its genome annotation data by the end of this year.

> Thanks,
> Nat

Cheers
Chris