[Biojava-l] database for biojava

Thomas Down td2@sanger.ac.uk
Wed, 22 Nov 2000 15:19:57 +0000

Just found this languishing at the end of my INBOX -- sorry...

On Thu, Nov 16, 2000 at 05:43:45PM +1300, McCulloch, Alan wrote:
> Does anybody have any tips on the right approach to setting up a database
> on top of which would sit biojava ?
> The platform will be Oracle 8 and I am very keen to NOT do my
> own data model (in the same way I'm keen to not do my own api/object
> design which is why I want to use something like biojava !) - I want to 
> use a standard model if possible, if there is such a thing.
> Can a relational data model of some sort be derived from biojava ?

It certainly should be possible to build a new relational model
based on BioJava.  Out basic model (simple sequence data,
hierarchical features) is really pretty simple -- the only
problems I can see might be:

  - Sparse locations -- it'll be a little bit of extra work to
    store these in the relational model.  I guess I'd go for
    having a `span' table:

      create table location_span (
        location_id         int not null,
        min_pos             int not null,
        max_pos             int not null
      ) ;
    So each location is modeled by one or more location_span
    rows.  Of course, the BioJava interfaces don't actually
    /require/ you to store sparse locations -- only implement
    this if you're actually going to need it.

  - Polymorphic features -- I guess the easiest way might be to
    have a separate table for each class of Feature object you
    want to store, but this means hardwiring the supported
    feature classes at a fairly low level.  Another approach
    would be to have a table like:

      create table feature (
          id               sequence,
          sequence_id      int not null,
          parent_id        int,
          location_id      int not null,
          type             text,
          source           text,
          biojava_feature  blob
      ) ;
    so you're storing the `universal' properties of the feature,
    and then serializing the whole feature object and dumping it
    in the blob. 

But before you start implementing from scratch, you might like
to take a look at what the EnsEMBL people have been doing
(http://www.ensembl.org).  They've got a fairly sophisticated
model for storing genomic data in a relational model (currently
using MySQL, but I've had the main tables running on PostgreSQL,
and I know someone is working on an Oracle port).  The EnsEMBL
tables are more closely geared towards one specific application
that the BioJava model is,  but it might be worth looking to
see if your data will fit into this model.

I've been working on some Java interfaces for EnsEMBL -- all
experimental code at the moment.  Feel free to take a look
at the following CVS modules if you're interested (in the main
BioJava repository):

  ensembl           Lightweight Java wrappers round the ensembl
                    SQL tables (largely complete for reading, maybe
                    40-50% done for writing)

  biojava-ensembl   Bridge which allows EnsEMBL databases to be
                    viewed as BioJava SequenceDBs  (currently
                    pretty experimental)

Hope this helps,

``If I was going to carry a large axe on my back to a diplomatic
function I think I'd want it glittery too.''
           -- Terry Pratchett