[Biojava-l] Schema and Docs for BioSQL

Wed, 20 Feb 2002 23:06:37 +0000

On Wed, Feb 20, 2002 at 01:52:10PM -0500, Marc Colosimo wrote:
> Hi,
> 
> Is there any information about using the BioSQL classes in BioJava, such
> as the schema for the database or examples in using it? I am interest in
> using postgre and biojava to store lots of sequence data.

BioSQL is based on bioperl-db.  There's a little bit about
it in the document from the first (O'Reilly) hackathon meeting:

   http://www.technophage.com/open-bio-database.pdf

The BioJava code's quite new -- I've got a little tutorial
planned, but I'm afraid (ahem) it's not written yet.

In the mean time, the code is integrated into the main
trunk version of biojava-live (although it didn't quite
make it into 1.2), and hopefully shouldn't be too
problematic to use (touch wood!).

You can get schemas (MySQL and PostgreSQL) from:

   http://www.biojava.org/download/biosql/

Right now, there are actually two PostgreSQL schemas --
one was auto-generated from the MySQL one, the other was
hand edited by me (identified by the -thomasd suffix).
Right now, I'd advise the hand-edited version, but this
should go away in future once the automated conversion has
been perfected.

If you're using PostgreSQL, note the following:

  - You need at least version 7.1 -- previous versions didn't
    support storing large strings in normal table attributes.

  - There's a file of stored procedures (biosqlprocs.sql)
    which you can load into the database after loading the
    schema.  These are auto-detected by the BioJava code,
    and can increase write performance by a significant 
    amount (a factor of 3, using my test setup).

On the BioJava side, there isn't really any API for BioSQL
as such.  You can just do something like:

  SequenceDB seqs = new BioSQLSequenceDB(
      "jdbc:postgresql://dbbox.mydomain.org/biosql_db",
      "username",
      "password",
      "database-name",
      true
  );

The first three arguments are just standard JDBC-style database
connection details.  There's a `database name' parameter because
BioSQL allows each `physical' SQL database to contain a number of
`logical' databases.  Perhaps namespace would be a better term
for these (but hey, I didn't write the original schema).  The final
argument specifies whether the namespace should be created if it
doesn't already exist.  Note that right now, the BioJava code
won't create the actual SQL database, or load the schema, for you.
You'll have to do this manally using your database's normal tools.

Having connected to the database, you can write complete
Sequence entries using the addSequence(Sequence) method.

You can retreive sequences by ID using the getSequence(String)
method.  Objects extracted by this method retain live connections
to the database.  Alterations to the sequence (for instance,
using the createFeature(Feature.Template) method) are immediately
reflected in the database (in a transactionally safe manner, if
the database supports this -- PostgreSQL does).  So they're true
persistant implementations of the BioJava interfaces.

The aim is to have everything work just like in-memory
SequenceDB, Sequence, and Feature objects.  For many purposes,
BioSQL is now pretty close to this ideal.

Basic BioSQL doesn't support hierarchical features, so theseg
get flattened when adding a sequence to a database (and attempts
to create new child features on a BioSQL sequence will fail).
However, I've got an /experimental/ extension for handling
this.   There's an extra table (seqfeature_hierarchy) in my
schema.  Once again, this is autodetected by the client code
and used if available.

Let me know how you get on,

    Thomas.