[Biojava-l] SQL-backed persistent Biojava sequence/feature objects

David Huen David Huen <smh1008@cus.cam.ac.uk>
Mon, 6 Aug 2001 12:22:14 +0100 (BST)


I've had a look of the schema and it does seem very reminiscent of
EMBL/Genbank files in the type of features it captures and the general way
they are laid out.

I think it is also possible to graft on hierarchical features onto this
schema in the manner Thomas indicated - use a separate table to represent
the parent_of relationship and yet another to identify the root features.
Additional fields and tables could be added to accomodate fields that are
Biojava specific.

Nevertheless, it does have some disadvantages to both sides.  Principally,
if we did so, we would be projecting a hierarchical feature space onto ta
flat feature space and at least from the BioPerl side, the feature table
could look like a hopeless jumble of features.  On flattening the gene,
exon, feature nesting onto a flat feature, it could be difficult to
discern from the flat space side of it, what the relationships are between
all the disparate features.  It could be worse with alternative
splices/exons. There may also be a cost in terms of additional lookups.
I'm not sure what the advantage of this kind of interoperability would
really amount to since it would project what may be incompatible models
onto each other.

It is possible that the BioPerl db moves to accomodate a hierarchical
space too but then it loses what is its most attractive aspect, a clean
direct mapping onto the principal file sequence file formats.  I can see
the advantages to that.  In the course of implementing Ragbag, I had
intended to implement mixed file formats to be used to construct
assemblies and the parsing and mapping works as advertised.  But because
each format is parsed with a different parser and maps to a slightly
different hierarchy with different feature type names for what are much
the same thing, the composite hierarchy isn't very readily used (Plea to
Biojava ppl: can we try to get an agreed mapping of these formats to
particular hierarchies/labels please?). Can I assume that your parsers
would parse Genbank/EMBL to a common flat feature structure irrespective
of whether the data came from Genbank or EMBL?  If so, It may be
advantageous to use BioPerl to establish databases from EMBL or Genbank or
other flat file databases for lookup from both BioPerl and BioJava.

Might it not work out better if we proceed along these lines:-
1) write an interface to allow Biojava to access and manipulate BioPerl-db
databases in a manner consistent with current BioPerl semantics.  The
interface enforces a flat feature space.  It immediately establishes
BioPerl-BioJava interoperability without breaking anything on the BioPerl
side.  It can even be done with MySQL to avoid needing any changes on the
BioPerl end.  Any situation we want to use both Biojava and BioPerl, we
use this interface to work what are in effect BioPerl-db databases.  I
suspect people will want to do this in most instances when they want to
get a database that is a subset of one of the sequence databases.

2) develop a lightweight Biojava persistent object package where the goal
is different, namely to allow persistent objects to be built that from the
user side work and respond precisely like the existing Biojava in-memory
objects.  A major reason for me wanting to write this beast is to allow a
Biojava application is for data caching - I shutdown my application and
when I restart it, the entire set of objects is still there like when I
last left it which makes a big difference to startup time for my
Ragbag-based DAS service (as this parses GAME files which have
hierarchical features anyway, the Biojava objects are a good fit).  It
seems likely that this schema would mainly be used to maintain internal
states rather than exporting data.

If it helps any, I could have a go at writing both - it might get us the
best of both worlds!

Best wishes,
David Huen