[Bioperl-l] Bioperl and BioSQL status

Tue Apr 1 16:47:41 EST 2003

Hi William

Very good questions!

Will your db primarily be for integrating and querying data from a
varierty of sources - or will it be primarily for the management and
tracking of in-house data - or both? The answer to this will give an idea
of how much redundancy and denormalisation you can afford in your
database.

Looks like you have a firm grasp of the issues involved with using a
generic schema like BioSQL. As you say, the schema can theoretically
accomodate different types of information. I would say this is true for
all the types you mention below, with the possible exception of pathway
(this can be modeled simplistically using feature graphs but really you
need a more expressive graph formalism with nodes connected to arcs).

You mention extending the schema to simplify updates (i would also say to
simplify querying). this would have certain repercussions depending on how
this is done.

You could extend the schema with entirely orthogonal types of data - eg
pathways. You could still use the feature component of BioSQL as-is, and
use the whole existing bioperl architecture (adapters plus objects)
without modification. Of course, you would need your own code for getting
at your types.

It gets trickier if the data types overlap the data types BioSQL is
intended to score. For instance, if you added your own table for promoters
the standard BioSQL architecture would not be aware of this table and
would omit these features. You could either modify the architecture -
which would be tricky i imagine, even trickier to do without introducing a
fork in the bioperl-db code. Hilmar is the expert here, he can give you a
more detailed answer as to how much disruption new tables, new columns etc
would cause.

It's easier if you allow redundancy. One pattern that has worked for me in
building a db to facilitate mining SNP data was to use BioSQL as-is for
slurping all feature data from a source such as NCBI. I added variation
specific and disease specific tables, which made updates, querying etc
easier. Because SNPs are a subset of the datatypes BioSQL is designed for,
I introduced redundancy into the database, with the possibility of things
getting out of sync which could potentially be horrible. You have to be
super careful here. But the advantage is that I could use the BioSQL perl
architecture without modification for loading data. For querying the data,
I eschewed objects and complex O/R architectures and used a simple
XML<->DB approach. Overall this worked quite nicely.

There are advantages and disadvantages to a generic design, we've been
back and forth on this question forever on this list with no real
resolution. Note that BioSQL need not necessarily be typing beyond the
feature model - it could be used in conjunction with ontologies to give an
extra layer of typing. The advantage is you have two layers, a generic
loosely typed layer which a certain subset of applications can target
(freeing you from a lot of awful code+schema synchronisation issues), and
a more strongly typed layer. The disadvantage is that the typing isn't
enforced at the relational layer, and the required code and/or SQL is more
complex and non-standard.

You could go ahead and build a non-generic schema with explicit tables for
all the datatypes you mention. You could do this either in discussion with
us BioSQL people (and wait for committee paralysis to set in as we argue
what a gene, allele etc actually is); or you could go it alone and come
back to us with something that we could incorporate as part of BioSQL (and
get disappointed when people don't use it because the schema has a
particular notion of transcription/translation embedded throughout that
doesn't take into account some weird biological stuff that someone is
really interested in...).

Perhaps there is room for both - two schemas, generic and specific (or
rather n specific schemas, each with their own view of biology
predominant). I don't think it would be that hard to come up with some
declarative mapping between the two - perhaps at the SQL/view layer (can
get kind of hard) or at some kind of XML-y transform layer. I think a
mapping like this really has to be declarative otherwise you get into all
kinds of problems.

The best example of a schema with the modeling all explicit at the
relational layer (and object layer, i guess) is of course Ensembl. This
would be another option. I'm not close enough to the ensembl schema or
codebase these days to figure out the difficulties involved in extending
it to your needs. For instance, adding extra tracks to the genome browser
is easy, as is adding new external data sources (I think this is how they
deal with SNPs etc). But I imagine the implications of changing the core
model as to what constitutes a gene may be gnarlier, as this percolates
all the way through the architecture and client code. I don't know what
your project requirements are and whether this may be an issue.

There is also the Chado schema, which has a lot of similarities to BioSQL.
It's a little more modular in its design, and in theory you can take the
modules you like and plug in your own, say, pathway module. No one has
actually done this in practice yet.

My own interest is in weird nonconformist biology which by definition
forces you to constantly change your model (dicistronic genes,
modification at transcriptional and translation levels, transsplicing, new
properties of noncoding RNA etc). For this sort of stuff I prefer a
generic schema and keeping the typing in an ontology. This is based on the
assumption that schema changes are more disruptive than changes to your
ontology.

You can find out about chado at:
http://www.gmod.org

You may also be interested in the sequence ontology:
http://song.sf.net

Good luck!
Chris

On Tue, 1 Apr 2003, William Hsiao wrote:

> Hi,
>   My lab is interested in adapting BioSQL as the basis
> for a functional genomic database that will support
> microarray analysis we wish to perform in the near
> future.  The types of information we wish to include
> in the database include pathway, signal peptide,
> transcription factors, binding sites, promotors,
> protein domains, signal peptides, subcellular
> localization information, etc.  The database will need
> to accommodate both eukaryotic and prokaryotic
> genomes/genes, and will need to be flexible to
> accommodate future analysis results that we may
> perform.  I have taken a look at the BioSQL schema,
> and from the available documentation, the schema,
> theoretically, can accommodate the different types of
> information.  However, it might be more reasonable to
> extend the current schema to suit the specific types
> of information to simplify insertion and update.
> First, I am wondering if anyone has any suggestions on
> the best (or good) approach to house the data types I
> mentioned above (i.e. the current BioSQL schema design
> is not strong typed, is it better to use a strong
> typed database (e.g. GUS) for storing such
> information)?  More specifically, I am wondering if we
> decided to add additional tables to the schema (but
> keep the original tables in tact), will that break the
> bioperl modules (bioperl-db, etc) that are associated
> with BioSQL?  Second, if we add more columns (fields)
> to the existing tables, will that break bioperl-db?
> Is BioPerl adaptor for BioSQL designed to accommodate
> the possibility that the actual schema might be
> expanded?
>
> Thank you
>
> William Hsiao
> Brinkman Laboratory, SFU
>
>
> ______________________________________________________________________
> Post your free ad now! http://personals.yahoo.ca
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>