[Biojava-dev] bjv2 alpha 3

Matthew Pocock matthew_pocock at yahoo.co.uk
Wed Jun 23 14:28:34 EDT 2004


Michael Heuer wrote:

>Hello Matthew,
>
>Could you give a bit more explanation or motivation for a couple of
>things, one is the introduction of Anchors in with Locations and Features,
>the other being foreign key references between objects.
>
>   michael
>
>

I should probably write this up propperly. Here goes for the (not so) 
quick explanation.

Sequences and Features
-------------------------

The sequence/feature model at the moment looks like this:

key:
  --> has-a
  -[collection]-> has-a <collection> of
  -[as query]-> this arc is like a 'materialised view' of queries over 
other arcs
 

Feature <-- Location -[list]-> Anchor --> Sequence
                                      --> start/end/strand

Feature -[Set as query]-> Sequence
Sequence -[Set as query]-> Feature


The idea is that you have a single Feature instance that represents a 
biological feature. For example, the human gene SLUR_HUMAN* *on 
chromosome 8. This would have some system-wide ID.  You could write 
algorithms that do stuff with the SLUR_HUMAN feature without any 
sequence data being involved. E.g., mine one of the disease databases.

Now, you load in the relevant ensembl region (seq_ens), an embl file for 
the region (seq_embl), and also make a sub-sequence of the embl region 
(seq_sub). We will now have in memory one Feature and threee sequences. 
There will also be three Location instances that link SLUR_HUMAN to the 
three sequences. Each Location will have a list of Anchor instances, for 
example, for each exon. Remote features are handled by having different 
Sequence fields for Anchor elements in the same Location.

This model handles trans-splicing and other nasties. A transcript could 
have anchors for each bit of the trans-splice.

For models where one level of features have genomic locations and others 
don't (e.g. ensembl genes don't but transcripts do), we can represent 
this natively. Fetaures can have relationships between one another, 
which you can chose to model as RDF relationships (I'm still working on 
integrating RDF to bjv2 propperly).

With the addition of some extra logic (e.g. define that the location of 
an ensembl gene is the extent of all of it's trancripts), the missing 
Locator instances can be 'magiced' into existance (again, these data 
views are things I haven't coded yet, but they are on the list of things 
to do).


Foreign Keys
------------

BJV2 has an API for data-integration. Let's face it, that's most of what 
we do all day. This uses the following model:

Raw Data: this is like rows in a relational database - these are java 
beans that are very dumb. Users of the library shouldn't see these. 
Their class names usually look like FooData. People publishing data 
(e.g. writing an ensembl bridge) may chose to hand-craft raw data beans, 
or hand-craft the lifecycle management of them. Parsers may well spit 
out streams or collections of raw data beans.

User Data: these are fully populated objects - these are java beans, but 
you only provide the interface for them. Users should not ever implement 
these interfaces. Developers shouldn't either. There will be one 
collection of user data beans for each domain e.g. genomics.

Integrator: takes a collection of pots of raw data and creates mature 
user data beans. These implement the user data interfaces. If you write 
your own integrator you deserve everything you get. Most people most of 
the time won't see these things.

Introspector: used by the Integrator to map properties and classes from 
the raw data arena into the user data space. Nobody should see these 
unless they are doing integerator magic.

To do the linking, raw-data entities need to get hooked up to each 
other. This is all done through introspection magic. To help this along, 
I've introduced the idea of properties that contain values that match 
the primary key of the object they are refering to. For Identifiable 
objects, that will usualy be the Identifier.

Once I've finnished the next round of anoying modifications to tools, 
most of the support structure will get generated by the compiler from 
source-code annotations and other magic, so /nearly/ everybody will just 
see the user bean APIs.


Data Querying
-------------

Because there is meta-data throughout bjv2 about what data we are 
working with, we also have an integrated query language. This is the 
primary way that a user accesses data. The queries themselves get passed 
through the integration layers, and potentialy turn into sane SQL 
statements, or hash table lookups where possible. It is up to people 
writing data providing plugins to do sane things with queries they can 
work with. Once RDF is plugged in, all bjv2 data will be queryable as if 
it was a big RDF tripple store. At least, that's the hope.


Does that help?

Matthew


More information about the biojava-dev mailing list