[GMOD-devel] Re: [Open-bio-l] Schema for genes & features&mappings to assemblies

Thomas Down td2@sanger.ac.uk
Wed, 1 May 2002 23:40:08 +0100


On Wed, May 01, 2002 at 01:38:21PM -0700, Chris Mungall wrote:
>
> > [Alignment representations]
> 
> Let's outline them, using a shortened tuple notation:
> 
> 1) feature pair links two seqfeatures
>
> # maps cleanly to the bioperl object model
> # join-heavy
> # lots of rows instantiated

# Maps pretty nicely to BioJava, too.

I must admit I like this.  Particularly because is uses seqfeatures,
and we're already storing tag-value stuff on seqfeatures.  This
is the natural place to put all kinds of alignment-type-dependant
information like reading frame.

Have you thought about extending this to allow n:n relationships,
for modelling alignments of >2 sequences?

> 2) use the existing sf_relationship table to make feature pairs

I think we should look into this more closely.  I'm not
altogether convinced that you /do/ need as many different
features as you're suggesting.  For the simple case of
pairwise similarity, you could just have two features,
eached linked by a single homologous_to line in the
seqfeature_relationship table.  But this doesn't scale
well for >2 sequences.

> 3) similaritypair links two biosequences

Argh, pleasepleaseplease no.  See below for my reasoning on
this.

> 4) similaritypair links two bioentries
> 
> simpair(simpair_id, bioentry1_id, start1, end1,
>                     strand1, frame1,
>                     bioentry2_id, start2, end2,
>                     strand2, frame2,
>                     e-value,
>                     score)
> simpair_qualifier_value(simpair_id, ontology_term_id, qualifier_rank,
>                         qualifier_value)
> 
> similar to (3). has the advantage over 3 that a lot of the time you don't
> need to join bioentry/biosequence, as a lot of the time the information
> you need (ie display_id/accession) is in bioentry. Even so, a lot of times
> you will need to make that extra join.

Note that this doesn't scale at all to alignments of >2 sequences,
although it's not too hard to come up with a similar scheme
which does.

My slight concern about this is that it's got lots of fields
where the values may be potentially doubtful.  For example,
most pairwise alignments don't /really/ have a strand on both query
and target -- there's just one bit of information, to say whether
the sequences are parallel or antiparallel.  Similarly, I'm suspicious
of frame.  Unless I'm missing something, this is only meaningful
in Protein <--> nucleic acid alignments, and even then there's only
one frame (on the nucleic acid side).

These objections aside, though, it's certainly quite implementable.

> I think my preference is for (4), it's fastest. It doesn't map directly to
> the bioperl object model but the manual mapping in the adapters isn't too
> hard.
> 
> Note I'm only mentioning the bioperl object model to provoke otehrs, eg
> Thomas to chime in.
> 
> I have to say, I'm still confused by the need for a bioentry/biosequence
> split. I can't see any situations in which you would want to store a tuple
> of one without the other.

I don't know exactly what the original rationale was for this
split.  However, it does provide the possibility of an extremely
powerful polymorphism.  I think of bioentry as an abstract base
class for annotated sequences (yes, yes, I know the relational model
isn't anything like object orientation -- but in this particular
case the analogy works well).  A biosequence is one concrete
subclass, which specifies storage of the sequence data in a single
`text' column of the database.  Other `subclasses' could include

  - assembly

  - shredded sequence (for efficient storage of large sequences
    in cases where you don't know/want to bother with the assembly)

The suggested assembly schema which I came up with a while back
works in exactly this way.  The code for implementing it turned
out very simple and clean.

But for this to work, it does mean that all the annotation needs
to stay joined onto bioentry.  Hence my `argh' when you suggested
joining simpair to biosequence.

> Thanks for being the first non-genbank-roundtrip use-case biosql beta
> tester!

So my Ensembl-in-BioSQL doesn't count? ;-)

    Thomas.