[GMOD-devel] Re: [Open-bio-l] Schema for genes & features&mappings to assemblies

Wed, 1 May 2002 17:25:36 -0700 (PDT)

On Wed, 1 May 2002, Thomas Down wrote:

> On Wed, May 01, 2002 at 01:38:21PM -0700, Chris Mungall wrote:
> >
> > > [Alignment representations]
> >
> > Let's outline them, using a shortened tuple notation:
> >
> > 1) feature pair links two seqfeatures
> >
> > # maps cleanly to the bioperl object model
> > # join-heavy
> > # lots of rows instantiated
>
> # Maps pretty nicely to BioJava, too.
>
> I must admit I like this.  Particularly because is uses seqfeatures,
> and we're already storing tag-value stuff on seqfeatures.  This
> is the natural place to put all kinds of alignment-type-dependant
> information like reading frame.

except we may want to attach the overall score for a group of HSPs at the
featurepair level, which breaks the everything-ona-seqfeature model

> Have you thought about extending this to allow n:n relationships,
> for modelling alignments of >2 sequences?

this is similar to (2), below

> > 2) use the existing sf_relationship table to make feature pairs
>
> I think we should look into this more closely.  I'm not
> altogether convinced that you /do/ need as many different
> features as you're suggesting.  For the simple case of
> pairwise similarity, you could just have two features,
> eached linked by a single homologous_to line in the
> seqfeature_relationship table.  But this doesn't scale
> well for >2 sequences.

so this is for ungapped pairwise alignments only?

also you'd lose the ability to do treat pairwise and multiple
alignments the same way so you'd be as well having seperate tables.

m sequences, each with n HSPs/ungapped blocks. using a 3 level hierarchy
(hit, HSP/block, extent), each alignment/hit will require:

bioentry:	m
seqfeature:	mn + n + 1	= 2nm + 1
sf_location:	mn + n + 1	= 2nm + 1
sf_rel:		mn + n		= 2nm

> > 3) similaritypair links two biosequences
>
> Argh, pleasepleaseplease no.  See below for my reasoning on
> this.
>
> > 4) similaritypair links two bioentries
> >
> > simpair(simpair_id, bioentry1_id, start1, end1,
> >                     strand1, frame1,
> >                     bioentry2_id, start2, end2,
> >                     strand2, frame2,
> >                     e-value,
> >                     score)
> > simpair_qualifier_value(simpair_id, ontology_term_id, qualifier_rank,
> >                         qualifier_value)
> >
> > similar to (3). has the advantage over 3 that a lot of the time you don't
> > need to join bioentry/biosequence, as a lot of the time the information
> > you need (ie display_id/accession) is in bioentry. Even so, a lot of times
> > you will need to make that extra join.
>
> Note that this doesn't scale at all to alignments of >2 sequences,
> although it's not too hard to come up with a similar scheme
> which does.
>
> My slight concern about this is that it's got lots of fields
> where the values may be potentially doubtful.  For example,
> most pairwise alignments don't /really/ have a strand on both query
> and target -- there's just one bit of information, to say whether
> the sequences are parallel or antiparallel.  Similarly, I'm suspicious
> of frame.  Unless I'm missing something, this is only meaningful
> in Protein <--> nucleic acid alignments, and even then there's only
> one frame (on the nucleic acid side).

and tblastx (2 frames)

it was just a suggested set of nullable columns, i think it should contain
either all of the generally useful ones (including frame) OR everything
should go over into qualifier_value

> These objections aside, though, it's certainly quite implementable.
>
> > I think my preference is for (4), it's fastest. It doesn't map directly to
> > the bioperl object model but the manual mapping in the adapters isn't too
> > hard.
> >
> > Note I'm only mentioning the bioperl object model to provoke otehrs, eg
> > Thomas to chime in.
> >
> > I have to say, I'm still confused by the need for a bioentry/biosequence
> > split. I can't see any situations in which you would want to store a tuple
> > of one without the other.
>
> I don't know exactly what the original rationale was for this
> split.  However, it does provide the possibility of an extremely
> powerful polymorphism.  I think of bioentry as an abstract base
> class for annotated sequences (yes, yes, I know the relational model
> isn't anything like object orientation -- but in this particular
> case the analogy works well).  A biosequence is one concrete
> subclass, which specifies storage of the sequence data in a single
> `text' column of the database.  Other `subclasses' could include
>
>   - assembly
>
>   - shredded sequence (for efficient storage of large sequences
>     in cases where you don't know/want to bother with the assembly)

another example could be PFAM type 'subclasses', e.g. sequence models

this would allow you to store pfam/interpro hits on peptide sequences
using whatever relations we decide on for storing alignments

ok, the bioentry-as-base-class explanation works for me (not necessarily
abstract)

this is essentially how i do it in GadFly, except everything is collapsed
into the 'seq' table, including shredded sequence, assembly and even
interpro "virtual seqs"

It is elegant, but I do worry about the performance hit though. In
collapsing the bioentry table with its 'subclass' tables, you have the
disadvantage of the nulled biosequence_str column sitting around doing
nothing. But other than that it seems simpler and faster.

Ok, related question. Thomas, when you store an ensembl gene object in
bioSQL, do you instantiate biosequence/bioentry rows for transcript and
translation objects. if so, how are they linked?

> The suggested assembly schema which I came up with a while back
> works in exactly this way.  The code for implementing it turned
> out very simple and clean.
>
> But for this to work, it does mean that all the annotation needs
> to stay joined onto bioentry.  Hence my `argh' when you suggested
> joining simpair to biosequence.

sure, i see. it's just my instinct to treat the biosequence table as you
treat bioentry

> > Thanks for being the first non-genbank-roundtrip use-case biosql beta
> > tester!
>
> So my Ensembl-in-BioSQL doesn't count? ;-)

Oops, forgot about that!

>     Thomas.
>