[GMOD-devel] Re: [Open-bio-l] Schema for genes & features&mappings to assemblies

Lincoln Stein lstein@cshl.org
Thu, 2 May 2002 08:11:31 -0400


The AceDB model for handling gapped alignments might be worth looking
at.  There is a single Homol_data object to represent the overall
alignment between two sequences.  The Homol_data object contains the
details of the alignments of the gapped HSPs.  How similar is this to
what you're talking about?

Lincoln

Chris Mungall writes:
 > 
 > 
 > On Wed, 1 May 2002, Thomas Down wrote:
 > 
 > > On Wed, May 01, 2002 at 01:38:21PM -0700, Chris Mungall wrote:
 > > >
 > > > > [Alignment representations]
 > > >
 > > > Let's outline them, using a shortened tuple notation:
 > > >
 > > > 1) feature pair links two seqfeatures
 > > >
 > > > # maps cleanly to the bioperl object model
 > > > # join-heavy
 > > > # lots of rows instantiated
 > >
 > > # Maps pretty nicely to BioJava, too.
 > >
 > > I must admit I like this.  Particularly because is uses seqfeatures,
 > > and we're already storing tag-value stuff on seqfeatures.  This
 > > is the natural place to put all kinds of alignment-type-dependant
 > > information like reading frame.
 > 
 > except we may want to attach the overall score for a group of HSPs at the
 > featurepair level, which breaks the everything-ona-seqfeature model
 > 
 > > Have you thought about extending this to allow n:n relationships,
 > > for modelling alignments of >2 sequences?
 > 
 > this is similar to (2), below
 > 
 > > > 2) use the existing sf_relationship table to make feature pairs
 > >
 > > I think we should look into this more closely.  I'm not
 > > altogether convinced that you /do/ need as many different
 > > features as you're suggesting.  For the simple case of
 > > pairwise similarity, you could just have two features,
 > > eached linked by a single homologous_to line in the
 > > seqfeature_relationship table.  But this doesn't scale
 > > well for >2 sequences.
 > 
 > so this is for ungapped pairwise alignments only?
 > 
 > also you'd lose the ability to do treat pairwise and multiple
 > alignments the same way so you'd be as well having seperate tables.
 > 
 > m sequences, each with n HSPs/ungapped blocks. using a 3 level hierarchy
 > (hit, HSP/block, extent), each alignment/hit will require:
 > 
 > bioentry:	m
 > seqfeature:	mn + n + 1	= 2nm + 1
 > sf_location:	mn + n + 1	= 2nm + 1
 > sf_rel:		mn + n		= 2nm
 > 
 > > > 3) similaritypair links two biosequences
 > >
 > > Argh, pleasepleaseplease no.  See below for my reasoning on
 > > this.
 > >
 > > > 4) similaritypair links two bioentries
 > > >
 > > > simpair(simpair_id, bioentry1_id, start1, end1,
 > > >                     strand1, frame1,
 > > >                     bioentry2_id, start2, end2,
 > > >                     strand2, frame2,
 > > >                     e-value,
 > > >                     score)
 > > > simpair_qualifier_value(simpair_id, ontology_term_id, qualifier_rank,
 > > >                         qualifier_value)
 > > >
 > > > similar to (3). has the advantage over 3 that a lot of the time you don't
 > > > need to join bioentry/biosequence, as a lot of the time the information
 > > > you need (ie display_id/accession) is in bioentry. Even so, a lot of times
 > > > you will need to make that extra join.
 > >
 > > Note that this doesn't scale at all to alignments of >2 sequences,
 > > although it's not too hard to come up with a similar scheme
 > > which does.
 > >
 > > My slight concern about this is that it's got lots of fields
 > > where the values may be potentially doubtful.  For example,
 > > most pairwise alignments don't /really/ have a strand on both query
 > > and target -- there's just one bit of information, to say whether
 > > the sequences are parallel or antiparallel.  Similarly, I'm suspicious
 > > of frame.  Unless I'm missing something, this is only meaningful
 > > in Protein <--> nucleic acid alignments, and even then there's only
 > > one frame (on the nucleic acid side).
 > 
 > and tblastx (2 frames)
 > 
 > it was just a suggested set of nullable columns, i think it should contain
 > either all of the generally useful ones (including frame) OR everything
 > should go over into qualifier_value
 > 
 > > These objections aside, though, it's certainly quite implementable.
 > >
 > > > I think my preference is for (4), it's fastest. It doesn't map directly to
 > > > the bioperl object model but the manual mapping in the adapters isn't too
 > > > hard.
 > > >
 > > > Note I'm only mentioning the bioperl object model to provoke otehrs, eg
 > > > Thomas to chime in.
 > > >
 > > > I have to say, I'm still confused by the need for a bioentry/biosequence
 > > > split. I can't see any situations in which you would want to store a tuple
 > > > of one without the other.
 > >
 > > I don't know exactly what the original rationale was for this
 > > split.  However, it does provide the possibility of an extremely
 > > powerful polymorphism.  I think of bioentry as an abstract base
 > > class for annotated sequences (yes, yes, I know the relational model
 > > isn't anything like object orientation -- but in this particular
 > > case the analogy works well).  A biosequence is one concrete
 > > subclass, which specifies storage of the sequence data in a single
 > > `text' column of the database.  Other `subclasses' could include
 > >
 > >   - assembly
 > >
 > >   - shredded sequence (for efficient storage of large sequences
 > >     in cases where you don't know/want to bother with the assembly)
 > 
 > another example could be PFAM type 'subclasses', e.g. sequence models
 > 
 > this would allow you to store pfam/interpro hits on peptide sequences
 > using whatever relations we decide on for storing alignments
 > 
 > ok, the bioentry-as-base-class explanation works for me (not necessarily
 > abstract)
 > 
 > this is essentially how i do it in GadFly, except everything is collapsed
 > into the 'seq' table, including shredded sequence, assembly and even
 > interpro "virtual seqs"
 > 
 > It is elegant, but I do worry about the performance hit though. In
 > collapsing the bioentry table with its 'subclass' tables, you have the
 > disadvantage of the nulled biosequence_str column sitting around doing
 > nothing. But other than that it seems simpler and faster.
 > 
 > Ok, related question. Thomas, when you store an ensembl gene object in
 > bioSQL, do you instantiate biosequence/bioentry rows for transcript and
 > translation objects. if so, how are they linked?
 > 
 > > The suggested assembly schema which I came up with a while back
 > > works in exactly this way.  The code for implementing it turned
 > > out very simple and clean.
 > >
 > > But for this to work, it does mean that all the annotation needs
 > > to stay joined onto bioentry.  Hence my `argh' when you suggested
 > > joining simpair to biosequence.
 > 
 > sure, i see. it's just my instinct to treat the biosequence table as you
 > treat bioentry
 > 
 > > > Thanks for being the first non-genbank-roundtrip use-case biosql beta
 > > > tester!
 > >
 > > So my Ensembl-in-BioSQL doesn't count? ;-)
 > 
 > Oops, forgot about that!
 > 
 > >     Thomas.
 > >
 > 
 > 
 > _______________________________________________
 > Open-Bio-l mailing list
 > Open-Bio-l@open-bio.org
 > http://open-bio.org/mailman/listinfo/open-bio-l

-- 
========================================================================
Lincoln D. Stein                           Cold Spring Harbor Laboratory
lstein@cshl.org			                  Cold Spring Harbor, NY
Positions available at my lab: see http://stein.cshl.org/#hire
========================================================================