[GMOD-devel] Re: [Open-bio-l] Schema for genes & features&mappings to assemblies

Lincoln Stein lstein@cshl.org
Thu, 2 May 2002 08:41:46 -0400


Another thing to add is that the SNP-finding code, which we use for
the TSC project, records the boundaries of the overall alignment and
then encodes the details of the gaps into a compact binary structure
that is stored as a BLOB.  This might be similar in spirit to the
EnsEMBL "cigars" but I haven't looked at that code.

Lincoln

Lincoln Stein writes:
 > The AceDB model for handling gapped alignments might be worth looking
 > at.  There is a single Homol_data object to represent the overall
 > alignment between two sequences.  The Homol_data object contains the
 > details of the alignments of the gapped HSPs.  How similar is this to
 > what you're talking about?
 > 
 > Lincoln
 > 
 > Chris Mungall writes:
 >  > 
 >  > 
 >  > On Wed, 1 May 2002, Thomas Down wrote:
 >  > 
 >  > > On Wed, May 01, 2002 at 01:38:21PM -0700, Chris Mungall wrote:
 >  > > >
 >  > > > > [Alignment representations]
 >  > > >
 >  > > > Let's outline them, using a shortened tuple notation:
 >  > > >
 >  > > > 1) feature pair links two seqfeatures
 >  > > >
 >  > > > # maps cleanly to the bioperl object model
 >  > > > # join-heavy
 >  > > > # lots of rows instantiated
 >  > >
 >  > > # Maps pretty nicely to BioJava, too.
 >  > >
 >  > > I must admit I like this.  Particularly because is uses seqfeatures,
 >  > > and we're already storing tag-value stuff on seqfeatures.  This
 >  > > is the natural place to put all kinds of alignment-type-dependant
 >  > > information like reading frame.
 >  > 
 >  > except we may want to attach the overall score for a group of HSPs at the
 >  > featurepair level, which breaks the everything-ona-seqfeature model
 >  > 
 >  > > Have you thought about extending this to allow n:n relationships,
 >  > > for modelling alignments of >2 sequences?
 >  > 
 >  > this is similar to (2), below
 >  > 
 >  > > > 2) use the existing sf_relationship table to make feature pairs
 >  > >
 >  > > I think we should look into this more closely.  I'm not
 >  > > altogether convinced that you /do/ need as many different
 >  > > features as you're suggesting.  For the simple case of
 >  > > pairwise similarity, you could just have two features,
 >  > > eached linked by a single homologous_to line in the
 >  > > seqfeature_relationship table.  But this doesn't scale
 >  > > well for >2 sequences.
 >  > 
 >  > so this is for ungapped pairwise alignments only?
 >  > 
 >  > also you'd lose the ability to do treat pairwise and multiple
 >  > alignments the same way so you'd be as well having seperate tables.
 >  > 
 >  > m sequences, each with n HSPs/ungapped blocks. using a 3 level hierarchy
 >  > (hit, HSP/block, extent), each alignment/hit will require:
 >  > 
 >  > bioentry:	m
 >  > seqfeature:	mn + n + 1	= 2nm + 1
 >  > sf_location:	mn + n + 1	= 2nm + 1
 >  > sf_rel:		mn + n		= 2nm
 >  > 
 >  > > > 3) similaritypair links two biosequences
 >  > >
 >  > > Argh, pleasepleaseplease no.  See below for my reasoning on
 >  > > this.
 >  > >
 >  > > > 4) similaritypair links two bioentries
 >  > > >
 >  > > > simpair(simpair_id, bioentry1_id, start1, end1,
 >  > > >                     strand1, frame1,
 >  > > >                     bioentry2_id, start2, end2,
 >  > > >                     strand2, frame2,
 >  > > >                     e-value,
 >  > > >                     score)
 >  > > > simpair_qualifier_value(simpair_id, ontology_term_id, qualifier_rank,
 >  > > >                         qualifier_value)
 >  > > >
 >  > > > similar to (3). has the advantage over 3 that a lot of the time you don't
 >  > > > need to join bioentry/biosequence, as a lot of the time the information
 >  > > > you need (ie display_id/accession) is in bioentry. Even so, a lot of times
 >  > > > you will need to make that extra join.
 >  > >
 >  > > Note that this doesn't scale at all to alignments of >2 sequences,
 >  > > although it's not too hard to come up with a similar scheme
 >  > > which does.
 >  > >
 >  > > My slight concern about this is that it's got lots of fields
 >  > > where the values may be potentially doubtful.  For example,
 >  > > most pairwise alignments don't /really/ have a strand on both query
 >  > > and target -- there's just one bit of information, to say whether
 >  > > the sequences are parallel or antiparallel.  Similarly, I'm suspicious
 >  > > of frame.  Unless I'm missing something, this is only meaningful
 >  > > in Protein <--> nucleic acid alignments, and even then there's only
 >  > > one frame (on the nucleic acid side).
 >  > 
 >  > and tblastx (2 frames)
 >  > 
 >  > it was just a suggested set of nullable columns, i think it should contain
 >  > either all of the generally useful ones (including frame) OR everything
 >  > should go over into qualifier_value
 >  > 
 >  > > These objections aside, though, it's certainly quite implementable.
 >  > >
 >  > > > I think my preference is for (4), it's fastest. It doesn't map directly to
 >  > > > the bioperl object model but the manual mapping in the adapters isn't too
 >  > > > hard.
 >  > > >
 >  > > > Note I'm only mentioning the bioperl object model to provoke otehrs, eg
 >  > > > Thomas to chime in.
 >  > > >
 >  > > > I have to say, I'm still confused by the need for a bioentry/biosequence
 >  > > > split. I can't see any situations in which you would want to store a tuple
 >  > > > of one without the other.
 >  > >
 >  > > I don't know exactly what the original rationale was for this
 >  > > split.  However, it does provide the possibility of an extremely
 >  > > powerful polymorphism.  I think of bioentry as an abstract base
 >  > > class for annotated sequences (yes, yes, I know the relational model
 >  > > isn't anything like object orientation -- but in this particular
 >  > > case the analogy works well).  A biosequence is one concrete
 >  > > subclass, which specifies storage of the sequence data in a single
 >  > > `text' column of the database.  Other `subclasses' could include
 >  > >
 >  > >   - assembly
 >  > >
 >  > >   - shredded sequence (for efficient storage of large sequences
 >  > >     in cases where you don't know/want to bother with the assembly)
 >  > 
 >  > another example could be PFAM type 'subclasses', e.g. sequence models
 >  > 
 >  > this would allow you to store pfam/interpro hits on peptide sequences
 >  > using whatever relations we decide on for storing alignments
 >  > 
 >  > ok, the bioentry-as-base-class explanation works for me (not necessarily
 >  > abstract)
 >  > 
 >  > this is essentially how i do it in GadFly, except everything is collapsed
 >  > into the 'seq' table, including shredded sequence, assembly and even
 >  > interpro "virtual seqs"
 >  > 
 >  > It is elegant, but I do worry about the performance hit though. In
 >  > collapsing the bioentry table with its 'subclass' tables, you have the
 >  > disadvantage of the nulled biosequence_str column sitting around doing
 >  > nothing. But other than that it seems simpler and faster.
 >  > 
 >  > Ok, related question. Thomas, when you store an ensembl gene object in
 >  > bioSQL, do you instantiate biosequence/bioentry rows for transcript and
 >  > translation objects. if so, how are they linked?
 >  > 
 >  > > The suggested assembly schema which I came up with a while back
 >  > > works in exactly this way.  The code for implementing it turned
 >  > > out very simple and clean.
 >  > >
 >  > > But for this to work, it does mean that all the annotation needs
 >  > > to stay joined onto bioentry.  Hence my `argh' when you suggested
 >  > > joining simpair to biosequence.
 >  > 
 >  > sure, i see. it's just my instinct to treat the biosequence table as you
 >  > treat bioentry
 >  > 
 >  > > > Thanks for being the first non-genbank-roundtrip use-case biosql beta
 >  > > > tester!
 >  > >
 >  > > So my Ensembl-in-BioSQL doesn't count? ;-)
 >  > 
 >  > Oops, forgot about that!
 >  > 
 >  > >     Thomas.
 >  > >
 >  > 
 >  > 
 >  > _______________________________________________
 >  > Open-Bio-l mailing list
 >  > Open-Bio-l@open-bio.org
 >  > http://open-bio.org/mailman/listinfo/open-bio-l
 > 
 > -- 
 > ========================================================================
 > Lincoln D. Stein                           Cold Spring Harbor Laboratory
 > lstein@cshl.org			                  Cold Spring Harbor, NY
 > Positions available at my lab: see http://stein.cshl.org/#hire
 > ========================================================================
 > 
 > _______________________________________________________________
 > 
 > Have big pipes? SourceForge.net is looking for download mirrors. We supply
 > the hardware. You get the recognition. Email Us: bandwidth@sourceforge.net
 > _______________________________________________
 > Gmod-devel mailing list
 > Gmod-devel@lists.sourceforge.net
 > https://lists.sourceforge.net/lists/listinfo/gmod-devel

-- 
========================================================================
Lincoln D. Stein                           Cold Spring Harbor Laboratory
lstein@cshl.org			                  Cold Spring Harbor, NY
Positions available at my lab: see http://stein.cshl.org/#hire
========================================================================