[GMOD-devel] Re: [Open-bio-l] Schema for genes & features&mappings
to assemblies
Lincoln Stein
lstein@cshl.org
Thu, 2 May 2002 08:11:31 -0400
The AceDB model for handling gapped alignments might be worth looking
at. There is a single Homol_data object to represent the overall
alignment between two sequences. The Homol_data object contains the
details of the alignments of the gapped HSPs. How similar is this to
what you're talking about?
Lincoln
Chris Mungall writes:
>
>
> On Wed, 1 May 2002, Thomas Down wrote:
>
> > On Wed, May 01, 2002 at 01:38:21PM -0700, Chris Mungall wrote:
> > >
> > > > [Alignment representations]
> > >
> > > Let's outline them, using a shortened tuple notation:
> > >
> > > 1) feature pair links two seqfeatures
> > >
> > > # maps cleanly to the bioperl object model
> > > # join-heavy
> > > # lots of rows instantiated
> >
> > # Maps pretty nicely to BioJava, too.
> >
> > I must admit I like this. Particularly because is uses seqfeatures,
> > and we're already storing tag-value stuff on seqfeatures. This
> > is the natural place to put all kinds of alignment-type-dependant
> > information like reading frame.
>
> except we may want to attach the overall score for a group of HSPs at the
> featurepair level, which breaks the everything-ona-seqfeature model
>
> > Have you thought about extending this to allow n:n relationships,
> > for modelling alignments of >2 sequences?
>
> this is similar to (2), below
>
> > > 2) use the existing sf_relationship table to make feature pairs
> >
> > I think we should look into this more closely. I'm not
> > altogether convinced that you /do/ need as many different
> > features as you're suggesting. For the simple case of
> > pairwise similarity, you could just have two features,
> > eached linked by a single homologous_to line in the
> > seqfeature_relationship table. But this doesn't scale
> > well for >2 sequences.
>
> so this is for ungapped pairwise alignments only?
>
> also you'd lose the ability to do treat pairwise and multiple
> alignments the same way so you'd be as well having seperate tables.
>
> m sequences, each with n HSPs/ungapped blocks. using a 3 level hierarchy
> (hit, HSP/block, extent), each alignment/hit will require:
>
> bioentry: m
> seqfeature: mn + n + 1 = 2nm + 1
> sf_location: mn + n + 1 = 2nm + 1
> sf_rel: mn + n = 2nm
>
> > > 3) similaritypair links two biosequences
> >
> > Argh, pleasepleaseplease no. See below for my reasoning on
> > this.
> >
> > > 4) similaritypair links two bioentries
> > >
> > > simpair(simpair_id, bioentry1_id, start1, end1,
> > > strand1, frame1,
> > > bioentry2_id, start2, end2,
> > > strand2, frame2,
> > > e-value,
> > > score)
> > > simpair_qualifier_value(simpair_id, ontology_term_id, qualifier_rank,
> > > qualifier_value)
> > >
> > > similar to (3). has the advantage over 3 that a lot of the time you don't
> > > need to join bioentry/biosequence, as a lot of the time the information
> > > you need (ie display_id/accession) is in bioentry. Even so, a lot of times
> > > you will need to make that extra join.
> >
> > Note that this doesn't scale at all to alignments of >2 sequences,
> > although it's not too hard to come up with a similar scheme
> > which does.
> >
> > My slight concern about this is that it's got lots of fields
> > where the values may be potentially doubtful. For example,
> > most pairwise alignments don't /really/ have a strand on both query
> > and target -- there's just one bit of information, to say whether
> > the sequences are parallel or antiparallel. Similarly, I'm suspicious
> > of frame. Unless I'm missing something, this is only meaningful
> > in Protein <--> nucleic acid alignments, and even then there's only
> > one frame (on the nucleic acid side).
>
> and tblastx (2 frames)
>
> it was just a suggested set of nullable columns, i think it should contain
> either all of the generally useful ones (including frame) OR everything
> should go over into qualifier_value
>
> > These objections aside, though, it's certainly quite implementable.
> >
> > > I think my preference is for (4), it's fastest. It doesn't map directly to
> > > the bioperl object model but the manual mapping in the adapters isn't too
> > > hard.
> > >
> > > Note I'm only mentioning the bioperl object model to provoke otehrs, eg
> > > Thomas to chime in.
> > >
> > > I have to say, I'm still confused by the need for a bioentry/biosequence
> > > split. I can't see any situations in which you would want to store a tuple
> > > of one without the other.
> >
> > I don't know exactly what the original rationale was for this
> > split. However, it does provide the possibility of an extremely
> > powerful polymorphism. I think of bioentry as an abstract base
> > class for annotated sequences (yes, yes, I know the relational model
> > isn't anything like object orientation -- but in this particular
> > case the analogy works well). A biosequence is one concrete
> > subclass, which specifies storage of the sequence data in a single
> > `text' column of the database. Other `subclasses' could include
> >
> > - assembly
> >
> > - shredded sequence (for efficient storage of large sequences
> > in cases where you don't know/want to bother with the assembly)
>
> another example could be PFAM type 'subclasses', e.g. sequence models
>
> this would allow you to store pfam/interpro hits on peptide sequences
> using whatever relations we decide on for storing alignments
>
> ok, the bioentry-as-base-class explanation works for me (not necessarily
> abstract)
>
> this is essentially how i do it in GadFly, except everything is collapsed
> into the 'seq' table, including shredded sequence, assembly and even
> interpro "virtual seqs"
>
> It is elegant, but I do worry about the performance hit though. In
> collapsing the bioentry table with its 'subclass' tables, you have the
> disadvantage of the nulled biosequence_str column sitting around doing
> nothing. But other than that it seems simpler and faster.
>
> Ok, related question. Thomas, when you store an ensembl gene object in
> bioSQL, do you instantiate biosequence/bioentry rows for transcript and
> translation objects. if so, how are they linked?
>
> > The suggested assembly schema which I came up with a while back
> > works in exactly this way. The code for implementing it turned
> > out very simple and clean.
> >
> > But for this to work, it does mean that all the annotation needs
> > to stay joined onto bioentry. Hence my `argh' when you suggested
> > joining simpair to biosequence.
>
> sure, i see. it's just my instinct to treat the biosequence table as you
> treat bioentry
>
> > > Thanks for being the first non-genbank-roundtrip use-case biosql beta
> > > tester!
> >
> > So my Ensembl-in-BioSQL doesn't count? ;-)
>
> Oops, forgot about that!
>
> > Thomas.
> >
>
>
> _______________________________________________
> Open-Bio-l mailing list
> Open-Bio-l@open-bio.org
> http://open-bio.org/mailman/listinfo/open-bio-l
--
========================================================================
Lincoln D. Stein Cold Spring Harbor Laboratory
lstein@cshl.org Cold Spring Harbor, NY
Positions available at my lab: see http://stein.cshl.org/#hire
========================================================================