[GMOD-devel] Re: [Open-bio-l] Schema for genes & features&mappings
to assemblies
Lincoln Stein
lstein@cshl.org
Thu, 2 May 2002 08:41:46 -0400
Another thing to add is that the SNP-finding code, which we use for
the TSC project, records the boundaries of the overall alignment and
then encodes the details of the gaps into a compact binary structure
that is stored as a BLOB. This might be similar in spirit to the
EnsEMBL "cigars" but I haven't looked at that code.
Lincoln
Lincoln Stein writes:
> The AceDB model for handling gapped alignments might be worth looking
> at. There is a single Homol_data object to represent the overall
> alignment between two sequences. The Homol_data object contains the
> details of the alignments of the gapped HSPs. How similar is this to
> what you're talking about?
>
> Lincoln
>
> Chris Mungall writes:
> >
> >
> > On Wed, 1 May 2002, Thomas Down wrote:
> >
> > > On Wed, May 01, 2002 at 01:38:21PM -0700, Chris Mungall wrote:
> > > >
> > > > > [Alignment representations]
> > > >
> > > > Let's outline them, using a shortened tuple notation:
> > > >
> > > > 1) feature pair links two seqfeatures
> > > >
> > > > # maps cleanly to the bioperl object model
> > > > # join-heavy
> > > > # lots of rows instantiated
> > >
> > > # Maps pretty nicely to BioJava, too.
> > >
> > > I must admit I like this. Particularly because is uses seqfeatures,
> > > and we're already storing tag-value stuff on seqfeatures. This
> > > is the natural place to put all kinds of alignment-type-dependant
> > > information like reading frame.
> >
> > except we may want to attach the overall score for a group of HSPs at the
> > featurepair level, which breaks the everything-ona-seqfeature model
> >
> > > Have you thought about extending this to allow n:n relationships,
> > > for modelling alignments of >2 sequences?
> >
> > this is similar to (2), below
> >
> > > > 2) use the existing sf_relationship table to make feature pairs
> > >
> > > I think we should look into this more closely. I'm not
> > > altogether convinced that you /do/ need as many different
> > > features as you're suggesting. For the simple case of
> > > pairwise similarity, you could just have two features,
> > > eached linked by a single homologous_to line in the
> > > seqfeature_relationship table. But this doesn't scale
> > > well for >2 sequences.
> >
> > so this is for ungapped pairwise alignments only?
> >
> > also you'd lose the ability to do treat pairwise and multiple
> > alignments the same way so you'd be as well having seperate tables.
> >
> > m sequences, each with n HSPs/ungapped blocks. using a 3 level hierarchy
> > (hit, HSP/block, extent), each alignment/hit will require:
> >
> > bioentry: m
> > seqfeature: mn + n + 1 = 2nm + 1
> > sf_location: mn + n + 1 = 2nm + 1
> > sf_rel: mn + n = 2nm
> >
> > > > 3) similaritypair links two biosequences
> > >
> > > Argh, pleasepleaseplease no. See below for my reasoning on
> > > this.
> > >
> > > > 4) similaritypair links two bioentries
> > > >
> > > > simpair(simpair_id, bioentry1_id, start1, end1,
> > > > strand1, frame1,
> > > > bioentry2_id, start2, end2,
> > > > strand2, frame2,
> > > > e-value,
> > > > score)
> > > > simpair_qualifier_value(simpair_id, ontology_term_id, qualifier_rank,
> > > > qualifier_value)
> > > >
> > > > similar to (3). has the advantage over 3 that a lot of the time you don't
> > > > need to join bioentry/biosequence, as a lot of the time the information
> > > > you need (ie display_id/accession) is in bioentry. Even so, a lot of times
> > > > you will need to make that extra join.
> > >
> > > Note that this doesn't scale at all to alignments of >2 sequences,
> > > although it's not too hard to come up with a similar scheme
> > > which does.
> > >
> > > My slight concern about this is that it's got lots of fields
> > > where the values may be potentially doubtful. For example,
> > > most pairwise alignments don't /really/ have a strand on both query
> > > and target -- there's just one bit of information, to say whether
> > > the sequences are parallel or antiparallel. Similarly, I'm suspicious
> > > of frame. Unless I'm missing something, this is only meaningful
> > > in Protein <--> nucleic acid alignments, and even then there's only
> > > one frame (on the nucleic acid side).
> >
> > and tblastx (2 frames)
> >
> > it was just a suggested set of nullable columns, i think it should contain
> > either all of the generally useful ones (including frame) OR everything
> > should go over into qualifier_value
> >
> > > These objections aside, though, it's certainly quite implementable.
> > >
> > > > I think my preference is for (4), it's fastest. It doesn't map directly to
> > > > the bioperl object model but the manual mapping in the adapters isn't too
> > > > hard.
> > > >
> > > > Note I'm only mentioning the bioperl object model to provoke otehrs, eg
> > > > Thomas to chime in.
> > > >
> > > > I have to say, I'm still confused by the need for a bioentry/biosequence
> > > > split. I can't see any situations in which you would want to store a tuple
> > > > of one without the other.
> > >
> > > I don't know exactly what the original rationale was for this
> > > split. However, it does provide the possibility of an extremely
> > > powerful polymorphism. I think of bioentry as an abstract base
> > > class for annotated sequences (yes, yes, I know the relational model
> > > isn't anything like object orientation -- but in this particular
> > > case the analogy works well). A biosequence is one concrete
> > > subclass, which specifies storage of the sequence data in a single
> > > `text' column of the database. Other `subclasses' could include
> > >
> > > - assembly
> > >
> > > - shredded sequence (for efficient storage of large sequences
> > > in cases where you don't know/want to bother with the assembly)
> >
> > another example could be PFAM type 'subclasses', e.g. sequence models
> >
> > this would allow you to store pfam/interpro hits on peptide sequences
> > using whatever relations we decide on for storing alignments
> >
> > ok, the bioentry-as-base-class explanation works for me (not necessarily
> > abstract)
> >
> > this is essentially how i do it in GadFly, except everything is collapsed
> > into the 'seq' table, including shredded sequence, assembly and even
> > interpro "virtual seqs"
> >
> > It is elegant, but I do worry about the performance hit though. In
> > collapsing the bioentry table with its 'subclass' tables, you have the
> > disadvantage of the nulled biosequence_str column sitting around doing
> > nothing. But other than that it seems simpler and faster.
> >
> > Ok, related question. Thomas, when you store an ensembl gene object in
> > bioSQL, do you instantiate biosequence/bioentry rows for transcript and
> > translation objects. if so, how are they linked?
> >
> > > The suggested assembly schema which I came up with a while back
> > > works in exactly this way. The code for implementing it turned
> > > out very simple and clean.
> > >
> > > But for this to work, it does mean that all the annotation needs
> > > to stay joined onto bioentry. Hence my `argh' when you suggested
> > > joining simpair to biosequence.
> >
> > sure, i see. it's just my instinct to treat the biosequence table as you
> > treat bioentry
> >
> > > > Thanks for being the first non-genbank-roundtrip use-case biosql beta
> > > > tester!
> > >
> > > So my Ensembl-in-BioSQL doesn't count? ;-)
> >
> > Oops, forgot about that!
> >
> > > Thomas.
> > >
> >
> >
> > _______________________________________________
> > Open-Bio-l mailing list
> > Open-Bio-l@open-bio.org
> > http://open-bio.org/mailman/listinfo/open-bio-l
>
> --
> ========================================================================
> Lincoln D. Stein Cold Spring Harbor Laboratory
> lstein@cshl.org Cold Spring Harbor, NY
> Positions available at my lab: see http://stein.cshl.org/#hire
> ========================================================================
>
> _______________________________________________________________
>
> Have big pipes? SourceForge.net is looking for download mirrors. We supply
> the hardware. You get the recognition. Email Us: bandwidth@sourceforge.net
> _______________________________________________
> Gmod-devel mailing list
> Gmod-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/gmod-devel
--
========================================================================
Lincoln D. Stein Cold Spring Harbor Laboratory
lstein@cshl.org Cold Spring Harbor, NY
Positions available at my lab: see http://stein.cshl.org/#hire
========================================================================