[GMOD-devel] Re: [Open-bio-l] Schema for genes & features&mappingsto assemblies

Chris Mungall cjm@bdgp.lbl.gov
Wed, 1 May 2002 16:28:04 -0700 (PDT)


On Wed, 1 May 2002, Hilmar Lapp wrote:

>
>
> > -----Original Message-----
> > From: Chris Mungall [mailto:cjm@bdgp.lbl.gov]
> > Sent: Wednesday, May 01, 2002 1:38 PM
> > To: Hilmar Lapp
> > Cc: GMOD Devel (E-mail); OBDA BioSQL (E-mail)
> > Subject: RE: [GMOD-devel] Re: [Open-bio-l] Schema for genes &
> > features&mappingsto assemblies
> >
> [...]
> >
> > Let's outline them, using a shortened tuple notation:
> >
> > 1) feature pair links two seqfeatures
> >
> > sfpair(sfpair_id, seqfeature1_id, seqfeature2_id, e-value, score)
> > sfpair_qualifier_value(sfpair_id, ontology_term_id, qualifier_rank,
> >                        qualifier_value)
> >
> > # maps cleanly to the bioperl object model
> > # join-heavy
> > # lots of rows instantiated
> >
> > 2) use the existing sf_relationship table to make feature pairs
> >
> > this would be *very* join heavy, and would require lots of
> > rows just to
> > represent, say, one blast hit - it would be a 3 level hierarchy, i.e.
> > HitFeature <-> HSP <-> query/subject feature
> >
> > this does have the advantage of being neutral with respect to the
> > directionality of the blast/analysis (which sequence is query
> > and which is
> > subject), and also cleanly extends to >2 sequences.
> >
> > but overall i think it would be too awkward / slow
> >
> > 3) similaritypair links two biosequences
> >
> > simpair(simpair_id, biosequence1_id, start1, end1,
> >                     strand1, frame1,
> >                     biosequence2_id, start2, end2,
> >                     strand2, frame2,
> >                     e-value,
> >                     score)
> > simpair_qualifier_value(simpair_id, ontology_term_id, qualifier_rank,
> >                         qualifier_value)
> >
> > advantages - fairly lean in terms of rows instantiated, which is good,
> > less joins.
> >
> > you still have to instantiate bioentrys unless you're fine
> > with completely
> > anonymous sequences (and what's the point in that), which means extra
> > joins in retrieval.
> >
> > 4) similaritypair links two bioentries
> >
> > simpair(simpair_id, bioentry1_id, start1, end1,
> >                     strand1, frame1,
> >                     bioentry2_id, start2, end2,
> >                     strand2, frame2,
> >                     e-value,
> >                     score)
> > simpair_qualifier_value(simpair_id, ontology_term_id, qualifier_rank,
> >                         qualifier_value)
> >
> > similar to (3). has the advantage over 3 that a lot of the
> > time you don't
> > need to join bioentry/biosequence, as a lot of the time the
> > information
> > you need (ie display_id/accession) is in bioentry. Even so, a
> > lot of times
> > you will need to make that extra join.
> >
> > I think my preference is for (4), it's fastest. It doesn't
> > map directly to
> > the bioperl object model but the manual mapping in the
> > adapters isn't too
> > hard.
> >
>
> Thanks for this excellent analysis Chris. After my previous email, 3
> things came to my mind in addition:
>
> o How do you represent HSPs versus a hit? Would you split a hit into its
> HSPs and store each HSP as a simpair? If so, would one want to keep
> track of which of the simpairs originally formed the hit, and if so,
> how? (I guess so in order to determine coverage; but that could be
> reconstructed, too, or maybe not necessarily?)

You're right - (3) and (4) above require a some way of grouping the
simpairs. I meant to add a seqfeature_id, as per (1)

> o How do you represent multiple alignments (of which a pair really is
> only a special case)? You mentioned that above for option 2), but how
> with option 4)? One could have an alignment entity with
> bioentry-associations to it, but the question is then how to identify an
> alignment entry.

I think multiple alignments and pairwise alignments are sufficiently
different to warrant different tables. But this was more a speed
consideration. It seems there is some support for a single generic table.

> o How do you represent sequence clusters? This is maybe almost the same
> as the one before, especially since frequently you will want to have a
> multiple alignment for the cluster.

Hmm, I'm presuming Elia is on one of these lists, I know he's been
thinking about modeling this sort of thing in bioSQL

> I actually was inclined towards option 4) too, I only started wondering
> how to best encompass more-than-pairwise alignments with this.
>
>
> > Note I'm only mentioning the bioperl object model to provoke
> > otehrs, eg
> > Thomas to chime in.
> >
> > I have to say, I'm still confused by the need for a
> > bioentry/biosequence
> > split. I can't see any situations in which you would want to
> > store a tuple
> > of one without the other.
>
> You may encounter the wish to have a lightweight bioentry without an
> actual sequence; in fact that's what I'll do because I replaced the
> dbxref with a FK to bioentry in an association table.

Hmmm, I still think it'd be easier with a nullable biosequence_str field

I like Thomas's explanation, it makes sense. It'll take a while for me to
get used to though, I think very seqfeature+seq+graph centric.

> Also, for db performance reasons on the table level you should have the
> two separated, because then, depending on the actual low-level storage
> mechanism of the RDBMS, full-table scans on the bioentry table don't
> have to read or seek over the sequences (these are potentially big).
> Although nowadays I'm hearing that most RDBMSs store BLOBs such that
> that doesn't need to bother you.

Yep, the blobs are generally stored seperately, so the extra joins will
hit you more

> >
> > My recommendation to people like Hilmarr, who aren't
> > constrained by MySQL
> > is to code to a simpler set of relations and database procedures, the
> > important thing is to be sure that these relations and
> > function map simply
> > (at the DBMS level) to the bioSQL core. These views/functions
> > could then
> > form satellite modules around bioSQL, or they could remain a
> > bridge layer
> > within GNF.
>
> What I'm heading for right now is a table design that deviates slightly
> from BioSQL (right now it is stricter), but that I can, as you said,
> straightforwardly map to BioSQL (via views) such that an application
> working off biosql would work off our db as well.
>
> I actually don't know how others view biosql, but generally whether you
> consider it being the supposed low-level implementation or rather an API
> specification does have some impact on the design. With the present,
> very generic, ontology/value design of biosql there is a problem in
> treating it as an API from an application level perspective: there are a
> thousand ways to implement semantics, but a particular application will
> have a hard time understanding all thousand ways. That is, an
> application tagged 'runs off biosql' is not guaranteed at all to actually
> run off any biosql-compliant instance, unless 'biosql-compliant'
> includes a certain defined ontology being adhered to. Defining that
> ontology may not be an easy task, and it could be a controversial one,
> too. But it's probably necessary in order to establish Biosql as an API
> for sequence database / feature browsers.

very good point, I see it as allowing for a layered semantics

for instance, there needs to be some level of agreement on the
seqfeature types. this could be handled by Michael Ashburners SO ontology
- see
ftp://ftp.geneontology.org/pub/go/gobo/sequence.ontology

different applications could target different layers in the semantic
stack. For instance, a DAS-stye viewer wouldn't care about the semantics,
it would just report the seqfeature locations as is.

other applications, e.g. apollo and editorial tools, would require a
minimum level of agreement on certain things, such as what codes and what
doesn't.

> 	-hilmar
>