[GMOD-devel] Re: [Open-bio-l] Schema for genes & features&mappingsto assemblies

Wed, 1 May 2002 15:07:53 -0700

> -----Original Message-----
> From: Chris Mungall [mailto:cjm@bdgp.lbl.gov]
> Sent: Wednesday, May 01, 2002 1:38 PM
> To: Hilmar Lapp
> Cc: GMOD Devel (E-mail); OBDA BioSQL (E-mail)
> Subject: RE: [GMOD-devel] Re: [Open-bio-l] Schema for genes &
> features&mappingsto assemblies
> 
[...]
> 
> Let's outline them, using a shortened tuple notation:
> 
> 1) feature pair links two seqfeatures
> 
> sfpair(sfpair_id, seqfeature1_id, seqfeature2_id, e-value, score)
> sfpair_qualifier_value(sfpair_id, ontology_term_id, qualifier_rank,
>                        qualifier_value)
> 
> # maps cleanly to the bioperl object model
> # join-heavy
> # lots of rows instantiated
> 
> 2) use the existing sf_relationship table to make feature pairs
> 
> this would be *very* join heavy, and would require lots of 
> rows just to
> represent, say, one blast hit - it would be a 3 level hierarchy, i.e.
> HitFeature <-> HSP <-> query/subject feature
> 
> this does have the advantage of being neutral with respect to the
> directionality of the blast/analysis (which sequence is query 
> and which is
> subject), and also cleanly extends to >2 sequences.
> 
> but overall i think it would be too awkward / slow
> 
> 3) similaritypair links two biosequences
> 
> simpair(simpair_id, biosequence1_id, start1, end1,
>                     strand1, frame1,
>                     biosequence2_id, start2, end2,
>                     strand2, frame2,
>                     e-value,
>                     score)
> simpair_qualifier_value(simpair_id, ontology_term_id, qualifier_rank,
>                         qualifier_value)
> 
> advantages - fairly lean in terms of rows instantiated, which is good,
> less joins.
> 
> you still have to instantiate bioentrys unless you're fine 
> with completely
> anonymous sequences (and what's the point in that), which means extra
> joins in retrieval.
> 
> 4) similaritypair links two bioentries
> 
> simpair(simpair_id, bioentry1_id, start1, end1,
>                     strand1, frame1,
>                     bioentry2_id, start2, end2,
>                     strand2, frame2,
>                     e-value,
>                     score)
> simpair_qualifier_value(simpair_id, ontology_term_id, qualifier_rank,
>                         qualifier_value)
> 
> similar to (3). has the advantage over 3 that a lot of the 
> time you don't
> need to join bioentry/biosequence, as a lot of the time the 
> information
> you need (ie display_id/accession) is in bioentry. Even so, a 
> lot of times
> you will need to make that extra join.
> 
> I think my preference is for (4), it's fastest. It doesn't 
> map directly to
> the bioperl object model but the manual mapping in the 
> adapters isn't too
> hard.
> 

Thanks for this excellent analysis Chris. After my previous email, 3 things came to my mind in addition:

o How do you represent HSPs versus a hit? Would you split a hit into its HSPs and store each HSP as a simpair? If so, would one want to keep track of which of the simpairs originally formed the hit, and if so, how? (I guess so in order to determine coverage; but that could be reconstructed, too, or maybe not necessarily?)

o How do you represent multiple alignments (of which a pair really is only a special case)? You mentioned that above for option 2), but how with option 4)? One could have an alignment entity with bioentry-associations to it, but the question is then how to identify an alignment entry.

o How do you represent sequence clusters? This is maybe almost the same as the one before, especially since frequently you will want to have a multiple alignment for the cluster.

I actually was inclined towards option 4) too, I only started wondering how to best encompass more-than-pairwise alignments with this.

> Note I'm only mentioning the bioperl object model to provoke 
> otehrs, eg
> Thomas to chime in.
> 
> I have to say, I'm still confused by the need for a 
> bioentry/biosequence
> split. I can't see any situations in which you would want to 
> store a tuple
> of one without the other.

You may encounter the wish to have a lightweight bioentry without an actual sequence; in fact that's what I'll do because I replaced the dbxref with a FK to bioentry in an association table.

Also, for db performance reasons on the table level you should have the two separated, because then, depending on the actual low-level storage mechanism of the RDBMS, full-table scans on the bioentry table don't have to read or seek over the sequences (these are potentially big). Although nowadays I'm hearing that most RDBMSs store BLOBs such that that doesn't need to bother you.

> 
> My recommendation to people like Hilmarr, who aren't 
> constrained by MySQL
> is to code to a simpler set of relations and database procedures, the
> important thing is to be sure that these relations and 
> function map simply
> (at the DBMS level) to the bioSQL core. These views/functions 
> could then
> form satellite modules around bioSQL, or they could remain a 
> bridge layer
> within GNF.

What I'm heading for right now is a table design that deviates slightly from BioSQL (right now it is stricter), but that I can, as you said, straightforwardly map to BioSQL (via views) such that an application working off biosql would work off our db as well.

I actually don't know how others view biosql, but generally whether you consider it being the supposed low-level implementation or rather an API specification does have some impact on the design. With the present, very generic, ontology/value design of biosql there is a problem in treating it as an API from an application level perspective: there are a thousand ways to implement semantics, but a particular application will have a hard time understanding all thousand ways. That is, an application tagged 'runs off biosql' is not guaranteed at all to actually run off any biosql-compliant instance, unless 'biosql-compliant' includes a certain defined ontology being adhered to. Defining that ontology may not be an easy task, and it could be a controversial one, too. But it's probably necessary in order to establish Biosql as an API for sequence database / feature browsers.

	-hilmar
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp@gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------