[GMOD-devel] Re: [Open-bio-l] Schema for genes & features&mappings to assemblies

Matthew Pocock matthew_pocock@yahoo.co.uk
Wed, 01 May 2002 23:47:21 +0100


Hi Chris,

Is there any easy way we could represent pairs of homologies as a 
special case of an n-sequence alignment? Something like a homology table 
with an id, and a homology_sequence table with an id, start, end? 
Adapters could deduce from the sequence id what kind of sequence a 
partner was (dna, protein, exotic). The homology entity defines the 
co-ordinate system of the alignment and is where you put homology-level 
annotation (like the blast or msf arguments). The homology_sequence 
entities are where you put individual sequence arguments like target/query.

Not sure how well this scales to vast numbers of homologies, but it does 
naturaly model gaps (multiple homology_sequence hits with the same 
homology and sequence, but different start-end) and the split between 
homology annotation and per-sequence annotation.

Matthew

Chris Mungall wrote:
> 
> On Wed, 1 May 2002, Hilmar Lapp wrote:
> 
> 
>>
>>>-----Original Message-----
>>>From: Chris Mungall [mailto:cjm@bdgp.lbl.gov]
>>>Sent: Wednesday, May 01, 2002 9:45 AM
>>>To: Hilmar Lapp
>>>Cc: GMOD Devel (E-mail); OBDA BioSQL (E-mail)
>>>Subject: RE: [GMOD-devel] Re: [Open-bio-l] Schema for genes &
>>>features&mappings to assemblies
>>>
>>>
>>
>>[...]
>>
>>>Ignoring assemblies for a second, genomic alignments of EST or RefSeq
>>>sequences should use the (not yet present) featurepair table,
>>>(or whatever we decide to name this table).
>>
>>The uncertainty about how you guys envision this table is probably part
>>of my confusion. Is it in your opinion supposed to link two bioentries,
>>two features, a feature and a bioentry, or any combination of them? I'm
>>undecided as to what would be best; all of them seem to have advantages
>>and limitations.
> 
> 
> Let's outline them, using a shortened tuple notation:
> 
> 1) feature pair links two seqfeatures
> 
> sfpair(sfpair_id, seqfeature1_id, seqfeature2_id, e-value, score)
> sfpair_qualifier_value(sfpair_id, ontology_term_id, qualifier_rank,
>                        qualifier_value)
> 
> # maps cleanly to the bioperl object model
> # join-heavy
> # lots of rows instantiated
> 
> 2) use the existing sf_relationship table to make feature pairs
> 
> this would be *very* join heavy, and would require lots of rows just to
> represent, say, one blast hit - it would be a 3 level hierarchy, i.e.
> HitFeature <-> HSP <-> query/subject feature
> 
> this does have the advantage of being neutral with respect to the
> directionality of the blast/analysis (which sequence is query and which is
> subject), and also cleanly extends to >2 sequences.
> 
> but overall i think it would be too awkward / slow
> 
> 3) similaritypair links two biosequences
> 
> simpair(simpair_id, biosequence1_id, start1, end1,
>                     strand1, frame1,
>                     biosequence2_id, start2, end2,
>                     strand2, frame2,
>                     e-value,
>                     score)
> simpair_qualifier_value(simpair_id, ontology_term_id, qualifier_rank,
>                         qualifier_value)
> 
> advantages - fairly lean in terms of rows instantiated, which is good,
> less joins.
> 
> you still have to instantiate bioentrys unless you're fine with completely
> anonymous sequences (and what's the point in that), which means extra
> joins in retrieval.
> 
> 4) similaritypair links two bioentries
> 
> simpair(simpair_id, bioentry1_id, start1, end1,
>                     strand1, frame1,
>                     bioentry2_id, start2, end2,
>                     strand2, frame2,
>                     e-value,
>                     score)
> simpair_qualifier_value(simpair_id, ontology_term_id, qualifier_rank,
>                         qualifier_value)
> 
> similar to (3). has the advantage over 3 that a lot of the time you don't
> need to join bioentry/biosequence, as a lot of the time the information
> you need (ie display_id/accession) is in bioentry. Even so, a lot of times
> you will need to make that extra join.
> 
> I think my preference is for (4), it's fastest. It doesn't map directly to
> the bioperl object model but the manual mapping in the adapters isn't too
> hard.
> 
> Note I'm only mentioning the bioperl object model to provoke otehrs, eg
> Thomas to chime in.
> 
> I have to say, I'm still confused by the need for a bioentry/biosequence
> split. I can't see any situations in which you would want to store a tuple
> of one without the other.
> 
> My recommendation to people like Hilmarr, who aren't constrained by MySQL
> is to code to a simpler set of relations and database procedures, the
> important thing is to be sure that these relations and function map simply
> (at the DBMS level) to the bioSQL core. These views/functions could then
> form satellite modules around bioSQL, or they could remain a bridge layer
> within GNF.
> 
> 
>>Since our local use case here may be isolated (although I don't believe
>>so), I'd be happy to see you come forward with an initial feature_pair
>>design that we'll adopt. Otherwise I'll settle on something and see how
>>it works out in practice.
> 
> 
> I don't think your case is so isolated
> 
> 
>>Thanks a lot Chris for your long response.
> 
> 
> Thanks for being the first non-genbank-roundtrip use-case biosql beta
> tester!
> 
> 
>>	-hilmar
>>
> 
> 
> _______________________________________________
> Open-Bio-l mailing list
> Open-Bio-l@open-bio.org
> http://open-bio.org/mailman/listinfo/open-bio-l
>