[GMOD-devel] Re: [Open-bio-l] Schema for genes & features&mappings to assemblies

Chris Mungall cjm@bdgp.lbl.gov
Wed, 1 May 2002 13:38:21 -0700 (PDT)


On Wed, 1 May 2002, Hilmar Lapp wrote:

>
>
> > -----Original Message-----
> > From: Chris Mungall [mailto:cjm@bdgp.lbl.gov]
> > Sent: Wednesday, May 01, 2002 9:45 AM
> > To: Hilmar Lapp
> > Cc: GMOD Devel (E-mail); OBDA BioSQL (E-mail)
> > Subject: RE: [GMOD-devel] Re: [Open-bio-l] Schema for genes &
> > features&mappings to assemblies
> >
> >
> [...]
> >
> > Ignoring assemblies for a second, genomic alignments of EST or RefSeq
> > sequences should use the (not yet present) featurepair table,
> > (or whatever we decide to name this table).
>
> The uncertainty about how you guys envision this table is probably part
> of my confusion. Is it in your opinion supposed to link two bioentries,
> two features, a feature and a bioentry, or any combination of them? I'm
> undecided as to what would be best; all of them seem to have advantages
> and limitations.

Let's outline them, using a shortened tuple notation:

1) feature pair links two seqfeatures

sfpair(sfpair_id, seqfeature1_id, seqfeature2_id, e-value, score)
sfpair_qualifier_value(sfpair_id, ontology_term_id, qualifier_rank,
                       qualifier_value)

# maps cleanly to the bioperl object model
# join-heavy
# lots of rows instantiated

2) use the existing sf_relationship table to make feature pairs

this would be *very* join heavy, and would require lots of rows just to
represent, say, one blast hit - it would be a 3 level hierarchy, i.e.
HitFeature <-> HSP <-> query/subject feature

this does have the advantage of being neutral with respect to the
directionality of the blast/analysis (which sequence is query and which is
subject), and also cleanly extends to >2 sequences.

but overall i think it would be too awkward / slow

3) similaritypair links two biosequences

simpair(simpair_id, biosequence1_id, start1, end1,
                    strand1, frame1,
                    biosequence2_id, start2, end2,
                    strand2, frame2,
                    e-value,
                    score)
simpair_qualifier_value(simpair_id, ontology_term_id, qualifier_rank,
                        qualifier_value)

advantages - fairly lean in terms of rows instantiated, which is good,
less joins.

you still have to instantiate bioentrys unless you're fine with completely
anonymous sequences (and what's the point in that), which means extra
joins in retrieval.

4) similaritypair links two bioentries

simpair(simpair_id, bioentry1_id, start1, end1,
                    strand1, frame1,
                    bioentry2_id, start2, end2,
                    strand2, frame2,
                    e-value,
                    score)
simpair_qualifier_value(simpair_id, ontology_term_id, qualifier_rank,
                        qualifier_value)

similar to (3). has the advantage over 3 that a lot of the time you don't
need to join bioentry/biosequence, as a lot of the time the information
you need (ie display_id/accession) is in bioentry. Even so, a lot of times
you will need to make that extra join.

I think my preference is for (4), it's fastest. It doesn't map directly to
the bioperl object model but the manual mapping in the adapters isn't too
hard.

Note I'm only mentioning the bioperl object model to provoke otehrs, eg
Thomas to chime in.

I have to say, I'm still confused by the need for a bioentry/biosequence
split. I can't see any situations in which you would want to store a tuple
of one without the other.

My recommendation to people like Hilmarr, who aren't constrained by MySQL
is to code to a simpler set of relations and database procedures, the
important thing is to be sure that these relations and function map simply
(at the DBMS level) to the bioSQL core. These views/functions could then
form satellite modules around bioSQL, or they could remain a bridge layer
within GNF.

> Since our local use case here may be isolated (although I don't believe
> so), I'd be happy to see you come forward with an initial feature_pair
> design that we'll adopt. Otherwise I'll settle on something and see how
> it works out in practice.

I don't think your case is so isolated

> Thanks a lot Chris for your long response.

Thanks for being the first non-genbank-roundtrip use-case biosql beta
tester!

> 	-hilmar
>