[GMOD-devel] Re: [Open-bio-l] Schema for genes & features
&mappings to assemblies
Chris Mungall
cjm@bdgp.lbl.gov
Wed, 1 May 2002 09:44:55 -0700 (PDT)
On Tue, 30 Apr 2002, Hilmar Lapp wrote:
> I like this proposal, but ...
>
> > 9. project-centric column names like "chromosome" are avoided; eg
> > drosophila has chromosome arms as top level sequences
really this is just a column naming issue, it doesn't affect the structure
of the schema at all.
> So, I'm still confused about how I am supposed to store gene
> predictions, EST, RefSeq, or whatever mappings to chromosomes in an
> assembly, such that I can answer queries like 'show me all exons of
> genes and their lines of evidence that map between markers X and Y on
> chromosome 5 of mouse. Next, show how human genes map to this region,
> and which human chromosomes.'
Ignoring assemblies for a second, genomic alignments of EST or RefSeq
sequences should use the (not yet present) featurepair table, (or whatever
we decide to name this table).
The exons would be stored as seqfeatures, as should the markers (presuming
you've mapped them to base pair coordinates)
assuming flat features i.e. no-assembly, you just select all seqfeatures
and featurepairs between X and Y, which will give you mouse genes and
associated evidence.
Amongst these feature pairs will be hits to RefSeq human. Let's say you're
storing both mouse and human in the same db, and you've run blast or
whatever on both mouse & human chromosomes vs RefSeq. You can do a query
joining bioentry to taxon (human only), and 2 joins of featurepair to get
the human sequences that match to both human and mouse.
It's a bit of a pain going through bioentry all the time, there is an
argument for collapsing bioentry and biosequence into the same table, it
would reduce a lot of joins.
That's one way; there are multiple ways to do it depending on how your
project works. Maybe you store RefSeq IDs as dbxrefs linked to your
seqfeature gene entries. There are a number of ways of determining the
'how human genes map to this region' part of your example query (e.g.
large scale synteny blocks, blastp results on the peptide level, curated
homologies...), I think all of them are possible in BioSQL in one way or
another.
Now if we throw assemblies into the equation, it gets a little trickier.
With the solution I outlined in an earlier email, you can do the exact
same query as you would above, using views to dynamically project your
features-on-contigs to features-on-chromosomes.
This could be slow, and it wouldn't be possible in MySQL.
However, if you were prepared to do all your data mining type queries on a
warehouse db recreated nightly, you could still use MySQL, you just need
to materialise the views. This is effectively what i do with GadFly
(MySQL) - the production db has features mapped to assembly units, the
public/warehouse copy gets flattened to chromosome arm coordinates, and
the adapter code is given a hint not to bother checking assembly units.
If you want to use MySQL *and* you want to use assemblies *and* you want
to do your query on a production database *then* you're forced to
either write really complex SQL *or* do it via adapters, in which case you
lose the ability to specify complex queries declaratively.
Somewhere is this multidimensional matrix of use-cases and DBMSs and
langauges there's cells that are unsatisfactory or just impossible, I
don't know whether BioSQL should be aiming to fill in as much of this
matrix as possible or just to focus on certain areas.
> Maybe someone can help me lifting my confusion.
>
> How is this done in GMOD and Ensembl, and how does that map to BioSQL
> with the assembly proposal below?
>
> -hilmar
>