[Open-bio-l] RE: Schema for genes & features & mappings to assemblies

Hilmar Lapp hlapp@gnf.org
Tue, 23 Apr 2002 15:36:37 -0700


Hi all, first off, thanks for all the responses.

It seems there are the following bottom lines.

1) Biosql has pretty wide acceptance, but lacks a clear support for assemblies and gene structure interpretation. There is, however, interest to put in assemblies.
2) There is going to be a new schema in Ensembl which may be worth looking at (Ewan, do you have a pointer to DDL or even an ERD?).
3) GGB can run off any schema for which one writes an adaptor for Bio::DasI (caveat: the tag/values must not break the logic in the given aggregators, or one has to provide one's own; Lincoln is that all roughly correct?)

So would someone be willing to drive assemblies in BioSQL? Eventually I will develop something if no-one else does, but I don't feel like the best person to do this in a generic way that serves well more than just ourselves. Does cutting out that part of ensembl sound like a good idea or rather not?

As for supporting gene structures, we'll actually need that. I agree that the logic of aggregating certain features into genes needs to sit somewhere, but I also don't like confusing the model with the view. If gene structures are part of your model, that's great so long as their definition is kind-of static. I just don't think that's the case yet, and then your model and hence database and applications all break as soon as you have have to adapt the gene structure model to advancements in science. I wouldn't want to define a gene structure entity if you don't even know exactly what belongs there. Aggregation on the software layer is one way of implementing a view on the model; sadly enough MySQL completely dismissed the concept of models and views, but with Oracle you can implement any view you want in the database layer.

What we (we refering to our group here) will need for assemblies is to represent existing ones such that we can stick in all the mappings (of features, genes, markers, etc). I thought that would then underpin all mapped entities with a sequence; i.e., in order to obtain a feature's sequence you need to specify the feature /and/ the assembly (assuming you have a mapping for that assembly); this means a gene's CDS sequence is going to be different from one assembly to another. The open question is whether or not you still need a fixed sequence for that feature (e.g., in order to map it). Does this make some sense or sound like a stupid idea?

As for the ideas that were mentioned I'm not sure how we (GNF) would want to exploit an n-depth representation of nested contigs, but others may well do so (as a remote idea, could you use that for in-silico SNP detection?). I disagree with Ewan's stance that alone the possibility of nested assemblies necessarily would require an application to handle that: you could just test for a flat assembly and exit gracefully if it's nested and you can't handle that. The impact of not allowing something that some people need (or want) appears to be worse to me. 

The zero-level approach sounds appealing to me; but wouldn't that require that the chromosome lengths be all known?

	-hilmar
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp@gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------