[DAS] RFC for feature data model

David Block dblock@gnf.org
Thu, 22 Aug 2002 16:35:21 -0700


This looks kind of similar to SymGene's 'gene-centric' database schema.  
In fact, we wanted to decouple location from features, so that a feature 
could exist on multiple assemblies, and so that users could traverse 
links between interesting features without ever consulting a genomic 
sequence.

We are also able to hang evidence on the relationships between 
entities - so Gene A is orthologous to Gene B, as evidenced by paper C.  
The evidence would otherwise be awkwardly linked to both Gene A and Gene 
B.  This makes the assertion it is supporting more explicit.

Just my $.02...


On Thursday, August 22, 2002, at 03:16 PM, Matthew Pocock wrote:

> Hi all,
>
> There is some discussion on the biojava-dev list the moment about 
> changing our core feature/sequence model. It would be nice to be able 
> to work with gene objects totaly without the need for genomic data 
> available. Also, realy the same gene instance should be used regardless 
> of the coordinate system in place for the sequence you have hold of - 
> if you are viewing a contig, or a chromosome, or an embl dump of the 
> region. Potentialy, you could have a single LINE repeat object, and 
> bind it to the genome every place repeat masker calls a repeat. This 
> decouples the biological inheritance hierachy (or ontology) from the 
> sequence/location stuff, which is probably a good thing all round.
>
> Matthew
>
> The proposal
> ------------
> Any format/data-model we use to annotate interesting regions of a 
> sequence should store all information necisary to mark a region of 
> sequence as being covered by some sort of feature entity (e.g. a list 
> of ranges - possibly 1 range element in length, and optional strand 
> info) and a link, id, URN/URI, ontology term or whatever giving the 
> actual feature at that location (gene, exon, etc. ad-nausium). A second 
> service may be used to resolve the link, id or whatever to the feature 
> entity itself.
>
> Possible costs
> --------------
> 1) double the number of transactions - one for region data and 1 for 
> feature objects.
> 2) writing and maintaining two services rather than 1.
>
> Possible Soultions/Counter Arguments
> ------------------------------------
> 1) If the xml schemas are written sensibly, then inline rich objects 
> could be used interchaingably with linked-in rich objects.
> 2) We have to write the code to serialize/deserialize this info 
> anyway - all we're doing is giving the user another access point
>
> Possible benefits
> -----------------
> 1) the region handeling service becomes very simple and regular in the 
> info it serves (all complex fluffy objects are in the other service).
> 2) different data producers could link to different world views of 
> entity types fairly painlessly.
> 3) the entity service can be re-used in different bioinformatics 
> domains e.g. the genome entity services could be used totaly 
> independantly of chromosomal information for things like:
>     * GO editing
>     * annotating micro array spots
> 4) info relevant to the rich entity lives on that entity, info relevant 
> to it's instance or projection lives on the projection (e.g. you could 
> annotate the region with blast scores and link to the (ADH,human) gene 
> entity which contains rich annotation about ADH in human, and 
> presumably has links to both ADH and human if you want to find more 
> stuff out.
> 5) the rich objects could be stored on a totaly different server, 
> allowing better reuse of complex concepts
>
> -- BioJava Consulting LTD - Support and training for BioJava
> http://www.biojava.co.uk
>
> __________________________________________________
> Do You Yahoo!?
> Everything you'll ever need on one web page
> from News and Sport to Email and Music Charts
> http://uk.my.yahoo.com
>
> _______________________________________________
> DAS mailing list
> DAS@biodas.org
> http://biodas.org/mailman/listinfo/das
>
--
David Block                                  dblock@gnf.org
GNF - San Diego, CA             http://www.gnf.org
Genome Informatics / Enterprise Programming
Weblog:      http://radio.weblogs.com/0104507/