[DAS] RFC for feature data model

Thu, 22 Aug 2002 23:16:04 +0100

Hi all,

There is some discussion on the biojava-dev list the moment about 
changing our core feature/sequence model. It would be nice to be able to 
work with gene objects totaly without the need for genomic data 
available. Also, realy the same gene instance should be used regardless 
of the coordinate system in place for the sequence you have hold of - if 
you are viewing a contig, or a chromosome, or an embl dump of the 
region. Potentialy, you could have a single LINE repeat object, and bind 
it to the genome every place repeat masker calls a repeat. This 
decouples the biological inheritance hierachy (or ontology) from the 
sequence/location stuff, which is probably a good thing all round.

Matthew

The proposal
------------
Any format/data-model we use to annotate interesting regions of a 
sequence should store all information necisary to mark a region of 
sequence as being covered by some sort of feature entity (e.g. a list of 
ranges - possibly 1 range element in length, and optional strand info) 
and a link, id, URN/URI, ontology term or whatever giving the actual 
feature at that location (gene, exon, etc. ad-nausium). A second service 
may be used to resolve the link, id or whatever to the feature entity 
itself.

Possible costs
--------------
1) double the number of transactions - one for region data and 1 for 
feature objects.
2) writing and maintaining two services rather than 1.

Possible Soultions/Counter Arguments
------------------------------------
1) If the xml schemas are written sensibly, then inline rich objects 
could be used interchaingably with linked-in rich objects.
2) We have to write the code to serialize/deserialize this info anyway - 
all we're doing is giving the user another access point

Possible benefits
-----------------
1) the region handeling service becomes very simple and regular in the 
info it serves (all complex fluffy objects are in the other service).
2) different data producers could link to different world views of 
entity types fairly painlessly.
3) the entity service can be re-used in different bioinformatics domains 
e.g. the genome entity services could be used totaly independantly of 
chromosomal information for things like:
     * GO editing
     * annotating micro array spots
4) info relevant to the rich entity lives on that entity, info relevant 
to it's instance or projection lives on the projection (e.g. you could 
annotate the region with blast scores and link to the (ADH,human) gene 
entity which contains rich annotation about ADH in human, and presumably 
has links to both ADH and human if you want to find more stuff out.
5) the rich objects could be stored on a totaly different server, 
allowing better reuse of complex concepts

-- 
BioJava Consulting LTD - Support and training for BioJava
http://www.biojava.co.uk

__________________________________________________
Do You Yahoo!?
Everything you'll ever need on one web page
from News and Sport to Email and Music Charts
http://uk.my.yahoo.com