[DAS] Re: Our identifier doc and proposal

Brian Gilman gilmanb@genome.wi.mit.edu
Thu, 6 Dec 2001 20:43:33 -0500 (EST)


Hello,

	Sorry for the late response to the doc! We're trying to get a
release out the door. 

	I don't see any problem with DAML+OIL and UDDI I think they are
complementary. The way I understand DAML+OIL is that it is trying to allow
the ontologist to describe the semantic relationships amongst entities.
UDDI (Universal Discription, Discovery Integration) is used to find the
repository which holds the entities of interest. You can't really have one
without the other. 

	I agree with Thomas and Matthew in there assessment of the wording
of the document and the practical matter of being able to define your
own local ontology. When a client asks for an explicit entity in the
database it should give it back without having to go and ask the base
ontology what the user was asking for. I don't see this as a technical
challenge, it comes back to the identificaton of entities in the database
and a common way of representing them. Might I suggest the following
approach: We set our sights on a UUID scheme for entities in the database
done over e-mail. Then try and hold a face to face meeting to hash out a
skeleton ontology for genomics? I would be happy to set this up this
meeting at Whitehead. I think this kind of discussion needs more bandwidth
and a whiteboard. Or we could have a meeting at the O'Reilly conference at
the end of January? Perhaps we could try and send Lincoln/the list  our
draft DAML+OIL documents to get a start on the problem and then discuss at
the face to face meeting?  Do others agree with this or am I out of left
field? 

	Perhaps a little domain engineering is needed here: As Lincoln has
stated, people are thinking of DAS as a "cure all" for the problems in
genomics. Well folks, it's not, plain and simple BUT IT HAS GREAT
POTENTIAL. People are now viewing it as the panacea or lingua franca of
bioinformatics when in reality it leaves a large portion of the
bioinformatics/biology community out to dry. It also seems to me that we
have not, as a community of DAS users/providers, defined the problems that
we are trying to address. Might it be more productive to first define a
few problems that each of us are faced with first to see if we are on the
same page? 

	I will try and list mine below:

************************************************************************

	A list of genes are suspected to be involved in a disease pathway.
The researcher wants to retrieve all or a subset of  annotations for this
list of genes.  These annotations may suggest a deeper analysis of the
region ie. literature search  for related annotations or further
computational  analysis,  which may
lead to  the discovery of a functional transcript.  Finally, a researcher
identifies a possible transcript and protein expression analysis may begin. 


Sequence data has been obtained for a region of interest  through
sequencing techniques. This sequence is BLASTed to the latest 
physical map of the genome to find its genomic coordinates. These
coordinates are used to poll annotation databases for annotations of
interest. In some cases the sequence is blasted
against a  protein database to look for protein identity or family
membership.

	
The physical coordinate for a feature on the genome is known which is
suspected to take part in a metabolic or disease pathway.  
Other annotations are then pulled off the web in order to gain 
a better understanding of the region.  This evidence 
narrows  the physical coordinate of the region of interest, if no
annotations exist then a computational annotation may be used over a
biological one. This may lead to the discovery of  a novel protein. If no
protein is found the region may be re-searched with other computational
tools.

*************************************************************************

	From an informatics perspective, the following problems exist:

	1) Identifiers are not stadardized in this domain: Searching is
	very hard
	2) There are 400 different file formats to parse
	3) Protocols do not exist to query biological datastores
	4) Many names exist for the same thing
	5) There is no easy way to do a literature search
	6) Can't see local annotations in the context of curators database
	7) No naming conventions exist in this domain
	8) Hard to find other annotators/annotations outside literature
	search
	9) Biological entities may be difficult to visualize


	Other Common Use Case:

	A biologist has a list of features (genes) which are suspected to
take part in a disease or pathway. They want to gain a deep understaning
of these features by performing laboratory analysis. In order to do this
they must perform the following functions:

	1) Find the region of interest in the golden path by the features
	identifier (gene name, exon name,contig #, id, etc)
	2) Obtain the annotations associated with this region
	3) Filter the annotations based upon submitter or physical
	criteria:
	(how far apart are the features. Is this feature in an exon etc.)
	4) Send this data through a laboratory processing pipeline
	5) See new annotations in context of curated information and/or
	other collaborators information
	

	Hopefully, this will start discussions as to what exactly we are
trying to solve and if we are on the same page. If we are all the better
but what I suspect is that we may not be and may want to start here to
define our scope. 

					Best, 

					-Brian


-----------------------
Brian Gilman <gilmanb@genome.wi.mit.edu>
Sr. Software Engineer MIT/Whitehead Inst. Center for Genome Research
One Kendall Square, Bldg. 300 / Cambridge, MA 02139-1561 USA
phone +1 617  252 1069 / fax +1 617 252 1902


On Thu, 6 Dec 2001, Lincoln Stein wrote:

> OK, so that argues that we need to develop a common ontology to work from, 
> right?  I was beginning to think that the sentiment was that DAS should *not* 
> develop an ontology of annotation types.
> 
> Lincoln
> 
> On Thursday 06 December 2001 17:07, Chris Mungall wrote:
> > On Thu, 6 Dec 2001, Lincoln Stein wrote:
> > > Hi Chris,
> > >
> > > If you have four different similar but not identical ontologies expressed
> > > in DAML+OIL, how does a third party provide the equivalence
> > > relationships?  Do you envision him providing an equivalence apping for
> > > each of the 6 pairs, or mapping them all to a single common ontology?
> >
> > The all by all approach would rapidly get out of hand. I think your idea
> > of mapping to a skeleton ontology is best. One can imagine all kinds of
> > different toplogies but that would be getting ahead of ourselves. The
> > important point is that the level of conformance should be optional.
> >
> > > Lincoln
> > >
> > > On Friday 30 November 2001 18:12, Chris Mungall wrote:
> > > > On Thu, 29 Nov 2001, Ewan Birney wrote:
> > > > > On Wed, 28 Nov 2001, Lincoln Stein wrote:
> > > > > > I think we're going to find that the features form a DAG and not a
> > > > > > hierarchy.  Otherwise you're going to have problems classifying
> > > > > > things like "genes".  In the context of genetics, a gene is a type
> > > > > > of complementation group.  In the context of genomics, a gene is a
> > > > > > subclass of transcription features, translation features, and
> > > > > > regulatory features.
> > > > >
> > > > > Bugger.
> > > > >
> > > > > You are right. I'm glad you are going to sort out how to have an
> > > > > extensible distributed DAG system that is easy to use. ;)
> > > >
> > > > Thankfully it's already been done - cf semantic web, RDF(S), DAML+OIL
> > > > etc
> > > >
> > > > The nice thing about this is if someone doesn't like "ontology
> > > > politburo"'s classes, they can add in their own.
> > > >
> > > > Two people can develop their own similar class hierarchies without
> > > > speaking to one another, and a third person can provide equivalence
> > > > relationships mapping between the concepts; or logical rules for
> > > > inferring one from the other.
> > > >
> > > > the DAGs can be as complex as you like, giving computable semantics for
> > > > terms like "tRNA" or just a flat vocabulary, whatever you like.
> > > > Complements DAS beautifully.
> > > >
> > > > All designed to be anarchic and distributed and not an OMG committee in
> > > > sight
> > > >
> > > > > DAS mailing list
> > > > > DAS@biodas.org
> > > > > http://biodas.org/mailman/listinfo/das
> > >
> > > --
> > > ========================================================================
> > > Lincoln D. Stein                           Cold Spring Harbor Laboratory
> > > lstein@cshl.org			                  Cold Spring Harbor, NY
> > >
> > > NOW HIRING BIOINFORMATICS POSTDOCTORAL FELLOWS AND PROGRAMMERS.
> > > PLEASE WRITE FOR DETAILS.
> > > ========================================================================
> 
> -- 
> ========================================================================
> Lincoln D. Stein                           Cold Spring Harbor Laboratory
> lstein@cshl.org			                  Cold Spring Harbor, NY
> 
> NOW HIRING BIOINFORMATICS POSTDOCTORAL FELLOWS AND PROGRAMMERS. 
> PLEASE WRITE FOR DETAILS.
> ========================================================================
>