[DAS] Re: Our identifier doc and proposal

Brian Gilman gilmanb@genome.wi.mit.edu
Mon, 10 Dec 2001 17:48:01 -0500 (EST)


Our docs are in VERY rough shape. I can try and get something to you after
the holidays. And yes, I'll be at the conference. 

			-B

-----------------------
Brian Gilman <gilmanb@genome.wi.mit.edu>
Sr. Software Engineer MIT/Whitehead Inst. Center for Genome Research
One Kendall Square, Bldg. 300 / Cambridge, MA 02139-1561 USA
phone +1 617  252 1069 / fax +1 617 252 1902


On Mon, 10 Dec 2001, Lincoln Stein wrote:

> Hi Brian,
> 
> I'd love to see your DAML+OIL draft documents.  Were these developed for the 
> I3C?  
> 
> We need to decide which entities to model before going any further.  DAS/1 
> has two objects: the map and the feature.  Brian's use cases below beg for 
> more entities, like "submitter" and "literature reference".  We could make a 
> first draft of these entities by pulling the top level objects out of BioJava 
> and BioPerl and then deciding which ones are in the DAS scope.  This would 
> also help down the road in creating Bio{Java,Perl}-compatible APIs.  Does 
> this sound like a reasonable approach?
> 
> The use cases are very enlightening.  Do you, or others on the mailing list, 
> have more?
> 
> I will be at the O'Reilly conference, as will Ewan and (I think) Brian.  How 
> many people from the DAS mailing list will also be there?  This would be a 
> good opportunity to nail down the plan.
> 
> > 	I agree with Thomas and Matthew in there assessment of the wording
> > of the document and the practical matter of being able to define your
> > own local ontology. 
> 
> As explained earlier, this was just poor wording in the document.  We're in 
> agreement on how the local ontologies should work.
> 
> Lincoln
> 
> On Thursday 06 December 2001 20:43, Brian Gilman wrote:
> > Hello,
> >
> > 	Sorry for the late response to the doc! We're trying to get a
> > release out the door.
> >
> > 	I don't see any problem with DAML+OIL and UDDI I think they are
> > complementary. The way I understand DAML+OIL is that it is trying to allow
> > the ontologist to describe the semantic relationships amongst entities.
> > UDDI (Universal Discription, Discovery Integration) is used to find the
> > repository which holds the entities of interest. You can't really have one
> > without the other.
> >
> > 	I agree with Thomas and Matthew in there assessment of the wording
> > of the document and the practical matter of being able to define your
> > own local ontology. When a client asks for an explicit entity in the
> > database it should give it back without having to go and ask the base
> > ontology what the user was asking for. I don't see this as a technical
> > challenge, it comes back to the identificaton of entities in the database
> > and a common way of representing them. Might I suggest the following
> > approach: We set our sights on a UUID scheme for entities in the database
> > done over e-mail. Then try and hold a face to face meeting to hash out a
> > skeleton ontology for genomics? I would be happy to set this up this
> > meeting at Whitehead. I think this kind of discussion needs more bandwidth
> > and a whiteboard. Or we could have a meeting at the O'Reilly conference at
> > the end of January? Perhaps we could try and send Lincoln/the list  our
> > draft DAML+OIL documents to get a start on the problem and then discuss at
> > the face to face meeting?  Do others agree with this or am I out of left
> > field?
> >
> > 	Perhaps a little domain engineering is needed here: As Lincoln has
> > stated, people are thinking of DAS as a "cure all" for the problems in
> > genomics. Well folks, it's not, plain and simple BUT IT HAS GREAT
> > POTENTIAL. People are now viewing it as the panacea or lingua franca of
> > bioinformatics when in reality it leaves a large portion of the
> > bioinformatics/biology community out to dry. It also seems to me that we
> > have not, as a community of DAS users/providers, defined the problems that
> > we are trying to address. Might it be more productive to first define a
> > few problems that each of us are faced with first to see if we are on the
> > same page?
> >
> > 	I will try and list mine below:
> >
> > ************************************************************************
> >
> > 	A list of genes are suspected to be involved in a disease pathway.
> > The researcher wants to retrieve all or a subset of  annotations for this
> > list of genes.  These annotations may suggest a deeper analysis of the
> > region ie. literature search  for related annotations or further
> > computational  analysis,  which may
> > lead to  the discovery of a functional transcript.  Finally, a researcher
> > identifies a possible transcript and protein expression analysis may begin.
> >
> >
> > Sequence data has been obtained for a region of interest  through
> > sequencing techniques. This sequence is BLASTed to the latest
> > physical map of the genome to find its genomic coordinates. These
> > coordinates are used to poll annotation databases for annotations of
> > interest. In some cases the sequence is blasted
> > against a  protein database to look for protein identity or family
> > membership.
> >
> >
> > The physical coordinate for a feature on the genome is known which is
> > suspected to take part in a metabolic or disease pathway.
> > Other annotations are then pulled off the web in order to gain
> > a better understanding of the region.  This evidence
> > narrows  the physical coordinate of the region of interest, if no
> > annotations exist then a computational annotation may be used over a
> > biological one. This may lead to the discovery of  a novel protein. If no
> > protein is found the region may be re-searched with other computational
> > tools.
> >
> > *************************************************************************
> >
> > 	From an informatics perspective, the following problems exist:
> >
> > 	1) Identifiers are not stadardized in this domain: Searching is
> > 	very hard
> > 	2) There are 400 different file formats to parse
> > 	3) Protocols do not exist to query biological datastores
> > 	4) Many names exist for the same thing
> > 	5) There is no easy way to do a literature search
> > 	6) Can't see local annotations in the context of curators database
> > 	7) No naming conventions exist in this domain
> > 	8) Hard to find other annotators/annotations outside literature
> > 	search
> > 	9) Biological entities may be difficult to visualize
> >
> >
> > 	Other Common Use Case:
> >
> > 	A biologist has a list of features (genes) which are suspected to
> > take part in a disease or pathway. They want to gain a deep understaning
> > of these features by performing laboratory analysis. In order to do this
> > they must perform the following functions:
> >
> > 	1) Find the region of interest in the golden path by the features
> > 	identifier (gene name, exon name,contig #, id, etc)
> > 	2) Obtain the annotations associated with this region
> > 	3) Filter the annotations based upon submitter or physical
> > 	criteria:
> > 	(how far apart are the features. Is this feature in an exon etc.)
> > 	4) Send this data through a laboratory processing pipeline
> > 	5) See new annotations in context of curated information and/or
> > 	other collaborators information
> >
> >
> > 	Hopefully, this will start discussions as to what exactly we are
> > trying to solve and if we are on the same page. If we are all the better
> > but what I suspect is that we may not be and may want to start here to
> > define our scope.
> >
> > 					Best,
> >
> > 					-Brian
> >
> >
> > -----------------------
> > Brian Gilman <gilmanb@genome.wi.mit.edu>
> > Sr. Software Engineer MIT/Whitehead Inst. Center for Genome Research
> > One Kendall Square, Bldg. 300 / Cambridge, MA 02139-1561 USA
> > phone +1 617  252 1069 / fax +1 617 252 1902
> >
> > On Thu, 6 Dec 2001, Lincoln Stein wrote:
> > > OK, so that argues that we need to develop a common ontology to work
> > > from, right?  I was beginning to think that the sentiment was that DAS
> > > should *not* develop an ontology of annotation types.
> > >
> > > Lincoln
> > >
> > > On Thursday 06 December 2001 17:07, Chris Mungall wrote:
> > > > On Thu, 6 Dec 2001, Lincoln Stein wrote:
> > > > > Hi Chris,
> > > > >
> > > > > If you have four different similar but not identical ontologies
> > > > > expressed in DAML+OIL, how does a third party provide the equivalence
> > > > > relationships?  Do you envision him providing an equivalence apping
> > > > > for each of the 6 pairs, or mapping them all to a single common
> > > > > ontology?
> > > >
> > > > The all by all approach would rapidly get out of hand. I think your
> > > > idea of mapping to a skeleton ontology is best. One can imagine all
> > > > kinds of different toplogies but that would be getting ahead of
> > > > ourselves. The important point is that the level of conformance should
> > > > be optional.
> > > >
> > > > > Lincoln
> > > > >
> > > > > On Friday 30 November 2001 18:12, Chris Mungall wrote:
> > > > > > On Thu, 29 Nov 2001, Ewan Birney wrote:
> > > > > > > On Wed, 28 Nov 2001, Lincoln Stein wrote:
> > > > > > > > I think we're going to find that the features form a DAG and
> > > > > > > > not a hierarchy.  Otherwise you're going to have problems
> > > > > > > > classifying things like "genes".  In the context of genetics, a
> > > > > > > > gene is a type of complementation group.  In the context of
> > > > > > > > genomics, a gene is a subclass of transcription features,
> > > > > > > > translation features, and regulatory features.
> > > > > > >
> > > > > > > Bugger.
> > > > > > >
> > > > > > > You are right. I'm glad you are going to sort out how to have an
> > > > > > > extensible distributed DAG system that is easy to use. ;)
> > > > > >
> > > > > > Thankfully it's already been done - cf semantic web, RDF(S),
> > > > > > DAML+OIL etc
> > > > > >
> > > > > > The nice thing about this is if someone doesn't like "ontology
> > > > > > politburo"'s classes, they can add in their own.
> > > > > >
> > > > > > Two people can develop their own similar class hierarchies without
> > > > > > speaking to one another, and a third person can provide equivalence
> > > > > > relationships mapping between the concepts; or logical rules for
> > > > > > inferring one from the other.
> > > > > >
> > > > > > the DAGs can be as complex as you like, giving computable semantics
> > > > > > for terms like "tRNA" or just a flat vocabulary, whatever you like.
> > > > > > Complements DAS beautifully.
> > > > > >
> > > > > > All designed to be anarchic and distributed and not an OMG
> > > > > > committee in sight
> > > > > >
> > > > > > > DAS mailing list
> > > > > > > DAS@biodas.org
> > > > > > > http://biodas.org/mailman/listinfo/das
> > > > >
> > > > > --
> > > > > =====================================================================
> > > > >=== Lincoln D. Stein                           Cold Spring Harbor
> > > > > Laboratory lstein@cshl.org			                  Cold Spring Harbor, NY
> > > > >
> > > > > NOW HIRING BIOINFORMATICS POSTDOCTORAL FELLOWS AND PROGRAMMERS.
> > > > > PLEASE WRITE FOR DETAILS.
> > > > > =====================================================================
> > > > >===
> > >
> > > --
> > > ========================================================================
> > > Lincoln D. Stein                           Cold Spring Harbor Laboratory
> > > lstein@cshl.org			                  Cold Spring Harbor, NY
> > >
> > > NOW HIRING BIOINFORMATICS POSTDOCTORAL FELLOWS AND PROGRAMMERS.
> > > PLEASE WRITE FOR DETAILS.
> > > ========================================================================
> 
> -- 
> ========================================================================
> Lincoln D. Stein                           Cold Spring Harbor Laboratory
> lstein@cshl.org			                  Cold Spring Harbor, NY
> 
> NOW HIRING BIOINFORMATICS POSTDOCTORAL FELLOWS AND PROGRAMMERS. 
> PLEASE WRITE FOR DETAILS.
> ========================================================================
>