[DAS] Re: Our identifier doc and proposal

Lincoln Stein lstein@cshl.org
Mon, 10 Dec 2001 16:28:23 -0500


Hi Brian,

I'd love to see your DAML+OIL draft documents.  Were these developed for the 
I3C?  

We need to decide which entities to model before going any further.  DAS/1 
has two objects: the map and the feature.  Brian's use cases below beg for 
more entities, like "submitter" and "literature reference".  We could make a 
first draft of these entities by pulling the top level objects out of BioJava 
and BioPerl and then deciding which ones are in the DAS scope.  This would 
also help down the road in creating Bio{Java,Perl}-compatible APIs.  Does 
this sound like a reasonable approach?

The use cases are very enlightening.  Do you, or others on the mailing list, 
have more?

I will be at the O'Reilly conference, as will Ewan and (I think) Brian.  How 
many people from the DAS mailing list will also be there?  This would be a 
good opportunity to nail down the plan.

> 	I agree with Thomas and Matthew in there assessment of the wording
> of the document and the practical matter of being able to define your
> own local ontology. 

As explained earlier, this was just poor wording in the document.  We're in 
agreement on how the local ontologies should work.

Lincoln

On Thursday 06 December 2001 20:43, Brian Gilman wrote:
> Hello,
>
> 	Sorry for the late response to the doc! We're trying to get a
> release out the door.
>
> 	I don't see any problem with DAML+OIL and UDDI I think they are
> complementary. The way I understand DAML+OIL is that it is trying to allow
> the ontologist to describe the semantic relationships amongst entities.
> UDDI (Universal Discription, Discovery Integration) is used to find the
> repository which holds the entities of interest. You can't really have one
> without the other.
>
> 	I agree with Thomas and Matthew in there assessment of the wording
> of the document and the practical matter of being able to define your
> own local ontology. When a client asks for an explicit entity in the
> database it should give it back without having to go and ask the base
> ontology what the user was asking for. I don't see this as a technical
> challenge, it comes back to the identificaton of entities in the database
> and a common way of representing them. Might I suggest the following
> approach: We set our sights on a UUID scheme for entities in the database
> done over e-mail. Then try and hold a face to face meeting to hash out a
> skeleton ontology for genomics? I would be happy to set this up this
> meeting at Whitehead. I think this kind of discussion needs more bandwidth
> and a whiteboard. Or we could have a meeting at the O'Reilly conference at
> the end of January? Perhaps we could try and send Lincoln/the list  our
> draft DAML+OIL documents to get a start on the problem and then discuss at
> the face to face meeting?  Do others agree with this or am I out of left
> field?
>
> 	Perhaps a little domain engineering is needed here: As Lincoln has
> stated, people are thinking of DAS as a "cure all" for the problems in
> genomics. Well folks, it's not, plain and simple BUT IT HAS GREAT
> POTENTIAL. People are now viewing it as the panacea or lingua franca of
> bioinformatics when in reality it leaves a large portion of the
> bioinformatics/biology community out to dry. It also seems to me that we
> have not, as a community of DAS users/providers, defined the problems that
> we are trying to address. Might it be more productive to first define a
> few problems that each of us are faced with first to see if we are on the
> same page?
>
> 	I will try and list mine below:
>
> ************************************************************************
>
> 	A list of genes are suspected to be involved in a disease pathway.
> The researcher wants to retrieve all or a subset of  annotations for this
> list of genes.  These annotations may suggest a deeper analysis of the
> region ie. literature search  for related annotations or further
> computational  analysis,  which may
> lead to  the discovery of a functional transcript.  Finally, a researcher
> identifies a possible transcript and protein expression analysis may begin.
>
>
> Sequence data has been obtained for a region of interest  through
> sequencing techniques. This sequence is BLASTed to the latest
> physical map of the genome to find its genomic coordinates. These
> coordinates are used to poll annotation databases for annotations of
> interest. In some cases the sequence is blasted
> against a  protein database to look for protein identity or family
> membership.
>
>
> The physical coordinate for a feature on the genome is known which is
> suspected to take part in a metabolic or disease pathway.
> Other annotations are then pulled off the web in order to gain
> a better understanding of the region.  This evidence
> narrows  the physical coordinate of the region of interest, if no
> annotations exist then a computational annotation may be used over a
> biological one. This may lead to the discovery of  a novel protein. If no
> protein is found the region may be re-searched with other computational
> tools.
>
> *************************************************************************
>
> 	From an informatics perspective, the following problems exist:
>
> 	1) Identifiers are not stadardized in this domain: Searching is
> 	very hard
> 	2) There are 400 different file formats to parse
> 	3) Protocols do not exist to query biological datastores
> 	4) Many names exist for the same thing
> 	5) There is no easy way to do a literature search
> 	6) Can't see local annotations in the context of curators database
> 	7) No naming conventions exist in this domain
> 	8) Hard to find other annotators/annotations outside literature
> 	search
> 	9) Biological entities may be difficult to visualize
>
>
> 	Other Common Use Case:
>
> 	A biologist has a list of features (genes) which are suspected to
> take part in a disease or pathway. They want to gain a deep understaning
> of these features by performing laboratory analysis. In order to do this
> they must perform the following functions:
>
> 	1) Find the region of interest in the golden path by the features
> 	identifier (gene name, exon name,contig #, id, etc)
> 	2) Obtain the annotations associated with this region
> 	3) Filter the annotations based upon submitter or physical
> 	criteria:
> 	(how far apart are the features. Is this feature in an exon etc.)
> 	4) Send this data through a laboratory processing pipeline
> 	5) See new annotations in context of curated information and/or
> 	other collaborators information
>
>
> 	Hopefully, this will start discussions as to what exactly we are
> trying to solve and if we are on the same page. If we are all the better
> but what I suspect is that we may not be and may want to start here to
> define our scope.
>
> 					Best,
>
> 					-Brian
>
>
> -----------------------
> Brian Gilman <gilmanb@genome.wi.mit.edu>
> Sr. Software Engineer MIT/Whitehead Inst. Center for Genome Research
> One Kendall Square, Bldg. 300 / Cambridge, MA 02139-1561 USA
> phone +1 617  252 1069 / fax +1 617 252 1902
>
> On Thu, 6 Dec 2001, Lincoln Stein wrote:
> > OK, so that argues that we need to develop a common ontology to work
> > from, right?  I was beginning to think that the sentiment was that DAS
> > should *not* develop an ontology of annotation types.
> >
> > Lincoln
> >
> > On Thursday 06 December 2001 17:07, Chris Mungall wrote:
> > > On Thu, 6 Dec 2001, Lincoln Stein wrote:
> > > > Hi Chris,
> > > >
> > > > If you have four different similar but not identical ontologies
> > > > expressed in DAML+OIL, how does a third party provide the equivalence
> > > > relationships?  Do you envision him providing an equivalence apping
> > > > for each of the 6 pairs, or mapping them all to a single common
> > > > ontology?
> > >
> > > The all by all approach would rapidly get out of hand. I think your
> > > idea of mapping to a skeleton ontology is best. One can imagine all
> > > kinds of different toplogies but that would be getting ahead of
> > > ourselves. The important point is that the level of conformance should
> > > be optional.
> > >
> > > > Lincoln
> > > >
> > > > On Friday 30 November 2001 18:12, Chris Mungall wrote:
> > > > > On Thu, 29 Nov 2001, Ewan Birney wrote:
> > > > > > On Wed, 28 Nov 2001, Lincoln Stein wrote:
> > > > > > > I think we're going to find that the features form a DAG and
> > > > > > > not a hierarchy.  Otherwise you're going to have problems
> > > > > > > classifying things like "genes".  In the context of genetics, a
> > > > > > > gene is a type of complementation group.  In the context of
> > > > > > > genomics, a gene is a subclass of transcription features,
> > > > > > > translation features, and regulatory features.
> > > > > >
> > > > > > Bugger.
> > > > > >
> > > > > > You are right. I'm glad you are going to sort out how to have an
> > > > > > extensible distributed DAG system that is easy to use. ;)
> > > > >
> > > > > Thankfully it's already been done - cf semantic web, RDF(S),
> > > > > DAML+OIL etc
> > > > >
> > > > > The nice thing about this is if someone doesn't like "ontology
> > > > > politburo"'s classes, they can add in their own.
> > > > >
> > > > > Two people can develop their own similar class hierarchies without
> > > > > speaking to one another, and a third person can provide equivalence
> > > > > relationships mapping between the concepts; or logical rules for
> > > > > inferring one from the other.
> > > > >
> > > > > the DAGs can be as complex as you like, giving computable semantics
> > > > > for terms like "tRNA" or just a flat vocabulary, whatever you like.
> > > > > Complements DAS beautifully.
> > > > >
> > > > > All designed to be anarchic and distributed and not an OMG
> > > > > committee in sight
> > > > >
> > > > > > DAS mailing list
> > > > > > DAS@biodas.org
> > > > > > http://biodas.org/mailman/listinfo/das
> > > >
> > > > --
> > > > =====================================================================
> > > >=== Lincoln D. Stein                           Cold Spring Harbor
> > > > Laboratory lstein@cshl.org			                  Cold Spring Harbor, NY
> > > >
> > > > NOW HIRING BIOINFORMATICS POSTDOCTORAL FELLOWS AND PROGRAMMERS.
> > > > PLEASE WRITE FOR DETAILS.
> > > > =====================================================================
> > > >===
> >
> > --
> > ========================================================================
> > Lincoln D. Stein                           Cold Spring Harbor Laboratory
> > lstein@cshl.org			                  Cold Spring Harbor, NY
> >
> > NOW HIRING BIOINFORMATICS POSTDOCTORAL FELLOWS AND PROGRAMMERS.
> > PLEASE WRITE FOR DETAILS.
> > ========================================================================

-- 
========================================================================
Lincoln D. Stein                           Cold Spring Harbor Laboratory
lstein@cshl.org			                  Cold Spring Harbor, NY

NOW HIRING BIOINFORMATICS POSTDOCTORAL FELLOWS AND PROGRAMMERS. 
PLEASE WRITE FOR DETAILS.
========================================================================