[DAS] Re: Our identifier doc and proposal
Brian Gilman
gilmanb@genome.wi.mit.edu
Mon, 10 Dec 2001 17:48:01 -0500 (EST)
Our docs are in VERY rough shape. I can try and get something to you after
the holidays. And yes, I'll be at the conference.
-B
-----------------------
Brian Gilman <gilmanb@genome.wi.mit.edu>
Sr. Software Engineer MIT/Whitehead Inst. Center for Genome Research
One Kendall Square, Bldg. 300 / Cambridge, MA 02139-1561 USA
phone +1 617 252 1069 / fax +1 617 252 1902
On Mon, 10 Dec 2001, Lincoln Stein wrote:
> Hi Brian,
>
> I'd love to see your DAML+OIL draft documents. Were these developed for the
> I3C?
>
> We need to decide which entities to model before going any further. DAS/1
> has two objects: the map and the feature. Brian's use cases below beg for
> more entities, like "submitter" and "literature reference". We could make a
> first draft of these entities by pulling the top level objects out of BioJava
> and BioPerl and then deciding which ones are in the DAS scope. This would
> also help down the road in creating Bio{Java,Perl}-compatible APIs. Does
> this sound like a reasonable approach?
>
> The use cases are very enlightening. Do you, or others on the mailing list,
> have more?
>
> I will be at the O'Reilly conference, as will Ewan and (I think) Brian. How
> many people from the DAS mailing list will also be there? This would be a
> good opportunity to nail down the plan.
>
> > I agree with Thomas and Matthew in there assessment of the wording
> > of the document and the practical matter of being able to define your
> > own local ontology.
>
> As explained earlier, this was just poor wording in the document. We're in
> agreement on how the local ontologies should work.
>
> Lincoln
>
> On Thursday 06 December 2001 20:43, Brian Gilman wrote:
> > Hello,
> >
> > Sorry for the late response to the doc! We're trying to get a
> > release out the door.
> >
> > I don't see any problem with DAML+OIL and UDDI I think they are
> > complementary. The way I understand DAML+OIL is that it is trying to allow
> > the ontologist to describe the semantic relationships amongst entities.
> > UDDI (Universal Discription, Discovery Integration) is used to find the
> > repository which holds the entities of interest. You can't really have one
> > without the other.
> >
> > I agree with Thomas and Matthew in there assessment of the wording
> > of the document and the practical matter of being able to define your
> > own local ontology. When a client asks for an explicit entity in the
> > database it should give it back without having to go and ask the base
> > ontology what the user was asking for. I don't see this as a technical
> > challenge, it comes back to the identificaton of entities in the database
> > and a common way of representing them. Might I suggest the following
> > approach: We set our sights on a UUID scheme for entities in the database
> > done over e-mail. Then try and hold a face to face meeting to hash out a
> > skeleton ontology for genomics? I would be happy to set this up this
> > meeting at Whitehead. I think this kind of discussion needs more bandwidth
> > and a whiteboard. Or we could have a meeting at the O'Reilly conference at
> > the end of January? Perhaps we could try and send Lincoln/the list our
> > draft DAML+OIL documents to get a start on the problem and then discuss at
> > the face to face meeting? Do others agree with this or am I out of left
> > field?
> >
> > Perhaps a little domain engineering is needed here: As Lincoln has
> > stated, people are thinking of DAS as a "cure all" for the problems in
> > genomics. Well folks, it's not, plain and simple BUT IT HAS GREAT
> > POTENTIAL. People are now viewing it as the panacea or lingua franca of
> > bioinformatics when in reality it leaves a large portion of the
> > bioinformatics/biology community out to dry. It also seems to me that we
> > have not, as a community of DAS users/providers, defined the problems that
> > we are trying to address. Might it be more productive to first define a
> > few problems that each of us are faced with first to see if we are on the
> > same page?
> >
> > I will try and list mine below:
> >
> > ************************************************************************
> >
> > A list of genes are suspected to be involved in a disease pathway.
> > The researcher wants to retrieve all or a subset of annotations for this
> > list of genes. These annotations may suggest a deeper analysis of the
> > region ie. literature search for related annotations or further
> > computational analysis, which may
> > lead to the discovery of a functional transcript. Finally, a researcher
> > identifies a possible transcript and protein expression analysis may begin.
> >
> >
> > Sequence data has been obtained for a region of interest through
> > sequencing techniques. This sequence is BLASTed to the latest
> > physical map of the genome to find its genomic coordinates. These
> > coordinates are used to poll annotation databases for annotations of
> > interest. In some cases the sequence is blasted
> > against a protein database to look for protein identity or family
> > membership.
> >
> >
> > The physical coordinate for a feature on the genome is known which is
> > suspected to take part in a metabolic or disease pathway.
> > Other annotations are then pulled off the web in order to gain
> > a better understanding of the region. This evidence
> > narrows the physical coordinate of the region of interest, if no
> > annotations exist then a computational annotation may be used over a
> > biological one. This may lead to the discovery of a novel protein. If no
> > protein is found the region may be re-searched with other computational
> > tools.
> >
> > *************************************************************************
> >
> > From an informatics perspective, the following problems exist:
> >
> > 1) Identifiers are not stadardized in this domain: Searching is
> > very hard
> > 2) There are 400 different file formats to parse
> > 3) Protocols do not exist to query biological datastores
> > 4) Many names exist for the same thing
> > 5) There is no easy way to do a literature search
> > 6) Can't see local annotations in the context of curators database
> > 7) No naming conventions exist in this domain
> > 8) Hard to find other annotators/annotations outside literature
> > search
> > 9) Biological entities may be difficult to visualize
> >
> >
> > Other Common Use Case:
> >
> > A biologist has a list of features (genes) which are suspected to
> > take part in a disease or pathway. They want to gain a deep understaning
> > of these features by performing laboratory analysis. In order to do this
> > they must perform the following functions:
> >
> > 1) Find the region of interest in the golden path by the features
> > identifier (gene name, exon name,contig #, id, etc)
> > 2) Obtain the annotations associated with this region
> > 3) Filter the annotations based upon submitter or physical
> > criteria:
> > (how far apart are the features. Is this feature in an exon etc.)
> > 4) Send this data through a laboratory processing pipeline
> > 5) See new annotations in context of curated information and/or
> > other collaborators information
> >
> >
> > Hopefully, this will start discussions as to what exactly we are
> > trying to solve and if we are on the same page. If we are all the better
> > but what I suspect is that we may not be and may want to start here to
> > define our scope.
> >
> > Best,
> >
> > -Brian
> >
> >
> > -----------------------
> > Brian Gilman <gilmanb@genome.wi.mit.edu>
> > Sr. Software Engineer MIT/Whitehead Inst. Center for Genome Research
> > One Kendall Square, Bldg. 300 / Cambridge, MA 02139-1561 USA
> > phone +1 617 252 1069 / fax +1 617 252 1902
> >
> > On Thu, 6 Dec 2001, Lincoln Stein wrote:
> > > OK, so that argues that we need to develop a common ontology to work
> > > from, right? I was beginning to think that the sentiment was that DAS
> > > should *not* develop an ontology of annotation types.
> > >
> > > Lincoln
> > >
> > > On Thursday 06 December 2001 17:07, Chris Mungall wrote:
> > > > On Thu, 6 Dec 2001, Lincoln Stein wrote:
> > > > > Hi Chris,
> > > > >
> > > > > If you have four different similar but not identical ontologies
> > > > > expressed in DAML+OIL, how does a third party provide the equivalence
> > > > > relationships? Do you envision him providing an equivalence apping
> > > > > for each of the 6 pairs, or mapping them all to a single common
> > > > > ontology?
> > > >
> > > > The all by all approach would rapidly get out of hand. I think your
> > > > idea of mapping to a skeleton ontology is best. One can imagine all
> > > > kinds of different toplogies but that would be getting ahead of
> > > > ourselves. The important point is that the level of conformance should
> > > > be optional.
> > > >
> > > > > Lincoln
> > > > >
> > > > > On Friday 30 November 2001 18:12, Chris Mungall wrote:
> > > > > > On Thu, 29 Nov 2001, Ewan Birney wrote:
> > > > > > > On Wed, 28 Nov 2001, Lincoln Stein wrote:
> > > > > > > > I think we're going to find that the features form a DAG and
> > > > > > > > not a hierarchy. Otherwise you're going to have problems
> > > > > > > > classifying things like "genes". In the context of genetics, a
> > > > > > > > gene is a type of complementation group. In the context of
> > > > > > > > genomics, a gene is a subclass of transcription features,
> > > > > > > > translation features, and regulatory features.
> > > > > > >
> > > > > > > Bugger.
> > > > > > >
> > > > > > > You are right. I'm glad you are going to sort out how to have an
> > > > > > > extensible distributed DAG system that is easy to use. ;)
> > > > > >
> > > > > > Thankfully it's already been done - cf semantic web, RDF(S),
> > > > > > DAML+OIL etc
> > > > > >
> > > > > > The nice thing about this is if someone doesn't like "ontology
> > > > > > politburo"'s classes, they can add in their own.
> > > > > >
> > > > > > Two people can develop their own similar class hierarchies without
> > > > > > speaking to one another, and a third person can provide equivalence
> > > > > > relationships mapping between the concepts; or logical rules for
> > > > > > inferring one from the other.
> > > > > >
> > > > > > the DAGs can be as complex as you like, giving computable semantics
> > > > > > for terms like "tRNA" or just a flat vocabulary, whatever you like.
> > > > > > Complements DAS beautifully.
> > > > > >
> > > > > > All designed to be anarchic and distributed and not an OMG
> > > > > > committee in sight
> > > > > >
> > > > > > > DAS mailing list
> > > > > > > DAS@biodas.org
> > > > > > > http://biodas.org/mailman/listinfo/das
> > > > >
> > > > > --
> > > > > =====================================================================
> > > > >=== Lincoln D. Stein Cold Spring Harbor
> > > > > Laboratory lstein@cshl.org Cold Spring Harbor, NY
> > > > >
> > > > > NOW HIRING BIOINFORMATICS POSTDOCTORAL FELLOWS AND PROGRAMMERS.
> > > > > PLEASE WRITE FOR DETAILS.
> > > > > =====================================================================
> > > > >===
> > >
> > > --
> > > ========================================================================
> > > Lincoln D. Stein Cold Spring Harbor Laboratory
> > > lstein@cshl.org Cold Spring Harbor, NY
> > >
> > > NOW HIRING BIOINFORMATICS POSTDOCTORAL FELLOWS AND PROGRAMMERS.
> > > PLEASE WRITE FOR DETAILS.
> > > ========================================================================
>
> --
> ========================================================================
> Lincoln D. Stein Cold Spring Harbor Laboratory
> lstein@cshl.org Cold Spring Harbor, NY
>
> NOW HIRING BIOINFORMATICS POSTDOCTORAL FELLOWS AND PROGRAMMERS.
> PLEASE WRITE FOR DETAILS.
> ========================================================================
>