[DAS] Re: Our identifier doc and proposal
Lincoln Stein
lstein@cshl.org
Tue, 11 Dec 2001 12:38:18 -0500
Even in rough shape I think they'll be very useful. And they'll probably
make great holiday reading too!
Lincoln
On Monday 10 December 2001 17:48, Brian Gilman wrote:
> Our docs are in VERY rough shape. I can try and get something to you after
> the holidays. And yes, I'll be at the conference.
>
> -B
>
> -----------------------
> Brian Gilman <gilmanb@genome.wi.mit.edu>
> Sr. Software Engineer MIT/Whitehead Inst. Center for Genome Research
> One Kendall Square, Bldg. 300 / Cambridge, MA 02139-1561 USA
> phone +1 617 252 1069 / fax +1 617 252 1902
>
> On Mon, 10 Dec 2001, Lincoln Stein wrote:
> > Hi Brian,
> >
> > I'd love to see your DAML+OIL draft documents. Were these developed for
> > the I3C?
> >
> > We need to decide which entities to model before going any further.
> > DAS/1 has two objects: the map and the feature. Brian's use cases below
> > beg for more entities, like "submitter" and "literature reference". We
> > could make a first draft of these entities by pulling the top level
> > objects out of BioJava and BioPerl and then deciding which ones are in
> > the DAS scope. This would also help down the road in creating
> > Bio{Java,Perl}-compatible APIs. Does this sound like a reasonable
> > approach?
> >
> > The use cases are very enlightening. Do you, or others on the mailing
> > list, have more?
> >
> > I will be at the O'Reilly conference, as will Ewan and (I think) Brian.
> > How many people from the DAS mailing list will also be there? This would
> > be a good opportunity to nail down the plan.
> >
> > > I agree with Thomas and Matthew in there assessment of the wording
> > > of the document and the practical matter of being able to define your
> > > own local ontology.
> >
> > As explained earlier, this was just poor wording in the document. We're
> > in agreement on how the local ontologies should work.
> >
> > Lincoln
> >
> > On Thursday 06 December 2001 20:43, Brian Gilman wrote:
> > > Hello,
> > >
> > > Sorry for the late response to the doc! We're trying to get a
> > > release out the door.
> > >
> > > I don't see any problem with DAML+OIL and UDDI I think they are
> > > complementary. The way I understand DAML+OIL is that it is trying to
> > > allow the ontologist to describe the semantic relationships amongst
> > > entities. UDDI (Universal Discription, Discovery Integration) is used
> > > to find the repository which holds the entities of interest. You can't
> > > really have one without the other.
> > >
> > > I agree with Thomas and Matthew in there assessment of the wording
> > > of the document and the practical matter of being able to define your
> > > own local ontology. When a client asks for an explicit entity in the
> > > database it should give it back without having to go and ask the base
> > > ontology what the user was asking for. I don't see this as a technical
> > > challenge, it comes back to the identificaton of entities in the
> > > database and a common way of representing them. Might I suggest the
> > > following approach: We set our sights on a UUID scheme for entities in
> > > the database done over e-mail. Then try and hold a face to face meeting
> > > to hash out a skeleton ontology for genomics? I would be happy to set
> > > this up this meeting at Whitehead. I think this kind of discussion
> > > needs more bandwidth and a whiteboard. Or we could have a meeting at
> > > the O'Reilly conference at the end of January? Perhaps we could try and
> > > send Lincoln/the list our draft DAML+OIL documents to get a start on
> > > the problem and then discuss at the face to face meeting? Do others
> > > agree with this or am I out of left field?
> > >
> > > Perhaps a little domain engineering is needed here: As Lincoln has
> > > stated, people are thinking of DAS as a "cure all" for the problems in
> > > genomics. Well folks, it's not, plain and simple BUT IT HAS GREAT
> > > POTENTIAL. People are now viewing it as the panacea or lingua franca of
> > > bioinformatics when in reality it leaves a large portion of the
> > > bioinformatics/biology community out to dry. It also seems to me that
> > > we have not, as a community of DAS users/providers, defined the
> > > problems that we are trying to address. Might it be more productive to
> > > first define a few problems that each of us are faced with first to see
> > > if we are on the same page?
> > >
> > > I will try and list mine below:
> > >
> > > ***********************************************************************
> > >*
> > >
> > > A list of genes are suspected to be involved in a disease pathway.
> > > The researcher wants to retrieve all or a subset of annotations for
> > > this list of genes. These annotations may suggest a deeper analysis of
> > > the region ie. literature search for related annotations or further
> > > computational analysis, which may
> > > lead to the discovery of a functional transcript. Finally, a
> > > researcher identifies a possible transcript and protein expression
> > > analysis may begin.
> > >
> > >
> > > Sequence data has been obtained for a region of interest through
> > > sequencing techniques. This sequence is BLASTed to the latest
> > > physical map of the genome to find its genomic coordinates. These
> > > coordinates are used to poll annotation databases for annotations of
> > > interest. In some cases the sequence is blasted
> > > against a protein database to look for protein identity or family
> > > membership.
> > >
> > >
> > > The physical coordinate for a feature on the genome is known which is
> > > suspected to take part in a metabolic or disease pathway.
> > > Other annotations are then pulled off the web in order to gain
> > > a better understanding of the region. This evidence
> > > narrows the physical coordinate of the region of interest, if no
> > > annotations exist then a computational annotation may be used over a
> > > biological one. This may lead to the discovery of a novel protein. If
> > > no protein is found the region may be re-searched with other
> > > computational tools.
> > >
> > > ***********************************************************************
> > >**
> > >
> > > From an informatics perspective, the following problems exist:
> > >
> > > 1) Identifiers are not stadardized in this domain: Searching is
> > > very hard
> > > 2) There are 400 different file formats to parse
> > > 3) Protocols do not exist to query biological datastores
> > > 4) Many names exist for the same thing
> > > 5) There is no easy way to do a literature search
> > > 6) Can't see local annotations in the context of curators database
> > > 7) No naming conventions exist in this domain
> > > 8) Hard to find other annotators/annotations outside literature
> > > search
> > > 9) Biological entities may be difficult to visualize
> > >
> > >
> > > Other Common Use Case:
> > >
> > > A biologist has a list of features (genes) which are suspected to
> > > take part in a disease or pathway. They want to gain a deep
> > > understaning of these features by performing laboratory analysis. In
> > > order to do this they must perform the following functions:
> > >
> > > 1) Find the region of interest in the golden path by the features
> > > identifier (gene name, exon name,contig #, id, etc)
> > > 2) Obtain the annotations associated with this region
> > > 3) Filter the annotations based upon submitter or physical
> > > criteria:
> > > (how far apart are the features. Is this feature in an exon etc.)
> > > 4) Send this data through a laboratory processing pipeline
> > > 5) See new annotations in context of curated information and/or
> > > other collaborators information
> > >
> > >
> > > Hopefully, this will start discussions as to what exactly we are
> > > trying to solve and if we are on the same page. If we are all the
> > > better but what I suspect is that we may not be and may want to start
> > > here to define our scope.
> > >
> > > Best,
> > >
> > > -Brian
> > >
> > >
> > > -----------------------
> > > Brian Gilman <gilmanb@genome.wi.mit.edu>
> > > Sr. Software Engineer MIT/Whitehead Inst. Center for Genome Research
> > > One Kendall Square, Bldg. 300 / Cambridge, MA 02139-1561 USA
> > > phone +1 617 252 1069 / fax +1 617 252 1902
> > >
> > > On Thu, 6 Dec 2001, Lincoln Stein wrote:
> > > > OK, so that argues that we need to develop a common ontology to work
> > > > from, right? I was beginning to think that the sentiment was that
> > > > DAS should *not* develop an ontology of annotation types.
> > > >
> > > > Lincoln
> > > >
> > > > On Thursday 06 December 2001 17:07, Chris Mungall wrote:
> > > > > On Thu, 6 Dec 2001, Lincoln Stein wrote:
> > > > > > Hi Chris,
> > > > > >
> > > > > > If you have four different similar but not identical ontologies
> > > > > > expressed in DAML+OIL, how does a third party provide the
> > > > > > equivalence relationships? Do you envision him providing an
> > > > > > equivalence apping for each of the 6 pairs, or mapping them all
> > > > > > to a single common ontology?
> > > > >
> > > > > The all by all approach would rapidly get out of hand. I think your
> > > > > idea of mapping to a skeleton ontology is best. One can imagine all
> > > > > kinds of different toplogies but that would be getting ahead of
> > > > > ourselves. The important point is that the level of conformance
> > > > > should be optional.
> > > > >
> > > > > > Lincoln
> > > > > >
> > > > > > On Friday 30 November 2001 18:12, Chris Mungall wrote:
> > > > > > > On Thu, 29 Nov 2001, Ewan Birney wrote:
> > > > > > > > On Wed, 28 Nov 2001, Lincoln Stein wrote:
> > > > > > > > > I think we're going to find that the features form a DAG
> > > > > > > > > and not a hierarchy. Otherwise you're going to have
> > > > > > > > > problems classifying things like "genes". In the context
> > > > > > > > > of genetics, a gene is a type of complementation group. In
> > > > > > > > > the context of genomics, a gene is a subclass of
> > > > > > > > > transcription features, translation features, and
> > > > > > > > > regulatory features.
> > > > > > > >
> > > > > > > > Bugger.
> > > > > > > >
> > > > > > > > You are right. I'm glad you are going to sort out how to have
> > > > > > > > an extensible distributed DAG system that is easy to use. ;)
> > > > > > >
> > > > > > > Thankfully it's already been done - cf semantic web, RDF(S),
> > > > > > > DAML+OIL etc
> > > > > > >
> > > > > > > The nice thing about this is if someone doesn't like "ontology
> > > > > > > politburo"'s classes, they can add in their own.
> > > > > > >
> > > > > > > Two people can develop their own similar class hierarchies
> > > > > > > without speaking to one another, and a third person can provide
> > > > > > > equivalence relationships mapping between the concepts; or
> > > > > > > logical rules for inferring one from the other.
> > > > > > >
> > > > > > > the DAGs can be as complex as you like, giving computable
> > > > > > > semantics for terms like "tRNA" or just a flat vocabulary,
> > > > > > > whatever you like. Complements DAS beautifully.
> > > > > > >
> > > > > > > All designed to be anarchic and distributed and not an OMG
> > > > > > > committee in sight
> > > > > > >
> > > > > > > > DAS mailing list
> > > > > > > > DAS@biodas.org
> > > > > > > > http://biodas.org/mailman/listinfo/das
> > > > > >
> > > > > > --
> > > > > > =================================================================
> > > > > >==== === Lincoln D. Stein Cold Spring
> > > > > > Harbor Laboratory lstein@cshl.org Cold Spring
> > > > > > Harbor, NY
> > > > > >
> > > > > > NOW HIRING BIOINFORMATICS POSTDOCTORAL FELLOWS AND PROGRAMMERS.
> > > > > > PLEASE WRITE FOR DETAILS.
> > > > > > =================================================================
> > > > > >==== ===
> > > >
> > > > --
> > > > =====================================================================
> > > >=== Lincoln D. Stein Cold Spring Harbor
> > > > Laboratory lstein@cshl.org Cold Spring Harbor, NY
> > > >
> > > > NOW HIRING BIOINFORMATICS POSTDOCTORAL FELLOWS AND PROGRAMMERS.
> > > > PLEASE WRITE FOR DETAILS.
> > > > =====================================================================
> > > >===
> >
> > --
> > ========================================================================
> > Lincoln D. Stein Cold Spring Harbor Laboratory
> > lstein@cshl.org Cold Spring Harbor, NY
> >
> > NOW HIRING BIOINFORMATICS POSTDOCTORAL FELLOWS AND PROGRAMMERS.
> > PLEASE WRITE FOR DETAILS.
> > ========================================================================
--
========================================================================
Lincoln D. Stein Cold Spring Harbor Laboratory
lstein@cshl.org Cold Spring Harbor, NY
NOW HIRING BIOINFORMATICS POSTDOCTORAL FELLOWS AND PROGRAMMERS.
PLEASE WRITE FOR DETAILS.
========================================================================