[DAS] Re: Our identifier doc and proposal
Brian
gilmanb@Jforge.net
Tue, 11 Dec 2001 03:54:50 -0500 (EST)
If you think they'd help then I'd be happy to submit them but I have to
warn you that the entity relationships may not be right based upon my
/very poor understanding of DAML+OIL to date/.
On a different note, I'd like to announce that we will be
releasing a 0.5 version of OmniGene this Thursday. I will be contributing
this freeze to the DAS project as well as put up some test DAS/SOAP
servers for people to play with 2 weeks after this. This will include
dbSNP server, Ensembl server, Genbank server and Golden Path server. We
will also provide some internal data to groups through an authentication
layer that we are cooking up.
Best,
-B
On Tue, 11 Dec 2001, Lincoln Stein wrote:
> Even in rough shape I think they'll be very useful. And they'll probably
> make great holiday reading too!
>
> Lincoln
>
> On Monday 10 December 2001 17:48, Brian Gilman wrote:
> > Our docs are in VERY rough shape. I can try and get something to you after
> > the holidays. And yes, I'll be at the conference.
> >
> > -B
> >
> > -----------------------
> > Brian Gilman <gilmanb@genome.wi.mit.edu>
> > Sr. Software Engineer MIT/Whitehead Inst. Center for Genome Research
> > One Kendall Square, Bldg. 300 / Cambridge, MA 02139-1561 USA
> > phone +1 617 252 1069 / fax +1 617 252 1902
> >
> > On Mon, 10 Dec 2001, Lincoln Stein wrote:
> > > Hi Brian,
> > >
> > > I'd love to see your DAML+OIL draft documents. Were these developed for
> > > the I3C?
> > >
> > > We need to decide which entities to model before going any further.
> > > DAS/1 has two objects: the map and the feature. Brian's use cases below
> > > beg for more entities, like "submitter" and "literature reference". We
> > > could make a first draft of these entities by pulling the top level
> > > objects out of BioJava and BioPerl and then deciding which ones are in
> > > the DAS scope. This would also help down the road in creating
> > > Bio{Java,Perl}-compatible APIs. Does this sound like a reasonable
> > > approach?
> > >
> > > The use cases are very enlightening. Do you, or others on the mailing
> > > list, have more?
> > >
> > > I will be at the O'Reilly conference, as will Ewan and (I think) Brian.
> > > How many people from the DAS mailing list will also be there? This would
> > > be a good opportunity to nail down the plan.
> > >
> > > > I agree with Thomas and Matthew in there assessment of the wording
> > > > of the document and the practical matter of being able to define your
> > > > own local ontology.
> > >
> > > As explained earlier, this was just poor wording in the document. We're
> > > in agreement on how the local ontologies should work.
> > >
> > > Lincoln
> > >
> > > On Thursday 06 December 2001 20:43, Brian Gilman wrote:
> > > > Hello,
> > > >
> > > > Sorry for the late response to the doc! We're trying to get a
> > > > release out the door.
> > > >
> > > > I don't see any problem with DAML+OIL and UDDI I think they are
> > > > complementary. The way I understand DAML+OIL is that it is trying to
> > > > allow the ontologist to describe the semantic relationships amongst
> > > > entities. UDDI (Universal Discription, Discovery Integration) is used
> > > > to find the repository which holds the entities of interest. You can't
> > > > really have one without the other.
> > > >
> > > > I agree with Thomas and Matthew in there assessment of the wording
> > > > of the document and the practical matter of being able to define your
> > > > own local ontology. When a client asks for an explicit entity in the
> > > > database it should give it back without having to go and ask the base
> > > > ontology what the user was asking for. I don't see this as a technical
> > > > challenge, it comes back to the identificaton of entities in the
> > > > database and a common way of representing them. Might I suggest the
> > > > following approach: We set our sights on a UUID scheme for entities in
> > > > the database done over e-mail. Then try and hold a face to face meeting
> > > > to hash out a skeleton ontology for genomics? I would be happy to set
> > > > this up this meeting at Whitehead. I think this kind of discussion
> > > > needs more bandwidth and a whiteboard. Or we could have a meeting at
> > > > the O'Reilly conference at the end of January? Perhaps we could try and
> > > > send Lincoln/the list our draft DAML+OIL documents to get a start on
> > > > the problem and then discuss at the face to face meeting? Do others
> > > > agree with this or am I out of left field?
> > > >
> > > > Perhaps a little domain engineering is needed here: As Lincoln has
> > > > stated, people are thinking of DAS as a "cure all" for the problems in
> > > > genomics. Well folks, it's not, plain and simple BUT IT HAS GREAT
> > > > POTENTIAL. People are now viewing it as the panacea or lingua franca of
> > > > bioinformatics when in reality it leaves a large portion of the
> > > > bioinformatics/biology community out to dry. It also seems to me that
> > > > we have not, as a community of DAS users/providers, defined the
> > > > problems that we are trying to address. Might it be more productive to
> > > > first define a few problems that each of us are faced with first to see
> > > > if we are on the same page?
> > > >
> > > > I will try and list mine below:
> > > >
> > > > ***********************************************************************
> > > >*
> > > >
> > > > A list of genes are suspected to be involved in a disease pathway.
> > > > The researcher wants to retrieve all or a subset of annotations for
> > > > this list of genes. These annotations may suggest a deeper analysis of
> > > > the region ie. literature search for related annotations or further
> > > > computational analysis, which may
> > > > lead to the discovery of a functional transcript. Finally, a
> > > > researcher identifies a possible transcript and protein expression
> > > > analysis may begin.
> > > >
> > > >
> > > > Sequence data has been obtained for a region of interest through
> > > > sequencing techniques. This sequence is BLASTed to the latest
> > > > physical map of the genome to find its genomic coordinates. These
> > > > coordinates are used to poll annotation databases for annotations of
> > > > interest. In some cases the sequence is blasted
> > > > against a protein database to look for protein identity or family
> > > > membership.
> > > >
> > > >
> > > > The physical coordinate for a feature on the genome is known which is
> > > > suspected to take part in a metabolic or disease pathway.
> > > > Other annotations are then pulled off the web in order to gain
> > > > a better understanding of the region. This evidence
> > > > narrows the physical coordinate of the region of interest, if no
> > > > annotations exist then a computational annotation may be used over a
> > > > biological one. This may lead to the discovery of a novel protein. If
> > > > no protein is found the region may be re-searched with other
> > > > computational tools.
> > > >
> > > > ***********************************************************************
> > > >**
> > > >
> > > > From an informatics perspective, the following problems exist:
> > > >
> > > > 1) Identifiers are not stadardized in this domain: Searching is
> > > > very hard
> > > > 2) There are 400 different file formats to parse
> > > > 3) Protocols do not exist to query biological datastores
> > > > 4) Many names exist for the same thing
> > > > 5) There is no easy way to do a literature search
> > > > 6) Can't see local annotations in the context of curators database
> > > > 7) No naming conventions exist in this domain
> > > > 8) Hard to find other annotators/annotations outside literature
> > > > search
> > > > 9) Biological entities may be difficult to visualize
> > > >
> > > >
> > > > Other Common Use Case:
> > > >
> > > > A biologist has a list of features (genes) which are suspected to
> > > > take part in a disease or pathway. They want to gain a deep
> > > > understaning of these features by performing laboratory analysis. In
> > > > order to do this they must perform the following functions:
> > > >
> > > > 1) Find the region of interest in the golden path by the features
> > > > identifier (gene name, exon name,contig #, id, etc)
> > > > 2) Obtain the annotations associated with this region
> > > > 3) Filter the annotations based upon submitter or physical
> > > > criteria:
> > > > (how far apart are the features. Is this feature in an exon etc.)
> > > > 4) Send this data through a laboratory processing pipeline
> > > > 5) See new annotations in context of curated information and/or
> > > > other collaborators information
> > > >
> > > >
> > > > Hopefully, this will start discussions as to what exactly we are
> > > > trying to solve and if we are on the same page. If we are all the
> > > > better but what I suspect is that we may not be and may want to start
> > > > here to define our scope.
> > > >
> > > > Best,
> > > >
> > > > -Brian
> > > >
> > > >
> > > > -----------------------
> > > > Brian Gilman <gilmanb@genome.wi.mit.edu>
> > > > Sr. Software Engineer MIT/Whitehead Inst. Center for Genome Research
> > > > One Kendall Square, Bldg. 300 / Cambridge, MA 02139-1561 USA
> > > > phone +1 617 252 1069 / fax +1 617 252 1902
> > > >
> > > > On Thu, 6 Dec 2001, Lincoln Stein wrote:
> > > > > OK, so that argues that we need to develop a common ontology to work
> > > > > from, right? I was beginning to think that the sentiment was that
> > > > > DAS should *not* develop an ontology of annotation types.
> > > > >
> > > > > Lincoln
> > > > >
> > > > > On Thursday 06 December 2001 17:07, Chris Mungall wrote:
> > > > > > On Thu, 6 Dec 2001, Lincoln Stein wrote:
> > > > > > > Hi Chris,
> > > > > > >
> > > > > > > If you have four different similar but not identical ontologies
> > > > > > > expressed in DAML+OIL, how does a third party provide the
> > > > > > > equivalence relationships? Do you envision him providing an
> > > > > > > equivalence apping for each of the 6 pairs, or mapping them all
> > > > > > > to a single common ontology?
> > > > > >
> > > > > > The all by all approach would rapidly get out of hand. I think your
> > > > > > idea of mapping to a skeleton ontology is best. One can imagine all
> > > > > > kinds of different toplogies but that would be getting ahead of
> > > > > > ourselves. The important point is that the level of conformance
> > > > > > should be optional.
> > > > > >
> > > > > > > Lincoln
> > > > > > >
> > > > > > > On Friday 30 November 2001 18:12, Chris Mungall wrote:
> > > > > > > > On Thu, 29 Nov 2001, Ewan Birney wrote:
> > > > > > > > > On Wed, 28 Nov 2001, Lincoln Stein wrote:
> > > > > > > > > > I think we're going to find that the features form a DAG
> > > > > > > > > > and not a hierarchy. Otherwise you're going to have
> > > > > > > > > > problems classifying things like "genes". In the context
> > > > > > > > > > of genetics, a gene is a type of complementation group. In
> > > > > > > > > > the context of genomics, a gene is a subclass of
> > > > > > > > > > transcription features, translation features, and
> > > > > > > > > > regulatory features.
> > > > > > > > >
> > > > > > > > > Bugger.
> > > > > > > > >
> > > > > > > > > You are right. I'm glad you are going to sort out how to have
> > > > > > > > > an extensible distributed DAG system that is easy to use. ;)
> > > > > > > >
> > > > > > > > Thankfully it's already been done - cf semantic web, RDF(S),
> > > > > > > > DAML+OIL etc
> > > > > > > >
> > > > > > > > The nice thing about this is if someone doesn't like "ontology
> > > > > > > > politburo"'s classes, they can add in their own.
> > > > > > > >
> > > > > > > > Two people can develop their own similar class hierarchies
> > > > > > > > without speaking to one another, and a third person can provide
> > > > > > > > equivalence relationships mapping between the concepts; or
> > > > > > > > logical rules for inferring one from the other.
> > > > > > > >
> > > > > > > > the DAGs can be as complex as you like, giving computable
> > > > > > > > semantics for terms like "tRNA" or just a flat vocabulary,
> > > > > > > > whatever you like. Complements DAS beautifully.
> > > > > > > >
> > > > > > > > All designed to be anarchic and distributed and not an OMG
> > > > > > > > committee in sight
> > > > > > > >
> > > > > > > > > DAS mailing list
> > > > > > > > > DAS@biodas.org
> > > > > > > > > http://biodas.org/mailman/listinfo/das
> > > > > > >
> > > > > > > --
> > > > > > > =================================================================
> > > > > > >==== === Lincoln D. Stein Cold Spring
> > > > > > > Harbor Laboratory lstein@cshl.org Cold Spring
> > > > > > > Harbor, NY
> > > > > > >
> > > > > > > NOW HIRING BIOINFORMATICS POSTDOCTORAL FELLOWS AND PROGRAMMERS.
> > > > > > > PLEASE WRITE FOR DETAILS.
> > > > > > > =================================================================
> > > > > > >==== ===
> > > > >
> > > > > --
> > > > > =====================================================================
> > > > >=== Lincoln D. Stein Cold Spring Harbor
> > > > > Laboratory lstein@cshl.org Cold Spring Harbor, NY
> > > > >
> > > > > NOW HIRING BIOINFORMATICS POSTDOCTORAL FELLOWS AND PROGRAMMERS.
> > > > > PLEASE WRITE FOR DETAILS.
> > > > > =====================================================================
> > > > >===
> > >
> > > --
> > > ========================================================================
> > > Lincoln D. Stein Cold Spring Harbor Laboratory
> > > lstein@cshl.org Cold Spring Harbor, NY
> > >
> > > NOW HIRING BIOINFORMATICS POSTDOCTORAL FELLOWS AND PROGRAMMERS.
> > > PLEASE WRITE FOR DETAILS.
> > > ========================================================================
>
>
--
----------------
Brian Gilman <gilmanb@jforge.net>