[DAS] Re: [Call to action] Retrieval of positions from feature identifiers

Lincoln Stein lstein@cshl.org
Tue, 20 Nov 2001 15:48:02 -0500


Brian,

Yes, use cases would be helpful.  Let me explain my internal use case:

   DAS/1 was designed to be act as an adjunct to a genome annotation
   web site like Ensembl.  Users navigate to a region of interest
   using any of the query mechanisms available from the web site
   (following GO terms, drilling down into a cytogenetic map, BLAST
   searches).  Once the region is in focus, the web site launches the DAS
   browser applet, servlet or CGI script, allowing the user to layer
   other 3d party groups' annotations on the same genomic segment.  Each
   annotation contains an optional note that provides some information
   about it, and an optional URL that allows the user to visit the 3d
   party annotator's site to get more detailed information.

This use case places the burden of managing biological objects and
managing meaningful queries across those objects on the primary web
site.  DAS is specialized to deal with genomic intervals, and not to
worry much about what those intervals mean.  Note that the originating 
web site does not necessarily equate with the DAS "reference server".

You (and many others, myself included) want to go the next step, and
allow DAS/2 to handle features lookups based on their identifiers and
attributes.  Since you are currently banging your head against this
very problem, can you describe the use cases that Omnigene is designed
to address?

Lincoln


Brian writes:
 > As you said it's a can of worms but something that cannot be
 > ignored. This is a huge problem in DAS 1.0 and absolutely needs to be
 > address in the next version of DAS. 
 > 
 > 	I agree with Ewan and David Huen, we should not "futz" with the
 > specification today. I was wrong about having optional features in the
 > current specification. Let's focus on this for the next version of the
 > specification and make sure it solves the most common use cases. 
 > 
 > 	This brings up another good point: Use Cases in Genomics. Have
 > they been defined by the group? Where do you find the most common use
 > cases and their solutions using DAS? I think our user community would
 > appreciate these and would better help them in understanding exaclty where
 > DAS fits into their daily lives. More importantly it will help the DAS
 > implementors and visionaries focus their attentions on the "right" part of
 > the new specification. 
 > 
 > 	It seems to me that there are a few groups who are playing in the
 > DAS domain. TIGR, Ensembl, Whitehead, Cold Spring Harbor, Wash U, and
 > UCSC. But there is still one missing piece to the puzzle: NCBI. 
 > 
 > 	My users cannot do their daily work without touching NCBI. It
 > looks like the NCI is making significant advances in implementing 
 > webservices. I am particularly excited to see their architecture overview
 > becuase it looks so much like my own. This begs the question again of
 > interoperability in bioinformatics. Without a unified notion of the
 > messages and interfaces which we are talking with/to we will be heading
 > down a road of interoperability nightmare. Is there anyone on this list
 > who is part of NCBI? If so can you tell us where NCBI is headed? Where DAS
 > fits into the organizations overall plan? What this or other  user
 > community can do to fully engage NIH and incorporate the data from your
 > organization? If not then we need to engage this community. The data which
 > they maintain is vitally important to the biological community as a whole. 
 > 
 > 	We are at a turning point in solving interoperability problems in
 > bioinformatics. We must first solve some of the infrastructure problems
 > before we can start seeing truly great discoveries in
 > biology/comp-bio. Integration of data from disparate datasources is a
 > problem which is a technically acheivable goal. We must remeber that this
 > domain is too large and distributed to think we can solve the domain
 > problems alone. I know this has gotten long winded but, I feel that this
 > goal should hit home for this group. Without collaboration and the immense
 > support by outside organizations DAS would be nothing more than a
 > theoretical paper. 
 > 
 > 	With this in mind, I think we should do a little domain
 > engineering, start from the RFC's, determine the common use cases, specify
 > the protocol and architecture (as much as we can), implementing parts as
 > we go to discover where we went wrong. The implementations should drive
 > ammendments to the specification. Once we feel that we have "solved" at
 > least 80% or more of the use cases we can try and push the architecture
 > further. But, without at least some notion of the vision for DAS and what
 > it is trying to solve, we will find that we have written hacky bits which
 > stick for a little while but will ultimately fall apart in the end.
 > 
 > 				-Brian
 > 
 > 
 > On Mon, 19 Nov 2001, Lincoln Stein
 > wrote:
 > 
 > > Hi Matthew,
 > > 
 > > Colons in identifiers are disallowed, period.  But if the client
 > > happens to screw up, the LDAS server's particular implementation will
 > > forgive it, and try to do the right thing.  That's no different than
 > > most browsers do with all the bad HTML out there.  Undocumented
 > > features are there for testing purposes, as is occurring between
 > > Brian's Omnigene and LDAS, but are not to be relied on.
 > > 
 > > What a can of worms!  I was just trying to be helpful to Brian.
 > > 
 > > Lincoln
 > > 
 > > Matthew Pocock writes:
 > >  > Lincoln Stein wrote:
 > >  > 
 > >  > 
 > >  > > However, it wouldn't be a bad idea to robustify your parsing code so
 > >  > > that it won't be tripped up by clients that aren't following the spec
 > >  > > exactly.  The colon character is very common in identifiers, so when I
 > >  > > parse out the segment, I only pay attention to the rightmost colon.
 > >  > > Everything to the left of it is the identifier.
 > >  > > 
 > >  > 
 > >  > 
 > >  > I'm probably being pedantic, but specs are there to tell us how to 
 > >  > behave and what is right/wrong. The spec clearly states that colons 
 > >  > (allong with tab, newline and a few other characters) are not allowed in 
 > >  > sequence IDs. Commas are not excluded from IDs. Currently, foo:3,5 is 
 > >  > not ambiguous (id,start,stop), and neither is foo3,5 (id).
 > >  > 
 > >  > By condoning colons, things do become ambiguous. Is a request of the 
 > >  > form "foo:3,5bar" an ID, or did the request get mis-typed? "foo:3,5" can 
 > >  > now be either a single ID or an id,start,stop tuple. There are several 
 > >  > solutions, none of which are drastic enough to requre all code 
 > >  > everywhere to be torn up. Here are just 3:
 > >  > 
 > >  > 1) Dissallow commas from IDs. How long untill this gets violated in the 
 > >  > name of a quick fix?
 > >  > 
 > >  > 2) Provide an escape character. It seems rude to me to exclude 
 > >  > characters and then not provide an alternative way to encode them.
 > >  > 
 > >  > 3) Specify a complete grammer for these expressions that is 
 > >  > unambiguously parsable, e.g. as a regexp, which explicitly allows all 
 > >  > legal combinations of valid IDs and optional locators, and excludes all 
 > >  > others.
 > >  > 
 > >  > 
 > >  > Sorry to rant, but what's the point of a spec if it is a guideline, not 
 > >  > a normative declaration? What other bits are 'nearly always' to be 
 > >  > adhereed to?
 > >  > 
 > >  > Matthew
 > > 
 > > 
 > 
 > -- 
 > ----------------
 > Brian Gilman <gilmanb@jforge.net>
 > 
 > 
 > 

-- 
========================================================================
Lincoln D. Stein                           Cold Spring Harbor Laboratory
lstein@cshl.org			                  Cold Spring Harbor, NY

NOW HIRING BIOINFORMATICS POSTDOCTORAL FELLOWS AND PROGRAMMERS. 
PLEASE WRITE FOR DETAILS.
========================================================================