From Steve_Chervitz at affymetrix.com Thu Nov 3 19:24:53 2005 From: Steve_Chervitz at affymetrix.com (Chervitz, Steve) Date: Thu, 3 Nov 2005 16:24:53 -0800 Subject: [DAS2] DAS/2 weekly meeting notes Message-ID: Notes from the weekly DAS/2 teleconference, 3 Nov 2005. $Id: das2-teleconf-2005-11-03.txt,v 1.2 2005/11/04 00:23:27 sac Exp $ Attendees: Affy: Steve Chervitz, Ed Erwin, Gregg Helt UCLA: Brian O'connor, Mark Carlson These notes are checked into the biodas.org CVS repository at das/das2/notes/2005. Instructions on how to access this repository are at http://biodas.org Status Reports -------------- Gregg: * A lot happened last week: - Major IGB public release (4.02) last Friday (10/28) - Attended and presented IGB demo at CSHL Genome Informatics meeting on Sunday (10/30) - Finished and submitted DAS/2 continuation grant on Tue (11/1). * Held a DAS/2 BOF (birds of a feather meeting) at CSHL. Good discussion and turnout (15). Collected feedback from EBI/Sanger folks. Asked people to download the client (IGB) and hit the servers (Affy, UCLA), so be looking for more traffic soon. * TODO: Monitor DAS/2 traffic, collect usage stats for both servers: http://netaffxdas.affymetrix.com http://biopackages.net. Especially check for performance degradation under load. Need to parse apache and server logs for things like: # users, typical query times, etc. * IGB demo went well. People were impressed with speed. Requests for Gregg's in-memory java DAS/2 server, but code is not yet ready for public consumption. Ed: * Reviewing various technologies of possible interest: - HTTP communication protocol, necessary commands. - Using a bean-based property editor for IGB * Spent time answering user questions on IGB forum (only 1 person posted trouble with installing data for use with new IGB release -- not bad). Gregg adds: Also no negative feedback from internal release. Steve: * Spec work: Posted message about types and features issues in the retrieval spec last Thurs (10/26). Mentioned Lincoln's response (doing away with xml:base and going with his namespace scheme). Gregg talked with Lincoln about this at CSHL and clarified that xml:base is for resolving relative URLs in attributes or CDATA elements, whereas xmlns is for resolving names of attributes and elements. Steve will post response to DAS/2 discussion list about this. * Tested the IGB release on OS X last week prior to release. Noted the display bug that Gregg knows about (disappearing view when you select a new DAS/2 annotation source). Found trouble with quick load synonym on the Affy internal server synonym. Ed fixed. * Installed new assembly (Human Nov 2002) available via quickload and DAS/2. Gregg says: Use DAS/1 for new genomes at this stage. * DAS/2 discussion list troubleshooting. Problem with open-bio sendmail, DNS. Brian, Mark: * Using the DAS/2 layer from the IGB code base and extending it for their assay and ontology namespaces. Want to put this new code in separate packages to avoid stepping on other IGB functionality. DAS/2 layer is currently in com.affymetrix.igb.das2. Options 1. Add subpackages to com.affymetrix.igb.das2. 2. Move das2 out from under igb to com.affymetrix.das2. 3. Move das2 out of com.affymetrix to be totally separate. Then com.affymetrix.igb.das2 and the assay/ontology code would depend on it. Brian is fine with #2. Gregg will check and remove any dependencies with the das2 package on IGB code. * Plan to release their code internally in December. Code is in their own CVS repository now. Genoviz/IGB code has not been committed to SF yet. --------------------------- TODO * Summarize CSHL genome informatics meeting happenings relevant to DAS/2 when others who were there are dialed in. * Move teleconf meeting to a more UK-friendly time. US is now on standard time. 9am PST = 12pm EST = 17:00 GMT. How does this work for folks? From Steve_Chervitz at affymetrix.com Fri Nov 4 15:32:22 2005 From: Steve_Chervitz at affymetrix.com (Chervitz, Steve) Date: Fri, 04 Nov 2005 12:32:22 -0800 Subject: [DAS2] Spec issues In-Reply-To: <200510270941.30528.lstein@cshl.edu> Message-ID: As Gregg noted in this week's DAS/2 meeting, xml:base and XML namespace (xmlns) are complementary technologies: * xml:base is for resolving relative URLs occurring within attribute values or CDATA elements * xmlns is for resolving names of attributes and elements. So bearing this in mind, here's my take: On Thursday 27 October 2005, Lincoln Stein wrote: > > On Wednesday 26 October 2005 07:29 pm, Chervitz, Steve wrote: > > > > > > > > Next issue: Feature properties example (only showing relevant attributes): > > > > Description: Properties are typed using the ptype attribute. The value of > > the property may be indicated by a URL given by the href attribute, or may > > be given inline as the CDATA content of the section. > > > > > > > type="type/curated_exon"> > > 29 > > 2 > > > href="/das/protein/volvox/2/feature/CTEL54X.1" /> > > > > > > > > So in contrast to the TYPE properties which are restricted to being simple > > string-based key:value pairs, FEATURE properties can be more complex, which > > seems reasonable, given the wild world of features. We might consider using > > 'key' rather than 'ptype' for FEATURE properties, for consistency with TYPE > > prop elements (however, read on). > > I'm not so happy with "key" since it is nondescript. Originally this was > "type" but the word collided with feature type. > > I am getting uncomfortable with the dichotomy we've (I've?) created between > XML base keys/properties and namespace-based keys/properties. It seems nasty > to have the ptype attribute be either a relative URI > (property/genefinder-score), or a controlled vocabulary member (das:phase). > Is there any reason we shouldn't choose one or the other? > > For example, does this work? > > xmlns:dasprop="http://www.biodas.org/ns/das/genome/2.00/properties" > xmlns:type="http://www.wormbase.org/das/genome/volvox/1/type" > xmlns:id="http://www.wormbase.org/das/genome/volvox/1/feature"> > xmlns:prop="http://www.wormbase.org/das/genome/volvox/1/property"> > das:type="type:curated_exon"> > 29 > 2 > das:href="http://www.wormbase.org/das/protein/volvox/2/feature/CTEL54X.1" /> > > > This looks so much cleaner to me. Here's a new version of this example using xml:base, a default xmlns, and a special attribute to define the URL for the controlled vocabulary of DAS property keys. I'm also using xlink for the href: 29 2 > Cc: Steve Chervitz > Subject: Re: New problem with content-type header in DAS/2 server responses! > > Looks like the cache server. FYI, I have updated the server to use all > "text/xml" Content-Type for all xml response types. This was approved by > Lincoln so that web browsers could be pointed at the das server and "just > work". I thought these changes had already made their way into the spec, > but apparently not. > > The table below summarizes what the server should be giving back. The > left column shows the command and format request, and the right side shows > the response Content-Type. > > 'das/das2xml' => 'text/xml', > 'domain/das2xml' => 'text/xml', > 'domain/compact' => 'text/plain', > 'feature/das2xml' => 'text/xml', > 'feature/chain' => 'text/plain', #LOOK > 'property/das2xml' => 'text/xml', > 'region/das2xml' => 'text/xml', > 'region/compact' => 'text/plain', > 'sequence/das2xml' => 'text/plain', #LOOK > 'sequence/fasta' => 'text/plain', > 'source/das2xml' => 'text/xml', > 'source/compact' => 'text/plain', > 'type/das2xml' => 'text/xml', > 'type/compact' => 'text/plain', > 'type/obo' => 'text/plain', > 'type/rdf' => 'text/xml', > 'versionedsource/das2xml' => 'text/xml', > > As you can see, the text/plain response to the /feature command is NOT > being given by the server, but somehow being mangled by the cache. Is > this going to severly impact your demo? If so I can disable the cache > module. It will be slow though. An alternative to the cache would be to > use our squid proxy. Brian can probably set you up to use it very > quickly. > > Let me know what needs to be done ASAP. > > -Allen > > > On Fri, 28 Oct 2005, Helt,Gregg wrote: > >> I just tried accessing the biopackages DAS/2 server from IGB, with this >> query: >> >> http://das.biopackages.net/das/genome/human/17/feature?overlaps=chr21/26 >> 027736:26068042;type=SO:mRNA >> >> and I'm getting back a message where the XML looks fine but here are the >> headers: >> >> HTTP/1.1 200 OK >> Date: Sat, 29 Oct 2005 05:49:46 GMT >> Server: Apache/2.0.51 (Fedora) >> X-DAS-Status: 200 >> Warning: 113 Heuristic expiration >> Content-Type: text/plain; charset=UTF-8 >> Age: 259582 >> Content-Length: 6004 >> Keep-Alive: timeout=15, max=100 >> Connection: Keep-Alive >> >> But according to the spec the content type header needs to be: >> Content-Type: text/x-das-features+xml >> I'm using this in the IGB DAS/2 client to parse responses based on the >> content type. With "text/plain; charset=UTF-8" IGB doesn't know what >> parser to use and gives up. So right now I can't visualize annotations >> from the biopackages server. I'm pretty sure the server was setting the >> content-type header correctly on Wednesday -- did anything change since >> then that could be causing this? Could the server-side cache be doing >> this for some reason? >> >> Thanks, >> Gregg >> >> From dalke at dalkescientific.com Tue Nov 8 19:27:42 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 9 Nov 2005 01:27:42 +0100 Subject: [DAS2] Re: New problem with content-type header in DAS/2 server responses! In-Reply-To: References: Message-ID: My apologies for not tracking what's been going on in the last few months. I'm back now and have time for the next few months to work on things. So I'll start with this exchange. I can't find the discussion in the mailing list history. Why the decision to use "text/xml" for all xml responses? I read it it is so "web browsers can 'just work'". What are they supposed to do? Display the XML as some sort of tree structure? Is that the only thing? One thing Allen and I talked about, and he tested, was the ability to insert a stylesheet declaration in the XML. Is this part of the reason to switch to using "text/xml"? Is there anything I'm missing? Since it looks like I'm going to be more in charge of the spec development, I would like to start collecting use cases and recording these sorts of decisions. I think having different content-types is an important feature. For example, it lets a DAS browser figure out what it's looking at before doing any parsing. Here's my use case. I want someone to send an email to someone else along the lines of "What do you think about http://blah.blah/das/genome/blah/blah" with the URL of the object included in the email. Paste that into a DAS browser and it should be able to figure out that this is a sequence, a feature, a whatever. With the old content-types there was enough information to do that right away. With this new one a DAS browser needs to parse the XML to figure out what's in it. Autodetection of XML formats? I don't want to go there. That's also the reason for Gregg's opposition. You (Allen) and Lincoln, on the other hand, want that user to be able to go to a web browser and paste the URL in, to get a basic idea of what's there. I think that's also important. I think there are other solutions. One is "if the server sees a web browser then return the XML data streams as a 'text/xml'". For example: if "Mozilla" in headers["User-Agent"]: ... this is IE, Mozilla, Firefox, and a few others .. That catches most of the browsers anyone here cares about. As another solution, look at the "Accept" header sent by the browser. Here's what Firefox sends: Accept: text/xml,application/xml,application/xhtml+xml,text/html; q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5' Here's Safari and "links" (a text browser): Accept: */* Another rule them might be if asking_for_xml_format and "*/*" in headers["Accept"]: ... return it as "text/xml" ... Though a better version is to make sure the client doesn't know about the expected content type: if asking_for_xml_format: return_content_type = ... whatever is appropriate ... if (return_content_type not in headers["Accept"] and "*/*" in headers["Accept"]): return_content_type = "text/xml" .... optionally insert style sheet .... Another solution is to send a "what kind of DAS object are you?" request to the URL (eg, tack on a ? query or tell the server that the client will "Accept: application/x-das-autodiscovery"). I think that's clumsy, but I mention it as another way to support both DAS client app and human browser requests of the same URL. >> From: Allen Day >> Looks like the cache server. FYI, I have updated the server to use >> all >> "text/xml" Content-Type for all xml response types. This was >> approved by >> Lincoln so that web browsers could be pointed at the das server and >> "just >> work". I thought these changes had already made their way into the >> spec, >> but apparently not. >> On Fri, 28 Oct 2005, Helt,Gregg wrote: >>> But according to the spec the content type header needs to be: >>> Content-Type: text/x-das-features+xml >>> I'm using this in the IGB DAS/2 client to parse responses based on >>> the >>> content type. With "text/plain; charset=UTF-8" IGB doesn't know what >>> parser to use and gives up. So right now I can't visualize >>> annotations >>> from the biopackages server. I'm pretty sure the server was setting >>> the >>> content-type header correctly on Wednesday -- did anything change >>> since >>> then that could be causing this? Could the server-side cache be >>> doing >>> this for some reason? Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Tue Nov 8 19:49:27 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 9 Nov 2005 01:49:27 +0100 Subject: [DAS2] Re: New problem with content-type header in DAS/2 server responses! In-Reply-To: References: Message-ID: <7e9e19f6885240c668ac677b6ea98ff0@dalkescientific.com> P.S. Gregg mentioned one need for wanting more selective content-types. Here's another. I expect most of the XML data we return will change. We may add an element field or change the meaning of an element. When that happens, how does a client know that a "text/xml" is for one version or another of a given document type? I expect that will be done by returning something like Content-Type: text/das2xml; version=2 This, btw, suggests a third solution to the problem of letting DAS/2 and web browser clients both point to the same object - se Content-Type: text/xml; das-type=das2xml But that's ugly. A 4th is to go back to the "add a das-content-type header" solution from DAS/1. I don't want that. Note, btw, that if a given URL can return different MIME types for the same request then it needs a "Vary: Accept" in the response headers so caching works correctly. Andrew dalke at dalkescientific.com From Steve_Chervitz at affymetrix.com Tue Nov 8 20:58:07 2005 From: Steve_Chervitz at affymetrix.com (Chervitz, Steve) Date: Tue, 08 Nov 2005 17:58:07 -0800 Subject: [DAS2] Re: New problem with content-type header in DAS/2 server responses! In-Reply-To: Message-ID: Andrew, Andrew Dalke wrote on 8 Nov 2005: > My apologies for not tracking what's been going on in the last few > months. I'm back now and have time for the next few months to work > on things. Great to have you back. I have been focusing on the spec for the past several weeks but would be glad to have you take the lead on it. We've been making the retrieval spec a priority and should really focus on getting it nailed down as soon as possible to allow others to start implementing clients and servers against it and providing feedback. We haven't talked about a freeze or release date for it, but maybe we should. I started going through the open bugs in bugzilla, but only resolved one (#1796). While going through and cleaning up the retrieval spec, I ran into other issues that were not in bugzilla that seemed important. One was this content-type issue that you address here. I raised some other issues regarding types and feature properties etc. a couple of weeks ago that I'd like you to chime in on: http://portal.open-bio.org/pipermail/das2/2005-October/000271.html The latest message on this thread is: http://portal.open-bio.org/pipermail/das2/2005-November/000278.html > So I'll start with this exchange. I can't find the discussion in the > mailing list history. > > Why the decision to use "text/xml" for all xml responses? I read it > it is so "web browsers can 'just work'". > > What are they supposed to do? Display the XML as some sort of tree > structure? Is that the only thing? > > One thing Allen and I talked about, and he tested, was the ability to > insert a stylesheet declaration in the XML. Is this part of the > reason to switch to using "text/xml"? Here's the relevant thread for reference: http://portal.open-bio.org/pipermail/das2/2005-July/000227.html In your other email on this thread, you said: > This, btw, suggests a third solution to the problem of letting DAS/2 > and web browser clients both point to the same object - se > > Content-Type: text/xml; das-type=das2xml > > But that's ugly. This seems like a good solution (and not too ugly IMHO). The das-type value could be more detailed (e.g., x-das-features+xml). However, I recall that there were possible problems with this syntax, but can't remember the details at the moment. Whatever the solution we decide, we should strive for simplicity. If we ask too much of servers and clients, that will be an impediment to implementation and maintenance. Steve From allenday at ucla.edu Tue Nov 8 21:21:51 2005 From: allenday at ucla.edu (Allen Day) Date: Tue, 8 Nov 2005 18:21:51 -0800 (PST) Subject: [DAS2] Re: New problem with content-type header in DAS/2 server responses! In-Reply-To: References: Message-ID: To be even more concise, there are two use cases being presented here: 1) DAS/2 content should be viewable in a web browser, and doing so requires a HTTP Content-Type header to have value 'text/xml'. 2) DAS/2 content should be viewable in a specialized DAS/2 browser, and be able to rely on HTTP headers to determine visualization mode, as XML/DTD/Schema sniffing is undesireable. The solution proposed in the referenced thread, or perhaps only on a conference call, is to use the Content-Type header to address (1), providing information to web browsers, as they are less flexible than a specialized DAS/2 client. (2) is addressed using a DAS/2 specific X-Das-Content-Type header, e.g. ================== % GET -e 'http://das.biopackages.net/das/genome/human/17/feature?overlaps=chr22/1000000:2000000;type=SO:mRNA' | head -100 Connection: close Date: Wed, 09 Nov 2005 02:15:24 GMT Server: Apache/2.0.51 (Fedora) Content-Type: text/xml Expires: Thu, 09 Nov 2006 02:15:24 GMT Client-Date: Wed, 09 Nov 2005 02:19:16 GMT Client-Peer: 164.67.183.101:80 Client-Response-Num: 1 Client-Transfer-Encoding: chunked X-DAS-Content-Type: text/x-das-feature+xml X-DAS-Server: GMOD/0.0 X-DAS-Status: 200 X-DAS-Version: DAS/2.0 ================== This also has the added benefit of already being implemented for a few months. Are there objections to this solution? -Allen On Wed, 9 Nov 2005, Andrew Dalke wrote: > My apologies for not tracking what's been going on in the last few > months. I'm back now and have time for the next few months to work > on things. > > So I'll start with this exchange. I can't find the discussion in the > mailing list history. > > Why the decision to use "text/xml" for all xml responses? I read it > it is so "web browsers can 'just work'". > > What are they supposed to do? Display the XML as some sort of tree > structure? Is that the only thing? > > One thing Allen and I talked about, and he tested, was the ability to > insert a stylesheet declaration in the XML. Is this part of the > reason to switch to using "text/xml"? > > Is there anything I'm missing? > > Since it looks like I'm going to be more in charge of the spec > development, > I would like to start collecting use cases and recording these sorts of > decisions. > > I think having different content-types is an important feature. For > example, it lets a DAS browser figure out what it's looking at before > doing any parsing. Here's my use case. > > I want someone to send an email to someone else along the lines of > "What do you think about http://blah.blah/das/genome/blah/blah" > with the URL of the object included in the email. > > Paste that into a DAS browser and it should be able to figure out that > this is a sequence, a feature, a whatever. With the old content-types > there was enough information to do that right away. With this new > one a DAS browser needs to parse the XML to figure out what's in it. > Autodetection of XML formats? I don't want to go there. > > That's also the reason for Gregg's opposition. > > > You (Allen) and Lincoln, on the other hand, want that user to be able to > go to a web browser and paste the URL in, to get a basic idea of what's > there. > > I think that's also important. > > I think there are other solutions. One is "if the server sees a web > browser then return the XML data streams as a 'text/xml'". > > For example: > if "Mozilla" in headers["User-Agent"]: > ... this is IE, Mozilla, Firefox, and a few others .. > > That catches most of the browsers anyone here cares about. As > another solution, look at the "Accept" header sent by the browser. > Here's what Firefox sends: > > Accept: text/xml,application/xml,application/xhtml+xml,text/html; > q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5' > > Here's Safari and "links" (a text browser): > > Accept: */* > > Another rule them might be > > if asking_for_xml_format and "*/*" in headers["Accept"]: > ... return it as "text/xml" ... > > Though a better version is to make sure the client doesn't know about > the expected content type: > > > if asking_for_xml_format: > return_content_type = ... whatever is appropriate ... > > if (return_content_type not in headers["Accept"] > and "*/*" in headers["Accept"]): > > return_content_type = "text/xml" > .... optionally insert style sheet .... > > > > Another solution is to send a "what kind of DAS object are you?" request > to the URL (eg, tack on a ? query or tell the server that the client > will > "Accept: application/x-das-autodiscovery"). > > > I think that's clumsy, but I mention it as another way to support > both DAS client app and human browser requests of the same URL. > > > >> From: Allen Day > > >> Looks like the cache server. FYI, I have updated the server to use > >> all > >> "text/xml" Content-Type for all xml response types. This was > >> approved by > >> Lincoln so that web browsers could be pointed at the das server and > >> "just > >> work". I thought these changes had already made their way into the > >> spec, > >> but apparently not. > > >> On Fri, 28 Oct 2005, Helt,Gregg wrote: > >>> But according to the spec the content type header needs to be: > >>> Content-Type: text/x-das-features+xml > >>> I'm using this in the IGB DAS/2 client to parse responses based on > >>> the > >>> content type. With "text/plain; charset=UTF-8" IGB doesn't know what > >>> parser to use and gives up. So right now I can't visualize > >>> annotations > >>> from the biopackages server. I'm pretty sure the server was setting > >>> the > >>> content-type header correctly on Wednesday -- did anything change > >>> since > >>> then that could be causing this? Could the server-side cache be > >>> doing > >>> this for some reason? > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 > From dalke at dalkescientific.com Wed Nov 9 12:37:21 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 9 Nov 2005 18:37:21 +0100 Subject: [DAS2] Re: New problem with content-type header in DAS/2 server responses! In-Reply-To: References: Message-ID: Steve: > Here's the relevant thread for reference: > http://portal.open-bio.org/pipermail/das2/2005-July/000227.html Ahh, it's the one I half remembered, from July. Allen said: > Not sure how much value there is in > this, but here is a very simple graphical display of regions on the > server, and their relative sizes. I think it's useful to have web browserability, as it were, but I think it's a secondary goal. To me the ability to transform the XML via the stylesheet is something that's technology driven and not user driven. That is, nothing in the previous work, including the DAS/2 proposals from others, mentioned that as a need. On the other hand, being able to get the content type of what's coming back from the server is a design goal, and we have an existing need -- Gregg's example -- for it. I would rather therefore put the onus on the data provider to be clever in sniffing out the client than in the DAS/2 client in sniffing out the data. Steve: > In your other email on this thread, you said: > >> This, btw, suggests a third solution to the problem of letting DAS/2 >> and web browser clients both point to the same object - se >> >> Content-Type: text/xml; das-type=das2xml >> >> But that's ugly. > > This seems like a good solution (and not too ugly IMHO). The das-type > value > could be more detailed (e.g., x-das-features+xml). However, I recall > that > there were possible problems with this syntax, but can't remember the > details at the moment. We have discussed this on-and-off for a while now, eh? Here's the previous thread on it: http://portal.open-bio.org/pipermail/das2/2004-December/000019.html I need to do a bit more research. I don't like the idea of making new headers and I don't like the idea of using a modified content-type like that. The first because we aren't doing anything unusual compared to other projects and the second because I don't have any experience with that. I suspect the answer will be: - by default if no "?format=" is specified then return "text/xml" - if the client sends an "Accept: text/x-das-features+xml" then return the document with the proper content-type information In that way if someone pastes a "http://.../blah?format=xyz and they get a bunch of garbage that can manually chop off the obvious "format=" part of the query. But that doesn't agree with my use case, where the DAS/2 client gets a random URL. It would need to send "Accept: ..." where the "..." is a list of all the possible DAS content-types. I'll think about this some more while I'm out salsa dancing this evening. :) Andrew dalke at dalkescientific.com From Steve_Chervitz at affymetrix.com Wed Nov 9 20:25:48 2005 From: Steve_Chervitz at affymetrix.com (Chervitz, Steve) Date: Wed, 09 Nov 2005 17:25:48 -0800 Subject: [DAS2] Agenda for weekly teleconference Message-ID: Time & Day: 12:00 Noon PST, Thursday 11 Nov 2005 Tel (US): 800-531-3250 Tel (Int'l): 303-928-2693 ID: 2879055 Agenda ------ * Decide on Europe-friendly time for this teleconference. Proposals: - Thu 9am PST = 12pm EST = 17:00 GMT - Wed 9am PST - Mon 9am PST * DAS/2 get spec issues: - Content-type: text/xml vs. text/x-das-blah+xml http://portal.open-bio.org/pipermail/das2/2005-November/000287.html - XML encoding of type and feature properties: http://portal.open-bio.org/pipermail/das2/2005-November/000278.html Time and people permitting: * Summarize CSHL genome informatics meeting happenings relevant to DAS/2 (Allen, Gregg, Suzi, Lincoln). * Introduction to Apollo (Suzi) * DAS/2 validation (Andrew) From dalke at dalkescientific.com Wed Nov 9 20:34:28 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 10 Nov 2005 02:34:28 +0100 Subject: [DAS2] Re: New problem with content-type header in DAS/2 server responses! In-Reply-To: References: Message-ID: <2dd2f4224520b2f35add5af1de821729@dalkescientific.com> Allen > To be even more concise, there are two use cases being presented here: > > 1) DAS/2 content should be viewable in a web browser, and doing so > requires a HTTP Content-Type header to have value 'text/xml'. > > 2) DAS/2 content should be viewable in a specialized DAS/2 browser, > and be > able to rely on HTTP headers to determine visualization mode, as > XML/DTD/Schema sniffing is undesireable. A use case describes what the user wants do to, from the user's perspective and not the implementation perspective. Sometimes they are the same, as when the user mandates certain technical decisions, but that's not the case here. The wikipedia has a goo definition, at http://en.wikipedia.org/wiki/Use_case . To make use cases read nicely I've found it useful to have a name better than "the user". There will be many users of different aspect of a DAS system. Some are: - a person making the database/DAS adapter - an annotator - a molecular biologist The use case where talking about here is to let person X (either an annotator or a molecular biologist) communicate with person Y. Rather than saying "X" and "Y" I'll say "Bill" and "Jim". Bill send Jim an email saying "I think there's a problem with this annotation; it looks like it's off-by-one. Could you take a look at it for me?" (Make up your own explanation :) Jim gets the email, sees the URL, and pastes it into his browser. If Jim is an annotator this will probably be a specialized DAS/2 client. If he's not, then more likely it will be a web browser. Both should "do the right thing", that is, provide meaningful information about the given entity and options for more exploration and analysis. This use case suggests several functional details: - There needs to be a way to exchange DAS details via normal text, for inclusion in email. DAS uses URLs so we should build on those. This means they'll also likely be used in generic web pages. Because the specific consumer of a URL isn't known it's not possible to put a "?format=" field on the end of the URL. Thus these URLs must not specify the format. - DAS/2 client (web browsers and specialized apps) should have some way to get (and easily get) the URL for a given annotation, region, feature type, etc. - specialized DAS clients (IGB) need a way for users to enter an arbitrary DAS URL. If one or more of these won't happen then there's no problem. For example, if IGB etc. all don't support entering an arbitrary DAS URL then there's no need to handle both classes of clients. If there's no demand for direct visualization in a web browser then there's also no problem. I'm going to ask about the last. The whole point of this change is to support the ability for a generic web browser to go to a given URL and show something of interest. 1) who needs that? Can any of us point to a group of people who would use a direct web interface to a given DAS/2 URL? If so, why didn't it come up in earlier discussions? 2) what can't they go to a DAS/2 web app elsewhere and from there tell it "now link in the data from this URL." That is, view the URL through an intermediary. 3) why can't we tell people "stick a 'format=html' at the end to see iT in HTML, if you want to make a web link to it, and if the server supports HTML displays. 4) Who wants to make a DAS/2 web app based directly on the DAS/2 data structure? Yes, that makes it trivial to have a first pass web app, but that app will suck. It'll only support browsing the server's data structure via a tree. It won't support, say, the ability to incorporate more or alternate records in a view, fancy AJAX GUIs, etc. There will be no way to merge records from different servers because the annotation server only understands annotations on that server. My view now is that having the default MIME type for a DAS/2 entity be "text/xml", for the purpose of supporting direct web browser visualization of that entity, is not driven by a realistic use case and is interesting mostly for technical reasons. As such, we shouldn't do that. We should leave the return documents as distinct MIME types. That leads me to the result of more research. The relevant spec for the MIME type for XML documents is RFC 3023, at http://www.ietf.org/rfc/rfc3023.txt For commentary also see: http://www.xml.com/lpt/a/2004/07/21/dive.html http://diveintomark.org/archives/2004/02/13/xml-media-types These say we have lots of things to worry about. For example, "text/xml" requires that the content-type include the charset declaration, else the spec says to assume the document is in US-ASCII. There is no way for the XML itself to override that. If we go the "text/xml" route we mandate that either: - all servers include a charset in the content-type - those that don't must only serve ASCII data. The proper MIME type is under "application", as "application/x-das-*+xml" > then the character encoding is determined in this order: > > * the encoding given in the charset parameter of the Content-Type > HTTP header, or > * the encoding given in the encoding attribute of the XML declaration > within the document, or > * utf-8. (quoting from http://www.xml.com/lpt/a/2004/07/21/dive.html ) Apparently some ISPs, eg. in Russian and Japan, will transcode text/xml documents at the HTTP level, ignoring the encoding information in the XML itself. This can lead to problems. As the author of those commentaries says, ?XML is tough.? http://diveintomark.org/archives/2004/07/06/tough > The solution proposed in the referenced thread, or perhaps only on a > conference call, is to use the Content-Type header to address (1), > providing information to web browsers, as they are less flexible than a > specialized DAS/2 client. (2) is addressed using a DAS/2 specific > X-Das-Content-Type header, e.g. It must have been a conference call. I don't see mention of that in my back emails. I'm thankful to Steve for doing the writeups. To emphasize what I said earlier, what will happen in the case of (1)? Who will implement it? What will users expect from it? Why can't those users go through some intermediate DAS web app to better view that data? Why can't we say "add a 'format=html' for interactive viewing"? As for (2), I don't want a new header. I know I talk about conneg and other neat features in HTTP but in re-reading appendix A of RFC 3023 http://www.ietf.org/rfc/rfc3023.txt it talks about over a dozen other solutions to the problem and why they were excluded. These include: > A.10 How about using a conneg tag instead (e.g., accept-features: > (syntax=xml))? > > When the conneg protocol is fully defined, this may potentially be a > reasonable thing to do. But given the limited current state of > conneg[RFC2703] development, it is not a credible replacement for a > MIME-based solution. In this case I'm willing to let people experiment with the idea before baking it into the spec. > A.9 How about a new Alternative-Content-Type header? > > This is better than Appendix A.8, in that no extra functionality > needs to be added to a MIME registry to support dispatching of > information other than standard content types. However, it still > requires both sender and receiver to be upgraded, and it will also > fail in many cases (e.g., web hosting to an outsourced server), > where > the user can set MIME types (often through implicit mapping to file > extensions), but has no way of adding arbitrary HTTP headers. How much control will DAS/2 data providers have over their server? I know I want to support people who provide data as a set of files through Apache, though that's not driven by any use case. (This use case would involve a user who has different requirement than either Jim or Bob.) mod_mime is designed for that. I don't know how to add other headers for this case. The data providers we have now have control over all the headers. If that will essentially always be the case then adding a new header isn't a problem. Then again, if this is always the case then we can go ahead with conneg since an argument against conneg is it puts more work on the server implementations. In this too I'll be conservative - DAS/2 pushes no new ground for a web app development project; there should be no reason to invent a new header. > A.6 How about labeling with parameters in the other direction (e.g., > application/xml; Content-Feature=iotp)? > > This proposal fails under the simplest case, of a user with neither > knowledge of XML nor an XML-capable MIME dispatcher. In that case, > the user's MIME dispatcher is likely to dispatch the content to an > XML processing application when the correct default behavior should > be to dispatch the content to the application responsible for the > content type (e.g., an ecommerce engine for > application/iotp+xml[RFC2801], once this media type is registered). > > Note that even if the user had already installed the appropriate > application (e.g., the ecommerce engine), and that installation had > updated the MIME registry, many operating system level MIME > registries such as .mailcap in Unix and HKEY_CLASSES_ROOT in Windows > do not currently support dispatching off a parameter, and cannot > easily be upgraded to do so. And, even if the operating system were > upgraded to support this, each MIME dispatcher would also separately > need to be upgraded. > X-DAS-Content-Type: text/x-das-feature+xml > X-DAS-Server: GMOD/0.0 > X-DAS-Status: 200 > X-DAS-Version: DAS/2.0 > ================== > > This also has the added benefit of already being implemented for a few > months. Are there objections to this solution? Yes. Several. When did "X-DAS-Status" come back into the picture? I thought we talked about this in spring and nixed it because it doesn't provide anything useful than the existing HTTP-level error code. Or perhaps it was fall of last year? I think I remember raking leaves at the time. More useful, for example, would be a document (html, xml, or otherwise) which accompanies the error response and gives more information about what occurred. What does the "X-DAS-Server" get you that the normal "Server:" doesn't get you? What's the use case? Why is the "X-DAS-Version" at all important? What's important is the data content. It's the document return type/version that's important and not the server version. But I mentioned most of these over a year ago http://portal.open-bio.org/pipermail/das/2004-September/000814.html In summary: - no support for direct web browser access to a URL, expect with a likely use case; - keep the default response in an XML format - change that XML content-type to "application/x-das-*+xml" instead of "text/*" - have no requirement for new, DAS-specific headers Andrew dalke at dalkescientific.com From allenday at ucla.edu Wed Nov 9 21:18:23 2005 From: allenday at ucla.edu (Allen Day) Date: Wed, 9 Nov 2005 18:18:23 -0800 (PST) Subject: [DAS2] Agenda for weekly teleconference In-Reply-To: References: Message-ID: Missing this week, I'm in Rio de Janeiro. I'm giving a talk on DAS tomorrow though, so I'm still contributing! :) -Allen On Wed, 9 Nov 2005, Chervitz, Steve wrote: > Time & Day: 12:00 Noon PST, Thursday 11 Nov 2005 > Tel (US): 800-531-3250 > Tel (Int'l): 303-928-2693 > ID: 2879055 > > Agenda > ------ > > * Decide on Europe-friendly time for this teleconference. > Proposals: > - Thu 9am PST = 12pm EST = 17:00 GMT > - Wed 9am PST > - Mon 9am PST > > * DAS/2 get spec issues: > - Content-type: text/xml vs. text/x-das-blah+xml > http://portal.open-bio.org/pipermail/das2/2005-November/000287.html > > - XML encoding of type and feature properties: > http://portal.open-bio.org/pipermail/das2/2005-November/000278.html > > Time and people permitting: > > * Summarize CSHL genome informatics meeting happenings relevant to > DAS/2 (Allen, Gregg, Suzi, Lincoln). > > * Introduction to Apollo (Suzi) > > * DAS/2 validation (Andrew) > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 > From ed_erwin at affymetrix.com Thu Nov 10 13:33:58 2005 From: ed_erwin at affymetrix.com (Ed Erwin) Date: Thu, 10 Nov 2005 10:33:58 -0800 Subject: [DAS2] Re: New problem with content-type header in DAS/2 server responses! In-Reply-To: <2dd2f4224520b2f35add5af1de821729@dalkescientific.com> References: <2dd2f4224520b2f35add5af1de821729@dalkescientific.com> Message-ID: <43739296.4030307@affymetrix.com> Andrew Dalke wrote: > > >> X-DAS-Content-Type: text/x-das-feature+xml >> X-DAS-Server: GMOD/0.0 >> X-DAS-Status: 200 >> X-DAS-Version: DAS/2.0 >> ================== >> >> This also has the added benefit of already being implemented for a few >> months. Are there objections to this solution? > > > Yes. Several. > > When did "X-DAS-Status" come back into the picture? I thought > we talked about this in spring and nixed it because it doesn't provide > anything useful than the existing HTTP-level error code. Or perhaps > it was fall of last year? I think I remember raking leaves at the time. > > More useful, for example, would be a document (html, xml, or otherwise) > which accompanies the error response and gives more information about > what occurred. > Using the HTTP-level error codes can cause problems. For a user (let's call her Varla) using IE, the browser will intercept some error codes and present her with some IE-specific garbage, throwing away any content that was sent back in addition to the error code. Even for a user (Marla this time) using IGB, firewalls and/or caching and/or apache port-forwarding mechanisms can throw out anything with a status code in the error range. (I did test having the NetAffx DAS server send HTTP status codes, and I did have problems with that in IGB, though I've forgotten the specifics. It was about a year ago....) I don't care if status code is indicated with a header like "X-DAS-Status: 200" or with some XML content, or with both. But I think the HTTP status code has to be a separate thing, and will usually be "400" indicating that the user (sorry, I meant to say LeRoy) successfully communicated with the DAS server. Ed From dalke at dalkescientific.com Thu Nov 10 14:49:18 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 10 Nov 2005 20:49:18 +0100 Subject: [DAS2] Re: New problem with content-type header in DAS/2 server responses! In-Reply-To: <43739296.4030307@affymetrix.com> References: <2dd2f4224520b2f35add5af1de821729@dalkescientific.com> <43739296.4030307@affymetrix.com> Message-ID: <83d48ca8f7128fb04efecd673ef61459@dalkescientific.com> Ed: > Using the HTTP-level error codes can cause problems. > I don't care if status code is indicated with a header like > "X-DAS-Status: 200" or with some XML content, or with both. But I > think the HTTP status code has to be a separate thing, and will > usually be "400" indicating that the user (sorry, I meant to say > LeRoy) successfully communicated with the DAS server. Okay, sounds like using HTTP codes for this causes problems in practice. What about returning a different content-type for that case? 200 Ok Content-Type: application/x-das-error Something bad happened. Pros: - doesn't add a new header - just as easy to detect in the client - easier to support on the server for some use cases Andrew dalke at dalkescientific.com From lstein at cshl.edu Thu Nov 10 14:34:51 2005 From: lstein at cshl.edu (Lincoln Stein) Date: Thu, 10 Nov 2005 14:34:51 -0500 Subject: [DAS2] Re: New problem with content-type header in DAS/2 server responses! In-Reply-To: <43739296.4030307@affymetrix.com> References: <2dd2f4224520b2f35add5af1de821729@dalkescientific.com> <43739296.4030307@affymetrix.com> Message-ID: <200511101434.51966.lstein@cshl.edu> I didn't know that X-DAS-Status had ever been deprecated. I strongly feel that the DAS status codes are separate from the HTTP codes and should not try to piggyback on the HTTP status line. Lincoln On Thursday 10 November 2005 01:33 pm, Ed Erwin wrote: > Andrew Dalke wrote: > >> X-DAS-Content-Type: text/x-das-feature+xml > >> X-DAS-Server: GMOD/0.0 > >> X-DAS-Status: 200 > >> X-DAS-Version: DAS/2.0 > >> ================== > >> > >> This also has the added benefit of already being implemented for a few > >> months. Are there objections to this solution? > > > > Yes. Several. > > > > When did "X-DAS-Status" come back into the picture? I thought > > we talked about this in spring and nixed it because it doesn't provide > > anything useful than the existing HTTP-level error code. Or perhaps > > it was fall of last year? I think I remember raking leaves at the time. > > > > More useful, for example, would be a document (html, xml, or otherwise) > > which accompanies the error response and gives more information about > > what occurred. > > Using the HTTP-level error codes can cause problems. > > For a user (let's call her Varla) using IE, the browser will intercept > some error codes and present her with some IE-specific garbage, throwing > away any content that was sent back in addition to the error code. > > Even for a user (Marla this time) using IGB, firewalls and/or caching > and/or apache port-forwarding mechanisms can throw out anything with a > status code in the error range. > > (I did test having the NetAffx DAS server send HTTP status codes, and I > did have problems with that in IGB, though I've forgotten the specifics. > It was about a year ago....) > > I don't care if status code is indicated with a header like > "X-DAS-Status: 200" or with some XML content, or with both. But I think > the HTTP status code has to be a separate thing, and will usually be > "400" indicating that the user (sorry, I meant to say LeRoy) > successfully communicated with the DAS server. > > Ed > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 -- Lincoln D. Stein Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724 FOR URGENT MESSAGES & SCHEDULING, PLEASE CONTACT MY ASSISTANT, SANDRA MICHELSEN, AT michelse at cshl.edu From ed_erwin at affymetrix.com Thu Nov 10 14:56:12 2005 From: ed_erwin at affymetrix.com (Ed Erwin) Date: Thu, 10 Nov 2005 11:56:12 -0800 Subject: [DAS2] Re: New problem with content-type header in DAS/2 server responses! In-Reply-To: <83d48ca8f7128fb04efecd673ef61459@dalkescientific.com> References: <2dd2f4224520b2f35add5af1de821729@dalkescientific.com> <43739296.4030307@affymetrix.com> <83d48ca8f7128fb04efecd673ef61459@dalkescientific.com> Message-ID: <4373A5DC.3070102@affymetrix.com> Andrew Dalke wrote: > Okay, sounds like using HTTP codes for this causes problems in > practice. > > What about returning a different content-type for that case? > > 200 Ok > Content-Type: application/x-das-error > > > Something bad happened. > > That seems fine to me. There is still the separate issue of whether the content is "application/x-das-error" or simply "text/xml". But that is another discussion that is already ongoing and to which I have nothing to add. From dalke at dalkescientific.com Thu Nov 10 15:01:45 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 10 Nov 2005 21:01:45 +0100 Subject: [DAS2] Re: New problem with content-type header in DAS/2 server responses! In-Reply-To: <200511101434.51966.lstein@cshl.edu> References: <2dd2f4224520b2f35add5af1de821729@dalkescientific.com> <43739296.4030307@affymetrix.com> <200511101434.51966.lstein@cshl.edu> Message-ID: <7fd7a40582a6d8ccdc694c2a91b6f8b7@dalkescientific.com> Lincoln: > I didn't know that X-DAS-Status had ever been deprecated. I strongly > feel that > the DAS status codes are separate from the HTTP codes and should not > try to > piggyback on the HTTP status line. I'm okay with not having the assertion "something happened at the DAS level" not be in the HTTP status code. Not ecstatic, but real world trumps purity. I don't like the idea of adding new HTTP headers for this information. In my client code I need to do the following: - was there an HTTP error code? - is the return content-type correct? Having another header means I write - was there an HTTP error code? - was there a DAS error code? - is the return content-type correct? I would rather have one less bit of code to do wrong. As I also mentioned, I would like to support DAS annotations made available through a basic Apache install and a set of files, likely used by someone who just wants to provide annotations. This is not one of the current design goals; should it be, or should we require that everyone have more control over the server? Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Thu Nov 10 15:10:14 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 10 Nov 2005 21:10:14 +0100 Subject: [DAS2] Re: New problem with content-type header in DAS/2 server responses! In-Reply-To: <43739296.4030307@affymetrix.com> References: <2dd2f4224520b2f35add5af1de821729@dalkescientific.com> <43739296.4030307@affymetrix.com> Message-ID: <81b4c8e3062e94b2032e37995f26b588@dalkescientific.com> Ed: > For a user (let's call her Varla) using IE, the browser will intercept > some error codes and present her with some IE-specific garbage, > throwing away any content that was sent back in addition to the error > code. Here's the question I had earlier. Will people be using a DAS/2 annotation server directly through a web browser? As far as I'm aware there's no demand for this. None of the proposals mentioned it and the current discussion started from a technical discussion at ISMB; that is, because it could, and not because it is needed. I thought most people using IE/Moz/etc. would go a DAS application server, which integrates views from different DAS annotation servers. All this discussion is about returning pages back from an annotation server in a form directly viewable by a web browser. I don't see that as being useful. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Thu Nov 10 16:45:09 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 10 Nov 2005 22:45:09 +0100 Subject: [DAS2] Re: New problem with content-type header in DAS/2 server responses! In-Reply-To: <43739296.4030307@affymetrix.com> References: <2dd2f4224520b2f35add5af1de821729@dalkescientific.com> <43739296.4030307@affymetrix.com> Message-ID: <725e762a203211651d1850097ae3fcc0@dalkescientific.com> Further refining this from today's phone meeting Ed: > For a user (let's call her Varla) using IE, the browser will intercept > some error codes and present her with some IE-specific garbage, > throwing away any content that was sent back in addition to the error > code. The case Ed came across was from an in-house group using a Windows call out to IE as a background process to fetch a web page. In that case (as I understand it) it would convert HTTP error responses into its own error messages. Ed couldn't during the conversation recall if it was possible to get ahold of the error code at all. Did they have to parse the output? > Even for a user (Marla this time) using IGB, firewalls and/or caching > and/or apache port-forwarding mechanisms can throw out anything with a > status code in the error range. 404 gets through, yes? All of those are supposed to be transparent to error codes, or at the very least translate them from (say) 404 to 400. Can anyone point me to some reports of one of these mishaps? We definitely need to have some tie-ins with the HTTP error codes. Consider these two implementations for getting http://example.com/das2/genome/dazypus/1.43/ (Note the typo "dazypus" -> "dasypus") A) One system might have all "/das2" URLs forwarded to a DAS server. B) Another might have a handler only for "/das2/genome/dasypus" and let Apache do the rest. In case A) the DAS server sees that the given resource doesn't exist. It needs to return an error. It can return either "200 Ok" followed by a DAS error payload, or return a "404 Not Found" at the HTTP level. In case B) the request never gets to the DAS handler because of the typo. Apache sees there's nothing for the resource so returns a "404 Not Found". The client code is easier if it can check the HTTP error code and stop on failure. This means it's best for case A) for the DAS/2 server to return an HTTP error code of 404, and perhaps an optional ignorable payload. > (I did test having the NetAffx DAS server send HTTP status codes, and > I did have problems with that in IGB, though I've forgotten the > specifics. It was about a year ago....) Do you have the specifics perhaps in an old email somewhere? Andrew dalke at dalkescientific.com From ed_erwin at affymetrix.com Thu Nov 10 17:43:02 2005 From: ed_erwin at affymetrix.com (Ed Erwin) Date: Thu, 10 Nov 2005 14:43:02 -0800 Subject: [DAS2] Re: how do I load probe sets into IGB now? In-Reply-To: <83722dde0511101429m398c38ebg8e4df3d9b2a8d0da@mail.gmail.com> References: <83722dde0511101429m398c38ebg8e4df3d9b2a8d0da@mail.gmail.com> Message-ID: <4373CCF6.9060508@affymetrix.com> Hi, The old DAS loading mechanism is still there, in exactly the same place it used to be: File->Load DAS Features. The new "DAS/2" tab at the bottom is for "DAS/2" servers, of which there are only a few at the moment, and which are still experimental. Ed Ann Loraine wrote: > Hi, > > Congratulations everybody on the new release of IGB! > > I have a question about the new Quickload/DAS tab. > > I'm trying to load some probe sets via DAS but can't figure out how to do it. > > I used to be able to get them by using the "DAS" menu item, which > opened a widget containing a menu of DAS servers. I would select the > one labeled AffyDas (or something like that) and then I would get to > pick the chip (more often, chips) I wanted to see. Then IGB would > query the server and get me the probe set design sequence alignments > for the currently-shown region. > > I can't find this in the new interface. > > Can you help? > > -Ann > > -- > Ann Loraine > Assistant Professor > Section on Statistical Genetics > University of Alabama at Birmingham > http://www.ssg.uab.edu > http://www.transvar.org From ed_erwin at affymetrix.com Thu Nov 10 17:49:47 2005 From: ed_erwin at affymetrix.com (Ed Erwin) Date: Thu, 10 Nov 2005 14:49:47 -0800 Subject: [DAS2] Re: New problem with content-type header in DAS/2 server responses! In-Reply-To: <725e762a203211651d1850097ae3fcc0@dalkescientific.com> References: <2dd2f4224520b2f35add5af1de821729@dalkescientific.com> <43739296.4030307@affymetrix.com> <725e762a203211651d1850097ae3fcc0@dalkescientific.com> Message-ID: <4373CE8B.3000302@affymetrix.com> Andrew Dalke wrote: > Further refining this from today's phone meeting > > Ed: > >> For a user (let's call her Varla) using IE, the browser will intercept >> some error codes and present her with some IE-specific garbage, >> throwing away any content that was sent back in addition to the error >> code. > > > The case Ed came across was from an in-house group using a Windows call > out to IE as a background process to fetch a web page. In that case > (as I understand it) it would convert HTTP error responses into its own > error messages. > > Ed couldn't during the conversation recall if it was possible to > get ahold of the error code at all. Did they have to parse the output? Here is some info from microsoft about these "friendly HTTP error messages": http://support.microsoft.com/kb/q218155/ Note that whether the real error message gets through seems to depend on both the error code, and the length of the content. How is that friendly? >> (I did test having the NetAffx DAS server send HTTP status codes, and >> I did have problems with that in IGB, though I've forgotten the >> specifics. It was about a year ago....) > > > Do you have the specifics perhaps in an old email somewhere? > I can look around when I get back from vacation, which I'm on all next week. Ed From Gregg_Helt at affymetrix.com Thu Nov 10 17:46:23 2005 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Thu, 10 Nov 2005 14:46:23 -0800 Subject: [DAS2] RE: how do I load probe sets into IGB now? Message-ID: That data is on a DAS/1 server. The new "Data Access" tab is just for QuickLoad and DAS/2 servers. DAS/1 servers are still accessible via the "File --> Load DAS Features" menu item. In the near term the plan is to soon move the DAS/1 access into the "Data Access" tab as a DAS/1 subtab alongside the QuickLoad and DAS/2 subtabs, but this wasn't ready in time for the current release. In the longer term the probe data will be hosted on both DAS/1 and DAS/2 servers. gregg > -----Original Message----- > From: Ann Loraine [mailto:aloraine at gmail.com] > Sent: Thursday, November 10, 2005 2:30 PM > To: das2 at portal.open-bio.org > Cc: Helt,Gregg; Erwin, Ed > Subject: how do I load probe sets into IGB now? > > Hi, > > Congratulations everybody on the new release of IGB! > > I have a question about the new Quickload/DAS tab. > > I'm trying to load some probe sets via DAS but can't figure out how to do > it. > > I used to be able to get them by using the "DAS" menu item, which > opened a widget containing a menu of DAS servers. I would select the > one labeled AffyDas (or something like that) and then I would get to > pick the chip (more often, chips) I wanted to see. Then IGB would > query the server and get me the probe set design sequence alignments > for the currently-shown region. > > I can't find this in the new interface. > > Can you help? > > -Ann > > -- > Ann Loraine > Assistant Professor > Section on Statistical Genetics > University of Alabama at Birmingham > http://www.ssg.uab.edu > http://www.transvar.org From dalke at dalkescientific.com Thu Nov 10 18:19:51 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Fri, 11 Nov 2005 00:19:51 +0100 Subject: [DAS2] Re: New problem with content-type header in DAS/2 server responses! In-Reply-To: <4373CE8B.3000302@affymetrix.com> References: <2dd2f4224520b2f35add5af1de821729@dalkescientific.com> <43739296.4030307@affymetrix.com> <725e762a203211651d1850097ae3fcc0@dalkescientific.com> <4373CE8B.3000302@affymetrix.com> Message-ID: <0cc693a86af103c99b668e5f6db2c9e6@dalkescientific.com> > Here is some info from microsoft about these "friendly HTTP error > messages": > > http://support.microsoft.com/kb/q218155/ > > Note that whether the real error message gets through seems to depend > on both the error code, and the length of the content. How is that > friendly? Indeed. >> Internet Explorer 5 and later provides a replacement for the HTML >> template for the following friendly error messages: >> >> 400, 403, 404, 405, 406, 408, 409, 410, 500, 501, 505 I've marked them with ***. The only ones I think we might use, were we to piggyback, are 409 (for locking?), 415 (for servers that don't support a requested format) and 416 (for unsupported range requests?). *** 400: ('Bad request', 'Bad request syntax or unsupported method'), 401: ('Unauthorized', 'No permission -- see authorization schemes'), 402: ('Payment required', 'No payment -- see charging schemes'), *** 403: ('Forbidden', 'Request forbidden -- authorization will not help'), *** 404: ('Not Found', 'Nothing matches the given URI'), *** 405: ('Method Not Allowed', 'Specified method is invalid for this server.'), *** 406: ('Not Acceptable', 'URI not available in preferred format.'), 407: ('Proxy Authentication Required', 'You must authenticate with ' 'this proxy before proceeding.'), *** 408: ('Request Time-out', 'Request timed out; try again later.'), *** 409: ('Conflict', 'Request conflict.'), *** 410: ('Gone', 'URI no longer exists and has been permanently removed.'), 411: ('Length Required', 'Client must specify Content-Length.'), 412: ('Precondition Failed', 'Precondition in headers is false.'), 413: ('Request Entity Too Large', 'Entity is too large.'), 414: ('Request-URI Too Long', 'URI is too long.'), 415: ('Unsupported Media Type', 'Entity body in unsupported format.'), 416: ('Requested Range Not Satisfiable', 'Cannot satisfy request range.'), 417: ('Expectation Failed', 'Expect condition could not be satisfied.'), *** 500: ('Internal error', 'Server got itself in trouble'), *** 501: ('Not Implemented', 'Server does not support this operation'), 502: ('Bad Gateway', 'Invalid responses from another server/proxy.'), 503: ('Service temporarily overloaded', 'The server cannot process the request due to a high load'), 504: ('Gateway timeout', 'The gateway server did not receive a timely response'), *** 505: ('HTTP Version not supported', 'Cannot fulfill request.'), > I can look around when I get back from vacation, which I'm on all next > week. Enjoy! Andrew dalke at dalkescientific.com From aloraine at gmail.com Thu Nov 10 17:29:48 2005 From: aloraine at gmail.com (Ann Loraine) Date: Thu, 10 Nov 2005 16:29:48 -0600 Subject: [DAS2] how do I load probe sets into IGB now? Message-ID: <83722dde0511101429m398c38ebg8e4df3d9b2a8d0da@mail.gmail.com> Hi, Congratulations everybody on the new release of IGB! I have a question about the new Quickload/DAS tab. I'm trying to load some probe sets via DAS but can't figure out how to do it. I used to be able to get them by using the "DAS" menu item, which opened a widget containing a menu of DAS servers. I would select the one labeled AffyDas (or something like that) and then I would get to pick the chip (more often, chips) I wanted to see. Then IGB would query the server and get me the probe set design sequence alignments for the currently-shown region. I can't find this in the new interface. Can you help? -Ann -- Ann Loraine Assistant Professor Section on Statistical Genetics University of Alabama at Birmingham http://www.ssg.uab.edu http://www.transvar.org From allenday at ucla.edu Thu Nov 10 20:39:36 2005 From: allenday at ucla.edu (Allen Day) Date: Thu, 10 Nov 2005 17:39:36 -0800 (PST) Subject: [DAS2] Re: New problem with content-type header in DAS/2 server responses! In-Reply-To: <2dd2f4224520b2f35add5af1de821729@dalkescientific.com> References: <2dd2f4224520b2f35add5af1de821729@dalkescientific.com> Message-ID: > What does the "X-DAS-Server" get you that the normal "Server:" doesn't > get you? What's the use case? I don't know. The absence of this header was actually reported by Dasypus output sent to me by you on May 26, 2005. Here's a snippet of the Dasypus diagnostics, followed by a comment from you: "Date: Thu, 26 May 2005 12:29:32 -0600 From: Andrew Dalke To: DAS/2 Subject: [DAS2] dasypus status [...] WARNING: Adding X-DAS-Server header 'gmod/0.0' The prototype doesn't mention the DAS server used. I stick one in based on the host name. [...]" > Why is the "X-DAS-Version" at all important? What's important is the > data content. It's the document return type/version that's important > and not the server version. It was actually originally (as far as I can tell from my email archive) discussed, along with X-DAS-Status in an email from Lincoln on May 21, 2004, and forwarded to me on August 12, 2004: "-----Original Message----- From: Lincoln Stein [mailto:lstein at cshl.edu] Sent: Friday, May 21, 2004 1:22 PM To: edgrif at sanger.ac.uk; Gregg_Helt at affymetrix.com; avc at sanger.ac.uk; gilmanb at mac.com; dalke at dalkescientific.com Cc: lstein at cshl.edu; allen.day at ucla.edu Subject: DAS/2 notes [...] In addition to the standard HTTP response headers, DAS servers return the following HTTP headers: X-DAS-Version: DAS/2.0 X-DAS-Status: XXX status code [...]" > But I mentioned most of these over a year ago > http://portal.open-bio.org/pipermail/das/2004-September/000814.html > > In summary: > - no support for direct web browser access to a URL, expect with a > likely use case; > - keep the default response in an XML format > - change that XML content-type to "application/x-das-*+xml" instead > of "text/*" > - have no requirement for new, DAS-specific headers This discussion suggests we need for a more formal process of modifying the client and server implementations, e.g. modify spec first and commit, then update code. -Allen From td2 at sanger.ac.uk Fri Nov 11 04:24:52 2005 From: td2 at sanger.ac.uk (Thomas Down) Date: Fri, 11 Nov 2005 09:24:52 +0000 Subject: [DAS2] Re: New problem with content-type header in DAS/2 server responses! In-Reply-To: <83d48ca8f7128fb04efecd673ef61459@dalkescientific.com> References: <2dd2f4224520b2f35add5af1de821729@dalkescientific.com> <43739296.4030307@affymetrix.com> <83d48ca8f7128fb04efecd673ef61459@dalkescientific.com> Message-ID: <8C869723-601C-4236-B9FA-88F6D6401016@sanger.ac.uk> On 10 Nov 2005, at 19:49, Andrew Dalke wrote: > Ed: > >> Using the HTTP-level error codes can cause problems. >> > > >> I don't care if status code is indicated with a header like >> "X-DAS-Status: 200" or with some XML content, or with both. But I >> think the HTTP status code has to be a separate thing, and will >> usually be "400" indicating that the user (sorry, I meant to say >> LeRoy) successfully communicated with the DAS server. >> > > Okay, sounds like using HTTP codes for this causes problems in > practice. > > What about returning a different content-type for that case? > > 200 Ok > Content-Type: application/x-das-error > > > Something bad happened. > That looks reasonable, but could we add a bit of structure: 407 The sky is falling (There's also a possible argument for using textual, rather than numeric, error codes -- but it would be good to keep at least one part of the error response using a well-defined vocabulary for the benefit of clients that want to respond to different error conditions in different ways). Thomas. From Steve_Chervitz at affymetrix.com Fri Nov 11 16:24:50 2005 From: Steve_Chervitz at affymetrix.com (Chervitz, Steve) Date: Fri, 11 Nov 2005 13:24:50 -0800 Subject: [DAS2] how do I load probe sets into IGB now? In-Reply-To: <83722dde0511101429m398c38ebg8e4df3d9b2a8d0da@mail.gmail.com> Message-ID: Ann, Go to File -> Load DAS Features. There should be a DAS server named 'NetAffx-Align' that will give you what you want. Steve > From: Ann Loraine > Date: Thu, 10 Nov 2005 16:29:48 -0600 > To: > Cc: , "Helt,Gregg" > Subject: [DAS2] how do I load probe sets into IGB now? > > Hi, > > Congratulations everybody on the new release of IGB! > > I have a question about the new Quickload/DAS tab. > > I'm trying to load some probe sets via DAS but can't figure out how to do it. > > I used to be able to get them by using the "DAS" menu item, which > opened a widget containing a menu of DAS servers. I would select the > one labeled AffyDas (or something like that) and then I would get to > pick the chip (more often, chips) I wanted to see. Then IGB would > query the server and get me the probe set design sequence alignments > for the currently-shown region. > > I can't find this in the new interface. > > Can you help? > > -Ann > > -- > Ann Loraine > Assistant Professor > Section on Statistical Genetics > University of Alabama at Birmingham > http://www.ssg.uab.edu > http://www.transvar.org > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 From Steve_Chervitz at affymetrix.com Fri Nov 11 19:51:41 2005 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Fri, 11 Nov 2005 16:51:41 -0800 Subject: [DAS2] DAS/2 weekly meeting notes for 10 Nov 05 Message-ID: Notes from the weekly DAS/2 teleconference, 10 Nov 2005. $Id: das2-teleconf-2005-11-10.txt,v 1.1 2005/11/12 00:48:39 sac Exp $ Attendees: Affy: Steve Chervitz, Ed E., Gregg Helt UCLA: Brian O'connor CSHL: Lincoln Stein UCBerkeley: Suzi Lewis Sweden: Andrew Dalke Action items are flagged with '[A]'. These notes are checked into the biodas.org CVS repository at das/das2/notes/2005. Instructions on how to access this repository are at http://biodas.org Agenda Items ------------ * New Euro-friendly meeting time It was decided to change the time for this weekly teleconference to Monday 9:30 AM PST (12:30 PM EST, 17:30 UK). [A] New teleconf time starts next week (Monday 14 Nov) * Spec Issues Gregg expressed a need to dedicate some of these weekly meetings to be focused on resolving spec issues. We will do this for next week's meeting. [A] Everyone come prepared to talk about retrieval spec issues on 11/14. Content-type issue: - Should we use text/xml or application/x-das-blah+xml? - Consensus: use application/x-das-blah+xml - [A] Steve will rollback changes made to the retrieval spec. - Andrew acknowledges that text/xml may be handy for visual debugging and other presentation tricks, but is not a user-driven need; it's a technical issue. - Lincoln: XML handling is very browser-dependent: o Firefox - nice DOM tree structure o Safari, Konqueror - no special rendering o MSIE - "Cannot be displayed" - Gregg: Now we just need to ensure that we're actually implementing the correct content-type for given responses, which brings up the next topic... * Validation - Gregg: we'd like to start using dasypus locally to verify client/server compliance with the spec. What state is it in? - Andrew: Just getting back to it now. [A] Andrew will talk with Chris D. to set up a web interface at biodas.org * Apollo Suzi: Can't talk about Apollo now. Will wait until Nomi is available. [A] Nomi will present Apollo at the 28 Nov DAS/2 weekly meeting. Status Reports -------------- Gregg: * CSHL Genome Informatics meeting summary of DAS/2-relevant things. - Gave talk about DAS/2 and demoed IGB. Went well. - Held a DAS BOF that was well-attended (n=15). Questions people had about DAS/2 have already been addressed. [A] Gregg will write up his CSHL DAS BOF notes and post. Discussion centered around what Sanger & EBI are doing with DAS. o There are lots of DAS-related projects there. o We'd like to have tighter linkage between DAS folks in the states and those in the the UK. [A] Andrew will visit the UK DAS folks more often. Ideas: + Help them transition to DAS/2 + Hold "DASathon" or jamboree there o People: Tim Hubbard, Thomas Down, Andreas Prlic o Projects: + Serving up 3D structures using modified DAS/1 server (SPICE) + Serving up protein annotations using modified DAS/1 server + Registry & discovery system for DAS/1 server This is SOAP-based. We'd like to have a non-SOAP-based system for DAS/2, which follows REST principles. - Andreas could likely create an HTTP-based alternative to his SOAP system, which uses the same core. - [A] Andrew will talk with Andreas P about non-SOAP reg/discovery - [A] DAS/2 grant needs progress on reg/discovery w/in next 6 mos * Grant (DAS/2 continuation) Lots of modifications were made just prior to submitting on 1 Nov. Some of the changes include: - Work closely with Sanger and EBI where they've done lots of work (3D structure and protein DAS). - More of a mechanism will be in place to drive the spec forward: o Andrew = designated 'spec czar' - makes ultimate decisions o Lincoln = designated 'spec godfather' - retains veto power Andrew: * Brought up the header issue from the spec discussion on the list this week. - Doesn't like the idea for 4 additional DAS-specific fields (error code, das version, server name, and something else) - Alternative: server returns content-type: application/x-das-error - Advantages: o no new header o simplified header -- just check the http error code in the content-type. o easier to implement o enables a flatfile-based server o Fits with REST philosophy of using HTTP as an application protocol, not a transport protocol. - Ed E: Can't we just return an error section in the document? Andrew: We could, but it requires parsing the document and only works for XML formats that we're in control of. - Gregg: The advantages of having metadata in the header outweighs the advantages of enabling a flatfile-based server. Andrew: We can utilize the existing header Ed E: Piggybacking error codes causes problems with proxy servers (see email on the DAS/2 discussion list). - Decision: [A] Use standard HTTP error codes; use XML to specify error details. E.g., server status=200 content= error document Steve: When reviewing spec, encountered potential issues surrounding relationship between HTTP and DAS-specific error codes. Using standard HTTP codes will obviate this issue. Also noted that there's a bugzilla entry regarding error codes (which is now moot): http://bugzilla.open-bio.org/show_bug.cgi?id=1784 - Ed E: MSIE hides or modifies content based on certain HTTP error codes it gets. This has important implications on windows platforms where IE's behavior can get in the way of other network-aware applications that don't even (knowingly) use IE. From Steve_Chervitz at affymetrix.com Fri Nov 11 20:52:15 2005 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Fri, 11 Nov 2005 17:52:15 -0800 Subject: [DAS2] DAS/2 weekly meeting notes for 10 Nov 05 In-Reply-To: Message-ID: > Content-type issue: > - Should we use text/xml or application/x-das-blah+xml? > - Consensus: use application/x-das-blah+xml > - [A] Steve will rollback changes made to the retrieval spec. Done, but I noticed that we had been using text/x-das-blah+xml rather than application/x-das-blah+xml. I left it as text for now, although 'application' seems more correct according to the RFC on MIME media types, http://www.rfc-editor.org/rfc/rfc2046.txt which states: text -- textual information. ... Other subtypes [i.e., anything besides 'plain'] are to be used for enriched text in forms where application software may enhance the appearance of the text... application -- some other kind of data, typically either uninterpreted binary data or information to be processed by an application. ... Steve From dalke at dalkescientific.com Mon Nov 14 06:47:09 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 14 Nov 2005 12:47:09 +0100 Subject: [DAS2] Re: New problem with content-type header in DAS/2 server responses! In-Reply-To: References: Message-ID: Steve: > I raised some other issues regarding types and feature properties etc. > a > couple of weeks ago that I'd like you to chime in on: > http://portal.open-bio.org/pipermail/das2/2005-October/000271.html > > The latest message on this thread is: > http://portal.open-bio.org/pipermail/das2/2005-November/000278.html I'll take them part by part. That last message suggested 29 2 * the values of the 'das:id', 'das:type', and 'das:ptype' attributes > are URLs relative to xml:base unless they begin with 'das:prop#', in > which case they are relative to the das:prop namespace. And from what I can tell about XML, there's no standard way to implement this using one of the standard XML parsers. How do you get the das:prop namespace for a given element? The parser often does the expansion for you. Eg, in one of the Python XML parsers it does the translations into Clark notation, like {http://www.biodas.org/ns/das/genome/2.00}ptype For more info on XML namespaces, see http://www.jclark.com/xml/xmlns.htm Andrew dalke at dalkescientific.com From ap3 at sanger.ac.uk Mon Nov 14 08:29:26 2005 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Mon, 14 Nov 2005 13:29:26 +0000 Subject: [DAS2] Re: what info is needed for DAS/2 registration? In-Reply-To: <955da4ae7783e60944687d86ec691e51@dalkescientific.com> References: <955da4ae7783e60944687d86ec691e51@dalkescientific.com> Message-ID: <81fdf1e73ee85ae55550f12ddcee13cf@sanger.ac.uk> Hi Andrew! > Looks like I will be more involved with the DAS/2 spec development, > and I'll be visiting the UK more often. good! > I want to make sure that the spec includes more of what's > needed for registration. o.k. very good, let's go through your mail: > My thought is to let the registration > system be able to query the DAS/2 server to get most of the fields > it needs, if not all. o.k. > There may still be some need to override the > definitions, The experience from doing the das1 registry tells that some corrections are needed every now and then. It seems to be inevitable that sometimes users make mistakes / inaccuracies, etc. > so at the manual registration level this will be used > more to pre-populate an entry with a default. sounds good. - so this means the configuration for setting up a DAS source will get a little bigger. > In looking at the manual registration page I see the following, > along with comparisons to the existing DAS/2 spec > > ** Title/Nickname used by DAS clients for the display of the das tracks > ** Description for the user to get a quick grasp what the data is about. - we have 60 sources in the registry by now and we expect to be up around 100 soon, so one needs a way to learn which of the sources are serving the data which is of particular interest ... > ** URL for more detailed description a link back to the homepage of the project that provides the data > > DAS/2 does not have this information for the service as a whole. > It does have it for each of the databases, somewhat. Here is > an example from the spec. > > taxon="http://www.ncbi.nlm.nih.gov/taxon-browser?id=29118" > > doc_href="http://www.wormbase.org/documentation/users_guide/ > volvox.html" > > > > Should we add a "title" field to each data source? yes that would be good > Should we > add title/description/url fields to the DAS/2 service as a whole? not sure what you mean by that > ** coordinate system > > Each data source may have 1 or more versions. The version information > looks like > > > > > > In theory that assembly id could be a URL with more detailed > information about the assembly. Right now it's used as a unique > identifier. There is nothing there to convert these URLs into > something human-readable. Hm. not sure if I am completely convinced with representing a coordinate system as a url. What if two reference servers provide the same assembly or are mirrors of each other? I would see it in a way where a DAS client would asks the registry "where are all the reference servers for NCBI 35- homo sapiens?" and then gets a list providing e.g. an american and a european mirror server the client could choose the one which is geographically closer. > > Possible solutions for this are: > - define an "assembly" document, to be put at that URL and > include the authority/version/type/organism data mentioned at > http://das.sanger.ac.uk/registry/help_coordsys.jsp something like that. > ** DAS url > > Yep, DAS/2 has that one. :) :-) > > ** Admin email > > Hmm. Yeah, there should be more information about the service as > a whole. Admin email and perhaps a documentation href, eg, with > information about planned downtime. would be good. > > ** DAS capabilities > > That's handled differently in DAS/2. Did people really use this > information? actually this information is important (for das1) - it is used to distinguish reference servers and annotation servers ( on the client side) and needed for validation (on the registry side) "capabilities" are also related to data-types. E.g. a genome DAS client does not need to query a protein structure, because it can not do 3D... > ** Test access/ segment code labels I think there is a misunderstanding here: the test code is not a "label" The test code is e.g. a chromosomal segment or an accession code for a protein database for which annotations are returned if a feature request is being made. The "label" is used mainly to describe by which project a source is being funded. >> We are currently discussing if the labels should be used to describe >> a DAS source in more detail. e.g. "experimentally verified", >> "computational prediction", etc. > > These are two different things in one field. yes you are very right. Together with the BioSapiens DAS people we recently decided that there should be the possibility to assign gene-ontology evidence codes to each das source, so in the next update of the registry, this will be changed. > > What I'm going to propose is a generic key/value data structure > for just about all records. Some of the key names will be well > defined. Others can add new fields to experiment with / extend > the spec in a semi-constrained fashion. This would let people > try out a new property easily. sounds good. > In summary it sound like DAS/2 needs: > - a few more pieces of meta data (eg, information about the > service as a whole) > - a bit better defined way to get information about the > reference assembly > I would agree to both that Greetings, Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 From Gregg_Helt at affymetrix.com Mon Nov 14 12:09:11 2005 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Mon, 14 Nov 2005 09:09:11 -0800 Subject: [DAS2] DAS/2 teleconference at 9:30 AM today PST Message-ID: Just a reminder that we've rescheduled the weekly DAS/2 teleconference for Mondays @ 9:30 AM Pacific time, starting today. I'm hoping the new time will give more people a chance to participate. Teleconference numbers: US dialin: 800-531-3250 International dialin: 303-928-2693 Conference ID: 2879055 We're also revising the format to focus on alternating weeks on the DAS/2 specification itself or implementations of the specification. This should allow people who are mainly concerned about one or the other to avoid extra overhead. Today we will focus on spec issues. thanks, Gregg Helt From lstein at cshl.edu Mon Nov 14 12:23:18 2005 From: lstein at cshl.edu (Lincoln Stein) Date: Mon, 14 Nov 2005 12:23:18 -0500 Subject: [DAS2] Re: New problem with content-type header in DAS/2 server responses! In-Reply-To: <725e762a203211651d1850097ae3fcc0@dalkescientific.com> References: <43739296.4030307@affymetrix.com> <725e762a203211651d1850097ae3fcc0@dalkescientific.com> Message-ID: <200511141223.19367.lstein@cshl.edu> Well, I give up arguing this one and will go with the way Andrew wants to do it. Therefore I propose the following rules: 1) Return the HTTP 404 error for the case that any component of the DAS2 path is invalid. This would apply to the following situations: Bad namespace Bad data source Unknown object ID 2) Return HTTP 301 and 302 redirects when the requested object has moved. 3) Return HTTP 403 (forbidden) for no-lock errors. 4) Return HTTP 500 when the server crashes. For all errors there should be a text/x-das-error entity returned that describes the error in more detail. Lincoln On Thursday 10 November 2005 04:45 pm, Andrew Dalke wrote: > Further refining this from today's phone meeting > > Ed: > > For a user (let's call her Varla) using IE, the browser will intercept > > some error codes and present her with some IE-specific garbage, > > throwing away any content that was sent back in addition to the error > > code. > > The case Ed came across was from an in-house group using a Windows call > out to IE as a background process to fetch a web page. In that case > (as I understand it) it would convert HTTP error responses into its own > error messages. > > Ed couldn't during the conversation recall if it was possible to > get ahold of the error code at all. Did they have to parse the output? > > > Even for a user (Marla this time) using IGB, firewalls and/or caching > > and/or apache port-forwarding mechanisms can throw out anything with a > > status code in the error range. > > 404 gets through, yes? > > All of those are supposed to be transparent to error codes, or at the > very least translate them from (say) 404 to 400. > > Can anyone point me to some reports of one of these mishaps? > > We definitely need to have some tie-ins with the HTTP error codes. > Consider these two implementations for getting > > http://example.com/das2/genome/dazypus/1.43/ > > (Note the typo "dazypus" -> "dasypus") > > A) One system might have all "/das2" URLs forwarded to a DAS server. > > B) Another might have a handler only for "/das2/genome/dasypus" and > let Apache do the rest. > > In case A) the DAS server sees that the given resource doesn't exist. > It needs to return an error. It can return either "200 Ok" followed > by a DAS error payload, or return a "404 Not Found" at the HTTP level. > > In case B) the request never gets to the DAS handler because > of the typo. Apache sees there's nothing for the resource so returns > a "404 Not Found". > > The client code is easier if it can check the HTTP error code and > stop on failure. This means it's best for case A) for the DAS/2 > server to return an HTTP error code of 404, and perhaps an optional > ignorable payload. > > > (I did test having the NetAffx DAS server send HTTP status codes, and > > I did have problems with that in IGB, though I've forgotten the > > specifics. It was about a year ago....) > > Do you have the specifics perhaps in an old email somewhere? > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 -- Lincoln D. Stein Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724 FOR URGENT MESSAGES & SCHEDULING, PLEASE CONTACT MY ASSISTANT, SANDRA MICHELSEN, AT michelse at cshl.edu From lstein at cshl.edu Mon Nov 14 12:28:10 2005 From: lstein at cshl.edu (Lincoln Stein) Date: Mon, 14 Nov 2005 12:28:10 -0500 Subject: [DAS2] Re: New problem with content-type header in DAS/2 server responses! In-Reply-To: References: Message-ID: <200511141228.11358.lstein@cshl.edu> On Monday 14 November 2005 06:47 am, Andrew Dalke wrote: > Steve: > > I raised some other issues regarding types and feature properties etc. > > a > > couple of weeks ago that I'd like you to chime in on: > > http://portal.open-bio.org/pipermail/das2/2005-October/000271.html > > > > The latest message on this thread is: > > http://portal.open-bio.org/pipermail/das2/2005-November/000278.html > > I'll take them part by part. > > That last message suggested > > xmlns:das="http://www.biodas.org/ns/das/genome/2.00" > xml:base="http://www.wormbase.org/das/genome/volvox/1/" > xmlns:xlink="http://www.w3.org/1999/xlink" > > das:prop="http://www.biodas.org/ns/das/genome/2.00/properties"> > das:type="type/curated_exon"> > 29 > 2 > xlink:type="simple" > > xlink:href="http://www.wormbase.org/das/protein/volvox/2/feature/ > CTEL54X.1 > /> > > > > I couldn't figure out why the "das:" namespace was needed for the > attributes. Why can't they be in the default namespace? The extras das: prefix is not needed since it is the same namespace as the default namespace. My feeling is that we should NOT be using namespaces in attribute names but not in attribute values (e.g. das:ptype is ok, but "das:prop#phase" is not OK). For attribute values we should be using URIs consistently. Lincoln > The "das:" in the value of an attribute doesn't know anything about > the currently defined namespaces. So this "das:" must be something > completely different from the xmlns:das=... definition. > > > * the values of the 'das:id', 'das:type', and 'das:ptype' attributes > > are URLs relative to xml:base unless they begin with 'das:prop#', in > > which case they are relative to the das:prop namespace. > > And from what I can tell about XML, there's no standard way to implement > this using one of the standard XML parsers. How do you get the das:prop > namespace for a given element? The parser often does the expansion > for you. Eg, in one of the Python XML parsers it does the translations > into Clark notation, like > > {http://www.biodas.org/ns/das/genome/2.00}ptype > > For more info on XML namespaces, see http://www.jclark.com/xml/xmlns.htm > > > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 -- Lincoln D. Stein Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724 FOR URGENT MESSAGES & SCHEDULING, PLEASE CONTACT MY ASSISTANT, SANDRA MICHELSEN, AT michelse at cshl.edu From dalke at dalkescientific.com Mon Nov 14 12:30:07 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 14 Nov 2005 18:30:07 +0100 Subject: [DAS2] Spec issues In-Reply-To: References: Message-ID: <05b94e3a6db3e4894af051f22f25dc4c@dalkescientific.com> On Nov 4 Steve wrote: > das:type="type/curated_exon"> > 29 > 2 > xlink:type="simple" > > xlink:href="http://www.wormbase.org/das/protein/volvox/2/feature/ > CTEL54X.1 > /> > I think we're missing something. This is XML. We can do 29 2 This message brought to you by AT&T The whole point of having namespaces in XML is to keep from needing to define new namespaces like . In doing that, there's no problem in supporting things like "bg:glyph", etc. because the values are expanded as expected by the XML processor. > Also, we might want to allow some controlled vocabulary terms to be > used for > the value of type.source (e.g., "das:curated"), to ensure that > different > users use the same term to specify that a feature type is produced by > curation. I talked with Andreas Prlic about what other metadata is needed for the registry system. He mentioned Together with the BioSapiens DAS people we recently decided that there should be the possibility to assign gene-ontology evidence codes to each das source, so in the next update of the registry, this will be changed. That's at the source level, but perhaps it's also needed at the annotation level. > The spec also seems alarmed by the existence of a xml:base attribute > in the > TYPE element. The idea is that any relative URL within this element > would be > resolved using that element's xml:base attribute. How would folks be > with > having the DAS/2 spec fully support the XML Base spec ( > http://www.w3.org/TR/xmlbase/ )? The result of this would be to add an > optional xml:base attribute to all elements that contain URLs or > subelements > with URLs. In my reading it seems that xml:base should be included wherever. See http://norman.walsh.name/2005/04/01/xinclude > Ugh. In the short term, I think there's only one answer: update your > schemas to allow xml:base either (a) everywhere or (b) everywhere you > want XInclude to be allowed. I urge you to put it everywhere as your > users are likely to want to do things you never imagined. ? > > Description: Properties are typed using the ptype attribute. The value > of > the property may be indicated by a URL given by the href attribute, or > may > be given inline as the CDATA content of the section. > > > type="type/curated_exon"> > 29 > 2 > href="/das/protein/volvox/2/feature/CTEL54X.1" /> > > > > So in contrast to the TYPE properties which are restricted to being > simple > string-based key:value pairs, FEATURE properties can be more complex, > which > seems reasonable, given the wild world of features. We might consider > using > 'key' rather than 'ptype' for FEATURE properties, for consistency with > TYPE > prop elements (however, read on). My thoughts on these are: - come up with a more consistent way to store key/value data - the Atom spec has a nice way to say "the data is in this CDATA as text/html/xml" vs. "this text is over there". I want to copy its way of doing things. - I'm still not clear about xlink. Another is the HTML-style Atom uses the "rel=" to encoding information about the link. For example, the URL to edit a given document is See http://atomenabled.org/developers/api/atom-api-spec.php Andrew dalke at dalkescientific.com From Steve_Chervitz at affymetrix.com Mon Nov 14 14:29:22 2005 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Mon, 14 Nov 2005 11:29:22 -0800 Subject: [DAS2] DAS/2 weekly meeting notes for 14 Nov 05 Message-ID: Notes from the weekly DAS/2 teleconference, 14 Nov 2005. $Id: das2-teleconf-2005-11-14.txt,v 1.2 2005/11/14 19:20:37 sac Exp $ Attendees: Affy: Steve Chervitz, Gregg Helt CSHL: Lincoln Stein UCBerkeley: Suzi Lewis Sweden: Andrew Dalke Action items are flagged with '[A]'. These notes are checked into the biodas.org CVS repository at das/das2/notes/2005. Instructions on how to access this repository are at http://biodas.org ---------------------------------- AD talked with A. Prlic about registry service, we want to incorporate what he needs within DAS/2. What they have: - name (a few words) - for display of das track - title, description (paragraph) - synopsis - url for more info we have desc, id, doc_href, taxon Therefore, we need name attribute Need : - name (mandatory) (done - LS: adding it to spec now) - desc (optional) Coord system reg server: * in das/2 - it's not optional (0 interbase) * they find this important We have confusion between assembly and reference server LS: Need URI that points to assembly, independent of the reference server. GH: Would like to have annot servers that don't know anything about the ref server. LS: Could use the region URI to ID the assembly das/genome/sourceid/region = assembly id/uri GH: The trouble is that NCBI is a ref source for many assemblies, yet they lack a das sever. They have no URI. LS: we can just make one up, or use most appropriate web page LS: When you request versioned source from a server, it should say what assembly coords it's working on and give a uri for that. In this case there's no guarantee you can do a 'get' on that URI. We want to say: 1- what is unique uri for assembly (everyone agrees to share this) 2- das URL for how to fetch it (some server's region url - trusted, faithful copy with what is at ncbi). Diff servers could assert that you can fetch it from various places. GH: assembly could be an attribute since there'd be only one. A list of ref servers that serve up that dna. LS: in versioned source response. new section between capabilities and namespaces called 'reference_sources'. Add 'assembly' attribute to version element: Message-ID: Andrew Dalke wrote on 14 Nov 05: > Steve: >> I raised some other issues regarding types and feature properties etc. >> a >> couple of weeks ago that I'd like you to chime in on: >> http://portal.open-bio.org/pipermail/das2/2005-October/000271.html >> >> The latest message on this thread is: >> http://portal.open-bio.org/pipermail/das2/2005-November/000278.html > > I'll take them part by part. > > That last message suggested > > xmlns:das="http://www.biodas.org/ns/das/genome/2.00" > xml:base="http://www.wormbase.org/das/genome/volvox/1/" > xmlns:xlink="http://www.w3.org/1999/xlink" > > das:prop="http://www.biodas.org/ns/das/genome/2.00/properties"> > das:type="type/curated_exon"> > 29 > 2 > xlink:type="simple" > > xlink:href="http://www.wormbase.org/das/protein/volvox/2/feature/ > CTEL54X.1 > /> > > > > I couldn't figure out why the "das:" namespace was needed for the > attributes. Why can't they be in the default namespace? Attributes don't have a default namespace (though one might think such a thing would be useful). See http://www.w3.org/TR/REC-xml-names/#defaulting This is a point which has been subject to much consternation: http://www.rpbourret.com/xml/NamespacesFAQ.htm#q5_3 http://lists.xml.org/archives/xml-dev/200002/msg00094.html > The "das:" in the value of an attribute doesn't know anything about > the currently defined namespaces. So this "das:" must be something > completely different from the xmlns:das=... definition. No, it refers to the xmlns:das definition in the parent FEATURES element. >> * the values of the 'das:id', 'das:type', and 'das:ptype' attributes >> are URLs relative to xml:base unless they begin with 'das:prop#', in >> which case they are relative to the das:prop namespace. > > And from what I can tell about XML, there's no standard way to implement > this using one of the standard XML parsers. How do you get the das:prop > namespace for a given element? You've identified the key weakness of my proposal: Knowing how to expand 'das:prop' occurring within attribute values would be a DAS-specific convention ('hack') for mapping to a controlled vocabulary for property values. So I'm not quite satisfied with this either. In another message of yours today, you propose an alternative to this: http://portal.open-bio.org/pipermail/das2/2005-November/000313.html See my reply to that for more ideas on this topic. Steve From td2 at sanger.ac.uk Tue Nov 15 04:14:01 2005 From: td2 at sanger.ac.uk (Thomas Down) Date: Tue, 15 Nov 2005 09:14:01 +0000 Subject: [DAS2] DAS/2 weekly meeting notes for 14 Nov 05 In-Reply-To: References: Message-ID: <21CB947F-FAE3-4D56-A110-CAB9606C9C84@sanger.ac.uk> On 14 Nov 2005, at 19:29, Steve Chervitz wrote: > > Coord system reg server: > * in das/2 - it's not optional (0 interbase) > * they find this important By "coordinate system" we're not really talking about the 0-based- vs-1-based issue, we're talking about globally unique names for sets of reference sequences (genome assemblies, protein databases, whatever). It might be possible to come up with a better name (I used to call these "namespaces"). > We have confusion between assembly and reference server > LS: Need URI that points to assembly, independent of the > reference server. > GH: Would like to have annot servers that don't know anything about > the ref server Definitely agree with this. This kind of "opaque assembly identifier" is what we've been calling a coord-system name. > LS: Could use the region URI to ID the assembly > das/genome/sourceid/region = assembly id/uri > > GH: The trouble is that NCBI is a ref source for many assemblies, yet > they lack a das sever. They have no URI. > LS: we can just make one up, or use most appropriate web page This is possibly an argument for avoiding the use of URLs for assembly identifiers, if we can't be sure that the organisation that's the authority for a given assembly will be running an authoritative DAS server. URNs would be fine, as would the kind of structured but location-independent identifer that Andreas has been using. > Question: What do they mean by 'coord system'? some confusion here > e.g., Do they mean things like: 'this assembly start at 5000 relative > to this other assembly'? I think the way to provide this kind of information is in the form of a DAS alignment service between two coord-systems. We love the idea of putting up alignments between NCBI34 and NCBI35 then having a liftover-like tool which can go off and query the registry to discover this. Thomas. From ap3 at sanger.ac.uk Tue Nov 15 05:24:45 2005 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Tue, 15 Nov 2005 10:24:45 +0000 Subject: [DAS2] DAS/2 weekly meeting notes for 14 Nov 05 In-Reply-To: References: Message-ID: Hi! I realized there were a couple of questions regarding the way "coordinate systems" are defined in the DAS-registry, so it would have been good if I would have joined yesterday.... I am glad that the conference is now at a time which is better for us europeans and want to join in future for some of the topics like registry, coordinate systems, proteins, etc. > > AD: ebi/sanger tracks three fields related to assembly (what they need > per server): > -authority = equiv to our assembly uri > -organism = we have as taxon > -type = ? "type" refers to a "physical dimension" of an object. E.g. a chromosome, a 3D protein structure, a protein sequence. > > Permits people to query things like: find out all servers that offer > ncbi > build 35 for human. > > Question: What do they mean by 'coord system'? some confusion here > e.g., Do they mean things like: 'this assembly start at 5000 relative > to this other assembly'? no, as Thomas already mentioned these "coordinate systems" could also be called "namespace". They should be globally unique descriptors for reference objects / databases. > > For protein DAS, authority typically defines two diff coord systems: > 'pdb resnum, interprot' > It does not permit automated translation between two coord systems. unfortunately this is not that easy in protein space. The mapping from the 3D protein structure to the protein sequence is not straightforward. Think of negative, non-consecutive, and "non-numeric" residue numbers that can appear in the 3D structures. Therefore we came up with the "alignment" DAS - document that allows to map one object in one coordinate system to another one. it can also be used to map one assembly to another. > [A] - Andrew will find out what they use it for > > AD: Believes the purpose is intended for human consumption. not only - the DAS clients usually can display a certain "coordinate system" e.g. Ensembl can do Chromosomal ones, but if DAS sources are available that speak the "UniProt, Protein Sequence" coordinate system, it knows how to project these onto the genome. - an "intelligent DAS client" :-) Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 From dalke at dalkescientific.com Wed Nov 16 21:35:32 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 17 Nov 2005 03:35:32 +0100 Subject: [DAS2] (x)link Message-ID: I mentioned having a generic tag, again based on Atom. Steve replied: > Not sure about this one yet. In the Atom API, the value of the rel > attribute is restricted to a controlled vocabulary of link > relationships and available services pertaining to editing and > publishing syndicated content on the web: > http://atomenabled.org/developers/api/atom-api- > spec.php#rfc.section.5.4.1 > > What would a controlled vocab for DAS resources be? I don't think I understand the Atom one. Turns out I was actually looking at the Atom publishing protocol at http://code.blogger.com/archives/atom-docs.html which defines links including The service.post is the URI where you would send an Entry to post to your blog. The service.feed is the URI where you would make an Atom API request to see the Blog's latest entries. We could define similar links like: - where to edit and/or lock the given resource - how to get a list of locks - how to get from the given DAS resource to it's parent (ie, how to go "up" in the tree, in the case of a cross-link from another server) These could be done as distinct elements or done as qualifications of an existing element. The advantage of the latter (using a ) is that others may add their own link types. > Skimming through the DAS/2 retrieval spec, our use of hrefs is > simply for pointing at the location of resources on the web > containing some specified content (e.g., documentation, database > entry, image data, etc.). But they are used in different contexts (for human browsing, for machine fetching, for "service" requests). > The next/prev/start idea for Atom might have good applicability in the > DAS world for iterating through versions of annotations or assemblies > (e.g., rel='link-to-gene-on-next-version-of-genome'). One relationship > that would be useful for DAS would be 'latest', to get the latest > version of an annotation. Hmm. So every annotation would have an optional section? In the current scheme do we always get the most recent version of an annotation? I didn't realize there was any way to get another version, except if it's been edited while you weren't looking. > DAS get URLs themselves seem fairly self-documenting (it's clear a > given link is for feature, type, or sequence for example), so having a > separate rel attribute may not provide much additional value for these > links. But it might be handy for versioning and for DAS/2 writebacks. I hadn't thought of versioning; I was thinking more of writebacks an and how to find the parent. I was also thinking of structure data where I might want the experimental x-ray density data for a a given structure. That might be done like That's part of the newly submitted DAS proposal so should not really drive this work. Steve also mentioned xlink. I've been looking at the spec but still don't understand its implications. There are several^H^Hmany parts to the spec I don't understand, especially in the context of DAS. locator? "arcrole"? "actuate"? Are all our links "simple"? Do we use anything else because the href? Also, I see no mention in that spec of content-type. One of the things in the Atom spec is support (though not in the spec proper) for alternate or multiple way to resolve a link or multiple formats (That is, a may contain subelements and these subelements, if in something other than the "das" namespace, are free to add variant meanings.) Andrew dalke at dalkescientific.com From ilari.scheinin at helsinki.fi Fri Nov 18 10:22:47 2005 From: ilari.scheinin at helsinki.fi (Ilari Scheinin) Date: Fri, 18 Nov 2005 17:22:47 +0200 Subject: [DAS2] Getting individual features in DAS/1 Message-ID: This mail is not really about DAS/2, but the web site says the original DAS mailing list is now closed. I am setting up a DAS server that serves CGH data from my database to a visualization software, which in my case is gbrowse. I've already set up Dazzle that serves the reference data from a local copy of Ensembl. I need to be able to select individual CGH experiments to be visualized, and as the measurements from a single CGH experiment cover the entire genome, this cannot of course be done by specifying a segment along with the features command. I noticed that there is a feature_id option for getting the features in DAS/1.5, but on a closer look, it seems to work by getting the segment that the specified feature corresponds to, and then getting all features from that segment. My next approach was to use the feature type to distinguish between different CGH experiments. As all my data is of the type CGH, I thought that I could use spare this piece of information for identifying purposes. First I tried the generic seqfeature plugin. I created a database for it with some test data. However, getting features by type does not seem to work. I always get all the features from the segment in question. Next I tried the LDAS plugin. Again I created a compatible database with some test data. I must have done something wrong the the data file I imported to the database, because getting the features does not work. I can get the feature types, but trying to get the features gives me an ERRORSEGMENT error. I thought that before I go further, it might be useful to ask whether my approach seems reasonable, or is there a better way to achieve what I am trying to do? What should I do to be able to visualize individual CGH profiles? I'm grateful for any advice, Ilari From ap3 at sanger.ac.uk Fri Nov 18 11:54:27 2005 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Fri, 18 Nov 2005 16:54:27 +0000 Subject: [DAS2] das registry and das2 Message-ID: <240aa4ff660b7427b6c463ffc10b1307@sanger.ac.uk> Hi! I would like to start a discussion of how to provide a proper DAS interface for our das- registration server at http://das.sanger.ac.uk/registry/ Currently it is possible to interact with it using SOAP, or manually via the HTML interface. We should also make it accessible using URL requests. To get this started I would propose the following query syntax. This might also provide another opportunity to have a discussion about the coordinate system descriptions. If some of the used terms are unclear, there is some documentation at http://das.sanger.ac.uk/registry/help_index.jsp Regards, Andreas Request: http://server/registry/list http://server/registry/find? [keyword,organism,authority,type,capability,label]=searchterm Response: DS_109 myDasSource some free text NCBI 35 chromosome Homo sapiens 9606 4:55349999,55749999 UniProt Protein Sequence P00280 sequence features 2005-Nov-16 about uniprot ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 From dalke at dalkescientific.com Fri Nov 18 13:00:12 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Fri, 18 Nov 2005 19:00:12 +0100 Subject: [DAS2] das registry and das2 In-Reply-To: <240aa4ff660b7427b6c463ffc10b1307@sanger.ac.uk> References: <240aa4ff660b7427b6c463ffc10b1307@sanger.ac.uk> Message-ID: <4569f5d3ff6658e5ead6b979e8b1fba9@dalkescientific.com> Andreas Prlic: > I would like to start a discussion of how to provide a proper DAS > interface for > our das- registration server at http://das.sanger.ac.uk/registry/ > > Currently it is possible to interact with it using SOAP, or manually > via the HTML > interface. We should also make it accessible using URL requests. One of the things Gregg and I talked about at ISMB was that the top-level "das-sources" format is, or can be, identical to what's needed for the registry server. As it's structured now the top-level interface to a das2/genome URL returns a list of sources. Based on what you need for the registry, we're going to add support for data about the source itself. The resulting das-sources XML document is effectively identical to what you're looking for. Hence I think the top-level XML format for a DAS/2 service is identical to the XML format for a registry server. A difference is the support for searches across sources. We don't have that in DAS. This is an example, btw, of how a generic element could be useful. Suppose we don't add this in DAS/2.0. The EBI could do something like to say that the given url (which would be the current URL) also supports a registry search interface. Or we could have that all DAS/2 servers implement a search. I don't think that should be a requirement. > http://server/registry/list > http://server/registry/find? > [keyword,organism,authority,type,capability,label]=searchterm My proposal doesn't affect this. Why do "find" and "list" take different URLs? Another possibility is that the same URL returns everything if there are no filters in place. Are multiple search terms allowed? Boolean AND or OR? Andrew dalke at dalkescientific.com From ap3 at sanger.ac.uk Mon Nov 21 05:55:06 2005 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Mon, 21 Nov 2005 10:55:06 +0000 Subject: [DAS2] das registry and das2 In-Reply-To: <4569f5d3ff6658e5ead6b979e8b1fba9@dalkescientific.com> References: <240aa4ff660b7427b6c463ffc10b1307@sanger.ac.uk> <4569f5d3ff6658e5ead6b979e8b1fba9@dalkescientific.com> Message-ID: Hi Andrew, > As it's structured now the top-level interface to a das2/genome URL > returns a list of sources. Based on what you need for the registry, > we're going to add support for data about the source itself. > > The resulting das-sources XML document is effectively identical to > what you're looking for. that sounds good. I agree the description should look identical for both the sources and the registry. If the sources are already properly described this also makes it easier to "publish" them. I think most of the fields in the registry are rather clear why they are there. The issue that might need most discussion might be how to describe a coordinate system. This information is important because a DAS client usually understands one or multiple coordinate systems. E.g. Ensembl knows about Chromosomes and Clones, but it can also display UniProt annotations in some cases. Similar the SPICE DAS client can display annotations served in PDB-residue numbering and UniProt coordinates, but does not know how to deal with genomic coordinates. Therefore the "coordinate system" or "namespace" is an important part of the description of a DAS source. What I found in the current spec-draft that comes closest to this issue is the different "domains" e.g http://server/das/genome/source/version/features so I might want to say http://server/das/genome/homosapiens/ncbi35/features http://server/das/genome/musmusculus/ncbim34/features or should it be http://server/das/genome/ncbi/homosapiens35/features http://server/das/genome/ncbi/musmusculus34/features ? Hm. I am not sure, but it seems that one level is missing? - either organism or authority ? The description of the data finally should allow to use the same DAS source in multiple DAS-clients. Some validation will be required on the descriptions, to warn people that "homo sapiens" should not be written as "human" or "homo". or more complicated: Ensembl does not do assemblies itself. The assembly used is currently NCBI_35. Therefore "Ensembl" can not be used as an authority for a chromosomal coordinate system. Currently the registry provides a restricted list of allowed coordinate systems, to keep this under control. >> http://server/registry/list >> http://server/registry/find? >> [keyword,organism,authority,type,capability,label]=searchterm > > My proposal doesn't affect this. > > Why do "find" and "list" take different URLs? Another possibility > is that the same URL returns everything if there are no filters > in place. yes - better use only one url. no filters would return all sources. > > Are multiple search terms allowed? yes > Boolean AND or OR? We can add a parameter where this can be chosen. Greetings, Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 From dalke at dalkescientific.com Mon Nov 21 12:06:25 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 21 Nov 2005 18:06:25 +0100 Subject: [DAS2] DAS/2 weekly meeting notes for 14 Nov 05 In-Reply-To: References: Message-ID: <90dff63fdc1e5b32ba97f8c18948758e@dalkescientific.com> Going through the back emails to prepare for the conference call in 30 minutes. Andreas, replying to Steve's comment: >> For protein DAS, authority typically defines two diff coord systems: >> 'pdb resnum, interprot' > >> It does not permit automated translation between two coord systems. > > unfortunately this is not that easy in protein space. The mapping from > the 3D > protein structure to the protein sequence is not straightforward. > Think of > negative, non-consecutive, and "non-numeric" residue numbers that can > appear > in the 3D structures. Therefore we came up with the "alignment" DAS - > document > that allows to map one object in one coordinate system to another one. > it can > also be used to map one assembly to another. Regarding the structure mapping, when we visited the PDB in August they said it's not a problem. The mmCIF records have the information needed for the mapping. I've not looked into this though. > not only - the DAS clients usually can display a certain "coordinate > system" e.g. Ensembl can do > Chromosomal ones, but if DAS sources are available that speak the > "UniProt, Protein Sequence" coordinate > system, it knows how to project these onto the genome. - an > "intelligent DAS client" :-) I like the use case of "user wants to merge annotations from different servers. As DAS currently doesn't have liftover support, the DAS client needs to get annotations only from servers using the same reference coordinate system." Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Mon Nov 21 12:08:30 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 21 Nov 2005 18:08:30 +0100 Subject: [DAS2] Getting individual features in DAS/1 In-Reply-To: References: Message-ID: <7f239b885d3eca821639654862770c65@dalkescientific.com> Has anyone answered Ilari's question? I never used DAS/1 enough to answer it myself. If the normal DAS list is closed, is this the right place for DAS/1 questions? On Nov 18, 2005, at 4:22 PM, Ilari Scheinin wrote: > This mail is not really about DAS/2, but the web site says the > original DAS mailing list is now closed. > > I am setting up a DAS server that serves CGH data from my database to > a visualization software, which in my case is gbrowse. I've already > set up Dazzle that serves the reference data from a local copy of > Ensembl. I need to be able to select individual CGH experiments to be > visualized, and as the measurements from a single CGH experiment cover > the entire genome, this cannot of course be done by specifying a > segment along with the features command. > > I noticed that there is a feature_id option for getting the features > in DAS/1.5, but on a closer look, it seems to work by getting the > segment that the specified feature corresponds to, and then getting > all features from that segment. My next approach was to use the > feature type to distinguish between different CGH experiments. As all > my data is of the type CGH, I thought that I could use spare this > piece of information for identifying purposes. > > First I tried the generic seqfeature plugin. I created a database for > it with some test data. However, getting features by type does not > seem to work. I always get all the features from the segment in > question. > > Next I tried the LDAS plugin. Again I created a compatible database > with some test data. I must have done something wrong the the data > file I imported to the database, because getting the features does not > work. I can get the feature types, but trying to get the features > gives me an ERRORSEGMENT error. > > I thought that before I go further, it might be useful to ask whether > my approach seems reasonable, or is there a better way to achieve what > I am trying to do? What should I do to be able to visualize individual > CGH profiles? > > I'm grateful for any advice, > Ilari Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Mon Nov 21 12:25:06 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 21 Nov 2005 18:25:06 +0100 Subject: [DAS2] das registry and das2 In-Reply-To: References: <240aa4ff660b7427b6c463ffc10b1307@sanger.ac.uk> <4569f5d3ff6658e5ead6b979e8b1fba9@dalkescientific.com> Message-ID: <21a521b096330a81bfa05b0789d3c92d@dalkescientific.com> Andreas Prlic wrote: > Therefore the "coordinate system" or "namespace" is an important part > of the description of a DAS source. > > What I found in the current spec-draft that comes closest to this > issue is the different "domains" > e.g > > http://server/das/genome/source/version/features > > so I might want to say > http://server/das/genome/homosapiens/ncbi35/features > http://server/das/genome/musmusculus/ncbim34/features > > or should it be > http://server/das/genome/ncbi/homosapiens35/features > http://server/das/genome/ncbi/musmusculus34/features > ? > > Hm. I am not sure, but it seems that one level is missing? - either > organism or authority ? The species information is available from the data source from the 'taxon' attribute, as in It's not available through a URL naming. That's arbitrary in that the data provider can use any term. I think there's nothing to preclude a provider from putting the actual source data one level deeper in the tree. Personally I find that that's over-classification. Who would use it? > Currently the registry provides a restricted list of allowed > coordinate systems, to keep this under control. Thomas: > This is possibly an argument for avoiding the use of URLs for assembly > identifiers, if we can't be sure that the organisation that's the > authority for a given assembly will be running an authoritative DAS > server. URNs would be fine, as would the kind of structured but > location-independent identifer that Andreas has been using. I think there's no reason we can't use our own names for these. Eg, http://www.biodas.org/coordinates/NCBI35 or a simple unique id like "NCBI35". Right now those are treated as opaque identifiers. There's no name resolution going on, and the coordinates are (I assume) implicit in that client software doesn't resolve the name, only check that the servers are returning data from the same coordinate system. Perhaps in the future that URL might resolve to something, but there's no current reason to do so. In the renewal grant there is reason to compare different coordinates. When that happens a client needs to pick one reference frame and get the translation information to the other. So the liftover service needs to know about the two coordindate systems. But it can be done through hard-coded information (perhaps with some information that coordinate system X is an alias for Y). I still don't think there's any need to resolve these URLs. Andreas: >> Are multiple search terms allowed? > > yes Then they should likely be along the same lines used for the DAS/2 searching. >> Boolean AND or OR? > > We can add a parameter where this can be chosen. The existing DAS/2 uses an AND search only. Rather "OR" for multiple fields of the same data type and "AND" across different fields. Andrew dalke at dalkescientific.com From Gregg_Helt at affymetrix.com Mon Nov 21 12:24:37 2005 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Mon, 21 Nov 2005 09:24:37 -0800 Subject: [DAS2] Getting individual features in DAS/1 Message-ID: We need to discuss at today's meeting. I don't think the original DAS list should be closed, but rather continue to serve as a list to discuss the DAS/1 protocol and implementations, and the DAS2 mailing list should focus on DAS/2. If we mix DAS/1 and DAS/2 discussions in the same mailing list I think it's going to lead to a lot of confusion. gregg > -----Original Message----- > From: das2-bounces at portal.open-bio.org [mailto:das2-bounces at portal.open- > bio.org] On Behalf Of Andrew Dalke > Sent: Monday, November 21, 2005 9:09 AM > To: DAS/2 > Subject: Re: [DAS2] Getting individual features in DAS/1 > > Has anyone answered Ilari's question? > > I never used DAS/1 enough to answer it myself. > > If the normal DAS list is closed, is this the right place for DAS/1 > questions? > > > On Nov 18, 2005, at 4:22 PM, Ilari Scheinin wrote: > > > This mail is not really about DAS/2, but the web site says the > > original DAS mailing list is now closed. > > > > I am setting up a DAS server that serves CGH data from my database to > > a visualization software, which in my case is gbrowse. I've already > > set up Dazzle that serves the reference data from a local copy of > > Ensembl. I need to be able to select individual CGH experiments to be > > visualized, and as the measurements from a single CGH experiment cover > > the entire genome, this cannot of course be done by specifying a > > segment along with the features command. > > > > I noticed that there is a feature_id option for getting the features > > in DAS/1.5, but on a closer look, it seems to work by getting the > > segment that the specified feature corresponds to, and then getting > > all features from that segment. My next approach was to use the > > feature type to distinguish between different CGH experiments. As all > > my data is of the type CGH, I thought that I could use spare this > > piece of information for identifying purposes. > > > > First I tried the generic seqfeature plugin. I created a database for > > it with some test data. However, getting features by type does not > > seem to work. I always get all the features from the segment in > > question. > > > > Next I tried the LDAS plugin. Again I created a compatible database > > with some test data. I must have done something wrong the the data > > file I imported to the database, because getting the features does not > > work. I can get the feature types, but trying to get the features > > gives me an ERRORSEGMENT error. > > > > I thought that before I go further, it might be useful to ask whether > > my approach seems reasonable, or is there a better way to achieve what > > I am trying to do? What should I do to be able to visualize individual > > CGH profiles? > > > > I'm grateful for any advice, > > Ilari > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 From Steve_Chervitz at affymetrix.com Mon Nov 21 15:15:41 2005 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Mon, 21 Nov 2005 12:15:41 -0800 Subject: [DAS2] DAS/2 weekly meeting notes for 21 Nov 05 Message-ID: Notes from the weekly DAS/2 teleconference, 21 Nov 2005. $Id: das2-teleconf-2005-11-21.txt,v 1.3 2005/11/21 20:15:28 sac Exp $ Attendees: Affy: Steve Chervitz, Gregg Helt UCLA: Allen Day, Brian O'connor UCBerkeley: Suzi Lewis, Nomi Harris Sweden: Andrew Dalke Sanger: Andreas Prlic Action items are flagged with '[A]'. These notes are checked into the biodas.org CVS repository at das/das2/notes/2005. Instructions on how to access this repository are at http://biodas.org Today's topic: Client-Server implementation issues ---------------------------------------------------- Suzi/Nomi --------- Questions for gregg: How to communicate styles in DAS/2? GH: Client gets style sheets from server that suggests how to render things. AD: EBI uses this a lot. Most of the DAS systems there use stylesheets. [A] Andreas will contact folks at Sanger/EBI for stylesheet example code. GH: The IGB client uses a preference configuration, using java preferences rather than special XML file. Windows: sets values in the registry. Has been successful. If client can understand DAS/2 stylesheets and client-side prefs, the client-side prefs should override the server styles (others agree). Steve ----- * Reported on some analysis of Affymetrix DAS server weblogs. Lots of google-bot data download. Lots of spotfire hits, too. BO: Google bots should respect robots.txt [A] Steve will install robots.txt in the relevant locations * Reported on getting Greggs DAS/2 server to run on top of apache rather than as a stand-alone server. Should be a matter of hooking apache up to tomcat using a tomcat connector. Directive for apache to defer to tomcat for servlet requests. [A] Steve will hook up affy das server to apache/tomcat. Gregg ----- * Regarding Spotfire - they are working on a IGB plugin to spotfire using http localhost API. This explains our spotfire hits. Gregg was previously integrating IGB with spotfire using a java to COM bridge. It works, but the COM bridges aren't free etc. etc. They are interested in driving IGB from spotfire since they're interested in using IGB to provide genome vizualization. Are currently evaluating whether to release it to public or not. Gregg considered putting this in the grant, but would have required permission, etc. and time was a factor. They may eventually commit to IGB code base directly, but still need to work out leagalese. They will be interested in tracking the interclient API work we are doing (IGB-Apollo). * No major work on DAS this week, just some niggling IGB issues. * Planning another IGB release by end of year that will have improvements to DAS/2 clients. Fixed: access via quickload then accesss to DAS/2 causes blankout of screen Fixed: DAS/2 interaction Brian ----- * Marc C has committed stuff to IGB code base (genovis). Is there a test suite we can use to verify we're not breaking anything? GH: No, but hopefully early next year. Definitely needed. * Also checked in the re-factor - separate namespaces for assay and ontology. [A] Gregg will relocate das2 package to com.affy.das2 & uncouple from IGB GH: There are a few igb dependencies to be unraveled (das2feature...). Don't want to do this in the next release since that's pretty significant given upcoming holidays. GH: Other features to get in: * Persistence of preferences. * Get rid of hardwiring of DAS2 servers. Already to this for DAS/1, just need to replicate for DAS/2. Allen ----- * API for handling ontologies, structures. Communication with Chris Mungall. * Have impl at stanford for autocompletion of ontology terms related to samples (Gavin Sherlock's group, SMD). What is bioontology group doing for distributing their ontologies, what api's are going to be made public? SL: Am at stanford right now to talk about that. Will offer bulk things like at obo site, but in terms of interactive API, will respond to community as best we can. Allen: Interested in more integration with bioontology group and with his work with SMD. Suzi: Not content, but tools right? Allen: Yes. Suzi: Work with chris. Timing couldn't be better. [A] Allen will work with Chris M re: ontology API tools for OBO & SMD * GH: Progress on writeback? Part of grant proposal to get it done by june. Will help funding continuation. Allen: We could start implementing some of that given the refactoring that's now done. GH: Ed Griffith at sanger is interested on this. hoping for his participation. In the short timeframe, you're server wouldn't have to implement it as long as there is at least one server available that can do it. Allen: Need to look at work load. There's no lack of work to be done for get requests (faster impls). GH: Would prefer to have just one writeback server and a faster get server rather than having two writeback capable servers. * Allen: Optimizations involving serving files, kind of a report-version of the chado adapters. GH: Regarding your rounding ranges optimization for tiling can you post to the list? [A] Allen will post his rounding ranges optimization to DAS/2 list GH: The idea is to help server-side caching by rounding the range requests so you're more likely to hit the same URI (e.g., stop=5010 becomes 6000). Different clients are more likely to hit the cache. Not in the spec, just a convention. Requires more smarts in client: giving more to the user than they asked for, or throwing out what's not asked for. Throwing out what they didn't ask for would be nicer. In theory, this won't be an issue with client caching. SC: Could make client's configuration re: rounding an option. GH: Users want fewer options. * IGB display troubles. Allen had trouble getting it to display anything besides mRNA GH: IGB expects 2-level or deeper annotations. For single-level annots, should connect all with a line. Allen: May be doing this for SNPs. But also saw some strange responses. GH: Needs a fix. Allen: will it be in next release? GH: harder to do it generally -- easier to hardwire it for particular data types. Rendering has to guess how deep you want to go. Currently goes to the leaves and then goes 1-level up, rather than top-down. IGB uses an extra level than you actually see to keep track of other things (e.g., region in query). Preferences UI: 'nested' can select two-level or one-level deep. Would like to hear what others you have problems with.. [A] Gregg will fix IGB display problems for single-level annots. Andrew ------ * Emailed open-bio root list to set up cgi for online verifier. But no response yet. * DAS/1 vs DAS/2 mailing list. GH: Confusion may occur if we combine DAS/1 and DAS/2 discussion. Let's keep DAS/1 for all DAS/1 spec related discussion. [A] Steve verify whether the DAS/1 list is still alive. [A] Steve will put a link to in on biodas.org for DAS/1 list * Locking: Plan to talk to EBI about this in January They are doing work for style sheets. [A] Andrew will ask Ed G. to join these meetings * Needs test data, mock data set. [A] Allen will point Andrew at some data for testing. Andreas ------- * The current registry implementation: Written in java two ways to interact: 1) html, can browser available DAS sources, see details, go back to DAS client and activate the DAS source in the DAS client. 2) soap, client contacts registry, get list of available sources. Is open source. [A] Andreas will post link to source code for DAS registry impl. GH: A central registry is good, but companies will want their own. eg., at affy there may be 5-7. Andreas: It's possible to have a set of registries, local vs. public. GH: Are you OK with idea to have an http-based interface? It can run on top of existing core. Andreas: Sure. [A] Andreas will provide http-based interface to Sanger DAS registry Agenda for next week teleconf ----------------------------- * Talk more about registry spec issues * Retrieval spec issues: - Content-type - DAS/2 headers - Feature and type properties - other things? Andrew: Prefer to have most of the discussion online (DAS/2 list) then the teleconf can be more productive. [A] Continue discussing spec issues on the list before next teleconf From allenday at ucla.edu Mon Nov 21 15:47:51 2005 From: allenday at ucla.edu (Allen Day) Date: Mon, 21 Nov 2005 12:47:51 -0800 (PST) Subject: [DAS2] tiled queries for performance Message-ID: Hi, I had an idea of how clients may be able to get better response from servers by using a tiled query technique. Here's the basic idea: ClientA wants features in chr1/1010:2020, and issues a request for that range. No other clients have previously requested this range, so the server-side cache faults to the DAS/2 service (slow). ClientB wants features in chr1/1020:2030, and issues a request for that range. Although the intersection of the resulting records with ClientA's query is large, the URIs are different and the server-side cache faults again. If ClientA and ClientB were to each issue two separate "tiled" requests: 1. chr1/1001:2000 2. chr1/2001:3000 ClientB could take advantage of the fact that ClientA had been looking at the same tiles. For this to work, the clients would need to be using the same tile size. The optimal tile size is likely to vary from datasource to datasource, depending on the length and density distributions of the features contained in the datasource. The "sources" or "versioned sources" payload could suggest a tiling size to prospective clients. Servers could also pre-cache all tiles by hitting each tile after an update of the datasource (or the DAS/2 service code). The tradeoff for the performance gains is that clients may now need to do filtering on the returned records to only return those requested by the client's client. -Allen From ap3 at sanger.ac.uk Tue Nov 22 08:54:27 2005 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Tue, 22 Nov 2005 13:54:27 +0000 Subject: [DAS2] das registry links Message-ID: Hi! There was a question yesterday where to get the source code from the das-registration server and if it is possible to have a local installation. The source code for the registry is available under LGPL at http://www.derkholm.net/svn/repos/dasregistry/trunk/ using subversion. To obtain a local installation, which caches/synchronizes the public available data and allows to add local das sources, see instructions at: http://www.derkholm.net/svn/repos/dasregistry/trunk/release/install.txt There is also a das-registry announce-mailing list at http://lists.sanger.ac.uk/mailman/listinfo/das_registry_announce Regards, Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 From ap3 at sanger.ac.uk Tue Nov 22 12:58:08 2005 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Tue, 22 Nov 2005 17:58:08 +0000 Subject: [DAS2] ensembl & stylesheet Message-ID: Hi! another question yesterday was about ensembl & stylesheet support. an example das source that provides a stylesheet is the following: http://das.ensembl.org/das/ens_35_segdup_washu/stylesheet description about it is at: http://das.ensembl.org/das/ens_35_segdup_washu/ To show how it is rendered in ensembl follow this "auto-activation" link: http://www.ensembl.org/Homo_sapiens/contigview?conf_script=contigview; c=17:14149999.5:1;w=200000;h=; add_das_source=(name=SEGDUP_WASHU+url=http://das.ensembl.org/ das+dsn=ens_35_segdup_washu+type=ensembl_location+color=black+strand=r+l abelflag=U+stylesheet=Y+group=Y+depth=9999+score=N+active=1) In terms of source code ensembl uses the Bio::DASLite perl module for fetching features and stylesheets http://search.cpan.org/~rpettett/Bio-DasLite-0.10/ Hope this helps, Cheers, Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 From gilmanb at pantherinformatics.com Mon Nov 21 16:46:25 2005 From: gilmanb at pantherinformatics.com (Brian Gilman) Date: Mon, 21 Nov 2005 16:46:25 -0500 Subject: [DAS2] tiled queries for performance In-Reply-To: References: Message-ID: <2042BBCD-8490-461D-80C1-1BB4A1FAACB1@pantherinformatics.com> Hello Everyone, I've been lurking on the list and wanted to say hi. We're looking into this kind of implementation issue ourselves and thought that a bitorrent like cache makes the most sense. ie. all servers in the "fabric" are issued the query in a certain "hop adjacency". These servers then send their data to the client who's job it is to assemble the data. HTH, -B -- Brian Gilman President Panther Informatics Inc. E-Mail: gilmanb at pantherinformatics.com gilmanb at jforge.net AIM: gilmanb1 01000010 01101001 01101111 01001001 01101110 01100110 01101111 01110010 01101101 01100001 01110100 01101001 01100011 01101001 01100001 01101110 On Nov 21, 2005, at 3:47 PM, Allen Day wrote: > Hi, > > I had an idea of how clients may be able to get better response from > servers by using a tiled query technique. Here's the basic idea: > > ClientA wants features in chr1/1010:2020, and issues a request for > that > range. No other clients have previously requested this range, so the > server-side cache faults to the DAS/2 service (slow). > > ClientB wants features in chr1/1020:2030, and issues a request for > that > range. Although the intersection of the resulting records with > ClientA's > query is large, the URIs are different and the server-side cache > faults > again. > > If ClientA and ClientB were to each issue two separate "tiled" > requests: > > 1. chr1/1001:2000 > 2. chr1/2001:3000 > > ClientB could take advantage of the fact that ClientA had been > looking at > the same tiles. > > For this to work, the clients would need to be using the same tile > size. > The optimal tile size is likely to vary from datasource to datasource, > depending on the length and density distributions of the features > contained in the datasource. The "sources" or "versioned sources" > payload could suggest a tiling size to prospective clients. > Servers could > also pre-cache all tiles by hitting each tile after an update of the > datasource (or the DAS/2 service code). > > The tradeoff for the performance gains is that clients may now need > to do > filtering on the returned records to only return those requested by > the > client's client. > > -Allen > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 From Steve_Chervitz at affymetrix.com Wed Nov 23 11:03:55 2005 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Wed, 23 Nov 2005 08:03:55 -0800 Subject: [DAS2] Simple Sharing Extensions for RSS and OPML Message-ID: This may have some concept relevant to DAS/2 writeback: http://msdn.microsoft.com/xml/rss/sse/ Steve From allenday at ucla.edu Wed Nov 23 18:50:24 2005 From: allenday at ucla.edu (Allen Day) Date: Wed, 23 Nov 2005 15:50:24 -0800 (PST) Subject: [DAS2] tiled queries for performance In-Reply-To: References: Message-ID: More thoughts on this. The client can eliminate the redundancy in the records returned by issuing the tiling queries as previously described (query1), then issuing queries for records that are not contained within tiles, but overlap the boundaries of 1 or more tiles (query2). However, by issuing all the overlaps queries at once, we've just deferred the performance hit one step, because we can't reasonably expect the server to have cached all combinations of tile overlaps queries. I think, to get this tiling optimization to work, the burden needs to be on the client to identify and remove duplicate responses for multiple edge-overlaps queries (query3). 1000bp 2000bp 3000bp | | | | === | =====^==== | | ====#===== | | ============#=============#===== | | | <-----------> query1a <-----------> query1b query2 query3a query3b Key: | : tile boundary = : feature ^ : gap between child features # : portion of feature overlapping tile boundary. : client overlaps query <.> : client contains query -Allen On Mon, 21 Nov 2005, Allen Day wrote: > Hi, > > I had an idea of how clients may be able to get better response from > servers by using a tiled query technique. Here's the basic idea: > > ClientA wants features in chr1/1010:2020, and issues a request for that > range. No other clients have previously requested this range, so the > server-side cache faults to the DAS/2 service (slow). > > ClientB wants features in chr1/1020:2030, and issues a request for that > range. Although the intersection of the resulting records with ClientA's > query is large, the URIs are different and the server-side cache faults > again. > > If ClientA and ClientB were to each issue two separate "tiled" requests: > > 1. chr1/1001:2000 > 2. chr1/2001:3000 > > ClientB could take advantage of the fact that ClientA had been looking at > the same tiles. > > For this to work, the clients would need to be using the same tile size. > The optimal tile size is likely to vary from datasource to datasource, > depending on the length and density distributions of the features > contained in the datasource. The "sources" or "versioned sources" > payload could suggest a tiling size to prospective clients. Servers could > also pre-cache all tiles by hitting each tile after an update of the > datasource (or the DAS/2 service code). > > The tradeoff for the performance gains is that clients may now need to do > filtering on the returned records to only return those requested by the > client's client. > > -Allen > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 > From Steve_Chervitz at affymetrix.com Wed Nov 23 20:40:13 2005 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Wed, 23 Nov 2005 17:40:13 -0800 Subject: [DAS2] Ontology Lookup Service Message-ID: Allen, This looks similar to what you have been working on for SMD: http://www.ebi.ac.uk/ontology-lookup/ Would be interesting to compare it with your ontology DAS-based implementation (e.g., performance, ease of installation, extending, etc.). Steve From dalke at dalkescientific.com Wed Nov 23 21:52:35 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 24 Nov 2005 03:52:35 +0100 Subject: [DAS2] tiled queries for performance In-Reply-To: References: Message-ID: Allen: > No other clients have previously requested this range, so the > server-side cache faults to the DAS/2 service (slow). Admittedly I'm curious about this. Why is this slow? What does slow mean? I assume "cannot be returned faster than the network will take it." How many annotations are in the database? Figuring one annotation for every ... 100 bases? gives me 30 million. Shouldn't a range search over < only 30 million be fast? Is this being done in the database? Which database and what's the SQL? If the DB is the bottleneck then pulling it out as a specialized search might be worthwhile. What I'm driving at for this is this. The proposal feels like a workaround for a given implementation. To use it requires more smarts in the client. Why not put that logic on the server? Andrew dalke at dalkescientific.com From allenday at ucla.edu Thu Nov 24 02:10:36 2005 From: allenday at ucla.edu (Allen Day) Date: Wed, 23 Nov 2005 23:10:36 -0800 Subject: [DAS2] tiled queries for performance In-Reply-To: References: Message-ID: <5c24dcc30511232310p1623ff4dk9088579cdf58e082@mail.gmail.com> Hi Andrew. I'd like to be able to consistently get network-bottlenecked response from the server. The largest (250 megabase) SQL range queries typically take ~30 seconds to complete, returning ~500K features. I'm currently working on getting the templating system (Template Toolkit aka TT2) we use to flush to the client periodically, rather than building the entire response first. This is the current bottleneck; TT2 generation of a 500K record XML document takes many minutes. Regardless of how much more optimization work we put into the server, it's never going to be as fast as serving up pre-queried, pre-rendered content. I borrowed the idea of tiling from the Google maps application ( maps.google.com). In their implementation the server is dumb, and just serves up a static HTML/Javascript document (the application), and static PNG images based on latitute/longitude coordinates (the data). All of the application logic for what to display occurs client side. Classic AJAX. In the DAS protocol, the distribution of the application logic is distributed between the client and server, sometimes to ill effect. Requiring both (a) the server to respond to arbitrary range queries, and (b) the client to display arbitrary ranges unnecessarily creates a bifurcation of the View component of the application. Brian was hinting at this when he mentioned the idea of bittorrent blocks earlier in the thread. We also require code redundancy between client and server to be able to fully use the type and exacttype filters. In this case the Model component has been bifurcated -- the client needs to build a model the ontology (from who knows where... presumably processing OBO-Edit files) so the user can issue queries, and the server needs to also have some representation of the ontology to generate a response. Hopefully the ontology DAS extension will help the latter situation outlined above by getting both client and server to be synchronized on the same data model. As far as the tiling optimization goes, it's likely that I'll implement a preprocessor for the HTTP query so I can break it into tiles -- conceptually very similar to the log10 binning that Lincoln does in the GFF database. -Allen On 11/23/05, Andrew Dalke wrote: > > Allen: > > No other clients have previously requested this range, so the > > server-side cache faults to the DAS/2 service (slow). > > Admittedly I'm curious about this. Why is this slow? What does > slow mean? I assume "cannot be returned faster than the network > will take it." > > How many annotations are in the database? Figuring one annotation > for every ... 100 bases? gives me 30 million. Shouldn't a range > search over < only 30 million be fast? Is this being done in the > database? Which database and what's the SQL? > > If the DB is the bottleneck then pulling it out as a specialized > search might be worthwhile. > > What I'm driving at for this is this. The proposal feels like > a workaround for a given implementation. To use it requires > more smarts in the client. Why not put that logic on the server? > > > Andrew > dalke at dalkescientific.com > > From allenday at ucla.edu Thu Nov 24 02:21:48 2005 From: allenday at ucla.edu (Allen Day) Date: Wed, 23 Nov 2005 23:21:48 -0800 Subject: [DAS2] Re: Ontology Lookup Service In-Reply-To: References: Message-ID: <5c24dcc30511232321v70f77dc9y7a1ceef22bcf6edc@mail.gmail.com> Hi Steve. Yes, this is pretty similar to what we're doing. The major differences I see are (a) the query flexibility -- It only lets you retrieve terms from one ontology at a time, and does not support wildcards (b) the display -- it doesn't actually show you the dag structure of the ontology, and (c) using different tech -- Java/SOAP as opposed to Perl/ReST. -Allen On 11/23/05, Steve Chervitz wrote: > > Allen, > > This looks similar to what you have been working on for SMD: > > http://www.ebi.ac.uk/ontology-lookup/ > > Would be interesting to compare it with your ontology DAS-based > implementation (e.g., performance, ease of installation, extending, etc.). > > Steve > > From dalke at dalkescientific.com Thu Nov 24 08:28:00 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 24 Nov 2005 14:28:00 +0100 Subject: [DAS2] tiled queries for performance In-Reply-To: <5c24dcc30511232310p1623ff4dk9088579cdf58e082@mail.gmail.com> References: <5c24dcc30511232310p1623ff4dk9088579cdf58e082@mail.gmail.com> Message-ID: <9eb929192db24ad93fb2a7cf423aa9c3@dalkescientific.com> Allen: > I'd like to be able to consistently get network-bottlenecked response > from the server.? The largest (250 megabase) SQL range queries > typically take ~30 seconds to complete, returning ~500K features.? I'm > currently working on getting the templating system (Template Toolkit > aka TT2) we use to flush to the client periodically, rather than > building the entire response first.? This is the current bottleneck; > TT2 generation of a 500K record XML document takes many minutes.? > Regardless of how much more optimization work we put into the server, > it's never going to be as fast as serving up pre-queried, pre-rendered > content. Interesting. So I was right, in that the range search is fast, but wrong in not considering the template generation problem. Could that cause a DoS attack by asking for several large ranges at once? You're building up multi-megabyte strings in memory. (If 1 feature is 1K then that's 500MB.) Ideologically the clean solution might be to have the search return only a list of identifiers and have the client fetch each feature one-by-one. This is a tile size of 1. Implementation-wise this will cause problems unless using HTTP 1.1 pipelining since the act of opening 500K connections takes non-trivial time. Adding a "return XML for these ids" service doesn't help either - it brings us back to the same problem. But another solution is to cache all the features as XML, leaving out only the header and footer. Skip the templating system (rather, it's upstream of the caching). Do the search, get the ids, and stream the contents directly from the cache. This would be used in feature lookup and for search results. > In the DAS protocol, the distribution of the application logic is > distributed between the client and server, sometimes to ill effect.? > Requiring both (a) the server to respond to arbitrary range queries, > and (b) the client to display arbitrary ranges unnecessarily creates a > bifurcation of the View component of the application.? Brian was > hinting at this when he mentioned the idea of bittorrent blocks > earlier in the thread. What application logic? There should be many ways to build different applications on top of DAS. DAS is a data model. The client provides the view (or many views). There are two reasons for query support on the server. 1. slow bandwidth and limited client resources - otherwise clients could download and search the data locally 2. easier support for (certain classes of) application developers To make the Google comparison, there's no reason Google searches couldn't take place on your personal machine except that you can't download the Internet and search it in usable time. With Google providing the service others can do things like provide domain-specific web searches via Google, include Google links in a web browser, or make something like Googlefight. > We also require code redundancy between client and server to be able > to fully use the type and exacttype filters.? In this case the Model > component has been bifurcated -- the client needs to build a model the > ontology (from who knows where... presumably processing OBO-Edit > files) so the user can issue queries, and the server needs to also > have some representation of the ontology to generate a response. > > Hopefully the ontology DAS extension will help the latter situation > outlined above by getting both client and server to be synchronized on > the same data model.? As far as the tiling optimization goes, it's > likely that I'll implement a preprocessor for the HTTP query so I can > break it into tiles -- conceptually very similar to the log10 binning > that Lincoln does in the GFF database. I didn't follow this. Code redundancy means what? There's an exchange of data models - in this case the model for a query. But any client/server needs to do this. Take Entrez, for example. It supports many types of search fields, including MeSH (which I think counts as an ontology). A sophisticated client may have a GUI to help people identify MeSH terms. This obviously does some duplicate work as with the server. Is that what you mean? If so, why does it matter? Note also that while Google Maps serves static images only, there's shared logic between the application (in the browser) and the tools that generated those maps. Eg, both have the same code for understanding geography/latitude&longitude. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Thu Nov 24 08:47:26 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 24 Nov 2005 14:47:26 +0100 Subject: [DAS2] tiled queries for performance In-Reply-To: <2042BBCD-8490-461D-80C1-1BB4A1FAACB1@pantherinformatics.com> References: <2042BBCD-8490-461D-80C1-1BB4A1FAACB1@pantherinformatics.com> Message-ID: <22110007fe53238adbda91041ee1baf2@dalkescientific.com> Hi Brian, > We're looking into this kind of implementation issue ourselves and > thought that a bitorrent like cache makes the most sense. ie. all > servers in the "fabric" are issued the query in a certain "hop > adjacency". These servers then send their data to the client who's job > it is to assemble the data. I go back and forth between the "large data set" model and the "large number of entities" model. In the first: - client requests a large data file - server returns it This can be sped up by distributing the file among many sites and using something like BitTorrent to put it together, or something like Coral ( http://www.coralcdn.org/ ) to redirect to nearby caches. But making the code for this is complicated. It's possible to build on BitTorrent and similar systems, but I have no feel for the actual implementation cost, which makes me wary. I've looked into a couple of the P2P toolkits and not gotten the feel that it's any easier than writing HTTP requests directly. Plus, who will set up the alternate servers? In the second: - make query to server - server returns list of N identifiers - make N-n requests (where 'n' is the number of identifiers already resolved) The id resolution can be done in a distributed fashion and is easily supported via web caches, either with well-configured proxies or (again) through Coral. I like the latter model in part because it's more fine grained. Eg, a progress bar can say "downloading feature 4 of 10000", and if a given feature is already present there's no need to refetch it. The downside of the 2nd is the need for HTTP 1.1 pipelining to make it be efficient. I don't know if we want to have that requirement. Gregg came up with the range restrictions because most of the massive results will be from range searches. By being a bit more clever about tracking what's known and not known, a client can get a much smaller results page. These are complementary. Using Gregg's restricted range queries can reduce the number of identifiers returned in a search, making the network overhead even smaller. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Fri Nov 25 10:21:21 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Fri, 25 Nov 2005 16:21:21 +0100 Subject: [DAS2] DAS intro Message-ID: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> The front of the DAS doc starts DAS 2.0 is designed to address the shortcomings of DAS 1.0, including: That kinda assumes people know what DAS 1.0 is to understand DAS 2.0. How about this instead, as an overview/introduction. ====== DAS/2 describes a data model for genome annotations. An annotation server provides information about one or more genome sources. Each source may have one or more versions. Different versions are usually based on different assemblies. As an implementation detail an assembly and corresponding sequence data may be distributed via a different machine, which is called the reference server. Portions of the assembly may have higher relative accuracy than the assembly as a whole. A reference server may supply these portions as an alternate reference frame. Annotations are located on the genome with a start and end position. The range may be specified mutiple times if there are alternate reference frames. An annotation may contain multiple non-continguous parts, making it the parent of those parts. Some parts may have more than one parent. Annotations have a type based on terms in SOFA (Sequence Ontology for Feature Annotation). Stylesheets contain a set of properties used to depict a given type. Annotations can be searched by range, type, and a properties table associated with each annotation. These are called feature filters. DAS/2 is implemented using a ReST architecture. Each entity (also called a document or object) has a name, which is a URL. Fetching the URL gets information about the entity. The DAS-specific entities are all XML documents. Other entities contain data types with an existing and frequently used file format. Where possible, a DAS server returns data using existing formats. In some cases a server may describe how to fetch a given entity in several different formats. ====== Andrew dalke at dalkescientific.com From asims at bcgsc.ca Fri Nov 25 14:15:17 2005 From: asims at bcgsc.ca (Asim Siddiqui) Date: Fri, 25 Nov 2005 11:15:17 -0800 Subject: [DAS2] tiled queries for performance Message-ID: <86C6E520C12E52429ACBCB01546DF4D3BE3E5E@xchange1.phage.bcgsc.ca> Hi, I'm a newbie to this list, so apologies if I've missed something critical. I think this is a great idea. I don't see this as a big change to the DAS/2 spec or requiring much in the way of additional smarts on the client side. The change is simply that instead of the client getting exactly what it asks for, it may get more. My 2 cents, Asim -----Original Message----- From: das2-bounces at portal.open-bio.org [mailto:das2-bounces at portal.open-bio.org] On Behalf Of Allen Day Sent: Wednesday, November 23, 2005 11:11 PM To: Andrew Dalke; DAS/2 Subject: Re: [DAS2] tiled queries for performance Hi Andrew. I'd like to be able to consistently get network-bottlenecked response from the server. The largest (250 megabase) SQL range queries typically take ~30 seconds to complete, returning ~500K features. I'm currently working on getting the templating system (Template Toolkit aka TT2) we use to flush to the client periodically, rather than building the entire response first. This is the current bottleneck; TT2 generation of a 500K record XML document takes many minutes. Regardless of how much more optimization work we put into the server, it's never going to be as fast as serving up pre-queried, pre-rendered content. I borrowed the idea of tiling from the Google maps application ( maps.google.com). In their implementation the server is dumb, and just serves up a static HTML/Javascript document (the application), and static PNG images based on latitute/longitude coordinates (the data). All of the application logic for what to display occurs client side. Classic AJAX. In the DAS protocol, the distribution of the application logic is distributed between the client and server, sometimes to ill effect. Requiring both (a) the server to respond to arbitrary range queries, and (b) the client to display arbitrary ranges unnecessarily creates a bifurcation of the View component of the application. Brian was hinting at this when he mentioned the idea of bittorrent blocks earlier in the thread. We also require code redundancy between client and server to be able to fully use the type and exacttype filters. In this case the Model component has been bifurcated -- the client needs to build a model the ontology (from who knows where... presumably processing OBO-Edit files) so the user can issue queries, and the server needs to also have some representation of the ontology to generate a response. Hopefully the ontology DAS extension will help the latter situation outlined above by getting both client and server to be synchronized on the same data model. As far as the tiling optimization goes, it's likely that I'll implement a preprocessor for the HTTP query so I can break it into tiles -- conceptually very similar to the log10 binning that Lincoln does in the GFF database. -Allen On 11/23/05, Andrew Dalke wrote: > > Allen: > > No other clients have previously requested this range, so the > > server-side cache faults to the DAS/2 service (slow). > > Admittedly I'm curious about this. Why is this slow? What does slow > mean? I assume "cannot be returned faster than the network will take > it." > > How many annotations are in the database? Figuring one annotation for > every ... 100 bases? gives me 30 million. Shouldn't a range search > over < only 30 million be fast? Is this being done in the database? > Which database and what's the SQL? > > If the DB is the bottleneck then pulling it out as a specialized > search might be worthwhile. > > What I'm driving at for this is this. The proposal feels like a > workaround for a given implementation. To use it requires more smarts > in the client. Why not put that logic on the server? > > > Andrew > dalke at dalkescientific.com > > _______________________________________________ DAS2 mailing list DAS2 at portal.open-bio.org http://portal.open-bio.org/mailman/listinfo/das2 From suzi at fruitfly.org Fri Nov 25 17:20:29 2005 From: suzi at fruitfly.org (Suzanna Lewis) Date: Fri, 25 Nov 2005 14:20:29 -0800 Subject: [DAS2] DAS intro In-Reply-To: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> References: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> Message-ID: <59fa39752e4d792d2142fe2682813937@fruitfly.org> a few minor in-line edits below. trying to simplify and not confuse, as this is just an intro. On Nov 25, 2005, at 7:21 AM, Andrew Dalke wrote: > The front of the DAS doc starts > > DAS 2.0 is designed to address the shortcomings of DAS 1.0, > including: > > That kinda assumes people know what DAS 1.0 is to understand DAS 2.0. > > How about this instead, as an overview/introduction. > > ====== > > DAS/2 describes a data model for genome annotations , THAT IS, DESCRIPTIONS OF FEATURES LOCATED ON THE GENOMIC SEQUENCE > . An annotation > server provides SUCH > information FOR > one or more genome SEQUENCES. > Each GENOMIC SEQUENCE > may have one or more versions. Different versions are usually > based on different assemblies. As an implementation detail an > assembly and corresponding sequence data may be distributed via a > different machine, which is called the reference server. (DELETED LAST 2 SENTENCES). > > Annotations are located on the genome with a start and end position. > The range may be specified mutiple times if there are alternate > SEQUENCES THEY MAY BE PLACED UPON (REFERENCE FRAMES). > An annotation may contain multiple non-continguous > parts (DELECTED PHRASE AND SENTENCE) > Annotations have a type based on terms in SOFA > (Sequence Ontology for Feature Annotation). Stylesheets contain a set > of properties used to depict a given type. > > Annotations can be searched by range, type, and a properties table > associated with each annotation. These are called feature filters. > > DAS/2 is implemented using a ReST architecture. Each entity (also > called a document or object) has a name, which is a URL. Fetching the > URL gets information about the entity. The DAS-specific entities are > all XML documents. Other entities contain data types with an existing > and frequently used file format. Where possible, a DAS server returns > data using existing formats. In some cases a server may describe how > to fetch a given entity in several different formats. > ====== > > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 From dalke at dalkescientific.com Fri Nov 25 18:43:10 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Sat, 26 Nov 2005 00:43:10 +0100 Subject: [DAS2] tiled queries for performance In-Reply-To: <86C6E520C12E52429ACBCB01546DF4D3BE3E5E@xchange1.phage.bcgsc.ca> References: <86C6E520C12E52429ACBCB01546DF4D3BE3E5E@xchange1.phage.bcgsc.ca> Message-ID: <9ec33e6fb3efbbe8b39adc52d2b78db7@dalkescientific.com> Asim Siddiqui > I think this is a great idea. > > I don't see this as a big change to the DAS/2 spec or requiring much in > the way of additional smarts on the client side. I agree with Allen on this - in some sense there's no effect on the spec. It ends up being an agreement among the clients to request aligned data, by rounding up/down to the nearest, say, kilobase and for the server implementers to cache those requests. > The change is simply that instead of the client getting exactly what it > asks for, it may get more. While that's another matter - the client makes a request and the server is free to expand the range to something it can handle a bit better. Allen? Were you suggesting this instead? In this case there is a change to the spec, and all clients must be able to filter or otherwise ignore extra results. I personally think it's an implementation issue related to performance and there are ways to make the results be generated fast enough. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Fri Nov 25 19:35:45 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Sat, 26 Nov 2005 01:35:45 +0100 Subject: [DAS2] DAS intro In-Reply-To: <59fa39752e4d792d2142fe2682813937@fruitfly.org> References: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> <59fa39752e4d792d2142fe2682813937@fruitfly.org> Message-ID: Hi Suzi, You're supposed to be on holiday - it's Thanksgiving after all. Though I'm not celebrating it until next week. I wonder where I can find pumpkin pie mix here ... >> DAS/2 describes a data model for genome annotations > , THAT IS, DESCRIPTIONS OF FEATURES LOCATED ON THE GENOMIC SEQUENCE Changed, along with the other fixes. > (DELETED LAST 2 SENTENCES). That was the two lines about >> Portions of >> the assembly may have higher relative accuracy than the assembly as a >> whole. A reference server may supply these portions as an alternate >> reference frame. In the intro I want to mention all of the parts of DAS. The problem is that I still don't understand the /region request. These two lines were my best attempt at explaining them. Was the deletion because my understanding is wrong or because it's not needed for the intro? I think my confusion is related the concept you mention in: >> Annotations are located on the genome with a start and end position. >> The range may be specified mutiple times if there are alternate >> > SEQUENCES THEY MAY BE PLACED UPON (REFERENCE FRAMES). because I don't understand what I should change. I made up the term 'reference frame' because of my physics training. Is it the correct term here? Does 'reference frame' as it's normally used only refer to the full assembly or does it refer to each "/region" as well? If I give the coordinates on a contig can I say it's in the reference frame of that contig? (Hmm, David Block agrees with me, according to http://open-bio.org/bosc2001/abstracts/lightning/block The presence of a Tiling_Path table allows the loading of any arbitrary length of sequence, in the reference frame of any of the contigs that make up the tiling path. ) I thought it was important to mention that a given annotation may have "several tags if the feature's location can be represented in multiple coordinate systems (e.g. multiple builds of a genome or multiple contigs)" Then again, I don't understand how a given feature can be annotated on multiple builds because I thought that a feature was only associated with a single versioned source, and a versioned source has only one build. I would like to have something in the intro which mentions "/region". I just don't know how to do it. Why does anyone care about regions and not just point directly to the sequence? >> An annotation may contain multiple non-continguous >> parts > > (DELECTED PHRASE AND SENTENCE) The deleted text there was ", making it the parent of those parts. Some parts may have more than one parent." I put it there because I remember we talked a lot about this at CSHL a couple years back and wanted to make sure the data model handled cases where, say, there were two parents to three parts. I seems to me that that structure is important enough that someone who is trying to get a quick understanding of DAS annotations would be interested in it. My internal model for the expected reader is someone like Allen or Gregg - people who have some experience in data models for annotations and would like to know that DAS can handle those sorts of more complicated tree structures. I'm willing to move it further into the text, but I'm not convinced that it makes things less confusing or simpler. Features having parts and parents is an essential part of the DAS data model. Andrew dalke at dalkescientific.com From suzi at fruitfly.org Fri Nov 25 20:44:54 2005 From: suzi at fruitfly.org (Suzanna Lewis) Date: Fri, 25 Nov 2005 17:44:54 -0800 Subject: [DAS2] DAS intro In-Reply-To: References: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> <59fa39752e4d792d2142fe2682813937@fruitfly.org> Message-ID: <1ac71c37969c1ef9dcc0d983157746aa@fruitfly.org> Hi Andrew, so there seem to be 2 questions. it would be good to have both in the intro, but only as long as the description can be clearly stated in just a sentence or two. If it takes more then it is clearly something that requires a fuller description outside of the intro. I'll try to give my understanding (but goodness knows I am peering through different lenses). I don't think in terms of the spec at all, just the information that needs to be conveyed. #1 "reference frame" ========================================= "reference frame", is (to my mind) "reference sequence". at least, that is what i've always called it. First, accuracy has nothing at all to do with it, so we don't want the sentence in there. Second, the region of sequence that is returned is nothing more than that. Think of it as a special type of feature. This is what makes a transformation possible from one coordinate-system to another (by adding the correct offsets) Third, just think of "reference sequence" as a coordinate system. One can have the exact same feature and indicate that: on coordinate-system-A this feature starts and ends here, and on coordinate-system-B it starts and ends there. Thus a feature's coordinates may be given both on a chromosome, and on a contig, and on any other coordinate-system that can be derived through a transform from these. So you could change the sentence below to read "A reference server may supply features where the locations (start and end) are relative to either contigs, some other arbitrary region, or to the entire chromosome." #2 "multiple parents" ========================================= It still is easier for me to think of this in terms of sequences. We may know that somewhere out in the world a sequence must exist, but the data/sequence we have collected is fragmentary. For example, thinly sequenced genomes (resulting in many separate contigs) or a pair of ESTs from an cDNA. In either of these cases we need to be able to have the many to many relationships you talk about. This one is perhaps too subtle for the introduction, but if we decide to include it then I think it should first be phrased in terms of the problem (biological sampling) and then in terms of the solution (multiple parents). -S On Nov 25, 2005, at 4:35 PM, Andrew Dalke wrote: > Hi Suzi, > > You're supposed to be on holiday - it's Thanksgiving after all. > > Though I'm not celebrating it until next week. I wonder where > I can find pumpkin pie mix here ... > >>> DAS/2 describes a data model for genome annotations >> , THAT IS, DESCRIPTIONS OF FEATURES LOCATED ON THE GENOMIC SEQUENCE > > Changed, along with the other fixes. > >> (DELETED LAST 2 SENTENCES). > > That was the two lines about > >>> Portions of >>> the assembly may have higher relative accuracy than the assembly as a >>> whole. A reference server may supply these portions as an alternate >>> reference frame. > > In the intro I want to mention all of the parts of DAS. The > problem is that I still don't understand the /region request. > These two lines were my best attempt at explaining them. > > Was the deletion because my understanding is wrong or because it's > not needed for the intro? > > I think my confusion is related the concept you mention in: >>> Annotations are located on the genome with a start and end position. >>> The range may be specified mutiple times if there are alternate >>> >> SEQUENCES THEY MAY BE PLACED UPON (REFERENCE FRAMES). > > because I don't understand what I should change. I made up the > term 'reference frame' because of my physics training. Is it > the correct term here? Does 'reference frame' as it's normally > used only refer to the full assembly or does it refer to each > "/region" as well? If I give the coordinates on a contig can > I say it's in the reference frame of that contig? > > (Hmm, David Block agrees with me, according to > http://open-bio.org/bosc2001/abstracts/lightning/block > The presence of a Tiling_Path table allows the loading of > any arbitrary length of sequence, in the reference frame > of any of the contigs that make up the tiling path. ) > > > > I thought it was important to mention that a given annotation > may have "several tags if the feature's location can be > represented in multiple coordinate systems (e.g. multiple builds > of a genome or multiple contigs)" > > Then again, I don't understand how a given feature can be > annotated on multiple builds because I thought that a feature > was only associated with a single versioned source, and a > versioned source has only one build. > > > I would like to have something in the intro which mentions > "/region". I just don't know how to do it. Why does anyone > care about regions and not just point directly to the sequence? > >>> An annotation may contain multiple non-continguous >>> parts >> >> (DELECTED PHRASE AND SENTENCE) > > The deleted text there was ", making it the parent of those parts. > Some parts may have more than one parent." > > I put it there because I remember we talked a lot about this > at CSHL a couple years back and wanted to make sure the data > model handled cases where, say, there were two parents to three > parts. I seems to me that that structure is important enough > that someone who is trying to get a quick understanding of > DAS annotations would be interested in it. > > My internal model for the expected reader is someone like > Allen or Gregg - people who have some experience in data > models for annotations and would like to know that DAS > can handle those sorts of more complicated tree structures. > > I'm willing to move it further into the text, but I'm not > convinced that it makes things less confusing or simpler. > Features having parts and parents is an essential part of > the DAS data model. > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 From dalke at dalkescientific.com Sat Nov 26 20:20:24 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Sun, 27 Nov 2005 02:20:24 +0100 Subject: [DAS2] DAS intro In-Reply-To: <1ac71c37969c1ef9dcc0d983157746aa@fruitfly.org> References: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> <59fa39752e4d792d2142fe2682813937@fruitfly.org> <1ac71c37969c1ef9dcc0d983157746aa@fruitfly.org> Message-ID: <9a9ee9242a38f40049a7c5d973980e7d@dalkescientific.com> Suzi: > so there seem to be 2 questions. it would be good to have both in the > intro, but only as long as the description can be clearly stated in > just a sentence or two. If it takes more then it is clearly something > that requires a fuller description outside of the intro. Agreed. > I'll try to give my understanding (but goodness knows I am peering > through different lenses). I don't think in terms of the spec at all, > just the information that needs to be conveyed. > > #1 "reference frame" ========================================= > > "reference frame", is (to my mind) "reference sequence". at least, > that is what i've always called it. > First, accuracy has nothing at all to do with it, so we don't want the > sentence in there. I'm fine with that. I've found it best to declare my ignorance early than to keep it hidden. > Second, the region of sequence that is returned is nothing more than > that. Think of it as a special type of feature. This is what makes a > transformation possible from one coordinate-system to another (by > adding the correct offsets) I can think of it as a feature just fine. But then shouldn't each region also be a feature? Why wouldn't all contigs be visible as an annotation? Contigs are in SOFA as @is_a at contig ; SO:0000149 @is_a@ assembly_component ; SO:0000143 @part_of@ supercontig ; SO:0000148 What advantage is there to break this feature out at a "/region"? One that I can see is that the reference server provides the regions while the annotation server provides the other features. But if that's the case we could have the reference server also provide the regions as features, and the annotation server makes references to those features rather than to regions. That is, in the current scheme we have: has 0 or more element, where the 'pos' attribute links to region + start/stop range and the optional 'seq' attribute links to the sequence range, as in: is only a link to the sequence and a length, as in: One alternate possibility is to change that so "pos" points to a /feature (instead of a /region) and have features for each contig or other assembly component. The result would look like: ... Doing this, however, means that all features must support subranges. As an alternate solution without ranges, use and then look up the sequence coordinates of feature/AB1234 to figure out where it starts/stops. The other advantage to a region is you can ask for the assembly via the 'agp' format. But because of the the existing support for formats which are only valid for some feature you can do that by asking for, say, all assembly_component features (via the feature filter) and return the results in 'agp' format. > Third, just think of "reference sequence" as a coordinate system. One > can have the exact same feature and indicate that: on > coordinate-system-A this feature starts and ends here, and on > coordinate-system-B it starts and ends there. Thus a feature's > coordinates may be given both on a chromosome, and on a contig, and on > any other coordinate-system that can be derived through a transform > from these. I believe I understand this. There really is only one reference frame for the entire genome sequence, for a given assembly, and all other coordinate systems are a fixed and definite offset of that single reference frame. I believe this is called the golden path? My reference to accuracy is because I figured that given two features A and B on an assembly component X then the fuzziness in the relative distance between A and B is small if X is also small. That is, smaller terms are less likely to have changes as the golden path changes. > So you could change the sentence below to read "A reference server > may supply features where the locations (start and end) are relative > to either contigs, some other arbitrary region, or to the entire > chromosome." Why not always supply it relative to the chromosome coordinates? The spec now allows that as an optional field. I can't figure out why you would want to do otherwise. Is it because sometimes it's easier to work with, say, a large number of contig reference frames than with one large reference frame? Does that mean we shift the complexity of coordinate translation from the data provider to the data consumer? (Making it easier to generate data than to consume data.) > This one is perhaps too subtle for the introduction, but if we decide > to include it then I think it should first be phrased in terms of the > problem (biological sampling) and then in terms of the solution > (multiple parents). Oh, definitely. It's some place where I just don't have the domain knowledge to explain it or even come up with examples. Andrew dalke at dalkescientific.com From suzi at fruitfly.org Sat Nov 26 20:24:07 2005 From: suzi at fruitfly.org (Suzanna Lewis) Date: Sat, 26 Nov 2005 17:24:07 -0800 Subject: [DAS2] DAS intro In-Reply-To: <9a9ee9242a38f40049a7c5d973980e7d@dalkescientific.com> References: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> <59fa39752e4d792d2142fe2682813937@fruitfly.org> <1ac71c37969c1ef9dcc0d983157746aa@fruitfly.org> <9a9ee9242a38f40049a7c5d973980e7d@dalkescientific.com> Message-ID: Lets add this to the agenda for Monday morning. Hopefully that will be faster than via e-mail. On Nov 26, 2005, at 5:20 PM, Andrew Dalke wrote: > Suzi: >> so there seem to be 2 questions. it would be good to have both in the >> intro, but only as long as the description can be clearly stated in >> just a sentence or two. If it takes more then it is clearly something >> that requires a fuller description outside of the intro. > > Agreed. > >> I'll try to give my understanding (but goodness knows I am peering >> through different lenses). I don't think in terms of the spec at all, >> just the information that needs to be conveyed. >> >> #1 "reference frame" ========================================= >> >> "reference frame", is (to my mind) "reference sequence". at least, >> that is what i've always called it. > > >> First, accuracy has nothing at all to do with it, so we don't want >> the sentence in there. > > I'm fine with that. I've found it best to declare my ignorance early > than to keep it hidden. > >> Second, the region of sequence that is returned is nothing more than >> that. Think of it as a special type of feature. This is what makes a >> transformation possible from one coordinate-system to another (by >> adding the correct offsets) > > I can think of it as a feature just fine. But then shouldn't each > region > also be a feature? Why wouldn't all contigs be visible as an > annotation? > > Contigs are in SOFA as > > @is_a at contig ; SO:0000149 @is_a@ assembly_component ; > SO:0000143 @part_of@ supercontig ; SO:0000148 > > What advantage is there to break this feature out at a "/region"? > > One that I can see is that the reference server provides the regions > while the annotation server provides the other features. But if > that's the case we could have the reference server also provide the > regions as features, and the annotation server makes references to > those features rather than to regions. > > That is, in the current scheme we have: > > has 0 or more element, where the 'pos' attribute > links to region + start/stop range and the optional 'seq' attribute > links to the sequence range, as in: > > seq="sequence/Chr3/1271:1507:1"/> > > > is only a link to the sequence and a length, as in: > > > > > One alternate possibility is to change that so "pos" points to a > /feature (instead of a /region) and have features for each contig or > other assembly component. The result would look like: > > seq="sequence/Chr3/1271:1507:1"/> > > ... > > Doing this, however, means that all features must support subranges. > > > As an alternate solution without ranges, use > > > > and then look up the sequence coordinates of feature/AB1234 to > figure out where it starts/stops. > > > The other advantage to a region is you can ask for the assembly > via the 'agp' format. But because of the the existing support for > formats which are only valid for some feature you can do that by asking > for, say, all assembly_component features (via the feature filter) and > return > the results in 'agp' format. > >> Third, just think of "reference sequence" as a coordinate system. One >> can have the exact same feature and indicate that: on >> coordinate-system-A this feature starts and ends here, and on >> coordinate-system-B it starts and ends there. Thus a feature's >> coordinates may be given both on a chromosome, and on a contig, and >> on any other coordinate-system that can be derived through a >> transform from these. > > I believe I understand this. There really is only one reference frame > for > the entire genome sequence, for a given assembly, and all other > coordinate > systems are a fixed and definite offset of that single reference frame. > I believe this is called the golden path? > > My reference to accuracy is because I figured that given two features > A and B on an assembly component X then the fuzziness in the relative > distance between A and B is small if X is also small. That is, smaller > terms are less likely to have changes as the golden path changes. > > >> So you could change the sentence below to read "A reference server >> may supply features where the locations (start and end) are relative >> to either contigs, some other arbitrary region, or to the entire >> chromosome." > > Why not always supply it relative to the chromosome coordinates? The > spec > now allows that as an optional field. I can't figure out why you would > want to do otherwise. > > Is it because sometimes it's easier to work with, say, a large number > of > contig reference frames than with one large reference frame? Does that > mean we shift the complexity of coordinate translation from the data > provider to the data consumer? (Making it easier to generate data than > to consume data.) > > >> This one is perhaps too subtle for the introduction, but if we decide >> to include it then I think it should first be phrased in terms of the >> problem (biological sampling) and then in terms of the solution >> (multiple parents). > > Oh, definitely. It's some place where I just don't have the domain > knowledge to explain it or even come up with examples. > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 From Gregg_Helt at affymetrix.com Mon Nov 28 04:44:18 2005 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Mon, 28 Nov 2005 01:44:18 -0800 Subject: [DAS2] tiled queries for performance Message-ID: > -----Original Message----- > From: das2-bounces at portal.open-bio.org [mailto:das2-bounces at portal.open- > bio.org] On Behalf Of Andrew Dalke > Sent: Thursday, November 24, 2005 5:47 AM > To: Brian Gilman > Cc: DAS/2 > Subject: Re: [DAS2] tiled queries for performance > > Hi Brian, > > > We're looking into this kind of implementation issue ourselves and > > thought that a bitorrent like cache makes the most sense. ie. all > > servers in the "fabric" are issued the query in a certain "hop > > adjacency". These servers then send their data to the client who's job > > it is to assemble the data. > > I go back and forth between the "large data set" model and the "large > number > of entities" model. > > In the first: > - client requests a large data file > - server returns it > > This can be sped up by distributing the file among many sites and > using something like BitTorrent to put it together, or something like > Coral ( http://www.coralcdn.org/ ) to redirect to nearby caches. > > But making the code for this is complicated. It's possible to build > on BitTorrent and similar systems, but I have no feel for the actual > implementation cost, which makes me wary. I've looked into a couple > of the P2P toolkits and not gotten the feel that it's any easier than > writing HTTP requests directly. Plus, who will set up the alternate > servers? My hope would be that any system like this could be hidden behind a single HTTP GET request and hence require no changes to the DAS/2 protocol. Standard web caches already work this way. I'm less familiar with the BitTorrent approach, but I'm guessing that the client-side code that stitches together the pieces from multiple servers could be encapsulated in a client-side daemon that responds to localhost HTTP calls. > In the second: > - make query to server > - server returns list of N identifiers > - make N-n requests (where 'n' is the number of identifiers already > resolved) > > The id resolution can be done in a distributed fashion and is easily > supported via web caches, either with well-configured proxies or (again) > through Coral. > > I like the latter model in part because it's more fine grained. Eg, > a progress bar can say "downloading feature 4 of 10000", and if a given > feature is already present there's no need to refetch it. > > The downside of the 2nd is the need for HTTP 1.1 pipelining to make it > be efficient. I don't know if we want to have that requirement. I'm wary of this "large number of entities" approach, for several reasons. Due to the overhead for TCP/IP, HTTP headers, and extra XML stuff like doctype and namespace declarations, making an HTTP GET request per feature will increase the total number of bytes that need to be transmitted. It will also increase the parsing overhead on the client side. And if the features contain little information (for example just type, parts/parents, and location) that overhead could easily exceed the time taken to process the "useful" data. As you indicated, some performance problems could be alleviated by HTTP 1.1 pipelining, but that adds additional requirements to both client and server. Also, for persistent caching on the local machine when you start splitting up the data into hundreds of thousands of files, I suspect the additional disk seek time will far exceed disk read time and become a serious performance impediment. Having said that, in theory this approach is (almost) testable using the current DAS/2 spec. Create one DAS/2 server that in response to feature queries returns only the minimum required information for "N" features: id and type. And have feature ids returned be URLs on another DAS/2 server that _does_ return full feature information (location, alignment, etc.). Then make "N-n" single-feature queries with those URLs to get full information. Due to the current DAS/2 requirement that any parts / parents referenced also be included in the same XML doc, this would only be a reasonable test for features with no hierarchical structure, such as SNPs. > Gregg > came up with the range restrictions because most of the massive results > will be from range searches. By being a bit more clever about tracking > what's known and not known, a client can get a much smaller results > page. > > > These are complementary. Using Gregg's restricted range queries can > reduce the number of identifiers returned in a search, making the > network overhead even smaller. > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 From Gregg_Helt at affymetrix.com Mon Nov 28 05:05:33 2005 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Mon, 28 Nov 2005 02:05:33 -0800 Subject: [DAS2] das registry and das2 Message-ID: > -----Original Message----- > From: das2-bounces at portal.open-bio.org [mailto:das2-bounces at portal.open- > bio.org] On Behalf Of Andrew Dalke > Sent: Friday, November 18, 2005 10:00 AM > To: DAS/2 > Subject: Re: [DAS2] das registry and das2 > > Andreas Prlic: > > I would like to start a discussion of how to provide a proper DAS > > interface for > > our das- registration server at http://das.sanger.ac.uk/registry/ > > > > Currently it is possible to interact with it using SOAP, or manually > > via the HTML > > interface. We should also make it accessible using URL requests. > > One of the things Gregg and I talked about at ISMB was that the > top-level > "das-sources" format is, or can be, identical to what's needed for the > registry server. > Some of what we discussed I wrote up in a post ealier this year: http://portal.open-bio.org/pipermail/das2/2005-June/000198.html Another post that might be useful in current discussions is a summary of what was discussed in the DAS/2 registry meeting we had in Hinxton back in September 2004: http://portal.open-bio.org/pipermail/das2/2005-June/000197.html gregg From Gregg_Helt at affymetrix.com Mon Nov 28 05:58:00 2005 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Mon, 28 Nov 2005 02:58:00 -0800 Subject: [DAS2] tiled queries for performance Message-ID: The attachment is a PowerPoint slide showing one of the feature query optimizations that the IGB client currently uses, which combines "overlaps" and "inside" filters. When used consistently this guarantees that the same feature is not returned in multiple feature queries. However in general I agree that it is the client's responsibility to reasonably handle cases where the same feature is returned multiple times. gregg > -----Original Message----- > From: das2-bounces at portal.open-bio.org [mailto:das2-bounces at portal.open- > bio.org] On Behalf Of Allen Day > Sent: Wednesday, November 23, 2005 3:50 PM > To: das2 at portal.open-bio.org > Subject: Re: [DAS2] tiled queries for performance > > More thoughts on this. The client can eliminate the redundancy in the > records returned by issuing the tiling queries as previously described > (query1), then issuing queries for records that are not contained within > tiles, but overlap the boundaries of 1 or more tiles (query2). > > However, by issuing all the overlaps queries at once, we've just deferred > the performance hit one step, because we can't reasonably expect the > server to have cached all combinations of tile overlaps queries. I think, > to get this tiling optimization to work, the burden needs to be on the > client to identify and remove duplicate responses for multiple > edge-overlaps queries (query3). > > 1000bp 2000bp 3000bp > | | | > | === | =====^==== | > | ====#===== | > | ============#=============#===== > | | | > > <-----------> query1a > <-----------> query1b > query2 > query3a > query3b > > Key: > > | : tile boundary > = : feature > ^ : gap between child features > # : portion of feature overlapping tile boundary. > : client overlaps query > <.> : client contains query > > -Allen > > > > On Mon, 21 Nov 2005, Allen Day wrote: > > > Hi, > > > > I had an idea of how clients may be able to get better response from > > servers by using a tiled query technique. Here's the basic idea: > > > > ClientA wants features in chr1/1010:2020, and issues a request for that > > range. No other clients have previously requested this range, so the > > server-side cache faults to the DAS/2 service (slow). > > > > ClientB wants features in chr1/1020:2030, and issues a request for that > > range. Although the intersection of the resulting records with > ClientA's > > query is large, the URIs are different and the server-side cache faults > > again. > > > > If ClientA and ClientB were to each issue two separate "tiled" requests: > > > > 1. chr1/1001:2000 > > 2. chr1/2001:3000 > > > > ClientB could take advantage of the fact that ClientA had been looking > at > > the same tiles. > > > > For this to work, the clients would need to be using the same tile size. > > The optimal tile size is likely to vary from datasource to datasource, > > depending on the length and density distributions of the features > > contained in the datasource. The "sources" or "versioned sources" > > payload could suggest a tiling size to prospective clients. Servers > could > > also pre-cache all tiles by hitting each tile after an update of the > > datasource (or the DAS/2 service code). > > > > The tradeoff for the performance gains is that clients may now need to > do > > filtering on the returned records to only return those requested by the > > client's client. > > > > -Allen > > _______________________________________________ > > DAS2 mailing list > > DAS2 at portal.open-bio.org > > http://portal.open-bio.org/mailman/listinfo/das2 > > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 -------------- next part -------------- A non-text attachment was scrubbed... Name: DAS2_Query_Optimization.ppt Type: application/vnd.ms-powerpoint Size: 287744 bytes Desc: DAS2_Query_Optimization.ppt URL: From ap3 at sanger.ac.uk Mon Nov 28 06:48:03 2005 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Mon, 28 Nov 2005 11:48:03 +0000 Subject: [DAS2] DAS intro In-Reply-To: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> References: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> Message-ID: Hi! > How about this instead, as an overview/introduction. > > ====== > > DAS/2 describes a data model for genome annotations. Can we formulate the start a little more general? something like: DAS/2 is a protocol to share biological data. It provides specifications for how to share annotations of genomes and proteins, assays, ontologies (space fore more here...). then I would continue with your text. Cheers, Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 From dalke at dalkescientific.com Mon Nov 28 12:10:30 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 28 Nov 2005 18:10:30 +0100 Subject: [DAS2] mtg topics for Nov 28 Message-ID: Here are the spec issues I would like to talk about for today's meeting, culled from the last few weeks of emails and phone calls 1) DAS Status Code in headers The current spec says > X-DAS-Status: XXX status code > > The list of status codes is similar, but not identical, to those used > by DAS/1: > > 200 OK, data follows > 400 Bad namespace > 401 Bad data source > 402 Bad data format > 403 Unknown object ID > 404 Invalid object ID > 405 Region coordinate error > 406 No lock > 407 Access denied > 500 Server error > 501 Unimplemented feature I argued that these are not needed. Some of them are duplicates with HTTP error codes and those which are not can be covered by an error code "300" along with an (optional) XML payload. The major problem with doing this seems to be in how MS IE handles certain error codes. While IE is not a target browser, MS software may use IE as a component for fetching data. From the link Ed dug up, it looks like this won't be a problem. Lincoln's last email on this was a tepid > I give up arguing this one and will go with the way Andrew wants to do > it. Therefore I propose the following rules: > > 1) Return the HTTP 404 error for the case that any component of the > DAS2 path > is invalid. This would apply to the following situations: > > Bad namespace > Bad data source > Unknown object ID > > 2) Return HTTP 301 and 302 redirects when the requested object has > moved. > > 3) Return HTTP 403 (forbidden) for no-lock errors. > > 4) Return HTTP 500 when the server crashes. > > For all errors there should be a text/x-das-error entity returned that > describes the error in more detail. The "x-das-error" format must have an invariant string, either an error code or fixed text, and a possible optional explanatory text section. Note the "should" in that last paragraph - this is optional. 2) Content-type There was some discussion about changing the content type to "text/xml" to support viewing DAS results in a browser. We decided that that wasn't a valid use case. In doing the research for this I found that the general recommendation for these sorts of XML documents is to put the document under "application/*" instead of "text/*". One reason is from http://www.ietf.org/rfc/rfc3023.txt If an XML document -- that is, the unprocessed, source XML document -- is readable by casual users, text/xml is preferable to application/xml. MIME user agents (and web user agents) that do not have explicit support for text/xml will treat it as text/plain, for example, by displaying the XML MIME entity as plain text. Application/xml is preferable when the XML MIME entity is unreadable by casual users. Similarly, text/xml-external-parsed-entity is preferable when an external parsed entity is readable by casual users, but application/xml-external-parsed-entity is preferable when a plain text display is inappropriate. NOTE: Users are in general not used to text containing tags such as , and often find such tags quite disorienting or annoying. If one is not sure, the conservative principle would suggest using application/* instead of text/* so as not to put information in front of users that they will quite likely not understand. Another is the difference in how application/* and text/* handle character set encodings. We use "text/x-...+xml" - I propose changing this to "application/x-...+xml" I don't think there are any objections to this. The main objection is to the difficulty of ploughing through all the specs related to charsets and unicode. 3) Key/value data As Steve pointed out, the spec is incomplete on how to handle key/value data associated with a record. The main problem is in how it handles namespaces. It mixes an internal attribute value namespace with the xml namespace, which doesn't happen. For example, This is a telomeric repeat birx28 This is a telomeric repeat 29 This is a telomeric repeat 29 - "simple extension elements" not in the "atom:" namespace > - "structured extension elements" not in the "atom:" namespace. > > Most of the "atom:" elements share a common structure. For example: > - the type= attribute indicates of the contents are text, escaped > HTML or XHTML; or an explicit content-type like "chemical/x-pdb". > > - the src= attribute indicates that the content of the element is > empty and to go to the given URL instead (apparently the hip > term for URL these days is IRL - internationalized Resource > Identifiers. > I think we only need to use URLs) > > > These are not always used for all elements; if it's appropriate for a > given field then it's used. > > > Simple extension elements are always of the form > Content goes here > where 'element' is not part of the 'atom:' namespace. Consumers of > this data may treat it as simple key/value data. > > Structured extension elements always have at least an attribute > or a sub-element, so must look like > .. > -or- > .. .. > > If the element isn't known this field may be ignored. > > These three things provide for: > - a set of well-define elements, understandable by everyone > - a simple extension for things which can be key/value data > - a way to store or refer to more complex data types 5) xlink and Several places in the spec include or may include links to documents elsewhere. The XLink specification describes an general extensibility mechanism for such links. xlinks have 1 of about 4 properties, the most important are: - where does the link go to - what kind of link is it - what should the browser do with such a link I personally don't understand the xlink spec well enough to want to use it, and I haven't come across examples of it in use. I am wary about specs like that. Another is to use something like the element from HTML 4.0 and in Atom. This looks something like that is, it has: - a category for how the link is related to the given object ('rel') - an optional MIME type (use, eg, if the server has multiple ways to provide data for the same 'rel' category) - an href to the data As implemented in Atom the contents of a are extensible, which allows people to experiment with things like mirroring. In any case we need a way to provide typed links to other documents. Such links may include: - link from a given feature to the versioned source - link from a versioned source to the lock document 6) Source filters This comes from Andreas Prlic. We can support metadata servers via the same document returned from the entry point to a DAS server. However, a metadata server may also support searches, eg, to show only H. sapiens annotations using the build 1234 assembly. Should we make this property searching part of the DAS/2 spec, which means everyone must support it, or should we say it's optional but if implemented it must be done in a standard way? Or leave it for version 2.1, once we have more experience with DAS in real-life? (Though we already have that experience.) 7) /regions Could someone please explain to me the point of the /region subtree? As far as I can tell, a region is just a type of feature. A generic feature is located somewhere on the genome (with respect to a given assembly), and may also say it's on various 'region' features. I don't see the need for a separate namespace for this. 8) Tiled queries Do they need spec changes, or spec recommendations? I think I've mentioned everything to be covered. Andrew dalke at dalkescientific.com From Gregg_Helt at affymetrix.com Mon Nov 28 12:14:28 2005 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Mon, 28 Nov 2005 09:14:28 -0800 Subject: [DAS2] tiled queries for performance Message-ID: I don't think we should allow servers to return features than do not meet the criteria specified in the query feature filters, it's an invitation for ambiguity. This may seem harmless with just an "overlaps" region filter, but what about "inside", "contains", "identical"? What about "type", etc? If different DAS/2 server implementations contain the same data, they should return the same set of features for a given feature query. gregg > -----Original Message----- > From: das2-bounces at portal.open-bio.org [mailto:das2-bounces at portal.open- > bio.org] On Behalf Of Andrew Dalke > Sent: Friday, November 25, 2005 3:43 PM > To: Asim Siddiqui > Cc: DAS/2 > Subject: Re: [DAS2] tiled queries for performance > > > The change is simply that instead of the client getting exactly what it > > asks for, it may get more. > > While that's another matter - the client makes a request > and the server is free to expand the range to something it can handle > a bit better. Allen? Were you suggesting this instead? > > In this case there is a change to the spec, and all clients must > be able to filter or otherwise ignore extra results. > > I personally think it's an implementation issue related to performance > and there are ways to make the results be generated fast enough. > > Andrew > dalke at dalkescientific.com > From dalke at dalkescientific.com Mon Nov 28 12:14:52 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 28 Nov 2005 18:14:52 +0100 Subject: [DAS2] DAS intro In-Reply-To: References: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> Message-ID: Andreas Prlic: > Can we formulate the start a little more general? > > something like: > > DAS/2 is a protocol to share biological data. It provides > specifications for how > to share annotations of genomes and proteins, assays, ontologies > (space fore more here...). I thought about that, but the DAS/2.0 spec doesn't include any of those. Perhaps be more definite instead and say this is DAS/2.0? Or say "Other projects (link, link, link) extend DAS/2 to protein, assay and ontology data sets." Andrew dalke at dalkescientific.com From lstein at cshl.edu Mon Nov 28 12:24:32 2005 From: lstein at cshl.edu (Lincoln Stein) Date: Mon, 28 Nov 2005 12:24:32 -0500 Subject: [DAS2] DAS intro In-Reply-To: <9a9ee9242a38f40049a7c5d973980e7d@dalkescientific.com> References: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> <1ac71c37969c1ef9dcc0d983157746aa@fruitfly.org> <9a9ee9242a38f40049a7c5d973980e7d@dalkescientific.com> Message-ID: <200511281224.32885.lstein@cshl.edu> > > > > is only a link to the sequence and a length, as in: > > You know, this is still kind of ugly. I hate to revisit this so late in the game, but can't we make sequence retrieval a three-step process? 1) Feature request returns: 2) Region request returns: (where seq= could be an absolute URL if someone else owns the bases) 3) Sequence request then returns the bases Lincoln > > > One alternate possibility is to change that so "pos" points to a > /feature (instead of a /region) and have features for each contig or > other assembly component. The result would look like: > > seq="sequence/Chr3/1271:1507:1"/> > > ... > > Doing this, however, means that all features must support subranges. > > > As an alternate solution without ranges, use > > > > and then look up the sequence coordinates of feature/AB1234 to > figure out where it starts/stops. > > > The other advantage to a region is you can ask for the assembly > via the 'agp' format. But because of the the existing support for > formats which are only valid for some feature you can do that by asking > for, say, all assembly_component features (via the feature filter) and > return > the results in 'agp' format. > > > Third, just think of "reference sequence" as a coordinate system. One > > can have the exact same feature and indicate that: on > > coordinate-system-A this feature starts and ends here, and on > > coordinate-system-B it starts and ends there. Thus a feature's > > coordinates may be given both on a chromosome, and on a contig, and on > > any other coordinate-system that can be derived through a transform > > from these. > > I believe I understand this. There really is only one reference frame > for > the entire genome sequence, for a given assembly, and all other > coordinate > systems are a fixed and definite offset of that single reference frame. > I believe this is called the golden path? > > My reference to accuracy is because I figured that given two features > A and B on an assembly component X then the fuzziness in the relative > distance between A and B is small if X is also small. That is, smaller > terms are less likely to have changes as the golden path changes. > > > So you could change the sentence below to read "A reference server > > may supply features where the locations (start and end) are relative > > to either contigs, some other arbitrary region, or to the entire > > chromosome." > > Why not always supply it relative to the chromosome coordinates? The > spec > now allows that as an optional field. I can't figure out why you would > want to do otherwise. > > Is it because sometimes it's easier to work with, say, a large number of > contig reference frames than with one large reference frame? Does that > mean we shift the complexity of coordinate translation from the data > provider to the data consumer? (Making it easier to generate data than > to consume data.) > > > This one is perhaps too subtle for the introduction, but if we decide > > to include it then I think it should first be phrased in terms of the > > problem (biological sampling) and then in terms of the solution > > (multiple parents). > > Oh, definitely. It's some place where I just don't have the domain > knowledge to explain it or even come up with examples. > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 -- Lincoln D. Stein Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724 FOR URGENT MESSAGES & SCHEDULING, PLEASE CONTACT MY ASSISTANT, SANDRA MICHELSEN, AT michelse at cshl.edu From lstein at cshl.edu Mon Nov 28 12:08:35 2005 From: lstein at cshl.edu (Lincoln Stein) Date: Mon, 28 Nov 2005 12:08:35 -0500 Subject: [DAS2] DAS intro In-Reply-To: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> References: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> Message-ID: <200511281208.36204.lstein@cshl.edu> Yes, this is a better intro. Lincoln On Friday 25 November 2005 10:21 am, Andrew Dalke wrote: > The front of the DAS doc starts > > DAS 2.0 is designed to address the shortcomings of DAS 1.0, including: > > That kinda assumes people know what DAS 1.0 is to understand DAS 2.0. > > How about this instead, as an overview/introduction. > > ====== > > DAS/2 describes a data model for genome annotations. An annotation > server provides information about one or more genome sources. Each > source may have one or more versions. Different versions are usually > based on different assemblies. As an implementation detail an > assembly and corresponding sequence data may be distributed via a > different machine, which is called the reference server. Portions of > the assembly may have higher relative accuracy than the assembly as a > whole. A reference server may supply these portions as an alternate > reference frame. > > Annotations are located on the genome with a start and end position. > The range may be specified mutiple times if there are alternate > reference frames. An annotation may contain multiple non-continguous > parts, making it the parent of those parts. Some parts may have more > than one parent. Annotations have a type based on terms in SOFA > (Sequence Ontology for Feature Annotation). Stylesheets contain a set > of properties used to depict a given type. > > Annotations can be searched by range, type, and a properties table > associated with each annotation. These are called feature filters. > > DAS/2 is implemented using a ReST architecture. Each entity (also > called a document or object) has a name, which is a URL. Fetching the > URL gets information about the entity. The DAS-specific entities are > all XML documents. Other entities contain data types with an existing > and frequently used file format. Where possible, a DAS server returns > data using existing formats. In some cases a server may describe how > to fetch a given entity in several different formats. > ====== > > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 -- Lincoln D. Stein Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724 FOR URGENT MESSAGES & SCHEDULING, PLEASE CONTACT MY ASSISTANT, SANDRA MICHELSEN, AT michelse at cshl.edu From lstein at cshl.edu Mon Nov 28 12:11:24 2005 From: lstein at cshl.edu (Lincoln Stein) Date: Mon, 28 Nov 2005 12:11:24 -0500 Subject: [DAS2] tiled queries for performance In-Reply-To: <9ec33e6fb3efbbe8b39adc52d2b78db7@dalkescientific.com> References: <86C6E520C12E52429ACBCB01546DF4D3BE3E5E@xchange1.phage.bcgsc.ca> <9ec33e6fb3efbbe8b39adc52d2b78db7@dalkescientific.com> Message-ID: <200511281211.25239.lstein@cshl.edu> One thing to do is to add to the spec a note that the server is free to return features from a range larger than requested. This way the server is free to expand the range to the 1k boundaries. My preference, however, would be for the server to implement a filter that removes from the precalculated tiled XML output all features that are outside the range. This would be completely transparent to the client. Lincoln On Friday 25 November 2005 06:43 pm, Andrew Dalke wrote: > Asim Siddiqui > > > I think this is a great idea. > > > > I don't see this as a big change to the DAS/2 spec or requiring much in > > the way of additional smarts on the client side. > > I agree with Allen on this - in some sense there's no effect on the > spec. It ends up being an agreement among the clients to request > aligned data, by rounding up/down to the nearest, say, kilobase and > for the server implementers to cache those requests. > > > The change is simply that instead of the client getting exactly what it > > asks for, it may get more. > > While that's another matter - the client makes a request > and the server is free to expand the range to something it can handle > a bit better. Allen? Were you suggesting this instead? > > In this case there is a change to the spec, and all clients must > be able to filter or otherwise ignore extra results. > > I personally think it's an implementation issue related to performance > and there are ways to make the results be generated fast enough. > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 -- Lincoln D. Stein Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724 FOR URGENT MESSAGES & SCHEDULING, PLEASE CONTACT MY ASSISTANT, SANDRA MICHELSEN, AT michelse at cshl.edu From Gregg_Helt at affymetrix.com Mon Nov 28 12:30:27 2005 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Mon, 28 Nov 2005 09:30:27 -0800 Subject: [DAS2] Agenda for today's DAS/2 meeting Message-ID: Today we're going over spec issues. Here's my short list of topics to cover: DAS-specific headers Error codes Feature properties Registry & Discovery Please feel free to add! gregg From td2 at sanger.ac.uk Mon Nov 28 12:27:31 2005 From: td2 at sanger.ac.uk (Thomas Down) Date: Mon, 28 Nov 2005 17:27:31 +0000 Subject: [DAS2] DAS intro In-Reply-To: References: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> Message-ID: <83634851-73AD-454A-B027-644539CF1869@sanger.ac.uk> On 28 Nov 2005, at 17:14, Andrew Dalke wrote: > Andreas Prlic: >> Can we formulate the start a little more general? >> >> something like: >> >> DAS/2 is a protocol to share biological data. It provides >> specifications for how >> to share annotations of genomes and proteins, assays, ontologies >> (space fore more here...). > > I thought about that, but the DAS/2.0 spec doesn't include any of > those. There are pages about assay and ontology retrieval on the website. Are these not part of the spec? Or are they being counted as something else (DAS/2.1?) Thomas. From dalke at dalkescientific.com Mon Nov 28 13:09:17 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 28 Nov 2005 19:09:17 +0100 Subject: properties and key/value data (was Re: [DAS2] Spec issues) In-Reply-To: References: Message-ID: Here's the email I sent to Steve that I meant to send to everyone. On Nov 17, 2005, at 2:09 AM, Andrew Dalke wrote: > I think I understand the Atom spec better now. In brief, the > Atom document contains sections which are extensible and sections > which are not. > > In an extensible section there are two/three categories of elements: > - those in the "atom:" namespace > - "simple extension elements" not in the "atom:" namespace > - "structured extension elements" not in the "atom:" namespace. > > Most of the "atom:" elements share a common structure. For example: > - the type= attribute indicates of the contents are text, escaped > HTML or XHTML; or an explicit content-type like "chemical/x-pdb". > > - the src= attribute indicates that the content of the element is > empty and to go to the given URL instead (apparently the hip > term for URL these days is IRL - internationalized Resource > Identifiers. > I think we only need to use URLs) > > > These are not always used for all elements; if it's appropriate for a > given field then it's used. > > > Simple extension elements are always of the form > Content goes here > where 'element' is not part of the 'atom:' namespace. Consumers of > this data may treat it as simple key/value data. > > Structured extension elements always have at least an attribute > or a sub-element, so must look like > .. > -or- > .. .. > > If the element isn't known this field may be ignored. > > These three things provide for: > - a set of well-define elements, understandable by everyone > - a simple extension for things which can be key/value data > - a way to store or refer to more complex data types > > > Steve, responding to an earlier posting of mine: >> Interesting, but a problem with this is that it effectively creates a >> new version of the TYPES schema every time a new property is added to >> the DAS properties controlled vocabulary. I would hope for a solution >> that decouples the content of the controlled vocab from the data >> exchange format. > > I looked into that. Relax-NG lets you define a "can be anything > except ...". The Atom spec is defined with the following > > # Simple Extension > > simpleExtensionElement = > element * - atom:* { > text > } > > # Structured Extension > > structuredExtensionElement = > element * - atom:* { > (attribute * { text }+, > (text|anyElement)*) > | (attribute * { text }*, > (text?, anyElement+, (text|anyElement)*)) > } > > The "element * - atom:*" means "Any element except those in > the atom namespace." > > Thus we can validate anything with DAS/2 tags, and ignore > validate of anything not part of DAS/2. And we can say that > extensions are only allowed in certain parts of the spec and > not in others. > > We would need to update the schema when we add new "das:" elements, > but we already need to do that. > > We wouldn't need to change the schema to allow others to develop > their own extensions. Indeed, the schema would still let use > verify that extensions are still well-formed. > >> Here's my next attempt, which more fully exploits xml:base to achieve >> this decoupling: >> >> > xmlns:das="http://www.biodas.org/ns/das/genome/2.00/" >> xml:base="http://www.wormbase.org/das/genome/volvox/1/" >> xmlns:xlink="http://www.w3.org/1999/xlink" >>> >> > das:type="type/curated_exon"> >> >> 29 >> >> > xml:base="http://www.biodas.org/ns/das/genome/2.00/properties"> >> 2 >> > xlink:type="simple" >> >> xlink:href="http://www.wormbase.org/das/protein/volvox/2/feature/ >> CTEL54X.1" >> /> >> >> > > Vs. > > xmlns:das="http://www.biodas.org/ns/das/genome/2.00/" > > xmlns:prop="http://www.biodas.org/ns/das/genome/2.00/properties" > xml:base="http://www.wormbase.org/das/genome/volvox/1/" > xmlns:xlink="http://www.w3.org/1999/xlink"> > das:type="type/curated_exon"> > 29 > 2 > src="http://www.wormbase.org/das/protein/volvox/2/feature/CTEL54X.1" > /> > > > > The main differences are: > - the properties are defined elements in the prop: namespace (though > I think they can just as easily be in the das: namespace) > > - I'm using lower-case since that seems to be the trend these days. > > > >> So now we have the following arrangement: >> >> * the attribute keys 'das:id', 'das:type', and 'das:ptype' are >> defined >> within the xmlns:das namespace (i.e., the full id of 'das:type' is >> derived by appending 'type' to the xmlns:das URL). > > I don't follow why the attributes have full namespaces. Is that > to allow extensibility of element attribute on a per-element basis? > > I kept "das:type" above because "type" already has too many meanings. > >> * the attributes values of 'das:id', 'das:type', and 'das:ptype' are >> URLs relative to xml:base. > > Are all attribute values relative to xml:base or only those three? > > Are xlink:href fields relative to xml:base as well? I assume "yes". > >> * The FEATURE element may contain zero or more PROPERTIES >> sub-elements, each with it's own xml:base attribute, effectively >> changing what xml:base is used within the containted PROP >> sub-elements. >> >> So in this example, the property >> 'das:ptype="property/genefinder-score"' >> inherits its xml:base from its grandparent FEATURES element and so >> expands to: >> >> http://www.wormbase.org/das/genome/volvox/1/property/genefinder-score >> >> while the 'das:ptype="phase"' and 'das:ptype="protein_translation"' >> properties inherit xml:base from their PROPERTIES parent element and >> so expand to: >> >> http://www.biodas.org/ns/das/genome/2.00/properties/phase >> http://www.biodas.org/ns/das/genome/2.00/properties/ >> protein_translation > > This is also what happens with the "prop:" namespaced elements, just > at the element level instead of the attribute level. > > To keep this on key/value data I've shifted the rest of the reply > to the next email. Andrew dalke at dalkescientific.com From asims at bcgsc.ca Mon Nov 28 14:21:47 2005 From: asims at bcgsc.ca (Asim Siddiqui) Date: Mon, 28 Nov 2005 11:21:47 -0800 Subject: [DAS2] tiled queries for performance Message-ID: <86C6E520C12E52429ACBCB01546DF4D3BE3EF8@xchange1.phage.bcgsc.ca> Agreed - in light of this, my suggestion doesn't make sense, though Allen's idea may be workable through some other means. Asim -----Original Message----- From: Helt,Gregg [mailto:Gregg_Helt at affymetrix.com] Sent: Monday, November 28, 2005 9:14 AM To: Andrew Dalke; Asim Siddiqui Cc: DAS/2 Subject: RE: [DAS2] tiled queries for performance I don't think we should allow servers to return features than do not meet the criteria specified in the query feature filters, it's an invitation for ambiguity. This may seem harmless with just an "overlaps" region filter, but what about "inside", "contains", "identical"? What about "type", etc? If different DAS/2 server implementations contain the same data, they should return the same set of features for a given feature query. gregg > -----Original Message----- > From: das2-bounces at portal.open-bio.org [mailto:das2-bounces at portal.open- > bio.org] On Behalf Of Andrew Dalke > Sent: Friday, November 25, 2005 3:43 PM > To: Asim Siddiqui > Cc: DAS/2 > Subject: Re: [DAS2] tiled queries for performance > > > The change is simply that instead of the client getting exactly what it > > asks for, it may get more. > > While that's another matter - the client makes a request and the > server is free to expand the range to something it can handle a bit > better. Allen? Were you suggesting this instead? > > In this case there is a change to the spec, and all clients must be > able to filter or otherwise ignore extra results. > > I personally think it's an implementation issue related to performance > and there are ways to make the results be generated fast enough. > > Andrew > dalke at dalkescientific.com > From allenday at ucla.edu Mon Nov 28 15:11:59 2005 From: allenday at ucla.edu (Allen Day) Date: Mon, 28 Nov 2005 12:11:59 -0800 (PST) Subject: [DAS2] tiled queries for performance In-Reply-To: <200511281211.25239.lstein@cshl.edu> References: <86C6E520C12E52429ACBCB01546DF4D3BE3E5E@xchange1.phage.bcgsc.ca> <9ec33e6fb3efbbe8b39adc52d2b78db7@dalkescientific.com> <200511281211.25239.lstein@cshl.edu> Message-ID: On Mon, 28 Nov 2005, Lincoln Stein wrote: > One thing to do is to add to the spec a note that the server is free to return > features from a range larger than requested. This way the server is free to > expand the range to the 1k boundaries. This would require the returned payload to contain the bounds of the features actually returned. E.g. if client asks for 1500..1600, and server responds with 1001..2000, it needs a way to tell the client what the actual bounds of the response are. > > My preference, however, would be for the server to implement a filter that > removes from the precalculated tiled XML output all features that are outside > the range. This would be completely transparent to the client. Yes, this is what I plan to do if we agree to use one of the tiling variants. -Allen > > Lincoln > > On Friday 25 November 2005 06:43 pm, Andrew Dalke wrote: > > Asim Siddiqui > > > > > I think this is a great idea. > > > > > > I don't see this as a big change to the DAS/2 spec or requiring much in > > > the way of additional smarts on the client side. > > > > I agree with Allen on this - in some sense there's no effect on the > > spec. It ends up being an agreement among the clients to request > > aligned data, by rounding up/down to the nearest, say, kilobase and > > for the server implementers to cache those requests. > > > > > The change is simply that instead of the client getting exactly what it > > > asks for, it may get more. > > > > While that's another matter - the client makes a request > > and the server is free to expand the range to something it can handle > > a bit better. Allen? Were you suggesting this instead? > > > > In this case there is a change to the spec, and all clients must > > be able to filter or otherwise ignore extra results. > > > > I personally think it's an implementation issue related to performance > > and there are ways to make the results be generated fast enough. > > > > Andrew > > dalke at dalkescientific.com > > > > _______________________________________________ > > DAS2 mailing list > > DAS2 at portal.open-bio.org > > http://portal.open-bio.org/mailman/listinfo/das2 > > From Steve_Chervitz at affymetrix.com Mon Nov 28 17:07:29 2005 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Mon, 28 Nov 2005 14:07:29 -0800 Subject: properties and key/value data (was Re: [DAS2] Spec issues) In-Reply-To: Message-ID: To give some context to the message that Andrew recently forwarded to the list, below is the message I sent to Andrew that prompted his reply (I also meant to send to the list instead of to just Andrew). It contains my fix to the 'namespace in attribute values' problem regarding properties which I mentioned in today's conf call, and is, I believe, the only viable alternative to Andrew's relax-NG based solution. Basically, the trick is to enclose PROP elements that are relative to the same xml:base within a parent PROPERTIES element and then permit multiple PROPERTIES elements within a feature. This way you can allow property attribute URIs that are relative to different xml:bases. To clarify a point of possible confusion, there are really two sets of key-value pairs to keep in mind: 1. The key-value pair for the property type. 2. The key-value pair for the property itself. So in this example: 29 The key for the type is 'das:ptype' and it's value is 'property/genefinder-score' and this value is a relative URL based on xml:base in the enclosing PROPERTIES element (or in it's grandparent or great-grandparent element, etc.). The value of the property itself is 29 and it's key is the whole key-value pair for the type ( das:ptype="property/genefinder-score"). In Andrew's Relax-NG equivalent: 29 the element name contains both the key ('prop:') and the value of the property type ('genefinder-score'), while the element name as a whole serves as the key for the property itself (value=29). The 'prop:genefinder-score' string is not a relative URL, but is just a namespace-scoped element name, with 'prop:' serving merely to make 'genefinder-score' globally unique, relative to the URI defined by: xmlns:prop="http://www.biodas.org/ns/das/genome/2.00/properties" A potential drawback of the Relax-NG approach, as discussed in today's conf call, is that the value of the property type is not resolvable as in the other approach using the PROPERTIES parent element. Andrew doesn't see a need for resolvability, e.g., for a dynamically discoverable schema fragment. But I thought of another use case besides the one mentioned in today's call (determining data type such as int or float, which isn't of much use in practice). The URL for the type could point to a human readable definition of the term. A user may not need clarification of 'genefinder-score' but might for something like 'softberry-ztuple'. One could still satisfy such a use case under the Relax-NG approach by providing a resolvable URL based on the element name + namespace such as: http://www.biodas.org/ns/das/genome/2.00/properties#genefinder-score True, there's no XML spec that says this is legal, but we could declare that such a convention will hold for all biodas.org-based properties. One problem with the above convention is that it's not obvious what the URL resolves to. So we could have something like: http://www.biodas.org/ns/das/genome/2.00/properties?prop=genefinder-score&de fine=true http://www.biodas.org/ns/das/genome/2.00/properties?prop=genefinder-score&sc hema=true Just a thought. Steve > From: Steve Chervitz > Date: Mon, 14 Nov 2005 17:40:28 -0800 > To: Andrew Dalke > Conversation: [DAS2] Spec issues > Subject: Re: [DAS2] Spec issues > > > Andrew Dalke wrote on 14 Nov 2005: >> >> To: DAS/2 >> Subject: Re: [DAS2] Spec issues >> >> On Nov 4 Steve wrote: >>> >> das:type="type/curated_exon"> >>> 29 >>> 2 >>> >> xlink:type="simple" >>> >>> xlink:href="http://www.wormbase.org/das/protein/volvox/2/feature/ >>> CTEL54X.1 >>> /> >>> >> >> I think we're missing something. This is XML. We can do >> >> >> > ontology="http://song.sf.net/ontologies/sofa#gene" >> source="curated" >> xml:base="gene/"> >> 29 >> 2 >> > xlink:href="http://www.wormbase.org/..." /> >> This message brought to you by >> AT&T >> > >> >> The whole point of having namespaces in XML is to keep from needing >> to define new namespaces like . >> >> In doing that, there's no problem in supporting things like "bg:glyph", >> etc. because the values are expanded as expected by the XML processor. > > Interesting, but a problem with this is that it effectively creates a > new version of the TYPES schema every time a new property is added to > the DAS properties controlled vocabulary. I would hope for a solution > that decouples the content of the controlled vocab from the data > exchange format. > > Here's my next attempt, which more fully exploits xml:base to achieve > this decoupling: > > xmlns:das="http://www.biodas.org/ns/das/genome/2.00/" > xml:base="http://www.wormbase.org/das/genome/volvox/1/" > xmlns:xlink="http://www.w3.org/1999/xlink" >> > das:type="type/curated_exon"> > > 29 > > xml:base="http://www.biodas.org/ns/das/genome/2.00/properties"> > 2 > xlink:type="simple" > > xlink:href="http://www.wormbase.org/das/protein/volvox/2/feature/CTEL54X.1" /> > > > > So now we have the following arrangement: > > * the attribute keys 'das:id', 'das:type', and 'das:ptype' are defined > within the xmlns:das namespace (i.e., the full id of 'das:type' is > derived by appending 'type' to the xmlns:das URL). > > * the attributes values of 'das:id', 'das:type', and 'das:ptype' are > URLs relative to xml:base. > > * The FEATURE element may contain zero or more PROPERTIES > sub-elements, each with it's own xml:base attribute, effectively > changing what xml:base is used within the containted PROP > sub-elements. > > So in this example, the property 'das:ptype="property/genefinder-score"' > inherits its xml:base from its grandparent FEATURES element and so > expands to: > > http://www.wormbase.org/das/genome/volvox/1/property/genefinder-score > > while the 'das:ptype="phase"' and 'das:ptype="protein_translation"' > properties inherit xml:base from their PROPERTIES parent element and > so expand to: > > http://www.biodas.org/ns/das/genome/2.00/properties/phase > http://www.biodas.org/ns/das/genome/2.00/properties/protein_translation > > >>> Also, we might want to allow some controlled vocabulary terms to be >>> used for >>> the value of type.source (e.g., "das:curated"), to ensure that >>> different >>> users use the same term to specify that a feature type is produced by >>> curation. >> >> I talked with Andreas Prlic about what other metadata is needed for the >> registry system. He mentioned >> >> Together with the BioSapiens DAS people we recently decided that >> there should be the possibility to assign gene-ontology evidence >> codes to each das source, so in the next update of the registry, >> this will be changed. >> >> That's at the source level, but perhaps it's also needed at the >> annotation level. > > I like this idea. Good re-use of GO technology. > >> >> >> My thoughts on these are: >> - come up with a more consistent way to store key/value data >> - the Atom spec has a nice way to say "the data is in this CDATA >> as text/html/xml" vs. "this text is over there". I want to copy its >> way of doing things. >> >> - I'm still not clear about xlink. Another is the HTML-style >> >> >> Atom uses the "rel=" to encoding information about the link. For >> example, the URL to edit a given document is >> >> >> >> See http://atomenabled.org/developers/api/atom-api-spec.php > > Not sure about this one yet. In the Atom API, the value of the rel > attribute is restricted to a controlled vocabulary of link > relationships and available services pertaining to editing and > publishing syndicated content on the web: > http://atomenabled.org/developers/api/atom-api-spec.php#rfc.section.5.4.1 > > What would a controlled vocab for DAS resources be? > > Skimming through the DAS/2 retrieval spec, our use of hrefs is > simply for pointing at the location of resources on the web > containing some specified content (e.g., documentation, database > entry, image data, etc.). > > The next/prev/start idea for Atom might have good applicability in the > DAS world for iterating through versions of annotations or assemblies > (e.g., rel='link-to-gene-on-next-version-of-genome'). One relationship > that would be useful for DAS would be 'latest', to get the latest > version of an annotation. > > DAS get URLs themselves seem fairly self-documenting (it's clear a > given link is for feature, type, or sequence for example), so having a > separate rel attribute may not provide much additional value for these > links. But it might be handy for versioning and for DAS/2 writebacks. > > Here's another link about Atom: > http://en.wikipedia.org/wiki/Atom_%28standard%29 > > Steve From ed_erwin at affymetrix.com Mon Nov 28 17:09:23 2005 From: ed_erwin at affymetrix.com (Ed Erwin) Date: Mon, 28 Nov 2005 14:09:23 -0800 Subject: [DAS2] DAS intro In-Reply-To: <9a9ee9242a38f40049a7c5d973980e7d@dalkescientific.com> References: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> <59fa39752e4d792d2142fe2682813937@fruitfly.org> <1ac71c37969c1ef9dcc0d983157746aa@fruitfly.org> <9a9ee9242a38f40049a7c5d973980e7d@dalkescientific.com> Message-ID: <438B8013.3060107@affymetrix.com> Andrew Dalke wrote: > > I believe I understand this. There really is only one reference frame for > the entire genome sequence, for a given assembly, and all other coordinate > systems are a fixed and definite offset of that single reference frame. No. The coordinate transformations are often more complicated than simple offsets. The coordinate space for features on one contig can be 'backwards' with respect to a different contig, and the coordinate space for a gene may skip over one or more gaps with respect to the genomic sequence. Also, the term 'reference frame' bugs me a bit because 'frame' always makes me think of 'reading frame', which is not what you intend. From Steve_Chervitz at affymetrix.com Mon Nov 28 17:55:28 2005 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Mon, 28 Nov 2005 14:55:28 -0800 Subject: [DAS2] DAS/1 vs DAS/2 discussion list In-Reply-To: Message-ID: The DAS/1 list is still open and working. I updated biodas.org to reflect this and set up a special page to inform people about which list to use: http://biodas.org/documents/biodas-lists.html Subscribers on the DAS/1 list have not been automatically added to the DAS/2 list. They must actively subscribe themselves here: http://biodas.org/mailman/listinfo/das2 Steve > From: "Helt,Gregg" > Date: Mon, 21 Nov 2005 09:24:37 -0800 > To: Andrew Dalke , DAS/2 > Conversation: [DAS2] Getting individual features in DAS/1 > Subject: RE: [DAS2] Getting individual features in DAS/1 > > We need to discuss at today's meeting. I don't think the original DAS > list should be closed, but rather continue to serve as a list to discuss > the DAS/1 protocol and implementations, and the DAS2 mailing list should > focus on DAS/2. If we mix DAS/1 and DAS/2 discussions in the same > mailing list I think it's going to lead to a lot of confusion. > > gregg > >> -----Original Message----- >> From: das2-bounces at portal.open-bio.org > [mailto:das2-bounces at portal.open- >> bio.org] On Behalf Of Andrew Dalke >> Sent: Monday, November 21, 2005 9:09 AM >> To: DAS/2 >> Subject: Re: [DAS2] Getting individual features in DAS/1 >> >> Has anyone answered Ilari's question? >> >> I never used DAS/1 enough to answer it myself. >> >> If the normal DAS list is closed, is this the right place for DAS/1 >> questions? >> >> >> On Nov 18, 2005, at 4:22 PM, Ilari Scheinin wrote: >> >>> This mail is not really about DAS/2, but the web site says the >>> original DAS mailing list is now closed. >>> >>> I am setting up a DAS server that serves CGH data from my database > to >>> a visualization software, which in my case is gbrowse. I've already >>> set up Dazzle that serves the reference data from a local copy of >>> Ensembl. I need to be able to select individual CGH experiments to > be >>> visualized, and as the measurements from a single CGH experiment > cover >>> the entire genome, this cannot of course be done by specifying a >>> segment along with the features command. >>> >>> I noticed that there is a feature_id option for getting the features >>> in DAS/1.5, but on a closer look, it seems to work by getting the >>> segment that the specified feature corresponds to, and then getting >>> all features from that segment. My next approach was to use the >>> feature type to distinguish between different CGH experiments. As > all >>> my data is of the type CGH, I thought that I could use spare this >>> piece of information for identifying purposes. >>> >>> First I tried the generic seqfeature plugin. I created a database > for >>> it with some test data. However, getting features by type does not >>> seem to work. I always get all the features from the segment in >>> question. >>> >>> Next I tried the LDAS plugin. Again I created a compatible database >>> with some test data. I must have done something wrong the the data >>> file I imported to the database, because getting the features does > not >>> work. I can get the feature types, but trying to get the features >>> gives me an ERRORSEGMENT error. >>> >>> I thought that before I go further, it might be useful to ask > whether >>> my approach seems reasonable, or is there a better way to achieve > what >>> I am trying to do? What should I do to be able to visualize > individual >>> CGH profiles? >>> >>> I'm grateful for any advice, >>> Ilari >> >> Andrew >> dalke at dalkescientific.com >> >> _______________________________________________ >> DAS2 mailing list >> DAS2 at portal.open-bio.org >> http://portal.open-bio.org/mailman/listinfo/das2 > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 From dalke at dalkescientific.com Mon Nov 28 19:01:08 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Tue, 29 Nov 2005 01:01:08 +0100 Subject: properties and key/value data (was Re: [DAS2] Spec issues) In-Reply-To: References: Message-ID: Steve: > To clarify a point of possible confusion, there are really two sets of > key-value pairs to keep in mind: > > 1. The key-value pair for the property type. > 2. The key-value pair for the property itself. I don't see that #1 is a useful distinction. > So in this example: > > 29 > > The key for the type is 'das:ptype' and it's value is > 'property/genefinder-score' and this value is a relative URL based on > xml:base in the enclosing PROPERTIES element (or in it's grandparent or > great-grandparent element, etc.). The value of the property itself is > 29 and > it's key is the whole key-value pair for the type ( > das:ptype="property/genefinder-score"). How do I make an extension type? For example, I want to add a new property for 3D structure depiction, which can be one of "cartoon", "ribbons", or "wires". Let's say it's under my company web site in http://www.dalkescientific.com/das-types/rep3d How do I write it? I tried but couldn't figure it out. What does that URL resolve, if anything? > In Andrew's Relax-NG equivalent: > > 29 > > the element name contains both the key ('prop:') and the value of the > property type ('genefinder-score'), while the element name as a whole > serves > as the key for the property itself (value=29). The > 'prop:genefinder-score' > string is not a relative URL, but is just a namespace-scoped element > name, > with 'prop:' serving merely to make 'genefinder-score' globally unique, > relative to the URI defined by: > > xmlns:prop="http://www.biodas.org/ns/das/genome/2.00/properties" It took me a while to understand XML namespaces. This helped http://www.jclark.com/xml/xmlns.htm He uses (for purposes of explanation) the so-called "Clark notation". An example from that document is maps to <{http://www.cars.com/xml}part/> """The role of the URI in a universal name is purely to allow applications to recognize the name. There are no guarantees about the resource identified by the URI.""" Using Clark notation helps with remembering that, since { and } here are not valid for URLs. The element name "prop:genefinder-score" is a convenient way to write the full element name, and that's all. There is no meaning to the parts of the name. "prop:" is not a key, since given these two namespace definitions <... xmlns:prop="http://www.dalkescientific.com/" xmlns:wash="http://www.dalkescientific.com/"> then these two elements are identical 29 29 I think Steve is saying the same thing as I am - I wanted to rephrase it to make sure. > A potential drawback of the Relax-NG approach, as discussed in today's > conf > call, is that the value of the property type is not resolvable as in > the > other approach using the PROPERTIES parent element. > > Andrew doesn't see a need for resolvability, e.g., for a dynamically > discoverable schema fragment. But I thought of another use case > besides the > one mentioned in today's call (determining data type such as int or > float, > which isn't of much use in practice). The URL for the type could point > to a > human readable definition of the term. A user may not need > clarification of > 'genefinder-score' but might for something like 'softberry-ztuple'. Who is the user that would want the clarification? That is, what human will be doing the reading? Once clarified, what does that user do with the information? In my opinion, the only people who care about this are developers, and more specifically, developers who will extend a client to support new data types. Users of, say, the web front end or of IGB don't care. That's a relatively small number of people. And the use case is solved by having the doc_href for the versioned source include a link to any extensions served. Here's another solution. Somewhere early in the results include where the schema includes links for each of the fields, including any extensions. It doesn't need to be a , just something meant as a shout out to developer people. > One could still satisfy such a use case under the Relax-NG approach by > providing a resolvable URL based on the element name + namespace such > as: > > http://www.biodas.org/ns/das/genome/2.00/properties#genefinder-score > > True, there's no XML spec that says this is legal, but we could > declare that > such a convention will hold for all biodas.org-based properties. One > problem > with the above convention is that it's not obvious what the URL > resolves to. > So we could have something like: > > http://www.biodas.org/ns/das/genome/2.00/properties?prop=genefinder- > score&de > fine=true > > http://www.biodas.org/ns/das/genome/2.00/properties?prop=genefinder- > score&sc > hema=true We could do this, though it's a bit complicated with some tools which represent element via Clark notation - it needs a bit of string munging. I suggest that the reason why "it's not obvious what the URL resolves to" is because there's nothing which will actually use this. It is easier to just have a human-readable link either on the doc_href page or via some special "if you're a developer, look here" reference, and don't worry about automating it further. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Mon Nov 28 19:16:17 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Tue, 29 Nov 2005 01:16:17 +0100 Subject: [DAS2] DAS intro In-Reply-To: <438B8013.3060107@affymetrix.com> References: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> <59fa39752e4d792d2142fe2682813937@fruitfly.org> <1ac71c37969c1ef9dcc0d983157746aa@fruitfly.org> <9a9ee9242a38f40049a7c5d973980e7d@dalkescientific.com> <438B8013.3060107@affymetrix.com> Message-ID: Ed Erwin: > No. The coordinate transformations are often more complicated than > simple offsets. The coordinate space for features on one contig can > be 'backwards' with respect to a different contig, and the coordinate > space for a gene may skip over one or more gaps with respect to the > genomic sequence. The /region entities in the DAS/2 spec are defined as (zero or more) A top-level region on the genome (similar to the "entry points" of the DAS/1 protocol). id ? the URI of the sequence ID length ? length of the sequence name (optional) ? a human-readable label for use when referring to the region doc_href (optional) ? a URL that gives additional information about this region Here is an example This is a very simple definition. As far as I can tell it does not capture the information for, say, skipping. How would you represent "the coordinate space for a gene [that skips] over one or more gapes with respect to the genomic sequence" using the current DAS/2 object model? Or goes backwards? I don't see anything like that. > Also, the term 'reference frame' bugs me a bit because 'frame' always > makes me think of 'reading frame', which is not what you intend. Oh, I agree. It's a bad term. Very very few genomics people use it, according to Google. There's a theory, popular in usenet and in some wikis, is that experts rarely write the details because after all they know the topic. The best way to get a detailed explanation is to post something in error and wait for the corrections. :) Andrew dalke at dalkescientific.com From Steve_Chervitz at affymetrix.com Mon Nov 28 22:05:40 2005 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Mon, 28 Nov 2005 19:05:40 -0800 Subject: [DAS2] DAS/2 weekly meeting notes for 28 Nov 05 Message-ID: Notes from the weekly DAS/2 teleconference, 28 Nov 2005. $Id: das2-teleconf-2005-11-28.txt,v 1.1 2005/11/29 03:06:04 sac Exp $ Note taker: Steve Chervitz Attendees: Affy: Steve Chervitz, Ed E., Gregg Helt CSHL: Lincoln Stein UC Berkeley: Suzi Lewis Sanger: Thomas Down, Andreas Prlic Sweden: Andrew Dalke Action items are flagged with '[A]'. These notes are checked into the biodas.org CVS repository at das/das2/notes/2005. Instructions on how to access this repository are at http://biodas.org DISCLAIMER: The note taker aims for completeness and accuracy, but these goals are not always achievable, given the desire to get the notes out with a rapid turnaround. So don't consider these notes as complete minutes from the meeting, but rather abbreviated, summarized versions of what was discussed. There may be errors of commission and omission. Participants are welcome to post comments and/or corrections to these as they see fit. Today's topic: Spec issues (for DAS/2 retrievals) ------------------------------------------------- We are following the agenda summary in Andrew's email: http://portal.open-bio.org/pipermail/das2/2005-November/000352.html 1) DAS Status Code in headers ----------------------------- Use http error codes and not das-specific ones. das-error to provide more detail. GH: Do we really need a detailed response document? TD: How do you distinguish different parts of the error-causing request? AD: how detailed do we need to be? LS: If you wish to do error recovery, you could have problems with one part and not another. You give up granularity. GH: Willing to give up the granularity in favor of simplicity. AD: Possibilities of error LS: How about everything that can be turned into an http error should be. And have a special section to provide das details. E.g.: client is still going to have to understand das error codes GH, AD: client does need to be there. AD: Using only http error codes reduces complexity - you only need to check one place. Another benefit - you can provide a file-based das server (this was not an use case from the RFCs, just AD's pet idea he envisions as potentially useful). GH: Can't think of DAS/1 clients that did anything meaningful with those das error codes. AD: NCBI entrez server - does lots of extra error support. Don't want to go there with das. TD, LS: DAS error codes can be used to tell client which part of the URL is at fault. Now it will be just '404 not found'. AD: REST API says use the http protocol directly. LS: There are some things in the DAS API that don't translate into http error codes. AD: We can support this with error document. [A] Use HTTP error codes and x-das-error document with code and optional description. 2) Content-type --------------- [A] No objections to using: application/x-das+blah+xml 3) Key/value data ----------------- Three possibilities summarized in Andrew's email. 1) (current spec) using namespace in attrib value. 2) (steve, lincoln) all attribute values are URI's 3) (andrew) Relax-NG based, drop in well-structured XML SC: (clarified proposal #2). For more, see today's post at: http://portal.open-bio.org/pipermail/das2/2005-November/000363.html AD: What's wrong with the Relax-NG based approach? LS: I don't understand it yet. SC: Community lacks experience with Relax-NG in general. TD: Does it let you to point to schema fragments for data types? AD: There are ways to define it in the schema, haven't looked at it. LS: This looks great. Would propose having a convention that if it's a simple, single-valued key, value should be encoded in an attribute (value="blah"), not as content of a section (CDATA). Reason: It's more consistent with rest of spec, and it's easier to parse. So in the example, genefinder-score is not correctly encoded. AD: That's not in the das: namespace, hence is not under our control. We can use this convention for things in the das namespace. AD: User can put it any xml as long as it's reasonably well-formed. We can define what well-formed is. This is what atom uses. Allows some simple key val data on client as if it were native data. It permits searches without needing to know about complex data. GH: Likes idea of allowing arbitrary xml. SC: Not completely arbitrary since we limit use of das: namespace, and possibly other aspects. LS: So we're going to say we have properties represented as key/val pairs using this syntax. You'll find 'das:' as well as possibly other namespaces. I think that works. What becomes of /property url (ptype)? Does that go away and replaced by namespace? AD: Possibly use it for data type (e.g., float). Or we could make it discoverable? LS: Easier to make it part of the spec. TD: If this can work like XML schema, we could have a pointer to an xsi. Is there a way to put a pointer to a schema url? AD: Found this to be useless. Hard coding what is expected is better than having discoverability. TD: With the xsi schema location, you can put multiple schema locations for the das schema, and your extension, separate pointers to both in a single document. AD: Never found dynamically resolved schemas useful for anything LS: In theory they are. Why not? AD: Knowing that something's an int does say what that int is supposed to mean. LS: Right. Let's make sure that the common types of annotation a server would want to return are in the spec from the get go. Anyone that doesn't care about extensions can ignore additional properties. No doubt people will make extensions to DAS/2 that are implemented on client and server that are in-house, private extensions that only work in client-server pairs. Should we allow schema fragments to be brought in via xsi? TD: this would be in the top-level element. Or can put it on an enclosing element. AD: Is there a good reason to do it? LS: Let's not seek discoverability. [A] Andrew will flesh out his Relax-NG based property encoding approach. SC: You could put your schema at the url pointed do by 'das:' AD: Don't see a need. I found that many of the DAS/1 schema fragments/documents were in valid. This didn't seem to bother DAS/1 clients and users. LS: In the real world, people don't validate. 5) xlink and ------------------- AD: The official xlink spec is long. Have not fully groked it. GH: Does anyone else have experience with it? (silence...) Seems like a reason to not go there. AD: Atom, uses link to say, "Here's some generic linked out stuff". We could use it to say, "I'm looking for the stylesheet for this thing or the schema for the xml document." GH: We need to draw line between generic links and specific things. eg. feature ids, all ids are resolvable links, and so could in principle be specified with link tags. AD: Link from feature to versioned source it's a part of. Client can figure out context from url. Use case: DAS user sends email to colleague, 'look at this url for feature X'. The other user enters URL in his das browser, client can identify the das2-versioned source given the feature URL. LS: They would rely on xml:base. Nothing in the current DAS/2 spec says that the xml base is for the versioned source. LS: But it does give you the versioned source. This is absolutely part of the spec. AD: Nothing in the spec that says that features have to be on the same machine as the rest of the data. LS: Why does user want versioned source on the same machine that the feature came from? AD: Nothing in the spec says that that a feature has to be under 'feature' in the URL. GH: Generalizing the info href element to be more generic, to specify what that link means is fine as long as we don't do this for everything that can be a link. Doc hrefs are fine, not ids. LS: We're not going to demand that people specify links. (Something about giving people enough rope to hang themselves with...) GH: Ids are opaque uris to id the feature. LS: The HTML link tag has been around a long time, and used a total of two times: style sheets, copyright statements. This could have easily been done with a stylesheet tag and copyright tag (without needing a general link tag). [A] Consider the xlink/link tags issue tabled. 6) Source filters ----------------- GH: Use case: DAS/2 client is trying to discover what registry has, query can be the same as for any das server, you can just apply additional filters when dealing with a registry. AP: Client would use tags that a registry server must implement. GH: A non-registry server can implement as well. TD: say filtering is optional in general. AD: I tend to not like optional things. Filtering is required for features. GH: The spec can state the filters that a registry is required to implement on sources query. General DAS/2 servers are not requiredd, but can if they want. What if you send a sources query with filters that it doesn't understand? LS: Return everything GH: Return error AP: Client can filter out what they want GH: It's already important to have search capability in client. Use case: On given genome, show me all gene predicitons for this region. You need to go to all servers, which could be many. AD: Can you filter by type of features that can be returned? AP: Can be added. GH: Want to be able to search on ontology term, not just id of the type. AD: Need meta-data server to ask of DAS/2 servers what features do you implement? LS: Does metadata protocol need to be part of das spec, or an additional protocol on top? There should be an optional section of DAS/2 that is implemented by metadata servers or registrys that allows you to do servers. Shouldn't overload the core server spec. GH: Concerned with the response. It's so close to the same xml, it might as well be the same. Makes it easy for clients to know about both servers and metadata servers. could call it 'sources' or something else. LS: Filtering by feature type, do we need that info that's returned by sources document? GH: No, it's part of the query. LS: Metadata server would have to do a types request. AD: What if there's a mismatch in SOFA version? LS: We're in trouble. AD: Concerned about change in meaning. SL: Not important. LS: Use case: There's a 'restriction site' node in SOFA 1.4 with five terms underneath it. In version 1.5, now there's six terms. A metadata server running off of the old version is using an incomplete node. Metadata engine should always run off the latest version. AP: Registry at Sanger checks every 2 hrs with server. AD: How is this better than having client do it itself? What features do you know with this type and this range? GH: If lots of DAS servers, this will be time intensive AD: Can we wait until there are lots of servers? AP: We have 17. LS: Current paradigm - EBI has many servers that just do one type of feature e.g, there's a server that just does repeat elements. So there are servers that will serve up one or a few feat types. AD: Had not considered that. LS: Happy to have optional filter syntax added to sources request supported by metadata servers. Gregg is right about returning error (unimplemented). Will not change protocol in fundamental way. Just an annex, just optional section supported by metadata servers. GH: Based on Andreas' queries in soap, can we squeeze everything in to params on url? filterable? AP: yes AD: optional fields will include species, build#, type, etc. [A] Add optional filter syntax to sources request. Allow unimpl error return. 7) /regions ----------- LS: In sofa, a feature of type region is root of all other features - everything is a region. Has props - ref sequence it's on, start, strandedness. The reason for region is for retrieving assemblies. SC: Region is also currently the only way to get back a list of available sequence ids without getting all sequence data. The top-level sequence request returns data along with sequence. LS/GH: region could be called 'landmarks' [A] Andrew will work directly with Lincoln on revising region request. 8) Tiled queries ---------------- LS: This doesn't need to be in spec. If client filters features by a range, is there a contract such that server must return exact range he asked for, contained in, or is ok for server to return more? GH: We need to be more strict. LS: Agree. Client should trim it. [A] Tiled queries should not be part of the spec. Other issues ------------ AP: There are still some other issues not addressed in this call. E.g., Not possible to handle situation where protein sequence in a structure varies from genome. Can defer to the next spec discussion conf call. From ed_erwin at affymetrix.com Tue Nov 29 14:30:41 2005 From: ed_erwin at affymetrix.com (Ed Erwin) Date: Tue, 29 Nov 2005 11:30:41 -0800 Subject: [DAS2] DAS intro In-Reply-To: References: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> <59fa39752e4d792d2142fe2682813937@fruitfly.org> <1ac71c37969c1ef9dcc0d983157746aa@fruitfly.org> <9a9ee9242a38f40049a7c5d973980e7d@dalkescientific.com> <438B8013.3060107@affymetrix.com> Message-ID: <438CAC61.1090104@affymetrix.com> Andrew Dalke wrote: > Ed Erwin: > >> No. The coordinate transformations are often more complicated than >> simple offsets. The coordinate space for features on one contig can >> be 'backwards' with respect to a different contig, and the coordinate >> space for a gene may skip over one or more gaps with respect to the >> genomic sequence. > > > The /region entities in the DAS/2 spec are defined as > > (zero or more) > A top-level region on the genome (similar to the "entry points" of > the DAS/1 protocol). > id ? the URI of the sequence ID > length ? length of the sequence > name (optional) ? a human-readable label for use when referring > to the region > doc_href (optional) ? a URL that gives additional information > about this region > > Here is an example > > > I had to go back and look-up the context for this discussion. Here it is: >> [Suzi wrote] >> Third, just think of "reference sequence" as a coordinate system. One >> can have the exact same feature and indicate that: on >> coordinate-system-A this feature starts and ends here, and on >> coordinate-system-B it starts and ends there. Thus a feature's >> coordinates may be given both on a chromosome, and on a contig, and on >> any other coordinate-system that can be derived through a transform >> from these. > > [Andrew wrote] > I believe I understand this. There really is only one reference frame > for the entire genome sequence, for a given assembly, and all other > coordinate systems are a fixed and definite offset of that single > reference frame. I understand this as talking about coordinates in general, not the elements or "pos" attributes in the spec. Suzi specifically mentions chromosomes and contigs; one can definitely be backwards with respect to the other. But top-level regions in an assembly would probably all be chromosomes or all be contigs, rather than a mixture. There is not one single "reference frame" for an assembly: rather there is one coordinate axis for *each* top-level region. If those top-level regions are chromosomes, then there is no relationship between the coordinates on different ones. If those top-level regions are contigs or ESTs (which I believe is allowed by the spec), then positions on one of them can be related to positions on others through various transforms. > This is a very simple definition. As far as I can tell it does not > capture the information for, say, skipping. > > How would you represent "the coordinate space for a gene [that skips] > over one or more gapes with respect to the genomic sequence" using the > current DAS/2 object model? > > Or goes backwards? I don't see anything like that. You represent gaps with tag parent-child relationships, and going backwards by specifying "+1" strand on one contig and "-1" strand on the other. The spec does not requires a DAS/2 server to know how to perform transformations from one coordinate system to another, but your statement "there really is only one reference frame for the entire genome sequence" is wrong as I understand it. There is one coordinate axis for *each* top-level region. From ed_erwin at affymetrix.com Tue Nov 29 14:36:13 2005 From: ed_erwin at affymetrix.com (Ed Erwin) Date: Tue, 29 Nov 2005 11:36:13 -0800 Subject: [DAS2] DAS intro In-Reply-To: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> References: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> Message-ID: <438CADAD.8060403@affymetrix.com> Andrew Dalke wrote: > The front of the DAS doc starts > > DAS 2.0 is designed to address the shortcomings of DAS 1.0, including: > > That kinda assumes people know what DAS 1.0 is to understand DAS 2.0. > > How about this instead, as an overview/introduction. > > ====== > > DAS/2 describes a data model for genome annotations. In general I like this better than the original introduction. Thanks for writing it. But I agree with Andreas that the first line is better as: > DAS/2 is a protocol to share biological data. I definitely think of DAS as a protocol first, rather than a data model first. From ed_erwin at affymetrix.com Tue Nov 29 15:16:11 2005 From: ed_erwin at affymetrix.com (Ed Erwin) Date: Tue, 29 Nov 2005 12:16:11 -0800 Subject: [DAS2] mtg topics for Nov 28 In-Reply-To: References: Message-ID: <438CB70B.4030005@affymetrix.com> Andrew Dalke wrote: > Here are the spec issues I would like to talk about for today's meeting, > culled from the last few weeks of emails and phone calls > > 1) DAS Status Code in headers > > The current spec says > >> X-DAS-Status: XXX status code >> >> The list of status codes is similar, but not identical, to those used >> by DAS/1: >> >> 200 OK, data follows >> 400 Bad namespace >> 401 Bad data source >> 402 Bad data format >> 403 Unknown object ID >> 404 Invalid object ID >> 405 Region coordinate error >> 406 No lock >> 407 Access denied >> 500 Server error >> 501 Unimplemented feature > > > I argued that these are not needed. Some of them are duplicates with > HTTP error codes and those which are not can be covered by an error > code "300" along with an (optional) XML payload. > > The major problem with doing this seems to be in how MS IE handles > certain error codes. While IE is not a target browser, MS software > may use IE as a component for fetching data. From the link Ed dug > up, it looks like this won't be a problem. > I'm not going to argue anymore against moving the X-DAS-Status code up into the HTTP status code. I'm willing to try it and see if it works. But I want to re-iterate why I'm suspicious of this. I have experience trying this in two separate projects and it failed both times. (Still, I think those problems won't occur this time.) 1. I tried this on a project internally at Affymetrix. It didn't work in this case because the client code was (indirectly) using MS IE code, and IE was throwing away the HTTP content when the header had certain error codes. This doesn't bother me much now, though, because I doubt many DAS clients will be written that interface with IE, and because I now know that you can force IE to keep the HTTP content as long as you make sure the content is always at least 512 characters long. So if we ever run into this problem, there is an easy work-around. 2. I tried putting the X-DAS-Status codes into the HTTP status code in our internal DAS/1 server about a year ago. (In DAS/1 they are not supposed to be in the HTTP status codes, but I misunderstood the spec.) I ran into problems when I tried that, and that is the main reason I objected to trying that in DAS/2. Unfortunately, I can't remember what those problems were.... The problem might have been: a) the IGB client didn't understand the status codes because they weren't in the expected place. If this is the case, then the problem was benign, because we are now writing new code to support the new spec, so we can make IGB understand whatever we want. b) I use Apache's ".htaccess" files to do some URL re-direction on our DAS/1 client machine. see http://httpd.apache.org/docs/1.3/mod/mod_rewrite.html#RewriteRule It is possible that this was causing the original HTTP status code to be replaced with a different one. I'm currently using the "proxy" form of redirect, which seems to keep the status code intact. Earlier I was using the "redirect" form of redirect, which may change the status code to 302. ----- Based on my experience with apache re-direction, I have a vague fear that we may run into cases where firewalls, or html cachers and optimizers may mangle the HTTP status codes for some users at some point. But since I have no confirmed evidence that that will happen, I have no objection to going ahead and trying to use HTTP status codes. From Steve_Chervitz at affymetrix.com Tue Nov 29 15:33:29 2005 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Tue, 29 Nov 2005 12:33:29 -0800 Subject: [DAS2] DAS intro In-Reply-To: <438CADAD.8060403@affymetrix.com> Message-ID: Ed Erwin wrote: > Andrew Dalke wrote: >> The front of the DAS doc starts >> >> DAS 2.0 is designed to address the shortcomings of DAS 1.0, including: >> >> That kinda assumes people know what DAS 1.0 is to understand DAS 2.0. >> >> How about this instead, as an overview/introduction. >> >> ====== >> >> DAS/2 describes a data model for genome annotations. > > In general I like this better than the original introduction. Thanks > for writing it. > > But I agree with Andreas that the first line is better as: > >> DAS/2 is a protocol to share biological data. > > I definitely think of DAS as a protocol first, rather than a data model > first. I concur. The main aim of DAS is to define an API to allow clients to query servers in order to retrieve bioinformatics data objects in defined response formats. Of course, the writeback facility of DAS/2 will make DAS more of a two-way street so we could say 'sharing and editing', but I think retrieval is more fundamental and probably accounts for the majority of uses. How about this for the first line: DAS is a protocol for sharing biological data. No need to limit it to version 2. This applies to all versions. Use 'DAS/2' when talking about new features in this version, such as writeback. Steve From dalke at dalkescientific.com Tue Nov 29 17:17:02 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Tue, 29 Nov 2005 23:17:02 +0100 Subject: [DAS2] DAS intro In-Reply-To: References: Message-ID: Steve: > How about this for the first line: > > DAS is a protocol for sharing biological data. > > No need to limit it to version 2. This applies to all versions. Use > 'DAS/2' > when talking about new features in this version, such as writeback. Done. Made a few changes to the CVS intro text to reduce the use of "DAS/2". So that email I just sent is out of date. :) Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Tue Nov 29 19:02:07 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 30 Nov 2005 01:02:07 +0100 Subject: What are regions for? (was Re: [DAS2] DAS intro) In-Reply-To: <438CAC61.1090104@affymetrix.com> References: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> <59fa39752e4d792d2142fe2682813937@fruitfly.org> <1ac71c37969c1ef9dcc0d983157746aa@fruitfly.org> <9a9ee9242a38f40049a7c5d973980e7d@dalkescientific.com> <438B8013.3060107@affymetrix.com> <438CAC61.1090104@affymetrix.com> Message-ID: <921477a6bd799b5e19b965b3cd39d239@dalkescientific.com> Ed: > I understand this as talking about coordinates in general, not the > elements or "pos" attributes in the spec. Suzi specifically > mentions chromosomes and contigs; one can definitely be backwards with > respect to the other. But top-level regions in an assembly would > probably all be chromosomes or all be contigs, rather than a mixture. I'm trying to figure out when people use the /region. In my way of understanding things there is the genomic sequence. That consists of a set of chromosomes, each with a list of bases. A chromosome is assembled from parts. One of these parts is called a 'contig'. I thought I knew what it was, but according to http://staden.sourceforge.net/contig.html there are several meanings. What I understand is that a 'contig' is a sequenced chunk of DNA which has overlaps with other contigs and when combined can be used to deduce the entire sequence (excepting regions of repeats and other ambiguities). The best such deduction is the golden path. For DAS/2 we assume sequenced genomes. When will people use top-level regions which are not chromosomes? Chromosome top-level regions are identical to the /sequence, except for the ability to get the assembly and the sequence data directly. Is that correct? The spec allows links from a feature into several different regions. This suggests to me that sometimes there will be regions which are a mixture of contigs and chromosomes. Else why support that ability? There is nothing in the spec (that I know of) which allows any hierarchy to the regions - all regions are top-level. Is this correct? > If those top-level regions are chromosomes, then there is no > relationship between the coordinates on different ones. While I understand that, I did get it wrong when I wrote it down. In my head I was thinking "each base has a 1-to-1 mapping to a number, and if two bases are next to each other then the corresponding two numbers are next to each other." This is invalid because the converse is not true - if one number is the end of a chromosome and the other is the start of the next then the two bases are not next to each other. > If those top-level regions are contigs or ESTs (which I believe is > allowed by the spec), then positions on one of them can be related to > positions on others through various transforms. Those are allowed. Will people use them? What advantage is there to having these be a special category instead of a feature? > You represent gaps with tag parent-child relationships, and > going backwards by specifying "+1" strand on one contig and "-1" > strand on the other. Something like this? (Yes, this is hand-wavy) Here's a (and note, this is NOT a ) with two subfeatures, one on the forward strand and one on the reverse. This I understand just fine. I don't understand why the positions are given in /region spec instead of either: - directly to /sequence space, eg ... -or- - point to a feature of type 'region' which provides the region coordinates ... (Again, hand-wavy. I think best looking at data and code.) > The spec does not requires a DAS/2 server to know how to perform > transformations from one coordinate system to another, but your > statement "there really is only one reference frame for the entire > genome sequence" is wrong as I understand it. There is one coordinate > axis for *each* top-level region. Understood. My questions, to summarize, are: - why do we need a /region space when we can 1. point directly to a sequence (for chromosome regions) and/or 2. point to a "contig" or "assembly" or "region" feature type (for other regions) - When would someone have regions which have more than one of contigs, ESTs and chromosomes? Especially given that this is the genome spec, so chromosome-level info is known, at least enough for a rough assembly. In other words, what are regions for? Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Tue Nov 29 19:26:41 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 30 Nov 2005 01:26:41 +0100 Subject: [DAS2] mtg topics for Nov 28 In-Reply-To: <438CB70B.4030005@affymetrix.com> References: <438CB70B.4030005@affymetrix.com> Message-ID: <45f7dbc8e14fa2a68af6c1d03153d715@dalkescientific.com> Ed: > I'm not going to argue anymore against moving the X-DAS-Status code up > into the HTTP status code. I'm willing to try it and see if it works. > > But I want to re-iterate why I'm suspicious of this. I have > experience trying this in two separate projects and it failed both > times. (Still, I think those problems won't occur this time.) > > 1. I tried this on a project internally at Affymetrix. It didn't > work in this case because the client code was (indirectly) using MS IE > code, and IE was throwing away the HTTP content when the header had > certain error codes. This was a two-part problem: - identifying in client code that a given error occured - extracting the payload when the error occurred As far as I can tell, the problem you are concerned about is the second part. Personally I don't want an application/x-das-error+xml return document. Several others do. Thing is, when Gregg asked if anyone used the DAS/1 error codes for anything other than "there was an error", no one said anything. I could hear the proverbial crickets chirping (or in my case, snow falling). I am convinced that the actual error content will be server implementation specific and as such non-portable across clients. I will flesh out a document type for this then ask Thomas, Lincoln etc. to provide a list of defined error code extensions that their servers will return. It's likely they'll not be able to agree on it, because their code will do different styles of error checking. I'll also dodge the whole mess by saying that the error document payload is optional, so clients are highly unlikely to read it for anything meaningful. (Except perhaps some text shunted to the user.) That makes more work in the spec implementation for something I can almost guarantee will be ignored by DAS clients. > b) I use Apache's ".htaccess" files to do some URL re-direction on our > DAS/1 client machine. > > see http://httpd.apache.org/docs/1.3/mod/mod_rewrite.html#RewriteRule > > It is possible that this was causing the original HTTP status code to > be replaced with a different one. > > I'm currently using the "proxy" form of redirect, which seems to keep > the status code intact. Earlier I was using the "redirect" form of > redirect, which may change the status code to 302. I don't understand how the old one would be a problem in the web clients I'm familiar with. It should be: send request to server get 302 "moved temporarily" response along with new URL repeat until no redirect or reached max redirect limit request new URL get headers/payload back The redirects shouldn't affect the real response code, which would be the last in the chain. If it did, it would also affect 404 and 200 responses. > Based on my experience with apache re-direction, I have a vague fear > that we may run into cases where firewalls, or html cachers and > optimizers may mangle the HTTP status codes for some users at some > point. But since I have no confirmed evidence that that will happen, > I have no objection to going ahead and trying to use HTTP status > codes. I know that fear. I've had intermediate web caches misconfigured which cached anything HTML page for an hour, making me unable to edit my web site and see the changes. That was with a normal 200 response code, so likely misconfigured caches will affect other response codes. But what's there to do about that? What's the error rate? We're using normal HTTP and if a web cache breaks for us - we aren't doing anything fancy; no content-negotiation, no 'If-Modified-Since', etc - then it will break for anyone doing HTTP. That's anyone exchanging HTML, sending RSS, etc. Andrew dalke at dalkescientific.com From ed_erwin at affymetrix.com Tue Nov 29 19:34:11 2005 From: ed_erwin at affymetrix.com (Ed Erwin) Date: Tue, 29 Nov 2005 16:34:11 -0800 Subject: [DAS2] mtg topics for Nov 28 In-Reply-To: <45f7dbc8e14fa2a68af6c1d03153d715@dalkescientific.com> References: <438CB70B.4030005@affymetrix.com> <45f7dbc8e14fa2a68af6c1d03153d715@dalkescientific.com> Message-ID: <438CF383.5050604@affymetrix.com> >> I'm currently using the "proxy" form of redirect, which seems to keep >> the status code intact. Earlier I was using the "redirect" form of >> redirect, which may change the status code to 302. > > > I don't understand how the old one would be a problem in the > web clients I'm familiar with. It should be: > > send request to server > get 302 "moved temporarily" response along with new URL > repeat until no redirect or reached max redirect limit > request new URL > get headers/payload back Unlike modern web browsers, IGB isn't smart enough to do that. Maybe someday it will need to be, but it isn't there yet. From dalke at dalkescientific.com Tue Nov 29 17:13:49 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Tue, 29 Nov 2005 23:13:49 +0100 Subject: [DAS2] DAS intro In-Reply-To: <438CADAD.8060403@affymetrix.com> References: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> <438CADAD.8060403@affymetrix.com> Message-ID: <24b1a9183d9f344398f80839f4c71b6e@dalkescientific.com> Ed: > I definitely think of DAS as a protocol first, rather than a data > model first. Mmm. I see you all's point. All protocols express a data model, though neither side necessarily must implement it that way. Here's the updated text. This is what I just committed to CVS. Note that it's missing mention of the '/region' section. ===== DAS/2 is a protocol for sharing biological data. This version of the specification describes features located on the genomic sequence. Future extensions will add support for sharing annotations of expression data, protein sequences, 3D structures, and ontologies. A DAS/2 annotation server provides feature information about one or more genome sources. Each source may have one or more versions. Different versions are usually based on different assemblies. As an implementation detail an assembly and corresponding sequence data may be distributed via a different machine, which is called the reference server. Annotations are located on the genomic sequence with a start and end position. The range may be specified mutiple times if there are alternate reference frames. An annotation may contain multiple non-continguous parts, making it the parent of those parts. Some parts may have more than one parent. Annotations have a type based on terms in SOFA (Sequence Ontology for Feature Annotation). Stylesheets contain a set of properties used to depict a given type. Annotations can be searched by range, type, and a properties table associated with each annotation. These are called feature filters. DAS/2 is implemented using a ReST architecture. Each entity (also called a document or object) has a name, which is a URL. Fetching the URL gets information about the entity. The DAS-specific entities are all XML documents. Other entities contain data types with an existing and frequently used file format. Where possible, a DAS server returns data using existing formats. In some cases a server may describe how to fetch a given entity in several different formats. ===== Andrew dalke at dalkescientific.com From ed_erwin at affymetrix.com Tue Nov 29 19:37:07 2005 From: ed_erwin at affymetrix.com (Ed Erwin) Date: Tue, 29 Nov 2005 16:37:07 -0800 Subject: What are regions for? (was Re: [DAS2] DAS intro) In-Reply-To: <921477a6bd799b5e19b965b3cd39d239@dalkescientific.com> References: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> <59fa39752e4d792d2142fe2682813937@fruitfly.org> <1ac71c37969c1ef9dcc0d983157746aa@fruitfly.org> <9a9ee9242a38f40049a7c5d973980e7d@dalkescientific.com> <438B8013.3060107@affymetrix.com> <438CAC61.1090104@affymetrix.com> <921477a6bd799b5e19b965b3cd39d239@dalkescientific.com> Message-ID: <438CF433.1020707@affymetrix.com> Andrew Dalke wrote: > My questions, to summarize, are: > - why do we need a /region space when we can > 1. point directly to a sequence (for chromosome regions) and/or > 2. point to a "contig" or "assembly" or "region" feature type > (for other regions) The way I understand it, that is what region is for: to point directly to a location on a sequence and/or contig. > - When would someone have regions which have more than one of > contigs, ESTs and chromosomes? Especially given that this > is the genome spec, so chromosome-level info is known, at > least enough for a rough assembly. I think they do it mainly 1) when the assembly is incomplete or 2) to preserve annotations from the past when the assembly was incomplete. There could be more reasons. Here is an example of a DAS/1 server that contains both chromosomes and "other" short sequences as entry points: http://servlet.sanger.ac.uk:8080/das/ensembl_Homo_sapiens_core_28_35a/entry_points See here for some more genomes that are treated similarly: http://servlet.sanger.ac.uk:8080/das > In other words, what are regions for? > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 From dalke at dalkescientific.com Tue Nov 29 20:26:29 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 30 Nov 2005 02:26:29 +0100 Subject: What is /region for? (was Re: [DAS2] DAS intro) In-Reply-To: <438CF433.1020707@affymetrix.com> References: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> <59fa39752e4d792d2142fe2682813937@fruitfly.org> <1ac71c37969c1ef9dcc0d983157746aa@fruitfly.org> <9a9ee9242a38f40049a7c5d973980e7d@dalkescientific.com> <438B8013.3060107@affymetrix.com> <438CAC61.1090104@affymetrix.com> <921477a6bd799b5e19b965b3cd39d239@dalkescientific.com> <438CF433.1020707@affymetrix.com> Message-ID: <6fd85d539c25833e9b6f7f41b3429231@dalkescientific.com> (Changed the Subject line slightly to be a bit clearer. I hope.) On Nov 30, 2005, at 1:37 AM, Ed Erwin wrote: > Andrew Dalke wrote: >> My questions, to summarize, are: >> - why do we need a /region space when we can >> 1. point directly to a sequence (for chromosome regions) and/or >> 2. point to a "contig" or "assembly" or "region" feature type >> (for other regions) > > The way I understand it, that is what region is for: to point directly > to a location on a sequence and/or contig. Am I not asking the question correctly? Am I missing the obvious? Been known to happen before! I know what regions are. I don't know why they are in a distinct /region subtree. I'm happy - enthusiastic - ecstatic - that there are different ways to identify certain regions. I fully accept that they are in use every day and widely understood. Why are they special enough to get their own /region subtree? Why can't they be features? Here's my proposal. Leaf node parts of a always point to a /sequence and optionally point to one or more /feature elements which are of type "region". (Or some other part of SOFA - perhaps assembly-component?) What to know where the feature is on a given "region" feature? Then look up the region to find its /sequence location. Use these two /sequence locations to get the location in the region. Both /sequence locations are in the same "coordinate space" of "identifier + start/end offset" BTW, if regions are a type of features then you can search for them. Eg, search for all top-level regions in the range 100000 to 2000000. Can't do that with the /region container. Can if the region data is in the /feature container. >> - When would someone have regions which have more than one of >> contigs, ESTs and chromosomes? Especially given that this >> is the genome spec, so chromosome-level info is known, at >> least enough for a rough assembly. > > I think they do it mainly 1) when the assembly is incomplete or 2) to > preserve annotations from the past when the assembly was incomplete. > There could be more reasons. > > Here is an example of a DAS/1 server that contains both chromosomes > and "other" short sequences as entry points: Okay, I'm fine with that. Thanks. Is a goal of DAS to support incomplete genomes? Note, btw, that the /sequence subtree does not need to contain only chromosomes. From the spec seqid is the sequence ID, and can correspond to an assembled chromosome, a contig, a clone, or any other accessionable chunk of sequence. Hence for incomplete genomes, put the sequence data as best you can under /sequence and have the /feature subtree point to it. >> In other words, what are regions for? Still don't understand the need for a /region namespace. Repeat: I understand regions, I just don't see why they go in their own subtree and aren't part of some other data chunk. Please, someone sketch out some example with hand-waving XML that shows how having a /region is the appropriate solution. That's what I'm worried about now - the representation in XML. Andrew dalke at dalkescientific.com From Gregg_Helt at affymetrix.com Tue Nov 29 21:08:47 2005 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Tue, 29 Nov 2005 18:08:47 -0800 Subject: [DAS2] mtg topics for Nov 28 Message-ID: Actually I think by default the java networking library that IGB uses follows most redirections automatically without IGB having to worry about it. I'm not familiar with what different forms of redirection might do to the status codes, but I expect that as long as the redirection is successful the code IGB would actually see would be 200 OK. IGB does have a ways to go to properly respond to all possible HTTP status codes though... gregg > -----Original Message----- > From: das2-bounces at portal.open-bio.org [mailto:das2-bounces at portal.open- > bio.org] On Behalf Of Ed Erwin > Sent: Tuesday, November 29, 2005 4:34 PM > To: Andrew Dalke > Cc: DAS/2 > Subject: Re: [DAS2] mtg topics for Nov 28 > > > >> I'm currently using the "proxy" form of redirect, which seems to keep > >> the status code intact. Earlier I was using the "redirect" form of > >> redirect, which may change the status code to 302. > > > > > > I don't understand how the old one would be a problem in the > > web clients I'm familiar with. It should be: > > > > send request to server > > get 302 "moved temporarily" response along with new URL > > repeat until no redirect or reached max redirect limit > > request new URL > > get headers/payload back > > Unlike modern web browsers, IGB isn't smart enough to do that. Maybe > someday it will need to be, but it isn't there yet. > > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 From Gregg_Helt at affymetrix.com Tue Nov 29 21:17:24 2005 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Tue, 29 Nov 2005 18:17:24 -0800 Subject: [DAS2] mtg topics for Nov 28 Message-ID: > -----Original Message----- > From: das2-bounces at portal.open-bio.org [mailto:das2-bounces at portal.open- > bio.org] On Behalf Of Ed Erwin > Sent: Tuesday, November 29, 2005 12:16 PM > To: Andrew Dalke > Cc: DAS/2 > Subject: Re: [DAS2] mtg topics for Nov 28 ... > 2. I tried putting the X-DAS-Status codes into the HTTP status code in > our internal DAS/1 server about a year ago. (In DAS/1 they are not > supposed to be in the HTTP status codes, but I misunderstood the spec.) > I ran into problems when I tried that, and that is the main reason I > objected to trying that in DAS/2. > > Unfortunately, I can't remember what those problems were.... > > The problem might have been: > a) the IGB client didn't understand the status codes because they > weren't in the expected place. > > If this is the case, then the problem was benign, because we are now > writing new code to support the new spec, so we can make IGB understand > whatever we want. I'm pretty sure this was the problem (IGB didn't know where to find the status codes). gregg From Steve_Chervitz at affymetrix.com Fri Nov 4 00:24:53 2005 From: Steve_Chervitz at affymetrix.com (Chervitz, Steve) Date: Thu, 3 Nov 2005 16:24:53 -0800 Subject: [DAS2] DAS/2 weekly meeting notes Message-ID: Notes from the weekly DAS/2 teleconference, 3 Nov 2005. $Id: das2-teleconf-2005-11-03.txt,v 1.2 2005/11/04 00:23:27 sac Exp $ Attendees: Affy: Steve Chervitz, Ed Erwin, Gregg Helt UCLA: Brian O'connor, Mark Carlson These notes are checked into the biodas.org CVS repository at das/das2/notes/2005. Instructions on how to access this repository are at http://biodas.org Status Reports -------------- Gregg: * A lot happened last week: - Major IGB public release (4.02) last Friday (10/28) - Attended and presented IGB demo at CSHL Genome Informatics meeting on Sunday (10/30) - Finished and submitted DAS/2 continuation grant on Tue (11/1). * Held a DAS/2 BOF (birds of a feather meeting) at CSHL. Good discussion and turnout (15). Collected feedback from EBI/Sanger folks. Asked people to download the client (IGB) and hit the servers (Affy, UCLA), so be looking for more traffic soon. * TODO: Monitor DAS/2 traffic, collect usage stats for both servers: http://netaffxdas.affymetrix.com http://biopackages.net. Especially check for performance degradation under load. Need to parse apache and server logs for things like: # users, typical query times, etc. * IGB demo went well. People were impressed with speed. Requests for Gregg's in-memory java DAS/2 server, but code is not yet ready for public consumption. Ed: * Reviewing various technologies of possible interest: - HTTP communication protocol, necessary commands. - Using a bean-based property editor for IGB * Spent time answering user questions on IGB forum (only 1 person posted trouble with installing data for use with new IGB release -- not bad). Gregg adds: Also no negative feedback from internal release. Steve: * Spec work: Posted message about types and features issues in the retrieval spec last Thurs (10/26). Mentioned Lincoln's response (doing away with xml:base and going with his namespace scheme). Gregg talked with Lincoln about this at CSHL and clarified that xml:base is for resolving relative URLs in attributes or CDATA elements, whereas xmlns is for resolving names of attributes and elements. Steve will post response to DAS/2 discussion list about this. * Tested the IGB release on OS X last week prior to release. Noted the display bug that Gregg knows about (disappearing view when you select a new DAS/2 annotation source). Found trouble with quick load synonym on the Affy internal server synonym. Ed fixed. * Installed new assembly (Human Nov 2002) available via quickload and DAS/2. Gregg says: Use DAS/1 for new genomes at this stage. * DAS/2 discussion list troubleshooting. Problem with open-bio sendmail, DNS. Brian, Mark: * Using the DAS/2 layer from the IGB code base and extending it for their assay and ontology namespaces. Want to put this new code in separate packages to avoid stepping on other IGB functionality. DAS/2 layer is currently in com.affymetrix.igb.das2. Options 1. Add subpackages to com.affymetrix.igb.das2. 2. Move das2 out from under igb to com.affymetrix.das2. 3. Move das2 out of com.affymetrix to be totally separate. Then com.affymetrix.igb.das2 and the assay/ontology code would depend on it. Brian is fine with #2. Gregg will check and remove any dependencies with the das2 package on IGB code. * Plan to release their code internally in December. Code is in their own CVS repository now. Genoviz/IGB code has not been committed to SF yet. --------------------------- TODO * Summarize CSHL genome informatics meeting happenings relevant to DAS/2 when others who were there are dialed in. * Move teleconf meeting to a more UK-friendly time. US is now on standard time. 9am PST = 12pm EST = 17:00 GMT. How does this work for folks? From Steve_Chervitz at affymetrix.com Fri Nov 4 20:32:22 2005 From: Steve_Chervitz at affymetrix.com (Chervitz, Steve) Date: Fri, 04 Nov 2005 12:32:22 -0800 Subject: [DAS2] Spec issues In-Reply-To: <200510270941.30528.lstein@cshl.edu> Message-ID: As Gregg noted in this week's DAS/2 meeting, xml:base and XML namespace (xmlns) are complementary technologies: * xml:base is for resolving relative URLs occurring within attribute values or CDATA elements * xmlns is for resolving names of attributes and elements. So bearing this in mind, here's my take: On Thursday 27 October 2005, Lincoln Stein wrote: > > On Wednesday 26 October 2005 07:29 pm, Chervitz, Steve wrote: > > > > > > > > Next issue: Feature properties example (only showing relevant attributes): > > > > Description: Properties are typed using the ptype attribute. The value of > > the property may be indicated by a URL given by the href attribute, or may > > be given inline as the CDATA content of the section. > > > > > > > type="type/curated_exon"> > > 29 > > 2 > > > href="/das/protein/volvox/2/feature/CTEL54X.1" /> > > > > > > > > So in contrast to the TYPE properties which are restricted to being simple > > string-based key:value pairs, FEATURE properties can be more complex, which > > seems reasonable, given the wild world of features. We might consider using > > 'key' rather than 'ptype' for FEATURE properties, for consistency with TYPE > > prop elements (however, read on). > > I'm not so happy with "key" since it is nondescript. Originally this was > "type" but the word collided with feature type. > > I am getting uncomfortable with the dichotomy we've (I've?) created between > XML base keys/properties and namespace-based keys/properties. It seems nasty > to have the ptype attribute be either a relative URI > (property/genefinder-score), or a controlled vocabulary member (das:phase). > Is there any reason we shouldn't choose one or the other? > > For example, does this work? > > xmlns:dasprop="http://www.biodas.org/ns/das/genome/2.00/properties" > xmlns:type="http://www.wormbase.org/das/genome/volvox/1/type" > xmlns:id="http://www.wormbase.org/das/genome/volvox/1/feature"> > xmlns:prop="http://www.wormbase.org/das/genome/volvox/1/property"> > das:type="type:curated_exon"> > 29 > 2 > das:href="http://www.wormbase.org/das/protein/volvox/2/feature/CTEL54X.1" /> > > > This looks so much cleaner to me. Here's a new version of this example using xml:base, a default xmlns, and a special attribute to define the URL for the controlled vocabulary of DAS property keys. I'm also using xlink for the href: 29 2 > Cc: Steve Chervitz > Subject: Re: New problem with content-type header in DAS/2 server responses! > > Looks like the cache server. FYI, I have updated the server to use all > "text/xml" Content-Type for all xml response types. This was approved by > Lincoln so that web browsers could be pointed at the das server and "just > work". I thought these changes had already made their way into the spec, > but apparently not. > > The table below summarizes what the server should be giving back. The > left column shows the command and format request, and the right side shows > the response Content-Type. > > 'das/das2xml' => 'text/xml', > 'domain/das2xml' => 'text/xml', > 'domain/compact' => 'text/plain', > 'feature/das2xml' => 'text/xml', > 'feature/chain' => 'text/plain', #LOOK > 'property/das2xml' => 'text/xml', > 'region/das2xml' => 'text/xml', > 'region/compact' => 'text/plain', > 'sequence/das2xml' => 'text/plain', #LOOK > 'sequence/fasta' => 'text/plain', > 'source/das2xml' => 'text/xml', > 'source/compact' => 'text/plain', > 'type/das2xml' => 'text/xml', > 'type/compact' => 'text/plain', > 'type/obo' => 'text/plain', > 'type/rdf' => 'text/xml', > 'versionedsource/das2xml' => 'text/xml', > > As you can see, the text/plain response to the /feature command is NOT > being given by the server, but somehow being mangled by the cache. Is > this going to severly impact your demo? If so I can disable the cache > module. It will be slow though. An alternative to the cache would be to > use our squid proxy. Brian can probably set you up to use it very > quickly. > > Let me know what needs to be done ASAP. > > -Allen > > > On Fri, 28 Oct 2005, Helt,Gregg wrote: > >> I just tried accessing the biopackages DAS/2 server from IGB, with this >> query: >> >> http://das.biopackages.net/das/genome/human/17/feature?overlaps=chr21/26 >> 027736:26068042;type=SO:mRNA >> >> and I'm getting back a message where the XML looks fine but here are the >> headers: >> >> HTTP/1.1 200 OK >> Date: Sat, 29 Oct 2005 05:49:46 GMT >> Server: Apache/2.0.51 (Fedora) >> X-DAS-Status: 200 >> Warning: 113 Heuristic expiration >> Content-Type: text/plain; charset=UTF-8 >> Age: 259582 >> Content-Length: 6004 >> Keep-Alive: timeout=15, max=100 >> Connection: Keep-Alive >> >> But according to the spec the content type header needs to be: >> Content-Type: text/x-das-features+xml >> I'm using this in the IGB DAS/2 client to parse responses based on the >> content type. With "text/plain; charset=UTF-8" IGB doesn't know what >> parser to use and gives up. So right now I can't visualize annotations >> from the biopackages server. I'm pretty sure the server was setting the >> content-type header correctly on Wednesday -- did anything change since >> then that could be causing this? Could the server-side cache be doing >> this for some reason? >> >> Thanks, >> Gregg >> >> From dalke at dalkescientific.com Wed Nov 9 00:27:42 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 9 Nov 2005 01:27:42 +0100 Subject: [DAS2] Re: New problem with content-type header in DAS/2 server responses! In-Reply-To: References: Message-ID: My apologies for not tracking what's been going on in the last few months. I'm back now and have time for the next few months to work on things. So I'll start with this exchange. I can't find the discussion in the mailing list history. Why the decision to use "text/xml" for all xml responses? I read it it is so "web browsers can 'just work'". What are they supposed to do? Display the XML as some sort of tree structure? Is that the only thing? One thing Allen and I talked about, and he tested, was the ability to insert a stylesheet declaration in the XML. Is this part of the reason to switch to using "text/xml"? Is there anything I'm missing? Since it looks like I'm going to be more in charge of the spec development, I would like to start collecting use cases and recording these sorts of decisions. I think having different content-types is an important feature. For example, it lets a DAS browser figure out what it's looking at before doing any parsing. Here's my use case. I want someone to send an email to someone else along the lines of "What do you think about http://blah.blah/das/genome/blah/blah" with the URL of the object included in the email. Paste that into a DAS browser and it should be able to figure out that this is a sequence, a feature, a whatever. With the old content-types there was enough information to do that right away. With this new one a DAS browser needs to parse the XML to figure out what's in it. Autodetection of XML formats? I don't want to go there. That's also the reason for Gregg's opposition. You (Allen) and Lincoln, on the other hand, want that user to be able to go to a web browser and paste the URL in, to get a basic idea of what's there. I think that's also important. I think there are other solutions. One is "if the server sees a web browser then return the XML data streams as a 'text/xml'". For example: if "Mozilla" in headers["User-Agent"]: ... this is IE, Mozilla, Firefox, and a few others .. That catches most of the browsers anyone here cares about. As another solution, look at the "Accept" header sent by the browser. Here's what Firefox sends: Accept: text/xml,application/xml,application/xhtml+xml,text/html; q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5' Here's Safari and "links" (a text browser): Accept: */* Another rule them might be if asking_for_xml_format and "*/*" in headers["Accept"]: ... return it as "text/xml" ... Though a better version is to make sure the client doesn't know about the expected content type: if asking_for_xml_format: return_content_type = ... whatever is appropriate ... if (return_content_type not in headers["Accept"] and "*/*" in headers["Accept"]): return_content_type = "text/xml" .... optionally insert style sheet .... Another solution is to send a "what kind of DAS object are you?" request to the URL (eg, tack on a ? query or tell the server that the client will "Accept: application/x-das-autodiscovery"). I think that's clumsy, but I mention it as another way to support both DAS client app and human browser requests of the same URL. >> From: Allen Day >> Looks like the cache server. FYI, I have updated the server to use >> all >> "text/xml" Content-Type for all xml response types. This was >> approved by >> Lincoln so that web browsers could be pointed at the das server and >> "just >> work". I thought these changes had already made their way into the >> spec, >> but apparently not. >> On Fri, 28 Oct 2005, Helt,Gregg wrote: >>> But according to the spec the content type header needs to be: >>> Content-Type: text/x-das-features+xml >>> I'm using this in the IGB DAS/2 client to parse responses based on >>> the >>> content type. With "text/plain; charset=UTF-8" IGB doesn't know what >>> parser to use and gives up. So right now I can't visualize >>> annotations >>> from the biopackages server. I'm pretty sure the server was setting >>> the >>> content-type header correctly on Wednesday -- did anything change >>> since >>> then that could be causing this? Could the server-side cache be >>> doing >>> this for some reason? Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Wed Nov 9 00:49:27 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 9 Nov 2005 01:49:27 +0100 Subject: [DAS2] Re: New problem with content-type header in DAS/2 server responses! In-Reply-To: References: Message-ID: <7e9e19f6885240c668ac677b6ea98ff0@dalkescientific.com> P.S. Gregg mentioned one need for wanting more selective content-types. Here's another. I expect most of the XML data we return will change. We may add an element field or change the meaning of an element. When that happens, how does a client know that a "text/xml" is for one version or another of a given document type? I expect that will be done by returning something like Content-Type: text/das2xml; version=2 This, btw, suggests a third solution to the problem of letting DAS/2 and web browser clients both point to the same object - se Content-Type: text/xml; das-type=das2xml But that's ugly. A 4th is to go back to the "add a das-content-type header" solution from DAS/1. I don't want that. Note, btw, that if a given URL can return different MIME types for the same request then it needs a "Vary: Accept" in the response headers so caching works correctly. Andrew dalke at dalkescientific.com From Steve_Chervitz at affymetrix.com Wed Nov 9 01:58:07 2005 From: Steve_Chervitz at affymetrix.com (Chervitz, Steve) Date: Tue, 08 Nov 2005 17:58:07 -0800 Subject: [DAS2] Re: New problem with content-type header in DAS/2 server responses! In-Reply-To: Message-ID: Andrew, Andrew Dalke wrote on 8 Nov 2005: > My apologies for not tracking what's been going on in the last few > months. I'm back now and have time for the next few months to work > on things. Great to have you back. I have been focusing on the spec for the past several weeks but would be glad to have you take the lead on it. We've been making the retrieval spec a priority and should really focus on getting it nailed down as soon as possible to allow others to start implementing clients and servers against it and providing feedback. We haven't talked about a freeze or release date for it, but maybe we should. I started going through the open bugs in bugzilla, but only resolved one (#1796). While going through and cleaning up the retrieval spec, I ran into other issues that were not in bugzilla that seemed important. One was this content-type issue that you address here. I raised some other issues regarding types and feature properties etc. a couple of weeks ago that I'd like you to chime in on: http://portal.open-bio.org/pipermail/das2/2005-October/000271.html The latest message on this thread is: http://portal.open-bio.org/pipermail/das2/2005-November/000278.html > So I'll start with this exchange. I can't find the discussion in the > mailing list history. > > Why the decision to use "text/xml" for all xml responses? I read it > it is so "web browsers can 'just work'". > > What are they supposed to do? Display the XML as some sort of tree > structure? Is that the only thing? > > One thing Allen and I talked about, and he tested, was the ability to > insert a stylesheet declaration in the XML. Is this part of the > reason to switch to using "text/xml"? Here's the relevant thread for reference: http://portal.open-bio.org/pipermail/das2/2005-July/000227.html In your other email on this thread, you said: > This, btw, suggests a third solution to the problem of letting DAS/2 > and web browser clients both point to the same object - se > > Content-Type: text/xml; das-type=das2xml > > But that's ugly. This seems like a good solution (and not too ugly IMHO). The das-type value could be more detailed (e.g., x-das-features+xml). However, I recall that there were possible problems with this syntax, but can't remember the details at the moment. Whatever the solution we decide, we should strive for simplicity. If we ask too much of servers and clients, that will be an impediment to implementation and maintenance. Steve From allenday at ucla.edu Wed Nov 9 02:21:51 2005 From: allenday at ucla.edu (Allen Day) Date: Tue, 8 Nov 2005 18:21:51 -0800 (PST) Subject: [DAS2] Re: New problem with content-type header in DAS/2 server responses! In-Reply-To: References: Message-ID: To be even more concise, there are two use cases being presented here: 1) DAS/2 content should be viewable in a web browser, and doing so requires a HTTP Content-Type header to have value 'text/xml'. 2) DAS/2 content should be viewable in a specialized DAS/2 browser, and be able to rely on HTTP headers to determine visualization mode, as XML/DTD/Schema sniffing is undesireable. The solution proposed in the referenced thread, or perhaps only on a conference call, is to use the Content-Type header to address (1), providing information to web browsers, as they are less flexible than a specialized DAS/2 client. (2) is addressed using a DAS/2 specific X-Das-Content-Type header, e.g. ================== % GET -e 'http://das.biopackages.net/das/genome/human/17/feature?overlaps=chr22/1000000:2000000;type=SO:mRNA' | head -100 Connection: close Date: Wed, 09 Nov 2005 02:15:24 GMT Server: Apache/2.0.51 (Fedora) Content-Type: text/xml Expires: Thu, 09 Nov 2006 02:15:24 GMT Client-Date: Wed, 09 Nov 2005 02:19:16 GMT Client-Peer: 164.67.183.101:80 Client-Response-Num: 1 Client-Transfer-Encoding: chunked X-DAS-Content-Type: text/x-das-feature+xml X-DAS-Server: GMOD/0.0 X-DAS-Status: 200 X-DAS-Version: DAS/2.0 ================== This also has the added benefit of already being implemented for a few months. Are there objections to this solution? -Allen On Wed, 9 Nov 2005, Andrew Dalke wrote: > My apologies for not tracking what's been going on in the last few > months. I'm back now and have time for the next few months to work > on things. > > So I'll start with this exchange. I can't find the discussion in the > mailing list history. > > Why the decision to use "text/xml" for all xml responses? I read it > it is so "web browsers can 'just work'". > > What are they supposed to do? Display the XML as some sort of tree > structure? Is that the only thing? > > One thing Allen and I talked about, and he tested, was the ability to > insert a stylesheet declaration in the XML. Is this part of the > reason to switch to using "text/xml"? > > Is there anything I'm missing? > > Since it looks like I'm going to be more in charge of the spec > development, > I would like to start collecting use cases and recording these sorts of > decisions. > > I think having different content-types is an important feature. For > example, it lets a DAS browser figure out what it's looking at before > doing any parsing. Here's my use case. > > I want someone to send an email to someone else along the lines of > "What do you think about http://blah.blah/das/genome/blah/blah" > with the URL of the object included in the email. > > Paste that into a DAS browser and it should be able to figure out that > this is a sequence, a feature, a whatever. With the old content-types > there was enough information to do that right away. With this new > one a DAS browser needs to parse the XML to figure out what's in it. > Autodetection of XML formats? I don't want to go there. > > That's also the reason for Gregg's opposition. > > > You (Allen) and Lincoln, on the other hand, want that user to be able to > go to a web browser and paste the URL in, to get a basic idea of what's > there. > > I think that's also important. > > I think there are other solutions. One is "if the server sees a web > browser then return the XML data streams as a 'text/xml'". > > For example: > if "Mozilla" in headers["User-Agent"]: > ... this is IE, Mozilla, Firefox, and a few others .. > > That catches most of the browsers anyone here cares about. As > another solution, look at the "Accept" header sent by the browser. > Here's what Firefox sends: > > Accept: text/xml,application/xml,application/xhtml+xml,text/html; > q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5' > > Here's Safari and "links" (a text browser): > > Accept: */* > > Another rule them might be > > if asking_for_xml_format and "*/*" in headers["Accept"]: > ... return it as "text/xml" ... > > Though a better version is to make sure the client doesn't know about > the expected content type: > > > if asking_for_xml_format: > return_content_type = ... whatever is appropriate ... > > if (return_content_type not in headers["Accept"] > and "*/*" in headers["Accept"]): > > return_content_type = "text/xml" > .... optionally insert style sheet .... > > > > Another solution is to send a "what kind of DAS object are you?" request > to the URL (eg, tack on a ? query or tell the server that the client > will > "Accept: application/x-das-autodiscovery"). > > > I think that's clumsy, but I mention it as another way to support > both DAS client app and human browser requests of the same URL. > > > >> From: Allen Day > > >> Looks like the cache server. FYI, I have updated the server to use > >> all > >> "text/xml" Content-Type for all xml response types. This was > >> approved by > >> Lincoln so that web browsers could be pointed at the das server and > >> "just > >> work". I thought these changes had already made their way into the > >> spec, > >> but apparently not. > > >> On Fri, 28 Oct 2005, Helt,Gregg wrote: > >>> But according to the spec the content type header needs to be: > >>> Content-Type: text/x-das-features+xml > >>> I'm using this in the IGB DAS/2 client to parse responses based on > >>> the > >>> content type. With "text/plain; charset=UTF-8" IGB doesn't know what > >>> parser to use and gives up. So right now I can't visualize > >>> annotations > >>> from the biopackages server. I'm pretty sure the server was setting > >>> the > >>> content-type header correctly on Wednesday -- did anything change > >>> since > >>> then that could be causing this? Could the server-side cache be > >>> doing > >>> this for some reason? > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 > From dalke at dalkescientific.com Wed Nov 9 17:37:21 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 9 Nov 2005 18:37:21 +0100 Subject: [DAS2] Re: New problem with content-type header in DAS/2 server responses! In-Reply-To: References: Message-ID: Steve: > Here's the relevant thread for reference: > http://portal.open-bio.org/pipermail/das2/2005-July/000227.html Ahh, it's the one I half remembered, from July. Allen said: > Not sure how much value there is in > this, but here is a very simple graphical display of regions on the > server, and their relative sizes. I think it's useful to have web browserability, as it were, but I think it's a secondary goal. To me the ability to transform the XML via the stylesheet is something that's technology driven and not user driven. That is, nothing in the previous work, including the DAS/2 proposals from others, mentioned that as a need. On the other hand, being able to get the content type of what's coming back from the server is a design goal, and we have an existing need -- Gregg's example -- for it. I would rather therefore put the onus on the data provider to be clever in sniffing out the client than in the DAS/2 client in sniffing out the data. Steve: > In your other email on this thread, you said: > >> This, btw, suggests a third solution to the problem of letting DAS/2 >> and web browser clients both point to the same object - se >> >> Content-Type: text/xml; das-type=das2xml >> >> But that's ugly. > > This seems like a good solution (and not too ugly IMHO). The das-type > value > could be more detailed (e.g., x-das-features+xml). However, I recall > that > there were possible problems with this syntax, but can't remember the > details at the moment. We have discussed this on-and-off for a while now, eh? Here's the previous thread on it: http://portal.open-bio.org/pipermail/das2/2004-December/000019.html I need to do a bit more research. I don't like the idea of making new headers and I don't like the idea of using a modified content-type like that. The first because we aren't doing anything unusual compared to other projects and the second because I don't have any experience with that. I suspect the answer will be: - by default if no "?format=" is specified then return "text/xml" - if the client sends an "Accept: text/x-das-features+xml" then return the document with the proper content-type information In that way if someone pastes a "http://.../blah?format=xyz and they get a bunch of garbage that can manually chop off the obvious "format=" part of the query. But that doesn't agree with my use case, where the DAS/2 client gets a random URL. It would need to send "Accept: ..." where the "..." is a list of all the possible DAS content-types. I'll think about this some more while I'm out salsa dancing this evening. :) Andrew dalke at dalkescientific.com From Steve_Chervitz at affymetrix.com Thu Nov 10 01:25:48 2005 From: Steve_Chervitz at affymetrix.com (Chervitz, Steve) Date: Wed, 09 Nov 2005 17:25:48 -0800 Subject: [DAS2] Agenda for weekly teleconference Message-ID: Time & Day: 12:00 Noon PST, Thursday 11 Nov 2005 Tel (US): 800-531-3250 Tel (Int'l): 303-928-2693 ID: 2879055 Agenda ------ * Decide on Europe-friendly time for this teleconference. Proposals: - Thu 9am PST = 12pm EST = 17:00 GMT - Wed 9am PST - Mon 9am PST * DAS/2 get spec issues: - Content-type: text/xml vs. text/x-das-blah+xml http://portal.open-bio.org/pipermail/das2/2005-November/000287.html - XML encoding of type and feature properties: http://portal.open-bio.org/pipermail/das2/2005-November/000278.html Time and people permitting: * Summarize CSHL genome informatics meeting happenings relevant to DAS/2 (Allen, Gregg, Suzi, Lincoln). * Introduction to Apollo (Suzi) * DAS/2 validation (Andrew) From dalke at dalkescientific.com Thu Nov 10 01:34:28 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 10 Nov 2005 02:34:28 +0100 Subject: [DAS2] Re: New problem with content-type header in DAS/2 server responses! In-Reply-To: References: Message-ID: <2dd2f4224520b2f35add5af1de821729@dalkescientific.com> Allen > To be even more concise, there are two use cases being presented here: > > 1) DAS/2 content should be viewable in a web browser, and doing so > requires a HTTP Content-Type header to have value 'text/xml'. > > 2) DAS/2 content should be viewable in a specialized DAS/2 browser, > and be > able to rely on HTTP headers to determine visualization mode, as > XML/DTD/Schema sniffing is undesireable. A use case describes what the user wants do to, from the user's perspective and not the implementation perspective. Sometimes they are the same, as when the user mandates certain technical decisions, but that's not the case here. The wikipedia has a goo definition, at http://en.wikipedia.org/wiki/Use_case . To make use cases read nicely I've found it useful to have a name better than "the user". There will be many users of different aspect of a DAS system. Some are: - a person making the database/DAS adapter - an annotator - a molecular biologist The use case where talking about here is to let person X (either an annotator or a molecular biologist) communicate with person Y. Rather than saying "X" and "Y" I'll say "Bill" and "Jim". Bill send Jim an email saying "I think there's a problem with this annotation; it looks like it's off-by-one. Could you take a look at it for me?" (Make up your own explanation :) Jim gets the email, sees the URL, and pastes it into his browser. If Jim is an annotator this will probably be a specialized DAS/2 client. If he's not, then more likely it will be a web browser. Both should "do the right thing", that is, provide meaningful information about the given entity and options for more exploration and analysis. This use case suggests several functional details: - There needs to be a way to exchange DAS details via normal text, for inclusion in email. DAS uses URLs so we should build on those. This means they'll also likely be used in generic web pages. Because the specific consumer of a URL isn't known it's not possible to put a "?format=" field on the end of the URL. Thus these URLs must not specify the format. - DAS/2 client (web browsers and specialized apps) should have some way to get (and easily get) the URL for a given annotation, region, feature type, etc. - specialized DAS clients (IGB) need a way for users to enter an arbitrary DAS URL. If one or more of these won't happen then there's no problem. For example, if IGB etc. all don't support entering an arbitrary DAS URL then there's no need to handle both classes of clients. If there's no demand for direct visualization in a web browser then there's also no problem. I'm going to ask about the last. The whole point of this change is to support the ability for a generic web browser to go to a given URL and show something of interest. 1) who needs that? Can any of us point to a group of people who would use a direct web interface to a given DAS/2 URL? If so, why didn't it come up in earlier discussions? 2) what can't they go to a DAS/2 web app elsewhere and from there tell it "now link in the data from this URL." That is, view the URL through an intermediary. 3) why can't we tell people "stick a 'format=html' at the end to see iT in HTML, if you want to make a web link to it, and if the server supports HTML displays. 4) Who wants to make a DAS/2 web app based directly on the DAS/2 data structure? Yes, that makes it trivial to have a first pass web app, but that app will suck. It'll only support browsing the server's data structure via a tree. It won't support, say, the ability to incorporate more or alternate records in a view, fancy AJAX GUIs, etc. There will be no way to merge records from different servers because the annotation server only understands annotations on that server. My view now is that having the default MIME type for a DAS/2 entity be "text/xml", for the purpose of supporting direct web browser visualization of that entity, is not driven by a realistic use case and is interesting mostly for technical reasons. As such, we shouldn't do that. We should leave the return documents as distinct MIME types. That leads me to the result of more research. The relevant spec for the MIME type for XML documents is RFC 3023, at http://www.ietf.org/rfc/rfc3023.txt For commentary also see: http://www.xml.com/lpt/a/2004/07/21/dive.html http://diveintomark.org/archives/2004/02/13/xml-media-types These say we have lots of things to worry about. For example, "text/xml" requires that the content-type include the charset declaration, else the spec says to assume the document is in US-ASCII. There is no way for the XML itself to override that. If we go the "text/xml" route we mandate that either: - all servers include a charset in the content-type - those that don't must only serve ASCII data. The proper MIME type is under "application", as "application/x-das-*+xml" > then the character encoding is determined in this order: > > * the encoding given in the charset parameter of the Content-Type > HTTP header, or > * the encoding given in the encoding attribute of the XML declaration > within the document, or > * utf-8. (quoting from http://www.xml.com/lpt/a/2004/07/21/dive.html ) Apparently some ISPs, eg. in Russian and Japan, will transcode text/xml documents at the HTTP level, ignoring the encoding information in the XML itself. This can lead to problems. As the author of those commentaries says, ?XML is tough.? http://diveintomark.org/archives/2004/07/06/tough > The solution proposed in the referenced thread, or perhaps only on a > conference call, is to use the Content-Type header to address (1), > providing information to web browsers, as they are less flexible than a > specialized DAS/2 client. (2) is addressed using a DAS/2 specific > X-Das-Content-Type header, e.g. It must have been a conference call. I don't see mention of that in my back emails. I'm thankful to Steve for doing the writeups. To emphasize what I said earlier, what will happen in the case of (1)? Who will implement it? What will users expect from it? Why can't those users go through some intermediate DAS web app to better view that data? Why can't we say "add a 'format=html' for interactive viewing"? As for (2), I don't want a new header. I know I talk about conneg and other neat features in HTTP but in re-reading appendix A of RFC 3023 http://www.ietf.org/rfc/rfc3023.txt it talks about over a dozen other solutions to the problem and why they were excluded. These include: > A.10 How about using a conneg tag instead (e.g., accept-features: > (syntax=xml))? > > When the conneg protocol is fully defined, this may potentially be a > reasonable thing to do. But given the limited current state of > conneg[RFC2703] development, it is not a credible replacement for a > MIME-based solution. In this case I'm willing to let people experiment with the idea before baking it into the spec. > A.9 How about a new Alternative-Content-Type header? > > This is better than Appendix A.8, in that no extra functionality > needs to be added to a MIME registry to support dispatching of > information other than standard content types. However, it still > requires both sender and receiver to be upgraded, and it will also > fail in many cases (e.g., web hosting to an outsourced server), > where > the user can set MIME types (often through implicit mapping to file > extensions), but has no way of adding arbitrary HTTP headers. How much control will DAS/2 data providers have over their server? I know I want to support people who provide data as a set of files through Apache, though that's not driven by any use case. (This use case would involve a user who has different requirement than either Jim or Bob.) mod_mime is designed for that. I don't know how to add other headers for this case. The data providers we have now have control over all the headers. If that will essentially always be the case then adding a new header isn't a problem. Then again, if this is always the case then we can go ahead with conneg since an argument against conneg is it puts more work on the server implementations. In this too I'll be conservative - DAS/2 pushes no new ground for a web app development project; there should be no reason to invent a new header. > A.6 How about labeling with parameters in the other direction (e.g., > application/xml; Content-Feature=iotp)? > > This proposal fails under the simplest case, of a user with neither > knowledge of XML nor an XML-capable MIME dispatcher. In that case, > the user's MIME dispatcher is likely to dispatch the content to an > XML processing application when the correct default behavior should > be to dispatch the content to the application responsible for the > content type (e.g., an ecommerce engine for > application/iotp+xml[RFC2801], once this media type is registered). > > Note that even if the user had already installed the appropriate > application (e.g., the ecommerce engine), and that installation had > updated the MIME registry, many operating system level MIME > registries such as .mailcap in Unix and HKEY_CLASSES_ROOT in Windows > do not currently support dispatching off a parameter, and cannot > easily be upgraded to do so. And, even if the operating system were > upgraded to support this, each MIME dispatcher would also separately > need to be upgraded. > X-DAS-Content-Type: text/x-das-feature+xml > X-DAS-Server: GMOD/0.0 > X-DAS-Status: 200 > X-DAS-Version: DAS/2.0 > ================== > > This also has the added benefit of already being implemented for a few > months. Are there objections to this solution? Yes. Several. When did "X-DAS-Status" come back into the picture? I thought we talked about this in spring and nixed it because it doesn't provide anything useful than the existing HTTP-level error code. Or perhaps it was fall of last year? I think I remember raking leaves at the time. More useful, for example, would be a document (html, xml, or otherwise) which accompanies the error response and gives more information about what occurred. What does the "X-DAS-Server" get you that the normal "Server:" doesn't get you? What's the use case? Why is the "X-DAS-Version" at all important? What's important is the data content. It's the document return type/version that's important and not the server version. But I mentioned most of these over a year ago http://portal.open-bio.org/pipermail/das/2004-September/000814.html In summary: - no support for direct web browser access to a URL, expect with a likely use case; - keep the default response in an XML format - change that XML content-type to "application/x-das-*+xml" instead of "text/*" - have no requirement for new, DAS-specific headers Andrew dalke at dalkescientific.com From allenday at ucla.edu Thu Nov 10 02:18:23 2005 From: allenday at ucla.edu (Allen Day) Date: Wed, 9 Nov 2005 18:18:23 -0800 (PST) Subject: [DAS2] Agenda for weekly teleconference In-Reply-To: References: Message-ID: Missing this week, I'm in Rio de Janeiro. I'm giving a talk on DAS tomorrow though, so I'm still contributing! :) -Allen On Wed, 9 Nov 2005, Chervitz, Steve wrote: > Time & Day: 12:00 Noon PST, Thursday 11 Nov 2005 > Tel (US): 800-531-3250 > Tel (Int'l): 303-928-2693 > ID: 2879055 > > Agenda > ------ > > * Decide on Europe-friendly time for this teleconference. > Proposals: > - Thu 9am PST = 12pm EST = 17:00 GMT > - Wed 9am PST > - Mon 9am PST > > * DAS/2 get spec issues: > - Content-type: text/xml vs. text/x-das-blah+xml > http://portal.open-bio.org/pipermail/das2/2005-November/000287.html > > - XML encoding of type and feature properties: > http://portal.open-bio.org/pipermail/das2/2005-November/000278.html > > Time and people permitting: > > * Summarize CSHL genome informatics meeting happenings relevant to > DAS/2 (Allen, Gregg, Suzi, Lincoln). > > * Introduction to Apollo (Suzi) > > * DAS/2 validation (Andrew) > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 > From ed_erwin at affymetrix.com Thu Nov 10 18:33:58 2005 From: ed_erwin at affymetrix.com (Ed Erwin) Date: Thu, 10 Nov 2005 10:33:58 -0800 Subject: [DAS2] Re: New problem with content-type header in DAS/2 server responses! In-Reply-To: <2dd2f4224520b2f35add5af1de821729@dalkescientific.com> References: <2dd2f4224520b2f35add5af1de821729@dalkescientific.com> Message-ID: <43739296.4030307@affymetrix.com> Andrew Dalke wrote: > > >> X-DAS-Content-Type: text/x-das-feature+xml >> X-DAS-Server: GMOD/0.0 >> X-DAS-Status: 200 >> X-DAS-Version: DAS/2.0 >> ================== >> >> This also has the added benefit of already being implemented for a few >> months. Are there objections to this solution? > > > Yes. Several. > > When did "X-DAS-Status" come back into the picture? I thought > we talked about this in spring and nixed it because it doesn't provide > anything useful than the existing HTTP-level error code. Or perhaps > it was fall of last year? I think I remember raking leaves at the time. > > More useful, for example, would be a document (html, xml, or otherwise) > which accompanies the error response and gives more information about > what occurred. > Using the HTTP-level error codes can cause problems. For a user (let's call her Varla) using IE, the browser will intercept some error codes and present her with some IE-specific garbage, throwing away any content that was sent back in addition to the error code. Even for a user (Marla this time) using IGB, firewalls and/or caching and/or apache port-forwarding mechanisms can throw out anything with a status code in the error range. (I did test having the NetAffx DAS server send HTTP status codes, and I did have problems with that in IGB, though I've forgotten the specifics. It was about a year ago....) I don't care if status code is indicated with a header like "X-DAS-Status: 200" or with some XML content, or with both. But I think the HTTP status code has to be a separate thing, and will usually be "400" indicating that the user (sorry, I meant to say LeRoy) successfully communicated with the DAS server. Ed From dalke at dalkescientific.com Thu Nov 10 19:49:18 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 10 Nov 2005 20:49:18 +0100 Subject: [DAS2] Re: New problem with content-type header in DAS/2 server responses! In-Reply-To: <43739296.4030307@affymetrix.com> References: <2dd2f4224520b2f35add5af1de821729@dalkescientific.com> <43739296.4030307@affymetrix.com> Message-ID: <83d48ca8f7128fb04efecd673ef61459@dalkescientific.com> Ed: > Using the HTTP-level error codes can cause problems. > I don't care if status code is indicated with a header like > "X-DAS-Status: 200" or with some XML content, or with both. But I > think the HTTP status code has to be a separate thing, and will > usually be "400" indicating that the user (sorry, I meant to say > LeRoy) successfully communicated with the DAS server. Okay, sounds like using HTTP codes for this causes problems in practice. What about returning a different content-type for that case? 200 Ok Content-Type: application/x-das-error Something bad happened. Pros: - doesn't add a new header - just as easy to detect in the client - easier to support on the server for some use cases Andrew dalke at dalkescientific.com From lstein at cshl.edu Thu Nov 10 19:34:51 2005 From: lstein at cshl.edu (Lincoln Stein) Date: Thu, 10 Nov 2005 14:34:51 -0500 Subject: [DAS2] Re: New problem with content-type header in DAS/2 server responses! In-Reply-To: <43739296.4030307@affymetrix.com> References: <2dd2f4224520b2f35add5af1de821729@dalkescientific.com> <43739296.4030307@affymetrix.com> Message-ID: <200511101434.51966.lstein@cshl.edu> I didn't know that X-DAS-Status had ever been deprecated. I strongly feel that the DAS status codes are separate from the HTTP codes and should not try to piggyback on the HTTP status line. Lincoln On Thursday 10 November 2005 01:33 pm, Ed Erwin wrote: > Andrew Dalke wrote: > >> X-DAS-Content-Type: text/x-das-feature+xml > >> X-DAS-Server: GMOD/0.0 > >> X-DAS-Status: 200 > >> X-DAS-Version: DAS/2.0 > >> ================== > >> > >> This also has the added benefit of already being implemented for a few > >> months. Are there objections to this solution? > > > > Yes. Several. > > > > When did "X-DAS-Status" come back into the picture? I thought > > we talked about this in spring and nixed it because it doesn't provide > > anything useful than the existing HTTP-level error code. Or perhaps > > it was fall of last year? I think I remember raking leaves at the time. > > > > More useful, for example, would be a document (html, xml, or otherwise) > > which accompanies the error response and gives more information about > > what occurred. > > Using the HTTP-level error codes can cause problems. > > For a user (let's call her Varla) using IE, the browser will intercept > some error codes and present her with some IE-specific garbage, throwing > away any content that was sent back in addition to the error code. > > Even for a user (Marla this time) using IGB, firewalls and/or caching > and/or apache port-forwarding mechanisms can throw out anything with a > status code in the error range. > > (I did test having the NetAffx DAS server send HTTP status codes, and I > did have problems with that in IGB, though I've forgotten the specifics. > It was about a year ago....) > > I don't care if status code is indicated with a header like > "X-DAS-Status: 200" or with some XML content, or with both. But I think > the HTTP status code has to be a separate thing, and will usually be > "400" indicating that the user (sorry, I meant to say LeRoy) > successfully communicated with the DAS server. > > Ed > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 -- Lincoln D. Stein Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724 FOR URGENT MESSAGES & SCHEDULING, PLEASE CONTACT MY ASSISTANT, SANDRA MICHELSEN, AT michelse at cshl.edu From ed_erwin at affymetrix.com Thu Nov 10 19:56:12 2005 From: ed_erwin at affymetrix.com (Ed Erwin) Date: Thu, 10 Nov 2005 11:56:12 -0800 Subject: [DAS2] Re: New problem with content-type header in DAS/2 server responses! In-Reply-To: <83d48ca8f7128fb04efecd673ef61459@dalkescientific.com> References: <2dd2f4224520b2f35add5af1de821729@dalkescientific.com> <43739296.4030307@affymetrix.com> <83d48ca8f7128fb04efecd673ef61459@dalkescientific.com> Message-ID: <4373A5DC.3070102@affymetrix.com> Andrew Dalke wrote: > Okay, sounds like using HTTP codes for this causes problems in > practice. > > What about returning a different content-type for that case? > > 200 Ok > Content-Type: application/x-das-error > > > Something bad happened. > > That seems fine to me. There is still the separate issue of whether the content is "application/x-das-error" or simply "text/xml". But that is another discussion that is already ongoing and to which I have nothing to add. From dalke at dalkescientific.com Thu Nov 10 20:01:45 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 10 Nov 2005 21:01:45 +0100 Subject: [DAS2] Re: New problem with content-type header in DAS/2 server responses! In-Reply-To: <200511101434.51966.lstein@cshl.edu> References: <2dd2f4224520b2f35add5af1de821729@dalkescientific.com> <43739296.4030307@affymetrix.com> <200511101434.51966.lstein@cshl.edu> Message-ID: <7fd7a40582a6d8ccdc694c2a91b6f8b7@dalkescientific.com> Lincoln: > I didn't know that X-DAS-Status had ever been deprecated. I strongly > feel that > the DAS status codes are separate from the HTTP codes and should not > try to > piggyback on the HTTP status line. I'm okay with not having the assertion "something happened at the DAS level" not be in the HTTP status code. Not ecstatic, but real world trumps purity. I don't like the idea of adding new HTTP headers for this information. In my client code I need to do the following: - was there an HTTP error code? - is the return content-type correct? Having another header means I write - was there an HTTP error code? - was there a DAS error code? - is the return content-type correct? I would rather have one less bit of code to do wrong. As I also mentioned, I would like to support DAS annotations made available through a basic Apache install and a set of files, likely used by someone who just wants to provide annotations. This is not one of the current design goals; should it be, or should we require that everyone have more control over the server? Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Thu Nov 10 20:10:14 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 10 Nov 2005 21:10:14 +0100 Subject: [DAS2] Re: New problem with content-type header in DAS/2 server responses! In-Reply-To: <43739296.4030307@affymetrix.com> References: <2dd2f4224520b2f35add5af1de821729@dalkescientific.com> <43739296.4030307@affymetrix.com> Message-ID: <81b4c8e3062e94b2032e37995f26b588@dalkescientific.com> Ed: > For a user (let's call her Varla) using IE, the browser will intercept > some error codes and present her with some IE-specific garbage, > throwing away any content that was sent back in addition to the error > code. Here's the question I had earlier. Will people be using a DAS/2 annotation server directly through a web browser? As far as I'm aware there's no demand for this. None of the proposals mentioned it and the current discussion started from a technical discussion at ISMB; that is, because it could, and not because it is needed. I thought most people using IE/Moz/etc. would go a DAS application server, which integrates views from different DAS annotation servers. All this discussion is about returning pages back from an annotation server in a form directly viewable by a web browser. I don't see that as being useful. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Thu Nov 10 21:45:09 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 10 Nov 2005 22:45:09 +0100 Subject: [DAS2] Re: New problem with content-type header in DAS/2 server responses! In-Reply-To: <43739296.4030307@affymetrix.com> References: <2dd2f4224520b2f35add5af1de821729@dalkescientific.com> <43739296.4030307@affymetrix.com> Message-ID: <725e762a203211651d1850097ae3fcc0@dalkescientific.com> Further refining this from today's phone meeting Ed: > For a user (let's call her Varla) using IE, the browser will intercept > some error codes and present her with some IE-specific garbage, > throwing away any content that was sent back in addition to the error > code. The case Ed came across was from an in-house group using a Windows call out to IE as a background process to fetch a web page. In that case (as I understand it) it would convert HTTP error responses into its own error messages. Ed couldn't during the conversation recall if it was possible to get ahold of the error code at all. Did they have to parse the output? > Even for a user (Marla this time) using IGB, firewalls and/or caching > and/or apache port-forwarding mechanisms can throw out anything with a > status code in the error range. 404 gets through, yes? All of those are supposed to be transparent to error codes, or at the very least translate them from (say) 404 to 400. Can anyone point me to some reports of one of these mishaps? We definitely need to have some tie-ins with the HTTP error codes. Consider these two implementations for getting http://example.com/das2/genome/dazypus/1.43/ (Note the typo "dazypus" -> "dasypus") A) One system might have all "/das2" URLs forwarded to a DAS server. B) Another might have a handler only for "/das2/genome/dasypus" and let Apache do the rest. In case A) the DAS server sees that the given resource doesn't exist. It needs to return an error. It can return either "200 Ok" followed by a DAS error payload, or return a "404 Not Found" at the HTTP level. In case B) the request never gets to the DAS handler because of the typo. Apache sees there's nothing for the resource so returns a "404 Not Found". The client code is easier if it can check the HTTP error code and stop on failure. This means it's best for case A) for the DAS/2 server to return an HTTP error code of 404, and perhaps an optional ignorable payload. > (I did test having the NetAffx DAS server send HTTP status codes, and > I did have problems with that in IGB, though I've forgotten the > specifics. It was about a year ago....) Do you have the specifics perhaps in an old email somewhere? Andrew dalke at dalkescientific.com From ed_erwin at affymetrix.com Thu Nov 10 22:43:02 2005 From: ed_erwin at affymetrix.com (Ed Erwin) Date: Thu, 10 Nov 2005 14:43:02 -0800 Subject: [DAS2] Re: how do I load probe sets into IGB now? In-Reply-To: <83722dde0511101429m398c38ebg8e4df3d9b2a8d0da@mail.gmail.com> References: <83722dde0511101429m398c38ebg8e4df3d9b2a8d0da@mail.gmail.com> Message-ID: <4373CCF6.9060508@affymetrix.com> Hi, The old DAS loading mechanism is still there, in exactly the same place it used to be: File->Load DAS Features. The new "DAS/2" tab at the bottom is for "DAS/2" servers, of which there are only a few at the moment, and which are still experimental. Ed Ann Loraine wrote: > Hi, > > Congratulations everybody on the new release of IGB! > > I have a question about the new Quickload/DAS tab. > > I'm trying to load some probe sets via DAS but can't figure out how to do it. > > I used to be able to get them by using the "DAS" menu item, which > opened a widget containing a menu of DAS servers. I would select the > one labeled AffyDas (or something like that) and then I would get to > pick the chip (more often, chips) I wanted to see. Then IGB would > query the server and get me the probe set design sequence alignments > for the currently-shown region. > > I can't find this in the new interface. > > Can you help? > > -Ann > > -- > Ann Loraine > Assistant Professor > Section on Statistical Genetics > University of Alabama at Birmingham > http://www.ssg.uab.edu > http://www.transvar.org From ed_erwin at affymetrix.com Thu Nov 10 22:49:47 2005 From: ed_erwin at affymetrix.com (Ed Erwin) Date: Thu, 10 Nov 2005 14:49:47 -0800 Subject: [DAS2] Re: New problem with content-type header in DAS/2 server responses! In-Reply-To: <725e762a203211651d1850097ae3fcc0@dalkescientific.com> References: <2dd2f4224520b2f35add5af1de821729@dalkescientific.com> <43739296.4030307@affymetrix.com> <725e762a203211651d1850097ae3fcc0@dalkescientific.com> Message-ID: <4373CE8B.3000302@affymetrix.com> Andrew Dalke wrote: > Further refining this from today's phone meeting > > Ed: > >> For a user (let's call her Varla) using IE, the browser will intercept >> some error codes and present her with some IE-specific garbage, >> throwing away any content that was sent back in addition to the error >> code. > > > The case Ed came across was from an in-house group using a Windows call > out to IE as a background process to fetch a web page. In that case > (as I understand it) it would convert HTTP error responses into its own > error messages. > > Ed couldn't during the conversation recall if it was possible to > get ahold of the error code at all. Did they have to parse the output? Here is some info from microsoft about these "friendly HTTP error messages": http://support.microsoft.com/kb/q218155/ Note that whether the real error message gets through seems to depend on both the error code, and the length of the content. How is that friendly? >> (I did test having the NetAffx DAS server send HTTP status codes, and >> I did have problems with that in IGB, though I've forgotten the >> specifics. It was about a year ago....) > > > Do you have the specifics perhaps in an old email somewhere? > I can look around when I get back from vacation, which I'm on all next week. Ed From Gregg_Helt at affymetrix.com Thu Nov 10 22:46:23 2005 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Thu, 10 Nov 2005 14:46:23 -0800 Subject: [DAS2] RE: how do I load probe sets into IGB now? Message-ID: That data is on a DAS/1 server. The new "Data Access" tab is just for QuickLoad and DAS/2 servers. DAS/1 servers are still accessible via the "File --> Load DAS Features" menu item. In the near term the plan is to soon move the DAS/1 access into the "Data Access" tab as a DAS/1 subtab alongside the QuickLoad and DAS/2 subtabs, but this wasn't ready in time for the current release. In the longer term the probe data will be hosted on both DAS/1 and DAS/2 servers. gregg > -----Original Message----- > From: Ann Loraine [mailto:aloraine at gmail.com] > Sent: Thursday, November 10, 2005 2:30 PM > To: das2 at portal.open-bio.org > Cc: Helt,Gregg; Erwin, Ed > Subject: how do I load probe sets into IGB now? > > Hi, > > Congratulations everybody on the new release of IGB! > > I have a question about the new Quickload/DAS tab. > > I'm trying to load some probe sets via DAS but can't figure out how to do > it. > > I used to be able to get them by using the "DAS" menu item, which > opened a widget containing a menu of DAS servers. I would select the > one labeled AffyDas (or something like that) and then I would get to > pick the chip (more often, chips) I wanted to see. Then IGB would > query the server and get me the probe set design sequence alignments > for the currently-shown region. > > I can't find this in the new interface. > > Can you help? > > -Ann > > -- > Ann Loraine > Assistant Professor > Section on Statistical Genetics > University of Alabama at Birmingham > http://www.ssg.uab.edu > http://www.transvar.org From dalke at dalkescientific.com Thu Nov 10 23:19:51 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Fri, 11 Nov 2005 00:19:51 +0100 Subject: [DAS2] Re: New problem with content-type header in DAS/2 server responses! In-Reply-To: <4373CE8B.3000302@affymetrix.com> References: <2dd2f4224520b2f35add5af1de821729@dalkescientific.com> <43739296.4030307@affymetrix.com> <725e762a203211651d1850097ae3fcc0@dalkescientific.com> <4373CE8B.3000302@affymetrix.com> Message-ID: <0cc693a86af103c99b668e5f6db2c9e6@dalkescientific.com> > Here is some info from microsoft about these "friendly HTTP error > messages": > > http://support.microsoft.com/kb/q218155/ > > Note that whether the real error message gets through seems to depend > on both the error code, and the length of the content. How is that > friendly? Indeed. >> Internet Explorer 5 and later provides a replacement for the HTML >> template for the following friendly error messages: >> >> 400, 403, 404, 405, 406, 408, 409, 410, 500, 501, 505 I've marked them with ***. The only ones I think we might use, were we to piggyback, are 409 (for locking?), 415 (for servers that don't support a requested format) and 416 (for unsupported range requests?). *** 400: ('Bad request', 'Bad request syntax or unsupported method'), 401: ('Unauthorized', 'No permission -- see authorization schemes'), 402: ('Payment required', 'No payment -- see charging schemes'), *** 403: ('Forbidden', 'Request forbidden -- authorization will not help'), *** 404: ('Not Found', 'Nothing matches the given URI'), *** 405: ('Method Not Allowed', 'Specified method is invalid for this server.'), *** 406: ('Not Acceptable', 'URI not available in preferred format.'), 407: ('Proxy Authentication Required', 'You must authenticate with ' 'this proxy before proceeding.'), *** 408: ('Request Time-out', 'Request timed out; try again later.'), *** 409: ('Conflict', 'Request conflict.'), *** 410: ('Gone', 'URI no longer exists and has been permanently removed.'), 411: ('Length Required', 'Client must specify Content-Length.'), 412: ('Precondition Failed', 'Precondition in headers is false.'), 413: ('Request Entity Too Large', 'Entity is too large.'), 414: ('Request-URI Too Long', 'URI is too long.'), 415: ('Unsupported Media Type', 'Entity body in unsupported format.'), 416: ('Requested Range Not Satisfiable', 'Cannot satisfy request range.'), 417: ('Expectation Failed', 'Expect condition could not be satisfied.'), *** 500: ('Internal error', 'Server got itself in trouble'), *** 501: ('Not Implemented', 'Server does not support this operation'), 502: ('Bad Gateway', 'Invalid responses from another server/proxy.'), 503: ('Service temporarily overloaded', 'The server cannot process the request due to a high load'), 504: ('Gateway timeout', 'The gateway server did not receive a timely response'), *** 505: ('HTTP Version not supported', 'Cannot fulfill request.'), > I can look around when I get back from vacation, which I'm on all next > week. Enjoy! Andrew dalke at dalkescientific.com From aloraine at gmail.com Thu Nov 10 22:29:48 2005 From: aloraine at gmail.com (Ann Loraine) Date: Thu, 10 Nov 2005 16:29:48 -0600 Subject: [DAS2] how do I load probe sets into IGB now? Message-ID: <83722dde0511101429m398c38ebg8e4df3d9b2a8d0da@mail.gmail.com> Hi, Congratulations everybody on the new release of IGB! I have a question about the new Quickload/DAS tab. I'm trying to load some probe sets via DAS but can't figure out how to do it. I used to be able to get them by using the "DAS" menu item, which opened a widget containing a menu of DAS servers. I would select the one labeled AffyDas (or something like that) and then I would get to pick the chip (more often, chips) I wanted to see. Then IGB would query the server and get me the probe set design sequence alignments for the currently-shown region. I can't find this in the new interface. Can you help? -Ann -- Ann Loraine Assistant Professor Section on Statistical Genetics University of Alabama at Birmingham http://www.ssg.uab.edu http://www.transvar.org From allenday at ucla.edu Fri Nov 11 01:39:36 2005 From: allenday at ucla.edu (Allen Day) Date: Thu, 10 Nov 2005 17:39:36 -0800 (PST) Subject: [DAS2] Re: New problem with content-type header in DAS/2 server responses! In-Reply-To: <2dd2f4224520b2f35add5af1de821729@dalkescientific.com> References: <2dd2f4224520b2f35add5af1de821729@dalkescientific.com> Message-ID: > What does the "X-DAS-Server" get you that the normal "Server:" doesn't > get you? What's the use case? I don't know. The absence of this header was actually reported by Dasypus output sent to me by you on May 26, 2005. Here's a snippet of the Dasypus diagnostics, followed by a comment from you: "Date: Thu, 26 May 2005 12:29:32 -0600 From: Andrew Dalke To: DAS/2 Subject: [DAS2] dasypus status [...] WARNING: Adding X-DAS-Server header 'gmod/0.0' The prototype doesn't mention the DAS server used. I stick one in based on the host name. [...]" > Why is the "X-DAS-Version" at all important? What's important is the > data content. It's the document return type/version that's important > and not the server version. It was actually originally (as far as I can tell from my email archive) discussed, along with X-DAS-Status in an email from Lincoln on May 21, 2004, and forwarded to me on August 12, 2004: "-----Original Message----- From: Lincoln Stein [mailto:lstein at cshl.edu] Sent: Friday, May 21, 2004 1:22 PM To: edgrif at sanger.ac.uk; Gregg_Helt at affymetrix.com; avc at sanger.ac.uk; gilmanb at mac.com; dalke at dalkescientific.com Cc: lstein at cshl.edu; allen.day at ucla.edu Subject: DAS/2 notes [...] In addition to the standard HTTP response headers, DAS servers return the following HTTP headers: X-DAS-Version: DAS/2.0 X-DAS-Status: XXX status code [...]" > But I mentioned most of these over a year ago > http://portal.open-bio.org/pipermail/das/2004-September/000814.html > > In summary: > - no support for direct web browser access to a URL, expect with a > likely use case; > - keep the default response in an XML format > - change that XML content-type to "application/x-das-*+xml" instead > of "text/*" > - have no requirement for new, DAS-specific headers This discussion suggests we need for a more formal process of modifying the client and server implementations, e.g. modify spec first and commit, then update code. -Allen From td2 at sanger.ac.uk Fri Nov 11 09:24:52 2005 From: td2 at sanger.ac.uk (Thomas Down) Date: Fri, 11 Nov 2005 09:24:52 +0000 Subject: [DAS2] Re: New problem with content-type header in DAS/2 server responses! In-Reply-To: <83d48ca8f7128fb04efecd673ef61459@dalkescientific.com> References: <2dd2f4224520b2f35add5af1de821729@dalkescientific.com> <43739296.4030307@affymetrix.com> <83d48ca8f7128fb04efecd673ef61459@dalkescientific.com> Message-ID: <8C869723-601C-4236-B9FA-88F6D6401016@sanger.ac.uk> On 10 Nov 2005, at 19:49, Andrew Dalke wrote: > Ed: > >> Using the HTTP-level error codes can cause problems. >> > > >> I don't care if status code is indicated with a header like >> "X-DAS-Status: 200" or with some XML content, or with both. But I >> think the HTTP status code has to be a separate thing, and will >> usually be "400" indicating that the user (sorry, I meant to say >> LeRoy) successfully communicated with the DAS server. >> > > Okay, sounds like using HTTP codes for this causes problems in > practice. > > What about returning a different content-type for that case? > > 200 Ok > Content-Type: application/x-das-error > > > Something bad happened. > That looks reasonable, but could we add a bit of structure: 407 The sky is falling (There's also a possible argument for using textual, rather than numeric, error codes -- but it would be good to keep at least one part of the error response using a well-defined vocabulary for the benefit of clients that want to respond to different error conditions in different ways). Thomas. From Steve_Chervitz at affymetrix.com Fri Nov 11 21:24:50 2005 From: Steve_Chervitz at affymetrix.com (Chervitz, Steve) Date: Fri, 11 Nov 2005 13:24:50 -0800 Subject: [DAS2] how do I load probe sets into IGB now? In-Reply-To: <83722dde0511101429m398c38ebg8e4df3d9b2a8d0da@mail.gmail.com> Message-ID: Ann, Go to File -> Load DAS Features. There should be a DAS server named 'NetAffx-Align' that will give you what you want. Steve > From: Ann Loraine > Date: Thu, 10 Nov 2005 16:29:48 -0600 > To: > Cc: , "Helt,Gregg" > Subject: [DAS2] how do I load probe sets into IGB now? > > Hi, > > Congratulations everybody on the new release of IGB! > > I have a question about the new Quickload/DAS tab. > > I'm trying to load some probe sets via DAS but can't figure out how to do it. > > I used to be able to get them by using the "DAS" menu item, which > opened a widget containing a menu of DAS servers. I would select the > one labeled AffyDas (or something like that) and then I would get to > pick the chip (more often, chips) I wanted to see. Then IGB would > query the server and get me the probe set design sequence alignments > for the currently-shown region. > > I can't find this in the new interface. > > Can you help? > > -Ann > > -- > Ann Loraine > Assistant Professor > Section on Statistical Genetics > University of Alabama at Birmingham > http://www.ssg.uab.edu > http://www.transvar.org > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 From Steve_Chervitz at affymetrix.com Sat Nov 12 00:51:41 2005 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Fri, 11 Nov 2005 16:51:41 -0800 Subject: [DAS2] DAS/2 weekly meeting notes for 10 Nov 05 Message-ID: Notes from the weekly DAS/2 teleconference, 10 Nov 2005. $Id: das2-teleconf-2005-11-10.txt,v 1.1 2005/11/12 00:48:39 sac Exp $ Attendees: Affy: Steve Chervitz, Ed E., Gregg Helt UCLA: Brian O'connor CSHL: Lincoln Stein UCBerkeley: Suzi Lewis Sweden: Andrew Dalke Action items are flagged with '[A]'. These notes are checked into the biodas.org CVS repository at das/das2/notes/2005. Instructions on how to access this repository are at http://biodas.org Agenda Items ------------ * New Euro-friendly meeting time It was decided to change the time for this weekly teleconference to Monday 9:30 AM PST (12:30 PM EST, 17:30 UK). [A] New teleconf time starts next week (Monday 14 Nov) * Spec Issues Gregg expressed a need to dedicate some of these weekly meetings to be focused on resolving spec issues. We will do this for next week's meeting. [A] Everyone come prepared to talk about retrieval spec issues on 11/14. Content-type issue: - Should we use text/xml or application/x-das-blah+xml? - Consensus: use application/x-das-blah+xml - [A] Steve will rollback changes made to the retrieval spec. - Andrew acknowledges that text/xml may be handy for visual debugging and other presentation tricks, but is not a user-driven need; it's a technical issue. - Lincoln: XML handling is very browser-dependent: o Firefox - nice DOM tree structure o Safari, Konqueror - no special rendering o MSIE - "Cannot be displayed" - Gregg: Now we just need to ensure that we're actually implementing the correct content-type for given responses, which brings up the next topic... * Validation - Gregg: we'd like to start using dasypus locally to verify client/server compliance with the spec. What state is it in? - Andrew: Just getting back to it now. [A] Andrew will talk with Chris D. to set up a web interface at biodas.org * Apollo Suzi: Can't talk about Apollo now. Will wait until Nomi is available. [A] Nomi will present Apollo at the 28 Nov DAS/2 weekly meeting. Status Reports -------------- Gregg: * CSHL Genome Informatics meeting summary of DAS/2-relevant things. - Gave talk about DAS/2 and demoed IGB. Went well. - Held a DAS BOF that was well-attended (n=15). Questions people had about DAS/2 have already been addressed. [A] Gregg will write up his CSHL DAS BOF notes and post. Discussion centered around what Sanger & EBI are doing with DAS. o There are lots of DAS-related projects there. o We'd like to have tighter linkage between DAS folks in the states and those in the the UK. [A] Andrew will visit the UK DAS folks more often. Ideas: + Help them transition to DAS/2 + Hold "DASathon" or jamboree there o People: Tim Hubbard, Thomas Down, Andreas Prlic o Projects: + Serving up 3D structures using modified DAS/1 server (SPICE) + Serving up protein annotations using modified DAS/1 server + Registry & discovery system for DAS/1 server This is SOAP-based. We'd like to have a non-SOAP-based system for DAS/2, which follows REST principles. - Andreas could likely create an HTTP-based alternative to his SOAP system, which uses the same core. - [A] Andrew will talk with Andreas P about non-SOAP reg/discovery - [A] DAS/2 grant needs progress on reg/discovery w/in next 6 mos * Grant (DAS/2 continuation) Lots of modifications were made just prior to submitting on 1 Nov. Some of the changes include: - Work closely with Sanger and EBI where they've done lots of work (3D structure and protein DAS). - More of a mechanism will be in place to drive the spec forward: o Andrew = designated 'spec czar' - makes ultimate decisions o Lincoln = designated 'spec godfather' - retains veto power Andrew: * Brought up the header issue from the spec discussion on the list this week. - Doesn't like the idea for 4 additional DAS-specific fields (error code, das version, server name, and something else) - Alternative: server returns content-type: application/x-das-error - Advantages: o no new header o simplified header -- just check the http error code in the content-type. o easier to implement o enables a flatfile-based server o Fits with REST philosophy of using HTTP as an application protocol, not a transport protocol. - Ed E: Can't we just return an error section in the document? Andrew: We could, but it requires parsing the document and only works for XML formats that we're in control of. - Gregg: The advantages of having metadata in the header outweighs the advantages of enabling a flatfile-based server. Andrew: We can utilize the existing header Ed E: Piggybacking error codes causes problems with proxy servers (see email on the DAS/2 discussion list). - Decision: [A] Use standard HTTP error codes; use XML to specify error details. E.g., server status=200 content= error document Steve: When reviewing spec, encountered potential issues surrounding relationship between HTTP and DAS-specific error codes. Using standard HTTP codes will obviate this issue. Also noted that there's a bugzilla entry regarding error codes (which is now moot): http://bugzilla.open-bio.org/show_bug.cgi?id=1784 - Ed E: MSIE hides or modifies content based on certain HTTP error codes it gets. This has important implications on windows platforms where IE's behavior can get in the way of other network-aware applications that don't even (knowingly) use IE. From Steve_Chervitz at affymetrix.com Sat Nov 12 01:52:15 2005 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Fri, 11 Nov 2005 17:52:15 -0800 Subject: [DAS2] DAS/2 weekly meeting notes for 10 Nov 05 In-Reply-To: Message-ID: > Content-type issue: > - Should we use text/xml or application/x-das-blah+xml? > - Consensus: use application/x-das-blah+xml > - [A] Steve will rollback changes made to the retrieval spec. Done, but I noticed that we had been using text/x-das-blah+xml rather than application/x-das-blah+xml. I left it as text for now, although 'application' seems more correct according to the RFC on MIME media types, http://www.rfc-editor.org/rfc/rfc2046.txt which states: text -- textual information. ... Other subtypes [i.e., anything besides 'plain'] are to be used for enriched text in forms where application software may enhance the appearance of the text... application -- some other kind of data, typically either uninterpreted binary data or information to be processed by an application. ... Steve From dalke at dalkescientific.com Mon Nov 14 11:47:09 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 14 Nov 2005 12:47:09 +0100 Subject: [DAS2] Re: New problem with content-type header in DAS/2 server responses! In-Reply-To: References: Message-ID: Steve: > I raised some other issues regarding types and feature properties etc. > a > couple of weeks ago that I'd like you to chime in on: > http://portal.open-bio.org/pipermail/das2/2005-October/000271.html > > The latest message on this thread is: > http://portal.open-bio.org/pipermail/das2/2005-November/000278.html I'll take them part by part. That last message suggested 29 2 * the values of the 'das:id', 'das:type', and 'das:ptype' attributes > are URLs relative to xml:base unless they begin with 'das:prop#', in > which case they are relative to the das:prop namespace. And from what I can tell about XML, there's no standard way to implement this using one of the standard XML parsers. How do you get the das:prop namespace for a given element? The parser often does the expansion for you. Eg, in one of the Python XML parsers it does the translations into Clark notation, like {http://www.biodas.org/ns/das/genome/2.00}ptype For more info on XML namespaces, see http://www.jclark.com/xml/xmlns.htm Andrew dalke at dalkescientific.com From ap3 at sanger.ac.uk Mon Nov 14 13:29:26 2005 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Mon, 14 Nov 2005 13:29:26 +0000 Subject: [DAS2] Re: what info is needed for DAS/2 registration? In-Reply-To: <955da4ae7783e60944687d86ec691e51@dalkescientific.com> References: <955da4ae7783e60944687d86ec691e51@dalkescientific.com> Message-ID: <81fdf1e73ee85ae55550f12ddcee13cf@sanger.ac.uk> Hi Andrew! > Looks like I will be more involved with the DAS/2 spec development, > and I'll be visiting the UK more often. good! > I want to make sure that the spec includes more of what's > needed for registration. o.k. very good, let's go through your mail: > My thought is to let the registration > system be able to query the DAS/2 server to get most of the fields > it needs, if not all. o.k. > There may still be some need to override the > definitions, The experience from doing the das1 registry tells that some corrections are needed every now and then. It seems to be inevitable that sometimes users make mistakes / inaccuracies, etc. > so at the manual registration level this will be used > more to pre-populate an entry with a default. sounds good. - so this means the configuration for setting up a DAS source will get a little bigger. > In looking at the manual registration page I see the following, > along with comparisons to the existing DAS/2 spec > > ** Title/Nickname used by DAS clients for the display of the das tracks > ** Description for the user to get a quick grasp what the data is about. - we have 60 sources in the registry by now and we expect to be up around 100 soon, so one needs a way to learn which of the sources are serving the data which is of particular interest ... > ** URL for more detailed description a link back to the homepage of the project that provides the data > > DAS/2 does not have this information for the service as a whole. > It does have it for each of the databases, somewhat. Here is > an example from the spec. > > taxon="http://www.ncbi.nlm.nih.gov/taxon-browser?id=29118" > > doc_href="http://www.wormbase.org/documentation/users_guide/ > volvox.html" > > > > Should we add a "title" field to each data source? yes that would be good > Should we > add title/description/url fields to the DAS/2 service as a whole? not sure what you mean by that > ** coordinate system > > Each data source may have 1 or more versions. The version information > looks like > > > > > > In theory that assembly id could be a URL with more detailed > information about the assembly. Right now it's used as a unique > identifier. There is nothing there to convert these URLs into > something human-readable. Hm. not sure if I am completely convinced with representing a coordinate system as a url. What if two reference servers provide the same assembly or are mirrors of each other? I would see it in a way where a DAS client would asks the registry "where are all the reference servers for NCBI 35- homo sapiens?" and then gets a list providing e.g. an american and a european mirror server the client could choose the one which is geographically closer. > > Possible solutions for this are: > - define an "assembly" document, to be put at that URL and > include the authority/version/type/organism data mentioned at > http://das.sanger.ac.uk/registry/help_coordsys.jsp something like that. > ** DAS url > > Yep, DAS/2 has that one. :) :-) > > ** Admin email > > Hmm. Yeah, there should be more information about the service as > a whole. Admin email and perhaps a documentation href, eg, with > information about planned downtime. would be good. > > ** DAS capabilities > > That's handled differently in DAS/2. Did people really use this > information? actually this information is important (for das1) - it is used to distinguish reference servers and annotation servers ( on the client side) and needed for validation (on the registry side) "capabilities" are also related to data-types. E.g. a genome DAS client does not need to query a protein structure, because it can not do 3D... > ** Test access/ segment code labels I think there is a misunderstanding here: the test code is not a "label" The test code is e.g. a chromosomal segment or an accession code for a protein database for which annotations are returned if a feature request is being made. The "label" is used mainly to describe by which project a source is being funded. >> We are currently discussing if the labels should be used to describe >> a DAS source in more detail. e.g. "experimentally verified", >> "computational prediction", etc. > > These are two different things in one field. yes you are very right. Together with the BioSapiens DAS people we recently decided that there should be the possibility to assign gene-ontology evidence codes to each das source, so in the next update of the registry, this will be changed. > > What I'm going to propose is a generic key/value data structure > for just about all records. Some of the key names will be well > defined. Others can add new fields to experiment with / extend > the spec in a semi-constrained fashion. This would let people > try out a new property easily. sounds good. > In summary it sound like DAS/2 needs: > - a few more pieces of meta data (eg, information about the > service as a whole) > - a bit better defined way to get information about the > reference assembly > I would agree to both that Greetings, Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 From Gregg_Helt at affymetrix.com Mon Nov 14 17:09:11 2005 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Mon, 14 Nov 2005 09:09:11 -0800 Subject: [DAS2] DAS/2 teleconference at 9:30 AM today PST Message-ID: Just a reminder that we've rescheduled the weekly DAS/2 teleconference for Mondays @ 9:30 AM Pacific time, starting today. I'm hoping the new time will give more people a chance to participate. Teleconference numbers: US dialin: 800-531-3250 International dialin: 303-928-2693 Conference ID: 2879055 We're also revising the format to focus on alternating weeks on the DAS/2 specification itself or implementations of the specification. This should allow people who are mainly concerned about one or the other to avoid extra overhead. Today we will focus on spec issues. thanks, Gregg Helt From lstein at cshl.edu Mon Nov 14 17:23:18 2005 From: lstein at cshl.edu (Lincoln Stein) Date: Mon, 14 Nov 2005 12:23:18 -0500 Subject: [DAS2] Re: New problem with content-type header in DAS/2 server responses! In-Reply-To: <725e762a203211651d1850097ae3fcc0@dalkescientific.com> References: <43739296.4030307@affymetrix.com> <725e762a203211651d1850097ae3fcc0@dalkescientific.com> Message-ID: <200511141223.19367.lstein@cshl.edu> Well, I give up arguing this one and will go with the way Andrew wants to do it. Therefore I propose the following rules: 1) Return the HTTP 404 error for the case that any component of the DAS2 path is invalid. This would apply to the following situations: Bad namespace Bad data source Unknown object ID 2) Return HTTP 301 and 302 redirects when the requested object has moved. 3) Return HTTP 403 (forbidden) for no-lock errors. 4) Return HTTP 500 when the server crashes. For all errors there should be a text/x-das-error entity returned that describes the error in more detail. Lincoln On Thursday 10 November 2005 04:45 pm, Andrew Dalke wrote: > Further refining this from today's phone meeting > > Ed: > > For a user (let's call her Varla) using IE, the browser will intercept > > some error codes and present her with some IE-specific garbage, > > throwing away any content that was sent back in addition to the error > > code. > > The case Ed came across was from an in-house group using a Windows call > out to IE as a background process to fetch a web page. In that case > (as I understand it) it would convert HTTP error responses into its own > error messages. > > Ed couldn't during the conversation recall if it was possible to > get ahold of the error code at all. Did they have to parse the output? > > > Even for a user (Marla this time) using IGB, firewalls and/or caching > > and/or apache port-forwarding mechanisms can throw out anything with a > > status code in the error range. > > 404 gets through, yes? > > All of those are supposed to be transparent to error codes, or at the > very least translate them from (say) 404 to 400. > > Can anyone point me to some reports of one of these mishaps? > > We definitely need to have some tie-ins with the HTTP error codes. > Consider these two implementations for getting > > http://example.com/das2/genome/dazypus/1.43/ > > (Note the typo "dazypus" -> "dasypus") > > A) One system might have all "/das2" URLs forwarded to a DAS server. > > B) Another might have a handler only for "/das2/genome/dasypus" and > let Apache do the rest. > > In case A) the DAS server sees that the given resource doesn't exist. > It needs to return an error. It can return either "200 Ok" followed > by a DAS error payload, or return a "404 Not Found" at the HTTP level. > > In case B) the request never gets to the DAS handler because > of the typo. Apache sees there's nothing for the resource so returns > a "404 Not Found". > > The client code is easier if it can check the HTTP error code and > stop on failure. This means it's best for case A) for the DAS/2 > server to return an HTTP error code of 404, and perhaps an optional > ignorable payload. > > > (I did test having the NetAffx DAS server send HTTP status codes, and > > I did have problems with that in IGB, though I've forgotten the > > specifics. It was about a year ago....) > > Do you have the specifics perhaps in an old email somewhere? > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 -- Lincoln D. Stein Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724 FOR URGENT MESSAGES & SCHEDULING, PLEASE CONTACT MY ASSISTANT, SANDRA MICHELSEN, AT michelse at cshl.edu From lstein at cshl.edu Mon Nov 14 17:28:10 2005 From: lstein at cshl.edu (Lincoln Stein) Date: Mon, 14 Nov 2005 12:28:10 -0500 Subject: [DAS2] Re: New problem with content-type header in DAS/2 server responses! In-Reply-To: References: Message-ID: <200511141228.11358.lstein@cshl.edu> On Monday 14 November 2005 06:47 am, Andrew Dalke wrote: > Steve: > > I raised some other issues regarding types and feature properties etc. > > a > > couple of weeks ago that I'd like you to chime in on: > > http://portal.open-bio.org/pipermail/das2/2005-October/000271.html > > > > The latest message on this thread is: > > http://portal.open-bio.org/pipermail/das2/2005-November/000278.html > > I'll take them part by part. > > That last message suggested > > xmlns:das="http://www.biodas.org/ns/das/genome/2.00" > xml:base="http://www.wormbase.org/das/genome/volvox/1/" > xmlns:xlink="http://www.w3.org/1999/xlink" > > das:prop="http://www.biodas.org/ns/das/genome/2.00/properties"> > das:type="type/curated_exon"> > 29 > 2 > xlink:type="simple" > > xlink:href="http://www.wormbase.org/das/protein/volvox/2/feature/ > CTEL54X.1 > /> > > > > I couldn't figure out why the "das:" namespace was needed for the > attributes. Why can't they be in the default namespace? The extras das: prefix is not needed since it is the same namespace as the default namespace. My feeling is that we should NOT be using namespaces in attribute names but not in attribute values (e.g. das:ptype is ok, but "das:prop#phase" is not OK). For attribute values we should be using URIs consistently. Lincoln > The "das:" in the value of an attribute doesn't know anything about > the currently defined namespaces. So this "das:" must be something > completely different from the xmlns:das=... definition. > > > * the values of the 'das:id', 'das:type', and 'das:ptype' attributes > > are URLs relative to xml:base unless they begin with 'das:prop#', in > > which case they are relative to the das:prop namespace. > > And from what I can tell about XML, there's no standard way to implement > this using one of the standard XML parsers. How do you get the das:prop > namespace for a given element? The parser often does the expansion > for you. Eg, in one of the Python XML parsers it does the translations > into Clark notation, like > > {http://www.biodas.org/ns/das/genome/2.00}ptype > > For more info on XML namespaces, see http://www.jclark.com/xml/xmlns.htm > > > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 -- Lincoln D. Stein Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724 FOR URGENT MESSAGES & SCHEDULING, PLEASE CONTACT MY ASSISTANT, SANDRA MICHELSEN, AT michelse at cshl.edu From dalke at dalkescientific.com Mon Nov 14 17:30:07 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 14 Nov 2005 18:30:07 +0100 Subject: [DAS2] Spec issues In-Reply-To: References: Message-ID: <05b94e3a6db3e4894af051f22f25dc4c@dalkescientific.com> On Nov 4 Steve wrote: > das:type="type/curated_exon"> > 29 > 2 > xlink:type="simple" > > xlink:href="http://www.wormbase.org/das/protein/volvox/2/feature/ > CTEL54X.1 > /> > I think we're missing something. This is XML. We can do 29 2 This message brought to you by AT&T The whole point of having namespaces in XML is to keep from needing to define new namespaces like . In doing that, there's no problem in supporting things like "bg:glyph", etc. because the values are expanded as expected by the XML processor. > Also, we might want to allow some controlled vocabulary terms to be > used for > the value of type.source (e.g., "das:curated"), to ensure that > different > users use the same term to specify that a feature type is produced by > curation. I talked with Andreas Prlic about what other metadata is needed for the registry system. He mentioned Together with the BioSapiens DAS people we recently decided that there should be the possibility to assign gene-ontology evidence codes to each das source, so in the next update of the registry, this will be changed. That's at the source level, but perhaps it's also needed at the annotation level. > The spec also seems alarmed by the existence of a xml:base attribute > in the > TYPE element. The idea is that any relative URL within this element > would be > resolved using that element's xml:base attribute. How would folks be > with > having the DAS/2 spec fully support the XML Base spec ( > http://www.w3.org/TR/xmlbase/ )? The result of this would be to add an > optional xml:base attribute to all elements that contain URLs or > subelements > with URLs. In my reading it seems that xml:base should be included wherever. See http://norman.walsh.name/2005/04/01/xinclude > Ugh. In the short term, I think there's only one answer: update your > schemas to allow xml:base either (a) everywhere or (b) everywhere you > want XInclude to be allowed. I urge you to put it everywhere as your > users are likely to want to do things you never imagined. ? > > Description: Properties are typed using the ptype attribute. The value > of > the property may be indicated by a URL given by the href attribute, or > may > be given inline as the CDATA content of the section. > > > type="type/curated_exon"> > 29 > 2 > href="/das/protein/volvox/2/feature/CTEL54X.1" /> > > > > So in contrast to the TYPE properties which are restricted to being > simple > string-based key:value pairs, FEATURE properties can be more complex, > which > seems reasonable, given the wild world of features. We might consider > using > 'key' rather than 'ptype' for FEATURE properties, for consistency with > TYPE > prop elements (however, read on). My thoughts on these are: - come up with a more consistent way to store key/value data - the Atom spec has a nice way to say "the data is in this CDATA as text/html/xml" vs. "this text is over there". I want to copy its way of doing things. - I'm still not clear about xlink. Another is the HTML-style Atom uses the "rel=" to encoding information about the link. For example, the URL to edit a given document is See http://atomenabled.org/developers/api/atom-api-spec.php Andrew dalke at dalkescientific.com From Steve_Chervitz at affymetrix.com Mon Nov 14 19:29:22 2005 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Mon, 14 Nov 2005 11:29:22 -0800 Subject: [DAS2] DAS/2 weekly meeting notes for 14 Nov 05 Message-ID: Notes from the weekly DAS/2 teleconference, 14 Nov 2005. $Id: das2-teleconf-2005-11-14.txt,v 1.2 2005/11/14 19:20:37 sac Exp $ Attendees: Affy: Steve Chervitz, Gregg Helt CSHL: Lincoln Stein UCBerkeley: Suzi Lewis Sweden: Andrew Dalke Action items are flagged with '[A]'. These notes are checked into the biodas.org CVS repository at das/das2/notes/2005. Instructions on how to access this repository are at http://biodas.org ---------------------------------- AD talked with A. Prlic about registry service, we want to incorporate what he needs within DAS/2. What they have: - name (a few words) - for display of das track - title, description (paragraph) - synopsis - url for more info we have desc, id, doc_href, taxon Therefore, we need name attribute Need : - name (mandatory) (done - LS: adding it to spec now) - desc (optional) Coord system reg server: * in das/2 - it's not optional (0 interbase) * they find this important We have confusion between assembly and reference server LS: Need URI that points to assembly, independent of the reference server. GH: Would like to have annot servers that don't know anything about the ref server. LS: Could use the region URI to ID the assembly das/genome/sourceid/region = assembly id/uri GH: The trouble is that NCBI is a ref source for many assemblies, yet they lack a das sever. They have no URI. LS: we can just make one up, or use most appropriate web page LS: When you request versioned source from a server, it should say what assembly coords it's working on and give a uri for that. In this case there's no guarantee you can do a 'get' on that URI. We want to say: 1- what is unique uri for assembly (everyone agrees to share this) 2- das URL for how to fetch it (some server's region url - trusted, faithful copy with what is at ncbi). Diff servers could assert that you can fetch it from various places. GH: assembly could be an attribute since there'd be only one. A list of ref servers that serve up that dna. LS: in versioned source response. new section between capabilities and namespaces called 'reference_sources'. Add 'assembly' attribute to version element: Message-ID: Andrew Dalke wrote on 14 Nov 05: > Steve: >> I raised some other issues regarding types and feature properties etc. >> a >> couple of weeks ago that I'd like you to chime in on: >> http://portal.open-bio.org/pipermail/das2/2005-October/000271.html >> >> The latest message on this thread is: >> http://portal.open-bio.org/pipermail/das2/2005-November/000278.html > > I'll take them part by part. > > That last message suggested > > xmlns:das="http://www.biodas.org/ns/das/genome/2.00" > xml:base="http://www.wormbase.org/das/genome/volvox/1/" > xmlns:xlink="http://www.w3.org/1999/xlink" > > das:prop="http://www.biodas.org/ns/das/genome/2.00/properties"> > das:type="type/curated_exon"> > 29 > 2 > xlink:type="simple" > > xlink:href="http://www.wormbase.org/das/protein/volvox/2/feature/ > CTEL54X.1 > /> > > > > I couldn't figure out why the "das:" namespace was needed for the > attributes. Why can't they be in the default namespace? Attributes don't have a default namespace (though one might think such a thing would be useful). See http://www.w3.org/TR/REC-xml-names/#defaulting This is a point which has been subject to much consternation: http://www.rpbourret.com/xml/NamespacesFAQ.htm#q5_3 http://lists.xml.org/archives/xml-dev/200002/msg00094.html > The "das:" in the value of an attribute doesn't know anything about > the currently defined namespaces. So this "das:" must be something > completely different from the xmlns:das=... definition. No, it refers to the xmlns:das definition in the parent FEATURES element. >> * the values of the 'das:id', 'das:type', and 'das:ptype' attributes >> are URLs relative to xml:base unless they begin with 'das:prop#', in >> which case they are relative to the das:prop namespace. > > And from what I can tell about XML, there's no standard way to implement > this using one of the standard XML parsers. How do you get the das:prop > namespace for a given element? You've identified the key weakness of my proposal: Knowing how to expand 'das:prop' occurring within attribute values would be a DAS-specific convention ('hack') for mapping to a controlled vocabulary for property values. So I'm not quite satisfied with this either. In another message of yours today, you propose an alternative to this: http://portal.open-bio.org/pipermail/das2/2005-November/000313.html See my reply to that for more ideas on this topic. Steve From td2 at sanger.ac.uk Tue Nov 15 09:14:01 2005 From: td2 at sanger.ac.uk (Thomas Down) Date: Tue, 15 Nov 2005 09:14:01 +0000 Subject: [DAS2] DAS/2 weekly meeting notes for 14 Nov 05 In-Reply-To: References: Message-ID: <21CB947F-FAE3-4D56-A110-CAB9606C9C84@sanger.ac.uk> On 14 Nov 2005, at 19:29, Steve Chervitz wrote: > > Coord system reg server: > * in das/2 - it's not optional (0 interbase) > * they find this important By "coordinate system" we're not really talking about the 0-based- vs-1-based issue, we're talking about globally unique names for sets of reference sequences (genome assemblies, protein databases, whatever). It might be possible to come up with a better name (I used to call these "namespaces"). > We have confusion between assembly and reference server > LS: Need URI that points to assembly, independent of the > reference server. > GH: Would like to have annot servers that don't know anything about > the ref server Definitely agree with this. This kind of "opaque assembly identifier" is what we've been calling a coord-system name. > LS: Could use the region URI to ID the assembly > das/genome/sourceid/region = assembly id/uri > > GH: The trouble is that NCBI is a ref source for many assemblies, yet > they lack a das sever. They have no URI. > LS: we can just make one up, or use most appropriate web page This is possibly an argument for avoiding the use of URLs for assembly identifiers, if we can't be sure that the organisation that's the authority for a given assembly will be running an authoritative DAS server. URNs would be fine, as would the kind of structured but location-independent identifer that Andreas has been using. > Question: What do they mean by 'coord system'? some confusion here > e.g., Do they mean things like: 'this assembly start at 5000 relative > to this other assembly'? I think the way to provide this kind of information is in the form of a DAS alignment service between two coord-systems. We love the idea of putting up alignments between NCBI34 and NCBI35 then having a liftover-like tool which can go off and query the registry to discover this. Thomas. From ap3 at sanger.ac.uk Tue Nov 15 10:24:45 2005 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Tue, 15 Nov 2005 10:24:45 +0000 Subject: [DAS2] DAS/2 weekly meeting notes for 14 Nov 05 In-Reply-To: References: Message-ID: Hi! I realized there were a couple of questions regarding the way "coordinate systems" are defined in the DAS-registry, so it would have been good if I would have joined yesterday.... I am glad that the conference is now at a time which is better for us europeans and want to join in future for some of the topics like registry, coordinate systems, proteins, etc. > > AD: ebi/sanger tracks three fields related to assembly (what they need > per server): > -authority = equiv to our assembly uri > -organism = we have as taxon > -type = ? "type" refers to a "physical dimension" of an object. E.g. a chromosome, a 3D protein structure, a protein sequence. > > Permits people to query things like: find out all servers that offer > ncbi > build 35 for human. > > Question: What do they mean by 'coord system'? some confusion here > e.g., Do they mean things like: 'this assembly start at 5000 relative > to this other assembly'? no, as Thomas already mentioned these "coordinate systems" could also be called "namespace". They should be globally unique descriptors for reference objects / databases. > > For protein DAS, authority typically defines two diff coord systems: > 'pdb resnum, interprot' > It does not permit automated translation between two coord systems. unfortunately this is not that easy in protein space. The mapping from the 3D protein structure to the protein sequence is not straightforward. Think of negative, non-consecutive, and "non-numeric" residue numbers that can appear in the 3D structures. Therefore we came up with the "alignment" DAS - document that allows to map one object in one coordinate system to another one. it can also be used to map one assembly to another. > [A] - Andrew will find out what they use it for > > AD: Believes the purpose is intended for human consumption. not only - the DAS clients usually can display a certain "coordinate system" e.g. Ensembl can do Chromosomal ones, but if DAS sources are available that speak the "UniProt, Protein Sequence" coordinate system, it knows how to project these onto the genome. - an "intelligent DAS client" :-) Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 From dalke at dalkescientific.com Thu Nov 17 02:35:32 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 17 Nov 2005 03:35:32 +0100 Subject: [DAS2] (x)link Message-ID: I mentioned having a generic tag, again based on Atom. Steve replied: > Not sure about this one yet. In the Atom API, the value of the rel > attribute is restricted to a controlled vocabulary of link > relationships and available services pertaining to editing and > publishing syndicated content on the web: > http://atomenabled.org/developers/api/atom-api- > spec.php#rfc.section.5.4.1 > > What would a controlled vocab for DAS resources be? I don't think I understand the Atom one. Turns out I was actually looking at the Atom publishing protocol at http://code.blogger.com/archives/atom-docs.html which defines links including The service.post is the URI where you would send an Entry to post to your blog. The service.feed is the URI where you would make an Atom API request to see the Blog's latest entries. We could define similar links like: - where to edit and/or lock the given resource - how to get a list of locks - how to get from the given DAS resource to it's parent (ie, how to go "up" in the tree, in the case of a cross-link from another server) These could be done as distinct elements or done as qualifications of an existing element. The advantage of the latter (using a ) is that others may add their own link types. > Skimming through the DAS/2 retrieval spec, our use of hrefs is > simply for pointing at the location of resources on the web > containing some specified content (e.g., documentation, database > entry, image data, etc.). But they are used in different contexts (for human browsing, for machine fetching, for "service" requests). > The next/prev/start idea for Atom might have good applicability in the > DAS world for iterating through versions of annotations or assemblies > (e.g., rel='link-to-gene-on-next-version-of-genome'). One relationship > that would be useful for DAS would be 'latest', to get the latest > version of an annotation. Hmm. So every annotation would have an optional section? In the current scheme do we always get the most recent version of an annotation? I didn't realize there was any way to get another version, except if it's been edited while you weren't looking. > DAS get URLs themselves seem fairly self-documenting (it's clear a > given link is for feature, type, or sequence for example), so having a > separate rel attribute may not provide much additional value for these > links. But it might be handy for versioning and for DAS/2 writebacks. I hadn't thought of versioning; I was thinking more of writebacks an and how to find the parent. I was also thinking of structure data where I might want the experimental x-ray density data for a a given structure. That might be done like That's part of the newly submitted DAS proposal so should not really drive this work. Steve also mentioned xlink. I've been looking at the spec but still don't understand its implications. There are several^H^Hmany parts to the spec I don't understand, especially in the context of DAS. locator? "arcrole"? "actuate"? Are all our links "simple"? Do we use anything else because the href? Also, I see no mention in that spec of content-type. One of the things in the Atom spec is support (though not in the spec proper) for alternate or multiple way to resolve a link or multiple formats (That is, a may contain subelements and these subelements, if in something other than the "das" namespace, are free to add variant meanings.) Andrew dalke at dalkescientific.com From ilari.scheinin at helsinki.fi Fri Nov 18 15:22:47 2005 From: ilari.scheinin at helsinki.fi (Ilari Scheinin) Date: Fri, 18 Nov 2005 17:22:47 +0200 Subject: [DAS2] Getting individual features in DAS/1 Message-ID: This mail is not really about DAS/2, but the web site says the original DAS mailing list is now closed. I am setting up a DAS server that serves CGH data from my database to a visualization software, which in my case is gbrowse. I've already set up Dazzle that serves the reference data from a local copy of Ensembl. I need to be able to select individual CGH experiments to be visualized, and as the measurements from a single CGH experiment cover the entire genome, this cannot of course be done by specifying a segment along with the features command. I noticed that there is a feature_id option for getting the features in DAS/1.5, but on a closer look, it seems to work by getting the segment that the specified feature corresponds to, and then getting all features from that segment. My next approach was to use the feature type to distinguish between different CGH experiments. As all my data is of the type CGH, I thought that I could use spare this piece of information for identifying purposes. First I tried the generic seqfeature plugin. I created a database for it with some test data. However, getting features by type does not seem to work. I always get all the features from the segment in question. Next I tried the LDAS plugin. Again I created a compatible database with some test data. I must have done something wrong the the data file I imported to the database, because getting the features does not work. I can get the feature types, but trying to get the features gives me an ERRORSEGMENT error. I thought that before I go further, it might be useful to ask whether my approach seems reasonable, or is there a better way to achieve what I am trying to do? What should I do to be able to visualize individual CGH profiles? I'm grateful for any advice, Ilari From ap3 at sanger.ac.uk Fri Nov 18 16:54:27 2005 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Fri, 18 Nov 2005 16:54:27 +0000 Subject: [DAS2] das registry and das2 Message-ID: <240aa4ff660b7427b6c463ffc10b1307@sanger.ac.uk> Hi! I would like to start a discussion of how to provide a proper DAS interface for our das- registration server at http://das.sanger.ac.uk/registry/ Currently it is possible to interact with it using SOAP, or manually via the HTML interface. We should also make it accessible using URL requests. To get this started I would propose the following query syntax. This might also provide another opportunity to have a discussion about the coordinate system descriptions. If some of the used terms are unclear, there is some documentation at http://das.sanger.ac.uk/registry/help_index.jsp Regards, Andreas Request: http://server/registry/list http://server/registry/find? [keyword,organism,authority,type,capability,label]=searchterm Response: DS_109 myDasSource some free text NCBI 35 chromosome Homo sapiens 9606 4:55349999,55749999 UniProt Protein Sequence P00280 sequence features 2005-Nov-16 about uniprot ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 From dalke at dalkescientific.com Fri Nov 18 18:00:12 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Fri, 18 Nov 2005 19:00:12 +0100 Subject: [DAS2] das registry and das2 In-Reply-To: <240aa4ff660b7427b6c463ffc10b1307@sanger.ac.uk> References: <240aa4ff660b7427b6c463ffc10b1307@sanger.ac.uk> Message-ID: <4569f5d3ff6658e5ead6b979e8b1fba9@dalkescientific.com> Andreas Prlic: > I would like to start a discussion of how to provide a proper DAS > interface for > our das- registration server at http://das.sanger.ac.uk/registry/ > > Currently it is possible to interact with it using SOAP, or manually > via the HTML > interface. We should also make it accessible using URL requests. One of the things Gregg and I talked about at ISMB was that the top-level "das-sources" format is, or can be, identical to what's needed for the registry server. As it's structured now the top-level interface to a das2/genome URL returns a list of sources. Based on what you need for the registry, we're going to add support for data about the source itself. The resulting das-sources XML document is effectively identical to what you're looking for. Hence I think the top-level XML format for a DAS/2 service is identical to the XML format for a registry server. A difference is the support for searches across sources. We don't have that in DAS. This is an example, btw, of how a generic element could be useful. Suppose we don't add this in DAS/2.0. The EBI could do something like to say that the given url (which would be the current URL) also supports a registry search interface. Or we could have that all DAS/2 servers implement a search. I don't think that should be a requirement. > http://server/registry/list > http://server/registry/find? > [keyword,organism,authority,type,capability,label]=searchterm My proposal doesn't affect this. Why do "find" and "list" take different URLs? Another possibility is that the same URL returns everything if there are no filters in place. Are multiple search terms allowed? Boolean AND or OR? Andrew dalke at dalkescientific.com From ap3 at sanger.ac.uk Mon Nov 21 10:55:06 2005 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Mon, 21 Nov 2005 10:55:06 +0000 Subject: [DAS2] das registry and das2 In-Reply-To: <4569f5d3ff6658e5ead6b979e8b1fba9@dalkescientific.com> References: <240aa4ff660b7427b6c463ffc10b1307@sanger.ac.uk> <4569f5d3ff6658e5ead6b979e8b1fba9@dalkescientific.com> Message-ID: Hi Andrew, > As it's structured now the top-level interface to a das2/genome URL > returns a list of sources. Based on what you need for the registry, > we're going to add support for data about the source itself. > > The resulting das-sources XML document is effectively identical to > what you're looking for. that sounds good. I agree the description should look identical for both the sources and the registry. If the sources are already properly described this also makes it easier to "publish" them. I think most of the fields in the registry are rather clear why they are there. The issue that might need most discussion might be how to describe a coordinate system. This information is important because a DAS client usually understands one or multiple coordinate systems. E.g. Ensembl knows about Chromosomes and Clones, but it can also display UniProt annotations in some cases. Similar the SPICE DAS client can display annotations served in PDB-residue numbering and UniProt coordinates, but does not know how to deal with genomic coordinates. Therefore the "coordinate system" or "namespace" is an important part of the description of a DAS source. What I found in the current spec-draft that comes closest to this issue is the different "domains" e.g http://server/das/genome/source/version/features so I might want to say http://server/das/genome/homosapiens/ncbi35/features http://server/das/genome/musmusculus/ncbim34/features or should it be http://server/das/genome/ncbi/homosapiens35/features http://server/das/genome/ncbi/musmusculus34/features ? Hm. I am not sure, but it seems that one level is missing? - either organism or authority ? The description of the data finally should allow to use the same DAS source in multiple DAS-clients. Some validation will be required on the descriptions, to warn people that "homo sapiens" should not be written as "human" or "homo". or more complicated: Ensembl does not do assemblies itself. The assembly used is currently NCBI_35. Therefore "Ensembl" can not be used as an authority for a chromosomal coordinate system. Currently the registry provides a restricted list of allowed coordinate systems, to keep this under control. >> http://server/registry/list >> http://server/registry/find? >> [keyword,organism,authority,type,capability,label]=searchterm > > My proposal doesn't affect this. > > Why do "find" and "list" take different URLs? Another possibility > is that the same URL returns everything if there are no filters > in place. yes - better use only one url. no filters would return all sources. > > Are multiple search terms allowed? yes > Boolean AND or OR? We can add a parameter where this can be chosen. Greetings, Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 From dalke at dalkescientific.com Mon Nov 21 17:06:25 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 21 Nov 2005 18:06:25 +0100 Subject: [DAS2] DAS/2 weekly meeting notes for 14 Nov 05 In-Reply-To: References: Message-ID: <90dff63fdc1e5b32ba97f8c18948758e@dalkescientific.com> Going through the back emails to prepare for the conference call in 30 minutes. Andreas, replying to Steve's comment: >> For protein DAS, authority typically defines two diff coord systems: >> 'pdb resnum, interprot' > >> It does not permit automated translation between two coord systems. > > unfortunately this is not that easy in protein space. The mapping from > the 3D > protein structure to the protein sequence is not straightforward. > Think of > negative, non-consecutive, and "non-numeric" residue numbers that can > appear > in the 3D structures. Therefore we came up with the "alignment" DAS - > document > that allows to map one object in one coordinate system to another one. > it can > also be used to map one assembly to another. Regarding the structure mapping, when we visited the PDB in August they said it's not a problem. The mmCIF records have the information needed for the mapping. I've not looked into this though. > not only - the DAS clients usually can display a certain "coordinate > system" e.g. Ensembl can do > Chromosomal ones, but if DAS sources are available that speak the > "UniProt, Protein Sequence" coordinate > system, it knows how to project these onto the genome. - an > "intelligent DAS client" :-) I like the use case of "user wants to merge annotations from different servers. As DAS currently doesn't have liftover support, the DAS client needs to get annotations only from servers using the same reference coordinate system." Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Mon Nov 21 17:08:30 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 21 Nov 2005 18:08:30 +0100 Subject: [DAS2] Getting individual features in DAS/1 In-Reply-To: References: Message-ID: <7f239b885d3eca821639654862770c65@dalkescientific.com> Has anyone answered Ilari's question? I never used DAS/1 enough to answer it myself. If the normal DAS list is closed, is this the right place for DAS/1 questions? On Nov 18, 2005, at 4:22 PM, Ilari Scheinin wrote: > This mail is not really about DAS/2, but the web site says the > original DAS mailing list is now closed. > > I am setting up a DAS server that serves CGH data from my database to > a visualization software, which in my case is gbrowse. I've already > set up Dazzle that serves the reference data from a local copy of > Ensembl. I need to be able to select individual CGH experiments to be > visualized, and as the measurements from a single CGH experiment cover > the entire genome, this cannot of course be done by specifying a > segment along with the features command. > > I noticed that there is a feature_id option for getting the features > in DAS/1.5, but on a closer look, it seems to work by getting the > segment that the specified feature corresponds to, and then getting > all features from that segment. My next approach was to use the > feature type to distinguish between different CGH experiments. As all > my data is of the type CGH, I thought that I could use spare this > piece of information for identifying purposes. > > First I tried the generic seqfeature plugin. I created a database for > it with some test data. However, getting features by type does not > seem to work. I always get all the features from the segment in > question. > > Next I tried the LDAS plugin. Again I created a compatible database > with some test data. I must have done something wrong the the data > file I imported to the database, because getting the features does not > work. I can get the feature types, but trying to get the features > gives me an ERRORSEGMENT error. > > I thought that before I go further, it might be useful to ask whether > my approach seems reasonable, or is there a better way to achieve what > I am trying to do? What should I do to be able to visualize individual > CGH profiles? > > I'm grateful for any advice, > Ilari Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Mon Nov 21 17:25:06 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 21 Nov 2005 18:25:06 +0100 Subject: [DAS2] das registry and das2 In-Reply-To: References: <240aa4ff660b7427b6c463ffc10b1307@sanger.ac.uk> <4569f5d3ff6658e5ead6b979e8b1fba9@dalkescientific.com> Message-ID: <21a521b096330a81bfa05b0789d3c92d@dalkescientific.com> Andreas Prlic wrote: > Therefore the "coordinate system" or "namespace" is an important part > of the description of a DAS source. > > What I found in the current spec-draft that comes closest to this > issue is the different "domains" > e.g > > http://server/das/genome/source/version/features > > so I might want to say > http://server/das/genome/homosapiens/ncbi35/features > http://server/das/genome/musmusculus/ncbim34/features > > or should it be > http://server/das/genome/ncbi/homosapiens35/features > http://server/das/genome/ncbi/musmusculus34/features > ? > > Hm. I am not sure, but it seems that one level is missing? - either > organism or authority ? The species information is available from the data source from the 'taxon' attribute, as in It's not available through a URL naming. That's arbitrary in that the data provider can use any term. I think there's nothing to preclude a provider from putting the actual source data one level deeper in the tree. Personally I find that that's over-classification. Who would use it? > Currently the registry provides a restricted list of allowed > coordinate systems, to keep this under control. Thomas: > This is possibly an argument for avoiding the use of URLs for assembly > identifiers, if we can't be sure that the organisation that's the > authority for a given assembly will be running an authoritative DAS > server. URNs would be fine, as would the kind of structured but > location-independent identifer that Andreas has been using. I think there's no reason we can't use our own names for these. Eg, http://www.biodas.org/coordinates/NCBI35 or a simple unique id like "NCBI35". Right now those are treated as opaque identifiers. There's no name resolution going on, and the coordinates are (I assume) implicit in that client software doesn't resolve the name, only check that the servers are returning data from the same coordinate system. Perhaps in the future that URL might resolve to something, but there's no current reason to do so. In the renewal grant there is reason to compare different coordinates. When that happens a client needs to pick one reference frame and get the translation information to the other. So the liftover service needs to know about the two coordindate systems. But it can be done through hard-coded information (perhaps with some information that coordinate system X is an alias for Y). I still don't think there's any need to resolve these URLs. Andreas: >> Are multiple search terms allowed? > > yes Then they should likely be along the same lines used for the DAS/2 searching. >> Boolean AND or OR? > > We can add a parameter where this can be chosen. The existing DAS/2 uses an AND search only. Rather "OR" for multiple fields of the same data type and "AND" across different fields. Andrew dalke at dalkescientific.com From Gregg_Helt at affymetrix.com Mon Nov 21 17:24:37 2005 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Mon, 21 Nov 2005 09:24:37 -0800 Subject: [DAS2] Getting individual features in DAS/1 Message-ID: We need to discuss at today's meeting. I don't think the original DAS list should be closed, but rather continue to serve as a list to discuss the DAS/1 protocol and implementations, and the DAS2 mailing list should focus on DAS/2. If we mix DAS/1 and DAS/2 discussions in the same mailing list I think it's going to lead to a lot of confusion. gregg > -----Original Message----- > From: das2-bounces at portal.open-bio.org [mailto:das2-bounces at portal.open- > bio.org] On Behalf Of Andrew Dalke > Sent: Monday, November 21, 2005 9:09 AM > To: DAS/2 > Subject: Re: [DAS2] Getting individual features in DAS/1 > > Has anyone answered Ilari's question? > > I never used DAS/1 enough to answer it myself. > > If the normal DAS list is closed, is this the right place for DAS/1 > questions? > > > On Nov 18, 2005, at 4:22 PM, Ilari Scheinin wrote: > > > This mail is not really about DAS/2, but the web site says the > > original DAS mailing list is now closed. > > > > I am setting up a DAS server that serves CGH data from my database to > > a visualization software, which in my case is gbrowse. I've already > > set up Dazzle that serves the reference data from a local copy of > > Ensembl. I need to be able to select individual CGH experiments to be > > visualized, and as the measurements from a single CGH experiment cover > > the entire genome, this cannot of course be done by specifying a > > segment along with the features command. > > > > I noticed that there is a feature_id option for getting the features > > in DAS/1.5, but on a closer look, it seems to work by getting the > > segment that the specified feature corresponds to, and then getting > > all features from that segment. My next approach was to use the > > feature type to distinguish between different CGH experiments. As all > > my data is of the type CGH, I thought that I could use spare this > > piece of information for identifying purposes. > > > > First I tried the generic seqfeature plugin. I created a database for > > it with some test data. However, getting features by type does not > > seem to work. I always get all the features from the segment in > > question. > > > > Next I tried the LDAS plugin. Again I created a compatible database > > with some test data. I must have done something wrong the the data > > file I imported to the database, because getting the features does not > > work. I can get the feature types, but trying to get the features > > gives me an ERRORSEGMENT error. > > > > I thought that before I go further, it might be useful to ask whether > > my approach seems reasonable, or is there a better way to achieve what > > I am trying to do? What should I do to be able to visualize individual > > CGH profiles? > > > > I'm grateful for any advice, > > Ilari > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 From Steve_Chervitz at affymetrix.com Mon Nov 21 20:15:41 2005 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Mon, 21 Nov 2005 12:15:41 -0800 Subject: [DAS2] DAS/2 weekly meeting notes for 21 Nov 05 Message-ID: Notes from the weekly DAS/2 teleconference, 21 Nov 2005. $Id: das2-teleconf-2005-11-21.txt,v 1.3 2005/11/21 20:15:28 sac Exp $ Attendees: Affy: Steve Chervitz, Gregg Helt UCLA: Allen Day, Brian O'connor UCBerkeley: Suzi Lewis, Nomi Harris Sweden: Andrew Dalke Sanger: Andreas Prlic Action items are flagged with '[A]'. These notes are checked into the biodas.org CVS repository at das/das2/notes/2005. Instructions on how to access this repository are at http://biodas.org Today's topic: Client-Server implementation issues ---------------------------------------------------- Suzi/Nomi --------- Questions for gregg: How to communicate styles in DAS/2? GH: Client gets style sheets from server that suggests how to render things. AD: EBI uses this a lot. Most of the DAS systems there use stylesheets. [A] Andreas will contact folks at Sanger/EBI for stylesheet example code. GH: The IGB client uses a preference configuration, using java preferences rather than special XML file. Windows: sets values in the registry. Has been successful. If client can understand DAS/2 stylesheets and client-side prefs, the client-side prefs should override the server styles (others agree). Steve ----- * Reported on some analysis of Affymetrix DAS server weblogs. Lots of google-bot data download. Lots of spotfire hits, too. BO: Google bots should respect robots.txt [A] Steve will install robots.txt in the relevant locations * Reported on getting Greggs DAS/2 server to run on top of apache rather than as a stand-alone server. Should be a matter of hooking apache up to tomcat using a tomcat connector. Directive for apache to defer to tomcat for servlet requests. [A] Steve will hook up affy das server to apache/tomcat. Gregg ----- * Regarding Spotfire - they are working on a IGB plugin to spotfire using http localhost API. This explains our spotfire hits. Gregg was previously integrating IGB with spotfire using a java to COM bridge. It works, but the COM bridges aren't free etc. etc. They are interested in driving IGB from spotfire since they're interested in using IGB to provide genome vizualization. Are currently evaluating whether to release it to public or not. Gregg considered putting this in the grant, but would have required permission, etc. and time was a factor. They may eventually commit to IGB code base directly, but still need to work out leagalese. They will be interested in tracking the interclient API work we are doing (IGB-Apollo). * No major work on DAS this week, just some niggling IGB issues. * Planning another IGB release by end of year that will have improvements to DAS/2 clients. Fixed: access via quickload then accesss to DAS/2 causes blankout of screen Fixed: DAS/2 interaction Brian ----- * Marc C has committed stuff to IGB code base (genovis). Is there a test suite we can use to verify we're not breaking anything? GH: No, but hopefully early next year. Definitely needed. * Also checked in the re-factor - separate namespaces for assay and ontology. [A] Gregg will relocate das2 package to com.affy.das2 & uncouple from IGB GH: There are a few igb dependencies to be unraveled (das2feature...). Don't want to do this in the next release since that's pretty significant given upcoming holidays. GH: Other features to get in: * Persistence of preferences. * Get rid of hardwiring of DAS2 servers. Already to this for DAS/1, just need to replicate for DAS/2. Allen ----- * API for handling ontologies, structures. Communication with Chris Mungall. * Have impl at stanford for autocompletion of ontology terms related to samples (Gavin Sherlock's group, SMD). What is bioontology group doing for distributing their ontologies, what api's are going to be made public? SL: Am at stanford right now to talk about that. Will offer bulk things like at obo site, but in terms of interactive API, will respond to community as best we can. Allen: Interested in more integration with bioontology group and with his work with SMD. Suzi: Not content, but tools right? Allen: Yes. Suzi: Work with chris. Timing couldn't be better. [A] Allen will work with Chris M re: ontology API tools for OBO & SMD * GH: Progress on writeback? Part of grant proposal to get it done by june. Will help funding continuation. Allen: We could start implementing some of that given the refactoring that's now done. GH: Ed Griffith at sanger is interested on this. hoping for his participation. In the short timeframe, you're server wouldn't have to implement it as long as there is at least one server available that can do it. Allen: Need to look at work load. There's no lack of work to be done for get requests (faster impls). GH: Would prefer to have just one writeback server and a faster get server rather than having two writeback capable servers. * Allen: Optimizations involving serving files, kind of a report-version of the chado adapters. GH: Regarding your rounding ranges optimization for tiling can you post to the list? [A] Allen will post his rounding ranges optimization to DAS/2 list GH: The idea is to help server-side caching by rounding the range requests so you're more likely to hit the same URI (e.g., stop=5010 becomes 6000). Different clients are more likely to hit the cache. Not in the spec, just a convention. Requires more smarts in client: giving more to the user than they asked for, or throwing out what's not asked for. Throwing out what they didn't ask for would be nicer. In theory, this won't be an issue with client caching. SC: Could make client's configuration re: rounding an option. GH: Users want fewer options. * IGB display troubles. Allen had trouble getting it to display anything besides mRNA GH: IGB expects 2-level or deeper annotations. For single-level annots, should connect all with a line. Allen: May be doing this for SNPs. But also saw some strange responses. GH: Needs a fix. Allen: will it be in next release? GH: harder to do it generally -- easier to hardwire it for particular data types. Rendering has to guess how deep you want to go. Currently goes to the leaves and then goes 1-level up, rather than top-down. IGB uses an extra level than you actually see to keep track of other things (e.g., region in query). Preferences UI: 'nested' can select two-level or one-level deep. Would like to hear what others you have problems with.. [A] Gregg will fix IGB display problems for single-level annots. Andrew ------ * Emailed open-bio root list to set up cgi for online verifier. But no response yet. * DAS/1 vs DAS/2 mailing list. GH: Confusion may occur if we combine DAS/1 and DAS/2 discussion. Let's keep DAS/1 for all DAS/1 spec related discussion. [A] Steve verify whether the DAS/1 list is still alive. [A] Steve will put a link to in on biodas.org for DAS/1 list * Locking: Plan to talk to EBI about this in January They are doing work for style sheets. [A] Andrew will ask Ed G. to join these meetings * Needs test data, mock data set. [A] Allen will point Andrew at some data for testing. Andreas ------- * The current registry implementation: Written in java two ways to interact: 1) html, can browser available DAS sources, see details, go back to DAS client and activate the DAS source in the DAS client. 2) soap, client contacts registry, get list of available sources. Is open source. [A] Andreas will post link to source code for DAS registry impl. GH: A central registry is good, but companies will want their own. eg., at affy there may be 5-7. Andreas: It's possible to have a set of registries, local vs. public. GH: Are you OK with idea to have an http-based interface? It can run on top of existing core. Andreas: Sure. [A] Andreas will provide http-based interface to Sanger DAS registry Agenda for next week teleconf ----------------------------- * Talk more about registry spec issues * Retrieval spec issues: - Content-type - DAS/2 headers - Feature and type properties - other things? Andrew: Prefer to have most of the discussion online (DAS/2 list) then the teleconf can be more productive. [A] Continue discussing spec issues on the list before next teleconf From allenday at ucla.edu Mon Nov 21 20:47:51 2005 From: allenday at ucla.edu (Allen Day) Date: Mon, 21 Nov 2005 12:47:51 -0800 (PST) Subject: [DAS2] tiled queries for performance Message-ID: Hi, I had an idea of how clients may be able to get better response from servers by using a tiled query technique. Here's the basic idea: ClientA wants features in chr1/1010:2020, and issues a request for that range. No other clients have previously requested this range, so the server-side cache faults to the DAS/2 service (slow). ClientB wants features in chr1/1020:2030, and issues a request for that range. Although the intersection of the resulting records with ClientA's query is large, the URIs are different and the server-side cache faults again. If ClientA and ClientB were to each issue two separate "tiled" requests: 1. chr1/1001:2000 2. chr1/2001:3000 ClientB could take advantage of the fact that ClientA had been looking at the same tiles. For this to work, the clients would need to be using the same tile size. The optimal tile size is likely to vary from datasource to datasource, depending on the length and density distributions of the features contained in the datasource. The "sources" or "versioned sources" payload could suggest a tiling size to prospective clients. Servers could also pre-cache all tiles by hitting each tile after an update of the datasource (or the DAS/2 service code). The tradeoff for the performance gains is that clients may now need to do filtering on the returned records to only return those requested by the client's client. -Allen From ap3 at sanger.ac.uk Tue Nov 22 13:54:27 2005 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Tue, 22 Nov 2005 13:54:27 +0000 Subject: [DAS2] das registry links Message-ID: Hi! There was a question yesterday where to get the source code from the das-registration server and if it is possible to have a local installation. The source code for the registry is available under LGPL at http://www.derkholm.net/svn/repos/dasregistry/trunk/ using subversion. To obtain a local installation, which caches/synchronizes the public available data and allows to add local das sources, see instructions at: http://www.derkholm.net/svn/repos/dasregistry/trunk/release/install.txt There is also a das-registry announce-mailing list at http://lists.sanger.ac.uk/mailman/listinfo/das_registry_announce Regards, Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 From ap3 at sanger.ac.uk Tue Nov 22 17:58:08 2005 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Tue, 22 Nov 2005 17:58:08 +0000 Subject: [DAS2] ensembl & stylesheet Message-ID: Hi! another question yesterday was about ensembl & stylesheet support. an example das source that provides a stylesheet is the following: http://das.ensembl.org/das/ens_35_segdup_washu/stylesheet description about it is at: http://das.ensembl.org/das/ens_35_segdup_washu/ To show how it is rendered in ensembl follow this "auto-activation" link: http://www.ensembl.org/Homo_sapiens/contigview?conf_script=contigview; c=17:14149999.5:1;w=200000;h=; add_das_source=(name=SEGDUP_WASHU+url=http://das.ensembl.org/ das+dsn=ens_35_segdup_washu+type=ensembl_location+color=black+strand=r+l abelflag=U+stylesheet=Y+group=Y+depth=9999+score=N+active=1) In terms of source code ensembl uses the Bio::DASLite perl module for fetching features and stylesheets http://search.cpan.org/~rpettett/Bio-DasLite-0.10/ Hope this helps, Cheers, Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 From gilmanb at pantherinformatics.com Mon Nov 21 21:46:25 2005 From: gilmanb at pantherinformatics.com (Brian Gilman) Date: Mon, 21 Nov 2005 16:46:25 -0500 Subject: [DAS2] tiled queries for performance In-Reply-To: References: Message-ID: <2042BBCD-8490-461D-80C1-1BB4A1FAACB1@pantherinformatics.com> Hello Everyone, I've been lurking on the list and wanted to say hi. We're looking into this kind of implementation issue ourselves and thought that a bitorrent like cache makes the most sense. ie. all servers in the "fabric" are issued the query in a certain "hop adjacency". These servers then send their data to the client who's job it is to assemble the data. HTH, -B -- Brian Gilman President Panther Informatics Inc. E-Mail: gilmanb at pantherinformatics.com gilmanb at jforge.net AIM: gilmanb1 01000010 01101001 01101111 01001001 01101110 01100110 01101111 01110010 01101101 01100001 01110100 01101001 01100011 01101001 01100001 01101110 On Nov 21, 2005, at 3:47 PM, Allen Day wrote: > Hi, > > I had an idea of how clients may be able to get better response from > servers by using a tiled query technique. Here's the basic idea: > > ClientA wants features in chr1/1010:2020, and issues a request for > that > range. No other clients have previously requested this range, so the > server-side cache faults to the DAS/2 service (slow). > > ClientB wants features in chr1/1020:2030, and issues a request for > that > range. Although the intersection of the resulting records with > ClientA's > query is large, the URIs are different and the server-side cache > faults > again. > > If ClientA and ClientB were to each issue two separate "tiled" > requests: > > 1. chr1/1001:2000 > 2. chr1/2001:3000 > > ClientB could take advantage of the fact that ClientA had been > looking at > the same tiles. > > For this to work, the clients would need to be using the same tile > size. > The optimal tile size is likely to vary from datasource to datasource, > depending on the length and density distributions of the features > contained in the datasource. The "sources" or "versioned sources" > payload could suggest a tiling size to prospective clients. > Servers could > also pre-cache all tiles by hitting each tile after an update of the > datasource (or the DAS/2 service code). > > The tradeoff for the performance gains is that clients may now need > to do > filtering on the returned records to only return those requested by > the > client's client. > > -Allen > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 From Steve_Chervitz at affymetrix.com Wed Nov 23 16:03:55 2005 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Wed, 23 Nov 2005 08:03:55 -0800 Subject: [DAS2] Simple Sharing Extensions for RSS and OPML Message-ID: This may have some concept relevant to DAS/2 writeback: http://msdn.microsoft.com/xml/rss/sse/ Steve From allenday at ucla.edu Wed Nov 23 23:50:24 2005 From: allenday at ucla.edu (Allen Day) Date: Wed, 23 Nov 2005 15:50:24 -0800 (PST) Subject: [DAS2] tiled queries for performance In-Reply-To: References: Message-ID: More thoughts on this. The client can eliminate the redundancy in the records returned by issuing the tiling queries as previously described (query1), then issuing queries for records that are not contained within tiles, but overlap the boundaries of 1 or more tiles (query2). However, by issuing all the overlaps queries at once, we've just deferred the performance hit one step, because we can't reasonably expect the server to have cached all combinations of tile overlaps queries. I think, to get this tiling optimization to work, the burden needs to be on the client to identify and remove duplicate responses for multiple edge-overlaps queries (query3). 1000bp 2000bp 3000bp | | | | === | =====^==== | | ====#===== | | ============#=============#===== | | | <-----------> query1a <-----------> query1b query2 query3a query3b Key: | : tile boundary = : feature ^ : gap between child features # : portion of feature overlapping tile boundary. : client overlaps query <.> : client contains query -Allen On Mon, 21 Nov 2005, Allen Day wrote: > Hi, > > I had an idea of how clients may be able to get better response from > servers by using a tiled query technique. Here's the basic idea: > > ClientA wants features in chr1/1010:2020, and issues a request for that > range. No other clients have previously requested this range, so the > server-side cache faults to the DAS/2 service (slow). > > ClientB wants features in chr1/1020:2030, and issues a request for that > range. Although the intersection of the resulting records with ClientA's > query is large, the URIs are different and the server-side cache faults > again. > > If ClientA and ClientB were to each issue two separate "tiled" requests: > > 1. chr1/1001:2000 > 2. chr1/2001:3000 > > ClientB could take advantage of the fact that ClientA had been looking at > the same tiles. > > For this to work, the clients would need to be using the same tile size. > The optimal tile size is likely to vary from datasource to datasource, > depending on the length and density distributions of the features > contained in the datasource. The "sources" or "versioned sources" > payload could suggest a tiling size to prospective clients. Servers could > also pre-cache all tiles by hitting each tile after an update of the > datasource (or the DAS/2 service code). > > The tradeoff for the performance gains is that clients may now need to do > filtering on the returned records to only return those requested by the > client's client. > > -Allen > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 > From Steve_Chervitz at affymetrix.com Thu Nov 24 01:40:13 2005 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Wed, 23 Nov 2005 17:40:13 -0800 Subject: [DAS2] Ontology Lookup Service Message-ID: Allen, This looks similar to what you have been working on for SMD: http://www.ebi.ac.uk/ontology-lookup/ Would be interesting to compare it with your ontology DAS-based implementation (e.g., performance, ease of installation, extending, etc.). Steve From dalke at dalkescientific.com Thu Nov 24 02:52:35 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 24 Nov 2005 03:52:35 +0100 Subject: [DAS2] tiled queries for performance In-Reply-To: References: Message-ID: Allen: > No other clients have previously requested this range, so the > server-side cache faults to the DAS/2 service (slow). Admittedly I'm curious about this. Why is this slow? What does slow mean? I assume "cannot be returned faster than the network will take it." How many annotations are in the database? Figuring one annotation for every ... 100 bases? gives me 30 million. Shouldn't a range search over < only 30 million be fast? Is this being done in the database? Which database and what's the SQL? If the DB is the bottleneck then pulling it out as a specialized search might be worthwhile. What I'm driving at for this is this. The proposal feels like a workaround for a given implementation. To use it requires more smarts in the client. Why not put that logic on the server? Andrew dalke at dalkescientific.com From allenday at ucla.edu Thu Nov 24 07:10:36 2005 From: allenday at ucla.edu (Allen Day) Date: Wed, 23 Nov 2005 23:10:36 -0800 Subject: [DAS2] tiled queries for performance In-Reply-To: References: Message-ID: <5c24dcc30511232310p1623ff4dk9088579cdf58e082@mail.gmail.com> Hi Andrew. I'd like to be able to consistently get network-bottlenecked response from the server. The largest (250 megabase) SQL range queries typically take ~30 seconds to complete, returning ~500K features. I'm currently working on getting the templating system (Template Toolkit aka TT2) we use to flush to the client periodically, rather than building the entire response first. This is the current bottleneck; TT2 generation of a 500K record XML document takes many minutes. Regardless of how much more optimization work we put into the server, it's never going to be as fast as serving up pre-queried, pre-rendered content. I borrowed the idea of tiling from the Google maps application ( maps.google.com). In their implementation the server is dumb, and just serves up a static HTML/Javascript document (the application), and static PNG images based on latitute/longitude coordinates (the data). All of the application logic for what to display occurs client side. Classic AJAX. In the DAS protocol, the distribution of the application logic is distributed between the client and server, sometimes to ill effect. Requiring both (a) the server to respond to arbitrary range queries, and (b) the client to display arbitrary ranges unnecessarily creates a bifurcation of the View component of the application. Brian was hinting at this when he mentioned the idea of bittorrent blocks earlier in the thread. We also require code redundancy between client and server to be able to fully use the type and exacttype filters. In this case the Model component has been bifurcated -- the client needs to build a model the ontology (from who knows where... presumably processing OBO-Edit files) so the user can issue queries, and the server needs to also have some representation of the ontology to generate a response. Hopefully the ontology DAS extension will help the latter situation outlined above by getting both client and server to be synchronized on the same data model. As far as the tiling optimization goes, it's likely that I'll implement a preprocessor for the HTTP query so I can break it into tiles -- conceptually very similar to the log10 binning that Lincoln does in the GFF database. -Allen On 11/23/05, Andrew Dalke wrote: > > Allen: > > No other clients have previously requested this range, so the > > server-side cache faults to the DAS/2 service (slow). > > Admittedly I'm curious about this. Why is this slow? What does > slow mean? I assume "cannot be returned faster than the network > will take it." > > How many annotations are in the database? Figuring one annotation > for every ... 100 bases? gives me 30 million. Shouldn't a range > search over < only 30 million be fast? Is this being done in the > database? Which database and what's the SQL? > > If the DB is the bottleneck then pulling it out as a specialized > search might be worthwhile. > > What I'm driving at for this is this. The proposal feels like > a workaround for a given implementation. To use it requires > more smarts in the client. Why not put that logic on the server? > > > Andrew > dalke at dalkescientific.com > > From allenday at ucla.edu Thu Nov 24 07:21:48 2005 From: allenday at ucla.edu (Allen Day) Date: Wed, 23 Nov 2005 23:21:48 -0800 Subject: [DAS2] Re: Ontology Lookup Service In-Reply-To: References: Message-ID: <5c24dcc30511232321v70f77dc9y7a1ceef22bcf6edc@mail.gmail.com> Hi Steve. Yes, this is pretty similar to what we're doing. The major differences I see are (a) the query flexibility -- It only lets you retrieve terms from one ontology at a time, and does not support wildcards (b) the display -- it doesn't actually show you the dag structure of the ontology, and (c) using different tech -- Java/SOAP as opposed to Perl/ReST. -Allen On 11/23/05, Steve Chervitz wrote: > > Allen, > > This looks similar to what you have been working on for SMD: > > http://www.ebi.ac.uk/ontology-lookup/ > > Would be interesting to compare it with your ontology DAS-based > implementation (e.g., performance, ease of installation, extending, etc.). > > Steve > > From dalke at dalkescientific.com Thu Nov 24 13:28:00 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 24 Nov 2005 14:28:00 +0100 Subject: [DAS2] tiled queries for performance In-Reply-To: <5c24dcc30511232310p1623ff4dk9088579cdf58e082@mail.gmail.com> References: <5c24dcc30511232310p1623ff4dk9088579cdf58e082@mail.gmail.com> Message-ID: <9eb929192db24ad93fb2a7cf423aa9c3@dalkescientific.com> Allen: > I'd like to be able to consistently get network-bottlenecked response > from the server.? The largest (250 megabase) SQL range queries > typically take ~30 seconds to complete, returning ~500K features.? I'm > currently working on getting the templating system (Template Toolkit > aka TT2) we use to flush to the client periodically, rather than > building the entire response first.? This is the current bottleneck; > TT2 generation of a 500K record XML document takes many minutes.? > Regardless of how much more optimization work we put into the server, > it's never going to be as fast as serving up pre-queried, pre-rendered > content. Interesting. So I was right, in that the range search is fast, but wrong in not considering the template generation problem. Could that cause a DoS attack by asking for several large ranges at once? You're building up multi-megabyte strings in memory. (If 1 feature is 1K then that's 500MB.) Ideologically the clean solution might be to have the search return only a list of identifiers and have the client fetch each feature one-by-one. This is a tile size of 1. Implementation-wise this will cause problems unless using HTTP 1.1 pipelining since the act of opening 500K connections takes non-trivial time. Adding a "return XML for these ids" service doesn't help either - it brings us back to the same problem. But another solution is to cache all the features as XML, leaving out only the header and footer. Skip the templating system (rather, it's upstream of the caching). Do the search, get the ids, and stream the contents directly from the cache. This would be used in feature lookup and for search results. > In the DAS protocol, the distribution of the application logic is > distributed between the client and server, sometimes to ill effect.? > Requiring both (a) the server to respond to arbitrary range queries, > and (b) the client to display arbitrary ranges unnecessarily creates a > bifurcation of the View component of the application.? Brian was > hinting at this when he mentioned the idea of bittorrent blocks > earlier in the thread. What application logic? There should be many ways to build different applications on top of DAS. DAS is a data model. The client provides the view (or many views). There are two reasons for query support on the server. 1. slow bandwidth and limited client resources - otherwise clients could download and search the data locally 2. easier support for (certain classes of) application developers To make the Google comparison, there's no reason Google searches couldn't take place on your personal machine except that you can't download the Internet and search it in usable time. With Google providing the service others can do things like provide domain-specific web searches via Google, include Google links in a web browser, or make something like Googlefight. > We also require code redundancy between client and server to be able > to fully use the type and exacttype filters.? In this case the Model > component has been bifurcated -- the client needs to build a model the > ontology (from who knows where... presumably processing OBO-Edit > files) so the user can issue queries, and the server needs to also > have some representation of the ontology to generate a response. > > Hopefully the ontology DAS extension will help the latter situation > outlined above by getting both client and server to be synchronized on > the same data model.? As far as the tiling optimization goes, it's > likely that I'll implement a preprocessor for the HTTP query so I can > break it into tiles -- conceptually very similar to the log10 binning > that Lincoln does in the GFF database. I didn't follow this. Code redundancy means what? There's an exchange of data models - in this case the model for a query. But any client/server needs to do this. Take Entrez, for example. It supports many types of search fields, including MeSH (which I think counts as an ontology). A sophisticated client may have a GUI to help people identify MeSH terms. This obviously does some duplicate work as with the server. Is that what you mean? If so, why does it matter? Note also that while Google Maps serves static images only, there's shared logic between the application (in the browser) and the tools that generated those maps. Eg, both have the same code for understanding geography/latitude&longitude. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Thu Nov 24 13:47:26 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 24 Nov 2005 14:47:26 +0100 Subject: [DAS2] tiled queries for performance In-Reply-To: <2042BBCD-8490-461D-80C1-1BB4A1FAACB1@pantherinformatics.com> References: <2042BBCD-8490-461D-80C1-1BB4A1FAACB1@pantherinformatics.com> Message-ID: <22110007fe53238adbda91041ee1baf2@dalkescientific.com> Hi Brian, > We're looking into this kind of implementation issue ourselves and > thought that a bitorrent like cache makes the most sense. ie. all > servers in the "fabric" are issued the query in a certain "hop > adjacency". These servers then send their data to the client who's job > it is to assemble the data. I go back and forth between the "large data set" model and the "large number of entities" model. In the first: - client requests a large data file - server returns it This can be sped up by distributing the file among many sites and using something like BitTorrent to put it together, or something like Coral ( http://www.coralcdn.org/ ) to redirect to nearby caches. But making the code for this is complicated. It's possible to build on BitTorrent and similar systems, but I have no feel for the actual implementation cost, which makes me wary. I've looked into a couple of the P2P toolkits and not gotten the feel that it's any easier than writing HTTP requests directly. Plus, who will set up the alternate servers? In the second: - make query to server - server returns list of N identifiers - make N-n requests (where 'n' is the number of identifiers already resolved) The id resolution can be done in a distributed fashion and is easily supported via web caches, either with well-configured proxies or (again) through Coral. I like the latter model in part because it's more fine grained. Eg, a progress bar can say "downloading feature 4 of 10000", and if a given feature is already present there's no need to refetch it. The downside of the 2nd is the need for HTTP 1.1 pipelining to make it be efficient. I don't know if we want to have that requirement. Gregg came up with the range restrictions because most of the massive results will be from range searches. By being a bit more clever about tracking what's known and not known, a client can get a much smaller results page. These are complementary. Using Gregg's restricted range queries can reduce the number of identifiers returned in a search, making the network overhead even smaller. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Fri Nov 25 15:21:21 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Fri, 25 Nov 2005 16:21:21 +0100 Subject: [DAS2] DAS intro Message-ID: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> The front of the DAS doc starts DAS 2.0 is designed to address the shortcomings of DAS 1.0, including: That kinda assumes people know what DAS 1.0 is to understand DAS 2.0. How about this instead, as an overview/introduction. ====== DAS/2 describes a data model for genome annotations. An annotation server provides information about one or more genome sources. Each source may have one or more versions. Different versions are usually based on different assemblies. As an implementation detail an assembly and corresponding sequence data may be distributed via a different machine, which is called the reference server. Portions of the assembly may have higher relative accuracy than the assembly as a whole. A reference server may supply these portions as an alternate reference frame. Annotations are located on the genome with a start and end position. The range may be specified mutiple times if there are alternate reference frames. An annotation may contain multiple non-continguous parts, making it the parent of those parts. Some parts may have more than one parent. Annotations have a type based on terms in SOFA (Sequence Ontology for Feature Annotation). Stylesheets contain a set of properties used to depict a given type. Annotations can be searched by range, type, and a properties table associated with each annotation. These are called feature filters. DAS/2 is implemented using a ReST architecture. Each entity (also called a document or object) has a name, which is a URL. Fetching the URL gets information about the entity. The DAS-specific entities are all XML documents. Other entities contain data types with an existing and frequently used file format. Where possible, a DAS server returns data using existing formats. In some cases a server may describe how to fetch a given entity in several different formats. ====== Andrew dalke at dalkescientific.com From asims at bcgsc.ca Fri Nov 25 19:15:17 2005 From: asims at bcgsc.ca (Asim Siddiqui) Date: Fri, 25 Nov 2005 11:15:17 -0800 Subject: [DAS2] tiled queries for performance Message-ID: <86C6E520C12E52429ACBCB01546DF4D3BE3E5E@xchange1.phage.bcgsc.ca> Hi, I'm a newbie to this list, so apologies if I've missed something critical. I think this is a great idea. I don't see this as a big change to the DAS/2 spec or requiring much in the way of additional smarts on the client side. The change is simply that instead of the client getting exactly what it asks for, it may get more. My 2 cents, Asim -----Original Message----- From: das2-bounces at portal.open-bio.org [mailto:das2-bounces at portal.open-bio.org] On Behalf Of Allen Day Sent: Wednesday, November 23, 2005 11:11 PM To: Andrew Dalke; DAS/2 Subject: Re: [DAS2] tiled queries for performance Hi Andrew. I'd like to be able to consistently get network-bottlenecked response from the server. The largest (250 megabase) SQL range queries typically take ~30 seconds to complete, returning ~500K features. I'm currently working on getting the templating system (Template Toolkit aka TT2) we use to flush to the client periodically, rather than building the entire response first. This is the current bottleneck; TT2 generation of a 500K record XML document takes many minutes. Regardless of how much more optimization work we put into the server, it's never going to be as fast as serving up pre-queried, pre-rendered content. I borrowed the idea of tiling from the Google maps application ( maps.google.com). In their implementation the server is dumb, and just serves up a static HTML/Javascript document (the application), and static PNG images based on latitute/longitude coordinates (the data). All of the application logic for what to display occurs client side. Classic AJAX. In the DAS protocol, the distribution of the application logic is distributed between the client and server, sometimes to ill effect. Requiring both (a) the server to respond to arbitrary range queries, and (b) the client to display arbitrary ranges unnecessarily creates a bifurcation of the View component of the application. Brian was hinting at this when he mentioned the idea of bittorrent blocks earlier in the thread. We also require code redundancy between client and server to be able to fully use the type and exacttype filters. In this case the Model component has been bifurcated -- the client needs to build a model the ontology (from who knows where... presumably processing OBO-Edit files) so the user can issue queries, and the server needs to also have some representation of the ontology to generate a response. Hopefully the ontology DAS extension will help the latter situation outlined above by getting both client and server to be synchronized on the same data model. As far as the tiling optimization goes, it's likely that I'll implement a preprocessor for the HTTP query so I can break it into tiles -- conceptually very similar to the log10 binning that Lincoln does in the GFF database. -Allen On 11/23/05, Andrew Dalke wrote: > > Allen: > > No other clients have previously requested this range, so the > > server-side cache faults to the DAS/2 service (slow). > > Admittedly I'm curious about this. Why is this slow? What does slow > mean? I assume "cannot be returned faster than the network will take > it." > > How many annotations are in the database? Figuring one annotation for > every ... 100 bases? gives me 30 million. Shouldn't a range search > over < only 30 million be fast? Is this being done in the database? > Which database and what's the SQL? > > If the DB is the bottleneck then pulling it out as a specialized > search might be worthwhile. > > What I'm driving at for this is this. The proposal feels like a > workaround for a given implementation. To use it requires more smarts > in the client. Why not put that logic on the server? > > > Andrew > dalke at dalkescientific.com > > _______________________________________________ DAS2 mailing list DAS2 at portal.open-bio.org http://portal.open-bio.org/mailman/listinfo/das2 From suzi at fruitfly.org Fri Nov 25 22:20:29 2005 From: suzi at fruitfly.org (Suzanna Lewis) Date: Fri, 25 Nov 2005 14:20:29 -0800 Subject: [DAS2] DAS intro In-Reply-To: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> References: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> Message-ID: <59fa39752e4d792d2142fe2682813937@fruitfly.org> a few minor in-line edits below. trying to simplify and not confuse, as this is just an intro. On Nov 25, 2005, at 7:21 AM, Andrew Dalke wrote: > The front of the DAS doc starts > > DAS 2.0 is designed to address the shortcomings of DAS 1.0, > including: > > That kinda assumes people know what DAS 1.0 is to understand DAS 2.0. > > How about this instead, as an overview/introduction. > > ====== > > DAS/2 describes a data model for genome annotations , THAT IS, DESCRIPTIONS OF FEATURES LOCATED ON THE GENOMIC SEQUENCE > . An annotation > server provides SUCH > information FOR > one or more genome SEQUENCES. > Each GENOMIC SEQUENCE > may have one or more versions. Different versions are usually > based on different assemblies. As an implementation detail an > assembly and corresponding sequence data may be distributed via a > different machine, which is called the reference server. (DELETED LAST 2 SENTENCES). > > Annotations are located on the genome with a start and end position. > The range may be specified mutiple times if there are alternate > SEQUENCES THEY MAY BE PLACED UPON (REFERENCE FRAMES). > An annotation may contain multiple non-continguous > parts (DELECTED PHRASE AND SENTENCE) > Annotations have a type based on terms in SOFA > (Sequence Ontology for Feature Annotation). Stylesheets contain a set > of properties used to depict a given type. > > Annotations can be searched by range, type, and a properties table > associated with each annotation. These are called feature filters. > > DAS/2 is implemented using a ReST architecture. Each entity (also > called a document or object) has a name, which is a URL. Fetching the > URL gets information about the entity. The DAS-specific entities are > all XML documents. Other entities contain data types with an existing > and frequently used file format. Where possible, a DAS server returns > data using existing formats. In some cases a server may describe how > to fetch a given entity in several different formats. > ====== > > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 From dalke at dalkescientific.com Fri Nov 25 23:43:10 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Sat, 26 Nov 2005 00:43:10 +0100 Subject: [DAS2] tiled queries for performance In-Reply-To: <86C6E520C12E52429ACBCB01546DF4D3BE3E5E@xchange1.phage.bcgsc.ca> References: <86C6E520C12E52429ACBCB01546DF4D3BE3E5E@xchange1.phage.bcgsc.ca> Message-ID: <9ec33e6fb3efbbe8b39adc52d2b78db7@dalkescientific.com> Asim Siddiqui > I think this is a great idea. > > I don't see this as a big change to the DAS/2 spec or requiring much in > the way of additional smarts on the client side. I agree with Allen on this - in some sense there's no effect on the spec. It ends up being an agreement among the clients to request aligned data, by rounding up/down to the nearest, say, kilobase and for the server implementers to cache those requests. > The change is simply that instead of the client getting exactly what it > asks for, it may get more. While that's another matter - the client makes a request and the server is free to expand the range to something it can handle a bit better. Allen? Were you suggesting this instead? In this case there is a change to the spec, and all clients must be able to filter or otherwise ignore extra results. I personally think it's an implementation issue related to performance and there are ways to make the results be generated fast enough. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Sat Nov 26 00:35:45 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Sat, 26 Nov 2005 01:35:45 +0100 Subject: [DAS2] DAS intro In-Reply-To: <59fa39752e4d792d2142fe2682813937@fruitfly.org> References: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> <59fa39752e4d792d2142fe2682813937@fruitfly.org> Message-ID: Hi Suzi, You're supposed to be on holiday - it's Thanksgiving after all. Though I'm not celebrating it until next week. I wonder where I can find pumpkin pie mix here ... >> DAS/2 describes a data model for genome annotations > , THAT IS, DESCRIPTIONS OF FEATURES LOCATED ON THE GENOMIC SEQUENCE Changed, along with the other fixes. > (DELETED LAST 2 SENTENCES). That was the two lines about >> Portions of >> the assembly may have higher relative accuracy than the assembly as a >> whole. A reference server may supply these portions as an alternate >> reference frame. In the intro I want to mention all of the parts of DAS. The problem is that I still don't understand the /region request. These two lines were my best attempt at explaining them. Was the deletion because my understanding is wrong or because it's not needed for the intro? I think my confusion is related the concept you mention in: >> Annotations are located on the genome with a start and end position. >> The range may be specified mutiple times if there are alternate >> > SEQUENCES THEY MAY BE PLACED UPON (REFERENCE FRAMES). because I don't understand what I should change. I made up the term 'reference frame' because of my physics training. Is it the correct term here? Does 'reference frame' as it's normally used only refer to the full assembly or does it refer to each "/region" as well? If I give the coordinates on a contig can I say it's in the reference frame of that contig? (Hmm, David Block agrees with me, according to http://open-bio.org/bosc2001/abstracts/lightning/block The presence of a Tiling_Path table allows the loading of any arbitrary length of sequence, in the reference frame of any of the contigs that make up the tiling path. ) I thought it was important to mention that a given annotation may have "several tags if the feature's location can be represented in multiple coordinate systems (e.g. multiple builds of a genome or multiple contigs)" Then again, I don't understand how a given feature can be annotated on multiple builds because I thought that a feature was only associated with a single versioned source, and a versioned source has only one build. I would like to have something in the intro which mentions "/region". I just don't know how to do it. Why does anyone care about regions and not just point directly to the sequence? >> An annotation may contain multiple non-continguous >> parts > > (DELECTED PHRASE AND SENTENCE) The deleted text there was ", making it the parent of those parts. Some parts may have more than one parent." I put it there because I remember we talked a lot about this at CSHL a couple years back and wanted to make sure the data model handled cases where, say, there were two parents to three parts. I seems to me that that structure is important enough that someone who is trying to get a quick understanding of DAS annotations would be interested in it. My internal model for the expected reader is someone like Allen or Gregg - people who have some experience in data models for annotations and would like to know that DAS can handle those sorts of more complicated tree structures. I'm willing to move it further into the text, but I'm not convinced that it makes things less confusing or simpler. Features having parts and parents is an essential part of the DAS data model. Andrew dalke at dalkescientific.com From suzi at fruitfly.org Sat Nov 26 01:44:54 2005 From: suzi at fruitfly.org (Suzanna Lewis) Date: Fri, 25 Nov 2005 17:44:54 -0800 Subject: [DAS2] DAS intro In-Reply-To: References: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> <59fa39752e4d792d2142fe2682813937@fruitfly.org> Message-ID: <1ac71c37969c1ef9dcc0d983157746aa@fruitfly.org> Hi Andrew, so there seem to be 2 questions. it would be good to have both in the intro, but only as long as the description can be clearly stated in just a sentence or two. If it takes more then it is clearly something that requires a fuller description outside of the intro. I'll try to give my understanding (but goodness knows I am peering through different lenses). I don't think in terms of the spec at all, just the information that needs to be conveyed. #1 "reference frame" ========================================= "reference frame", is (to my mind) "reference sequence". at least, that is what i've always called it. First, accuracy has nothing at all to do with it, so we don't want the sentence in there. Second, the region of sequence that is returned is nothing more than that. Think of it as a special type of feature. This is what makes a transformation possible from one coordinate-system to another (by adding the correct offsets) Third, just think of "reference sequence" as a coordinate system. One can have the exact same feature and indicate that: on coordinate-system-A this feature starts and ends here, and on coordinate-system-B it starts and ends there. Thus a feature's coordinates may be given both on a chromosome, and on a contig, and on any other coordinate-system that can be derived through a transform from these. So you could change the sentence below to read "A reference server may supply features where the locations (start and end) are relative to either contigs, some other arbitrary region, or to the entire chromosome." #2 "multiple parents" ========================================= It still is easier for me to think of this in terms of sequences. We may know that somewhere out in the world a sequence must exist, but the data/sequence we have collected is fragmentary. For example, thinly sequenced genomes (resulting in many separate contigs) or a pair of ESTs from an cDNA. In either of these cases we need to be able to have the many to many relationships you talk about. This one is perhaps too subtle for the introduction, but if we decide to include it then I think it should first be phrased in terms of the problem (biological sampling) and then in terms of the solution (multiple parents). -S On Nov 25, 2005, at 4:35 PM, Andrew Dalke wrote: > Hi Suzi, > > You're supposed to be on holiday - it's Thanksgiving after all. > > Though I'm not celebrating it until next week. I wonder where > I can find pumpkin pie mix here ... > >>> DAS/2 describes a data model for genome annotations >> , THAT IS, DESCRIPTIONS OF FEATURES LOCATED ON THE GENOMIC SEQUENCE > > Changed, along with the other fixes. > >> (DELETED LAST 2 SENTENCES). > > That was the two lines about > >>> Portions of >>> the assembly may have higher relative accuracy than the assembly as a >>> whole. A reference server may supply these portions as an alternate >>> reference frame. > > In the intro I want to mention all of the parts of DAS. The > problem is that I still don't understand the /region request. > These two lines were my best attempt at explaining them. > > Was the deletion because my understanding is wrong or because it's > not needed for the intro? > > I think my confusion is related the concept you mention in: >>> Annotations are located on the genome with a start and end position. >>> The range may be specified mutiple times if there are alternate >>> >> SEQUENCES THEY MAY BE PLACED UPON (REFERENCE FRAMES). > > because I don't understand what I should change. I made up the > term 'reference frame' because of my physics training. Is it > the correct term here? Does 'reference frame' as it's normally > used only refer to the full assembly or does it refer to each > "/region" as well? If I give the coordinates on a contig can > I say it's in the reference frame of that contig? > > (Hmm, David Block agrees with me, according to > http://open-bio.org/bosc2001/abstracts/lightning/block > The presence of a Tiling_Path table allows the loading of > any arbitrary length of sequence, in the reference frame > of any of the contigs that make up the tiling path. ) > > > > I thought it was important to mention that a given annotation > may have "several tags if the feature's location can be > represented in multiple coordinate systems (e.g. multiple builds > of a genome or multiple contigs)" > > Then again, I don't understand how a given feature can be > annotated on multiple builds because I thought that a feature > was only associated with a single versioned source, and a > versioned source has only one build. > > > I would like to have something in the intro which mentions > "/region". I just don't know how to do it. Why does anyone > care about regions and not just point directly to the sequence? > >>> An annotation may contain multiple non-continguous >>> parts >> >> (DELECTED PHRASE AND SENTENCE) > > The deleted text there was ", making it the parent of those parts. > Some parts may have more than one parent." > > I put it there because I remember we talked a lot about this > at CSHL a couple years back and wanted to make sure the data > model handled cases where, say, there were two parents to three > parts. I seems to me that that structure is important enough > that someone who is trying to get a quick understanding of > DAS annotations would be interested in it. > > My internal model for the expected reader is someone like > Allen or Gregg - people who have some experience in data > models for annotations and would like to know that DAS > can handle those sorts of more complicated tree structures. > > I'm willing to move it further into the text, but I'm not > convinced that it makes things less confusing or simpler. > Features having parts and parents is an essential part of > the DAS data model. > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 From dalke at dalkescientific.com Sun Nov 27 01:20:24 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Sun, 27 Nov 2005 02:20:24 +0100 Subject: [DAS2] DAS intro In-Reply-To: <1ac71c37969c1ef9dcc0d983157746aa@fruitfly.org> References: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> <59fa39752e4d792d2142fe2682813937@fruitfly.org> <1ac71c37969c1ef9dcc0d983157746aa@fruitfly.org> Message-ID: <9a9ee9242a38f40049a7c5d973980e7d@dalkescientific.com> Suzi: > so there seem to be 2 questions. it would be good to have both in the > intro, but only as long as the description can be clearly stated in > just a sentence or two. If it takes more then it is clearly something > that requires a fuller description outside of the intro. Agreed. > I'll try to give my understanding (but goodness knows I am peering > through different lenses). I don't think in terms of the spec at all, > just the information that needs to be conveyed. > > #1 "reference frame" ========================================= > > "reference frame", is (to my mind) "reference sequence". at least, > that is what i've always called it. > First, accuracy has nothing at all to do with it, so we don't want the > sentence in there. I'm fine with that. I've found it best to declare my ignorance early than to keep it hidden. > Second, the region of sequence that is returned is nothing more than > that. Think of it as a special type of feature. This is what makes a > transformation possible from one coordinate-system to another (by > adding the correct offsets) I can think of it as a feature just fine. But then shouldn't each region also be a feature? Why wouldn't all contigs be visible as an annotation? Contigs are in SOFA as @is_a at contig ; SO:0000149 @is_a@ assembly_component ; SO:0000143 @part_of@ supercontig ; SO:0000148 What advantage is there to break this feature out at a "/region"? One that I can see is that the reference server provides the regions while the annotation server provides the other features. But if that's the case we could have the reference server also provide the regions as features, and the annotation server makes references to those features rather than to regions. That is, in the current scheme we have: has 0 or more element, where the 'pos' attribute links to region + start/stop range and the optional 'seq' attribute links to the sequence range, as in: is only a link to the sequence and a length, as in: One alternate possibility is to change that so "pos" points to a /feature (instead of a /region) and have features for each contig or other assembly component. The result would look like: ... Doing this, however, means that all features must support subranges. As an alternate solution without ranges, use and then look up the sequence coordinates of feature/AB1234 to figure out where it starts/stops. The other advantage to a region is you can ask for the assembly via the 'agp' format. But because of the the existing support for formats which are only valid for some feature you can do that by asking for, say, all assembly_component features (via the feature filter) and return the results in 'agp' format. > Third, just think of "reference sequence" as a coordinate system. One > can have the exact same feature and indicate that: on > coordinate-system-A this feature starts and ends here, and on > coordinate-system-B it starts and ends there. Thus a feature's > coordinates may be given both on a chromosome, and on a contig, and on > any other coordinate-system that can be derived through a transform > from these. I believe I understand this. There really is only one reference frame for the entire genome sequence, for a given assembly, and all other coordinate systems are a fixed and definite offset of that single reference frame. I believe this is called the golden path? My reference to accuracy is because I figured that given two features A and B on an assembly component X then the fuzziness in the relative distance between A and B is small if X is also small. That is, smaller terms are less likely to have changes as the golden path changes. > So you could change the sentence below to read "A reference server > may supply features where the locations (start and end) are relative > to either contigs, some other arbitrary region, or to the entire > chromosome." Why not always supply it relative to the chromosome coordinates? The spec now allows that as an optional field. I can't figure out why you would want to do otherwise. Is it because sometimes it's easier to work with, say, a large number of contig reference frames than with one large reference frame? Does that mean we shift the complexity of coordinate translation from the data provider to the data consumer? (Making it easier to generate data than to consume data.) > This one is perhaps too subtle for the introduction, but if we decide > to include it then I think it should first be phrased in terms of the > problem (biological sampling) and then in terms of the solution > (multiple parents). Oh, definitely. It's some place where I just don't have the domain knowledge to explain it or even come up with examples. Andrew dalke at dalkescientific.com From suzi at fruitfly.org Sun Nov 27 01:24:07 2005 From: suzi at fruitfly.org (Suzanna Lewis) Date: Sat, 26 Nov 2005 17:24:07 -0800 Subject: [DAS2] DAS intro In-Reply-To: <9a9ee9242a38f40049a7c5d973980e7d@dalkescientific.com> References: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> <59fa39752e4d792d2142fe2682813937@fruitfly.org> <1ac71c37969c1ef9dcc0d983157746aa@fruitfly.org> <9a9ee9242a38f40049a7c5d973980e7d@dalkescientific.com> Message-ID: Lets add this to the agenda for Monday morning. Hopefully that will be faster than via e-mail. On Nov 26, 2005, at 5:20 PM, Andrew Dalke wrote: > Suzi: >> so there seem to be 2 questions. it would be good to have both in the >> intro, but only as long as the description can be clearly stated in >> just a sentence or two. If it takes more then it is clearly something >> that requires a fuller description outside of the intro. > > Agreed. > >> I'll try to give my understanding (but goodness knows I am peering >> through different lenses). I don't think in terms of the spec at all, >> just the information that needs to be conveyed. >> >> #1 "reference frame" ========================================= >> >> "reference frame", is (to my mind) "reference sequence". at least, >> that is what i've always called it. > > >> First, accuracy has nothing at all to do with it, so we don't want >> the sentence in there. > > I'm fine with that. I've found it best to declare my ignorance early > than to keep it hidden. > >> Second, the region of sequence that is returned is nothing more than >> that. Think of it as a special type of feature. This is what makes a >> transformation possible from one coordinate-system to another (by >> adding the correct offsets) > > I can think of it as a feature just fine. But then shouldn't each > region > also be a feature? Why wouldn't all contigs be visible as an > annotation? > > Contigs are in SOFA as > > @is_a at contig ; SO:0000149 @is_a@ assembly_component ; > SO:0000143 @part_of@ supercontig ; SO:0000148 > > What advantage is there to break this feature out at a "/region"? > > One that I can see is that the reference server provides the regions > while the annotation server provides the other features. But if > that's the case we could have the reference server also provide the > regions as features, and the annotation server makes references to > those features rather than to regions. > > That is, in the current scheme we have: > > has 0 or more element, where the 'pos' attribute > links to region + start/stop range and the optional 'seq' attribute > links to the sequence range, as in: > > seq="sequence/Chr3/1271:1507:1"/> > > > is only a link to the sequence and a length, as in: > > > > > One alternate possibility is to change that so "pos" points to a > /feature (instead of a /region) and have features for each contig or > other assembly component. The result would look like: > > seq="sequence/Chr3/1271:1507:1"/> > > ... > > Doing this, however, means that all features must support subranges. > > > As an alternate solution without ranges, use > > > > and then look up the sequence coordinates of feature/AB1234 to > figure out where it starts/stops. > > > The other advantage to a region is you can ask for the assembly > via the 'agp' format. But because of the the existing support for > formats which are only valid for some feature you can do that by asking > for, say, all assembly_component features (via the feature filter) and > return > the results in 'agp' format. > >> Third, just think of "reference sequence" as a coordinate system. One >> can have the exact same feature and indicate that: on >> coordinate-system-A this feature starts and ends here, and on >> coordinate-system-B it starts and ends there. Thus a feature's >> coordinates may be given both on a chromosome, and on a contig, and >> on any other coordinate-system that can be derived through a >> transform from these. > > I believe I understand this. There really is only one reference frame > for > the entire genome sequence, for a given assembly, and all other > coordinate > systems are a fixed and definite offset of that single reference frame. > I believe this is called the golden path? > > My reference to accuracy is because I figured that given two features > A and B on an assembly component X then the fuzziness in the relative > distance between A and B is small if X is also small. That is, smaller > terms are less likely to have changes as the golden path changes. > > >> So you could change the sentence below to read "A reference server >> may supply features where the locations (start and end) are relative >> to either contigs, some other arbitrary region, or to the entire >> chromosome." > > Why not always supply it relative to the chromosome coordinates? The > spec > now allows that as an optional field. I can't figure out why you would > want to do otherwise. > > Is it because sometimes it's easier to work with, say, a large number > of > contig reference frames than with one large reference frame? Does that > mean we shift the complexity of coordinate translation from the data > provider to the data consumer? (Making it easier to generate data than > to consume data.) > > >> This one is perhaps too subtle for the introduction, but if we decide >> to include it then I think it should first be phrased in terms of the >> problem (biological sampling) and then in terms of the solution >> (multiple parents). > > Oh, definitely. It's some place where I just don't have the domain > knowledge to explain it or even come up with examples. > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 From Gregg_Helt at affymetrix.com Mon Nov 28 09:44:18 2005 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Mon, 28 Nov 2005 01:44:18 -0800 Subject: [DAS2] tiled queries for performance Message-ID: > -----Original Message----- > From: das2-bounces at portal.open-bio.org [mailto:das2-bounces at portal.open- > bio.org] On Behalf Of Andrew Dalke > Sent: Thursday, November 24, 2005 5:47 AM > To: Brian Gilman > Cc: DAS/2 > Subject: Re: [DAS2] tiled queries for performance > > Hi Brian, > > > We're looking into this kind of implementation issue ourselves and > > thought that a bitorrent like cache makes the most sense. ie. all > > servers in the "fabric" are issued the query in a certain "hop > > adjacency". These servers then send their data to the client who's job > > it is to assemble the data. > > I go back and forth between the "large data set" model and the "large > number > of entities" model. > > In the first: > - client requests a large data file > - server returns it > > This can be sped up by distributing the file among many sites and > using something like BitTorrent to put it together, or something like > Coral ( http://www.coralcdn.org/ ) to redirect to nearby caches. > > But making the code for this is complicated. It's possible to build > on BitTorrent and similar systems, but I have no feel for the actual > implementation cost, which makes me wary. I've looked into a couple > of the P2P toolkits and not gotten the feel that it's any easier than > writing HTTP requests directly. Plus, who will set up the alternate > servers? My hope would be that any system like this could be hidden behind a single HTTP GET request and hence require no changes to the DAS/2 protocol. Standard web caches already work this way. I'm less familiar with the BitTorrent approach, but I'm guessing that the client-side code that stitches together the pieces from multiple servers could be encapsulated in a client-side daemon that responds to localhost HTTP calls. > In the second: > - make query to server > - server returns list of N identifiers > - make N-n requests (where 'n' is the number of identifiers already > resolved) > > The id resolution can be done in a distributed fashion and is easily > supported via web caches, either with well-configured proxies or (again) > through Coral. > > I like the latter model in part because it's more fine grained. Eg, > a progress bar can say "downloading feature 4 of 10000", and if a given > feature is already present there's no need to refetch it. > > The downside of the 2nd is the need for HTTP 1.1 pipelining to make it > be efficient. I don't know if we want to have that requirement. I'm wary of this "large number of entities" approach, for several reasons. Due to the overhead for TCP/IP, HTTP headers, and extra XML stuff like doctype and namespace declarations, making an HTTP GET request per feature will increase the total number of bytes that need to be transmitted. It will also increase the parsing overhead on the client side. And if the features contain little information (for example just type, parts/parents, and location) that overhead could easily exceed the time taken to process the "useful" data. As you indicated, some performance problems could be alleviated by HTTP 1.1 pipelining, but that adds additional requirements to both client and server. Also, for persistent caching on the local machine when you start splitting up the data into hundreds of thousands of files, I suspect the additional disk seek time will far exceed disk read time and become a serious performance impediment. Having said that, in theory this approach is (almost) testable using the current DAS/2 spec. Create one DAS/2 server that in response to feature queries returns only the minimum required information for "N" features: id and type. And have feature ids returned be URLs on another DAS/2 server that _does_ return full feature information (location, alignment, etc.). Then make "N-n" single-feature queries with those URLs to get full information. Due to the current DAS/2 requirement that any parts / parents referenced also be included in the same XML doc, this would only be a reasonable test for features with no hierarchical structure, such as SNPs. > Gregg > came up with the range restrictions because most of the massive results > will be from range searches. By being a bit more clever about tracking > what's known and not known, a client can get a much smaller results > page. > > > These are complementary. Using Gregg's restricted range queries can > reduce the number of identifiers returned in a search, making the > network overhead even smaller. > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 From Gregg_Helt at affymetrix.com Mon Nov 28 10:05:33 2005 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Mon, 28 Nov 2005 02:05:33 -0800 Subject: [DAS2] das registry and das2 Message-ID: > -----Original Message----- > From: das2-bounces at portal.open-bio.org [mailto:das2-bounces at portal.open- > bio.org] On Behalf Of Andrew Dalke > Sent: Friday, November 18, 2005 10:00 AM > To: DAS/2 > Subject: Re: [DAS2] das registry and das2 > > Andreas Prlic: > > I would like to start a discussion of how to provide a proper DAS > > interface for > > our das- registration server at http://das.sanger.ac.uk/registry/ > > > > Currently it is possible to interact with it using SOAP, or manually > > via the HTML > > interface. We should also make it accessible using URL requests. > > One of the things Gregg and I talked about at ISMB was that the > top-level > "das-sources" format is, or can be, identical to what's needed for the > registry server. > Some of what we discussed I wrote up in a post ealier this year: http://portal.open-bio.org/pipermail/das2/2005-June/000198.html Another post that might be useful in current discussions is a summary of what was discussed in the DAS/2 registry meeting we had in Hinxton back in September 2004: http://portal.open-bio.org/pipermail/das2/2005-June/000197.html gregg From Gregg_Helt at affymetrix.com Mon Nov 28 10:58:00 2005 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Mon, 28 Nov 2005 02:58:00 -0800 Subject: [DAS2] tiled queries for performance Message-ID: The attachment is a PowerPoint slide showing one of the feature query optimizations that the IGB client currently uses, which combines "overlaps" and "inside" filters. When used consistently this guarantees that the same feature is not returned in multiple feature queries. However in general I agree that it is the client's responsibility to reasonably handle cases where the same feature is returned multiple times. gregg > -----Original Message----- > From: das2-bounces at portal.open-bio.org [mailto:das2-bounces at portal.open- > bio.org] On Behalf Of Allen Day > Sent: Wednesday, November 23, 2005 3:50 PM > To: das2 at portal.open-bio.org > Subject: Re: [DAS2] tiled queries for performance > > More thoughts on this. The client can eliminate the redundancy in the > records returned by issuing the tiling queries as previously described > (query1), then issuing queries for records that are not contained within > tiles, but overlap the boundaries of 1 or more tiles (query2). > > However, by issuing all the overlaps queries at once, we've just deferred > the performance hit one step, because we can't reasonably expect the > server to have cached all combinations of tile overlaps queries. I think, > to get this tiling optimization to work, the burden needs to be on the > client to identify and remove duplicate responses for multiple > edge-overlaps queries (query3). > > 1000bp 2000bp 3000bp > | | | > | === | =====^==== | > | ====#===== | > | ============#=============#===== > | | | > > <-----------> query1a > <-----------> query1b > query2 > query3a > query3b > > Key: > > | : tile boundary > = : feature > ^ : gap between child features > # : portion of feature overlapping tile boundary. > : client overlaps query > <.> : client contains query > > -Allen > > > > On Mon, 21 Nov 2005, Allen Day wrote: > > > Hi, > > > > I had an idea of how clients may be able to get better response from > > servers by using a tiled query technique. Here's the basic idea: > > > > ClientA wants features in chr1/1010:2020, and issues a request for that > > range. No other clients have previously requested this range, so the > > server-side cache faults to the DAS/2 service (slow). > > > > ClientB wants features in chr1/1020:2030, and issues a request for that > > range. Although the intersection of the resulting records with > ClientA's > > query is large, the URIs are different and the server-side cache faults > > again. > > > > If ClientA and ClientB were to each issue two separate "tiled" requests: > > > > 1. chr1/1001:2000 > > 2. chr1/2001:3000 > > > > ClientB could take advantage of the fact that ClientA had been looking > at > > the same tiles. > > > > For this to work, the clients would need to be using the same tile size. > > The optimal tile size is likely to vary from datasource to datasource, > > depending on the length and density distributions of the features > > contained in the datasource. The "sources" or "versioned sources" > > payload could suggest a tiling size to prospective clients. Servers > could > > also pre-cache all tiles by hitting each tile after an update of the > > datasource (or the DAS/2 service code). > > > > The tradeoff for the performance gains is that clients may now need to > do > > filtering on the returned records to only return those requested by the > > client's client. > > > > -Allen > > _______________________________________________ > > DAS2 mailing list > > DAS2 at portal.open-bio.org > > http://portal.open-bio.org/mailman/listinfo/das2 > > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 -------------- next part -------------- A non-text attachment was scrubbed... Name: DAS2_Query_Optimization.ppt Type: application/vnd.ms-powerpoint Size: 287744 bytes Desc: DAS2_Query_Optimization.ppt URL: From ap3 at sanger.ac.uk Mon Nov 28 11:48:03 2005 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Mon, 28 Nov 2005 11:48:03 +0000 Subject: [DAS2] DAS intro In-Reply-To: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> References: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> Message-ID: Hi! > How about this instead, as an overview/introduction. > > ====== > > DAS/2 describes a data model for genome annotations. Can we formulate the start a little more general? something like: DAS/2 is a protocol to share biological data. It provides specifications for how to share annotations of genomes and proteins, assays, ontologies (space fore more here...). then I would continue with your text. Cheers, Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 From dalke at dalkescientific.com Mon Nov 28 17:10:30 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 28 Nov 2005 18:10:30 +0100 Subject: [DAS2] mtg topics for Nov 28 Message-ID: Here are the spec issues I would like to talk about for today's meeting, culled from the last few weeks of emails and phone calls 1) DAS Status Code in headers The current spec says > X-DAS-Status: XXX status code > > The list of status codes is similar, but not identical, to those used > by DAS/1: > > 200 OK, data follows > 400 Bad namespace > 401 Bad data source > 402 Bad data format > 403 Unknown object ID > 404 Invalid object ID > 405 Region coordinate error > 406 No lock > 407 Access denied > 500 Server error > 501 Unimplemented feature I argued that these are not needed. Some of them are duplicates with HTTP error codes and those which are not can be covered by an error code "300" along with an (optional) XML payload. The major problem with doing this seems to be in how MS IE handles certain error codes. While IE is not a target browser, MS software may use IE as a component for fetching data. From the link Ed dug up, it looks like this won't be a problem. Lincoln's last email on this was a tepid > I give up arguing this one and will go with the way Andrew wants to do > it. Therefore I propose the following rules: > > 1) Return the HTTP 404 error for the case that any component of the > DAS2 path > is invalid. This would apply to the following situations: > > Bad namespace > Bad data source > Unknown object ID > > 2) Return HTTP 301 and 302 redirects when the requested object has > moved. > > 3) Return HTTP 403 (forbidden) for no-lock errors. > > 4) Return HTTP 500 when the server crashes. > > For all errors there should be a text/x-das-error entity returned that > describes the error in more detail. The "x-das-error" format must have an invariant string, either an error code or fixed text, and a possible optional explanatory text section. Note the "should" in that last paragraph - this is optional. 2) Content-type There was some discussion about changing the content type to "text/xml" to support viewing DAS results in a browser. We decided that that wasn't a valid use case. In doing the research for this I found that the general recommendation for these sorts of XML documents is to put the document under "application/*" instead of "text/*". One reason is from http://www.ietf.org/rfc/rfc3023.txt If an XML document -- that is, the unprocessed, source XML document -- is readable by casual users, text/xml is preferable to application/xml. MIME user agents (and web user agents) that do not have explicit support for text/xml will treat it as text/plain, for example, by displaying the XML MIME entity as plain text. Application/xml is preferable when the XML MIME entity is unreadable by casual users. Similarly, text/xml-external-parsed-entity is preferable when an external parsed entity is readable by casual users, but application/xml-external-parsed-entity is preferable when a plain text display is inappropriate. NOTE: Users are in general not used to text containing tags such as , and often find such tags quite disorienting or annoying. If one is not sure, the conservative principle would suggest using application/* instead of text/* so as not to put information in front of users that they will quite likely not understand. Another is the difference in how application/* and text/* handle character set encodings. We use "text/x-...+xml" - I propose changing this to "application/x-...+xml" I don't think there are any objections to this. The main objection is to the difficulty of ploughing through all the specs related to charsets and unicode. 3) Key/value data As Steve pointed out, the spec is incomplete on how to handle key/value data associated with a record. The main problem is in how it handles namespaces. It mixes an internal attribute value namespace with the xml namespace, which doesn't happen. For example, This is a telomeric repeat birx28 This is a telomeric repeat 29 This is a telomeric repeat 29 - "simple extension elements" not in the "atom:" namespace > - "structured extension elements" not in the "atom:" namespace. > > Most of the "atom:" elements share a common structure. For example: > - the type= attribute indicates of the contents are text, escaped > HTML or XHTML; or an explicit content-type like "chemical/x-pdb". > > - the src= attribute indicates that the content of the element is > empty and to go to the given URL instead (apparently the hip > term for URL these days is IRL - internationalized Resource > Identifiers. > I think we only need to use URLs) > > > These are not always used for all elements; if it's appropriate for a > given field then it's used. > > > Simple extension elements are always of the form > Content goes here > where 'element' is not part of the 'atom:' namespace. Consumers of > this data may treat it as simple key/value data. > > Structured extension elements always have at least an attribute > or a sub-element, so must look like > .. > -or- > .. .. > > If the element isn't known this field may be ignored. > > These three things provide for: > - a set of well-define elements, understandable by everyone > - a simple extension for things which can be key/value data > - a way to store or refer to more complex data types 5) xlink and Several places in the spec include or may include links to documents elsewhere. The XLink specification describes an general extensibility mechanism for such links. xlinks have 1 of about 4 properties, the most important are: - where does the link go to - what kind of link is it - what should the browser do with such a link I personally don't understand the xlink spec well enough to want to use it, and I haven't come across examples of it in use. I am wary about specs like that. Another is to use something like the element from HTML 4.0 and in Atom. This looks something like that is, it has: - a category for how the link is related to the given object ('rel') - an optional MIME type (use, eg, if the server has multiple ways to provide data for the same 'rel' category) - an href to the data As implemented in Atom the contents of a are extensible, which allows people to experiment with things like mirroring. In any case we need a way to provide typed links to other documents. Such links may include: - link from a given feature to the versioned source - link from a versioned source to the lock document 6) Source filters This comes from Andreas Prlic. We can support metadata servers via the same document returned from the entry point to a DAS server. However, a metadata server may also support searches, eg, to show only H. sapiens annotations using the build 1234 assembly. Should we make this property searching part of the DAS/2 spec, which means everyone must support it, or should we say it's optional but if implemented it must be done in a standard way? Or leave it for version 2.1, once we have more experience with DAS in real-life? (Though we already have that experience.) 7) /regions Could someone please explain to me the point of the /region subtree? As far as I can tell, a region is just a type of feature. A generic feature is located somewhere on the genome (with respect to a given assembly), and may also say it's on various 'region' features. I don't see the need for a separate namespace for this. 8) Tiled queries Do they need spec changes, or spec recommendations? I think I've mentioned everything to be covered. Andrew dalke at dalkescientific.com From Gregg_Helt at affymetrix.com Mon Nov 28 17:14:28 2005 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Mon, 28 Nov 2005 09:14:28 -0800 Subject: [DAS2] tiled queries for performance Message-ID: I don't think we should allow servers to return features than do not meet the criteria specified in the query feature filters, it's an invitation for ambiguity. This may seem harmless with just an "overlaps" region filter, but what about "inside", "contains", "identical"? What about "type", etc? If different DAS/2 server implementations contain the same data, they should return the same set of features for a given feature query. gregg > -----Original Message----- > From: das2-bounces at portal.open-bio.org [mailto:das2-bounces at portal.open- > bio.org] On Behalf Of Andrew Dalke > Sent: Friday, November 25, 2005 3:43 PM > To: Asim Siddiqui > Cc: DAS/2 > Subject: Re: [DAS2] tiled queries for performance > > > The change is simply that instead of the client getting exactly what it > > asks for, it may get more. > > While that's another matter - the client makes a request > and the server is free to expand the range to something it can handle > a bit better. Allen? Were you suggesting this instead? > > In this case there is a change to the spec, and all clients must > be able to filter or otherwise ignore extra results. > > I personally think it's an implementation issue related to performance > and there are ways to make the results be generated fast enough. > > Andrew > dalke at dalkescientific.com > From dalke at dalkescientific.com Mon Nov 28 17:14:52 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 28 Nov 2005 18:14:52 +0100 Subject: [DAS2] DAS intro In-Reply-To: References: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> Message-ID: Andreas Prlic: > Can we formulate the start a little more general? > > something like: > > DAS/2 is a protocol to share biological data. It provides > specifications for how > to share annotations of genomes and proteins, assays, ontologies > (space fore more here...). I thought about that, but the DAS/2.0 spec doesn't include any of those. Perhaps be more definite instead and say this is DAS/2.0? Or say "Other projects (link, link, link) extend DAS/2 to protein, assay and ontology data sets." Andrew dalke at dalkescientific.com From lstein at cshl.edu Mon Nov 28 17:24:32 2005 From: lstein at cshl.edu (Lincoln Stein) Date: Mon, 28 Nov 2005 12:24:32 -0500 Subject: [DAS2] DAS intro In-Reply-To: <9a9ee9242a38f40049a7c5d973980e7d@dalkescientific.com> References: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> <1ac71c37969c1ef9dcc0d983157746aa@fruitfly.org> <9a9ee9242a38f40049a7c5d973980e7d@dalkescientific.com> Message-ID: <200511281224.32885.lstein@cshl.edu> > > > > is only a link to the sequence and a length, as in: > > You know, this is still kind of ugly. I hate to revisit this so late in the game, but can't we make sequence retrieval a three-step process? 1) Feature request returns: 2) Region request returns: (where seq= could be an absolute URL if someone else owns the bases) 3) Sequence request then returns the bases Lincoln > > > One alternate possibility is to change that so "pos" points to a > /feature (instead of a /region) and have features for each contig or > other assembly component. The result would look like: > > seq="sequence/Chr3/1271:1507:1"/> > > ... > > Doing this, however, means that all features must support subranges. > > > As an alternate solution without ranges, use > > > > and then look up the sequence coordinates of feature/AB1234 to > figure out where it starts/stops. > > > The other advantage to a region is you can ask for the assembly > via the 'agp' format. But because of the the existing support for > formats which are only valid for some feature you can do that by asking > for, say, all assembly_component features (via the feature filter) and > return > the results in 'agp' format. > > > Third, just think of "reference sequence" as a coordinate system. One > > can have the exact same feature and indicate that: on > > coordinate-system-A this feature starts and ends here, and on > > coordinate-system-B it starts and ends there. Thus a feature's > > coordinates may be given both on a chromosome, and on a contig, and on > > any other coordinate-system that can be derived through a transform > > from these. > > I believe I understand this. There really is only one reference frame > for > the entire genome sequence, for a given assembly, and all other > coordinate > systems are a fixed and definite offset of that single reference frame. > I believe this is called the golden path? > > My reference to accuracy is because I figured that given two features > A and B on an assembly component X then the fuzziness in the relative > distance between A and B is small if X is also small. That is, smaller > terms are less likely to have changes as the golden path changes. > > > So you could change the sentence below to read "A reference server > > may supply features where the locations (start and end) are relative > > to either contigs, some other arbitrary region, or to the entire > > chromosome." > > Why not always supply it relative to the chromosome coordinates? The > spec > now allows that as an optional field. I can't figure out why you would > want to do otherwise. > > Is it because sometimes it's easier to work with, say, a large number of > contig reference frames than with one large reference frame? Does that > mean we shift the complexity of coordinate translation from the data > provider to the data consumer? (Making it easier to generate data than > to consume data.) > > > This one is perhaps too subtle for the introduction, but if we decide > > to include it then I think it should first be phrased in terms of the > > problem (biological sampling) and then in terms of the solution > > (multiple parents). > > Oh, definitely. It's some place where I just don't have the domain > knowledge to explain it or even come up with examples. > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 -- Lincoln D. Stein Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724 FOR URGENT MESSAGES & SCHEDULING, PLEASE CONTACT MY ASSISTANT, SANDRA MICHELSEN, AT michelse at cshl.edu From lstein at cshl.edu Mon Nov 28 17:08:35 2005 From: lstein at cshl.edu (Lincoln Stein) Date: Mon, 28 Nov 2005 12:08:35 -0500 Subject: [DAS2] DAS intro In-Reply-To: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> References: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> Message-ID: <200511281208.36204.lstein@cshl.edu> Yes, this is a better intro. Lincoln On Friday 25 November 2005 10:21 am, Andrew Dalke wrote: > The front of the DAS doc starts > > DAS 2.0 is designed to address the shortcomings of DAS 1.0, including: > > That kinda assumes people know what DAS 1.0 is to understand DAS 2.0. > > How about this instead, as an overview/introduction. > > ====== > > DAS/2 describes a data model for genome annotations. An annotation > server provides information about one or more genome sources. Each > source may have one or more versions. Different versions are usually > based on different assemblies. As an implementation detail an > assembly and corresponding sequence data may be distributed via a > different machine, which is called the reference server. Portions of > the assembly may have higher relative accuracy than the assembly as a > whole. A reference server may supply these portions as an alternate > reference frame. > > Annotations are located on the genome with a start and end position. > The range may be specified mutiple times if there are alternate > reference frames. An annotation may contain multiple non-continguous > parts, making it the parent of those parts. Some parts may have more > than one parent. Annotations have a type based on terms in SOFA > (Sequence Ontology for Feature Annotation). Stylesheets contain a set > of properties used to depict a given type. > > Annotations can be searched by range, type, and a properties table > associated with each annotation. These are called feature filters. > > DAS/2 is implemented using a ReST architecture. Each entity (also > called a document or object) has a name, which is a URL. Fetching the > URL gets information about the entity. The DAS-specific entities are > all XML documents. Other entities contain data types with an existing > and frequently used file format. Where possible, a DAS server returns > data using existing formats. In some cases a server may describe how > to fetch a given entity in several different formats. > ====== > > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 -- Lincoln D. Stein Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724 FOR URGENT MESSAGES & SCHEDULING, PLEASE CONTACT MY ASSISTANT, SANDRA MICHELSEN, AT michelse at cshl.edu From lstein at cshl.edu Mon Nov 28 17:11:24 2005 From: lstein at cshl.edu (Lincoln Stein) Date: Mon, 28 Nov 2005 12:11:24 -0500 Subject: [DAS2] tiled queries for performance In-Reply-To: <9ec33e6fb3efbbe8b39adc52d2b78db7@dalkescientific.com> References: <86C6E520C12E52429ACBCB01546DF4D3BE3E5E@xchange1.phage.bcgsc.ca> <9ec33e6fb3efbbe8b39adc52d2b78db7@dalkescientific.com> Message-ID: <200511281211.25239.lstein@cshl.edu> One thing to do is to add to the spec a note that the server is free to return features from a range larger than requested. This way the server is free to expand the range to the 1k boundaries. My preference, however, would be for the server to implement a filter that removes from the precalculated tiled XML output all features that are outside the range. This would be completely transparent to the client. Lincoln On Friday 25 November 2005 06:43 pm, Andrew Dalke wrote: > Asim Siddiqui > > > I think this is a great idea. > > > > I don't see this as a big change to the DAS/2 spec or requiring much in > > the way of additional smarts on the client side. > > I agree with Allen on this - in some sense there's no effect on the > spec. It ends up being an agreement among the clients to request > aligned data, by rounding up/down to the nearest, say, kilobase and > for the server implementers to cache those requests. > > > The change is simply that instead of the client getting exactly what it > > asks for, it may get more. > > While that's another matter - the client makes a request > and the server is free to expand the range to something it can handle > a bit better. Allen? Were you suggesting this instead? > > In this case there is a change to the spec, and all clients must > be able to filter or otherwise ignore extra results. > > I personally think it's an implementation issue related to performance > and there are ways to make the results be generated fast enough. > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 -- Lincoln D. Stein Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724 FOR URGENT MESSAGES & SCHEDULING, PLEASE CONTACT MY ASSISTANT, SANDRA MICHELSEN, AT michelse at cshl.edu From Gregg_Helt at affymetrix.com Mon Nov 28 17:30:27 2005 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Mon, 28 Nov 2005 09:30:27 -0800 Subject: [DAS2] Agenda for today's DAS/2 meeting Message-ID: Today we're going over spec issues. Here's my short list of topics to cover: DAS-specific headers Error codes Feature properties Registry & Discovery Please feel free to add! gregg From td2 at sanger.ac.uk Mon Nov 28 17:27:31 2005 From: td2 at sanger.ac.uk (Thomas Down) Date: Mon, 28 Nov 2005 17:27:31 +0000 Subject: [DAS2] DAS intro In-Reply-To: References: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> Message-ID: <83634851-73AD-454A-B027-644539CF1869@sanger.ac.uk> On 28 Nov 2005, at 17:14, Andrew Dalke wrote: > Andreas Prlic: >> Can we formulate the start a little more general? >> >> something like: >> >> DAS/2 is a protocol to share biological data. It provides >> specifications for how >> to share annotations of genomes and proteins, assays, ontologies >> (space fore more here...). > > I thought about that, but the DAS/2.0 spec doesn't include any of > those. There are pages about assay and ontology retrieval on the website. Are these not part of the spec? Or are they being counted as something else (DAS/2.1?) Thomas. From dalke at dalkescientific.com Mon Nov 28 18:09:17 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 28 Nov 2005 19:09:17 +0100 Subject: properties and key/value data (was Re: [DAS2] Spec issues) In-Reply-To: References: Message-ID: Here's the email I sent to Steve that I meant to send to everyone. On Nov 17, 2005, at 2:09 AM, Andrew Dalke wrote: > I think I understand the Atom spec better now. In brief, the > Atom document contains sections which are extensible and sections > which are not. > > In an extensible section there are two/three categories of elements: > - those in the "atom:" namespace > - "simple extension elements" not in the "atom:" namespace > - "structured extension elements" not in the "atom:" namespace. > > Most of the "atom:" elements share a common structure. For example: > - the type= attribute indicates of the contents are text, escaped > HTML or XHTML; or an explicit content-type like "chemical/x-pdb". > > - the src= attribute indicates that the content of the element is > empty and to go to the given URL instead (apparently the hip > term for URL these days is IRL - internationalized Resource > Identifiers. > I think we only need to use URLs) > > > These are not always used for all elements; if it's appropriate for a > given field then it's used. > > > Simple extension elements are always of the form > Content goes here > where 'element' is not part of the 'atom:' namespace. Consumers of > this data may treat it as simple key/value data. > > Structured extension elements always have at least an attribute > or a sub-element, so must look like > .. > -or- > .. .. > > If the element isn't known this field may be ignored. > > These three things provide for: > - a set of well-define elements, understandable by everyone > - a simple extension for things which can be key/value data > - a way to store or refer to more complex data types > > > Steve, responding to an earlier posting of mine: >> Interesting, but a problem with this is that it effectively creates a >> new version of the TYPES schema every time a new property is added to >> the DAS properties controlled vocabulary. I would hope for a solution >> that decouples the content of the controlled vocab from the data >> exchange format. > > I looked into that. Relax-NG lets you define a "can be anything > except ...". The Atom spec is defined with the following > > # Simple Extension > > simpleExtensionElement = > element * - atom:* { > text > } > > # Structured Extension > > structuredExtensionElement = > element * - atom:* { > (attribute * { text }+, > (text|anyElement)*) > | (attribute * { text }*, > (text?, anyElement+, (text|anyElement)*)) > } > > The "element * - atom:*" means "Any element except those in > the atom namespace." > > Thus we can validate anything with DAS/2 tags, and ignore > validate of anything not part of DAS/2. And we can say that > extensions are only allowed in certain parts of the spec and > not in others. > > We would need to update the schema when we add new "das:" elements, > but we already need to do that. > > We wouldn't need to change the schema to allow others to develop > their own extensions. Indeed, the schema would still let use > verify that extensions are still well-formed. > >> Here's my next attempt, which more fully exploits xml:base to achieve >> this decoupling: >> >> > xmlns:das="http://www.biodas.org/ns/das/genome/2.00/" >> xml:base="http://www.wormbase.org/das/genome/volvox/1/" >> xmlns:xlink="http://www.w3.org/1999/xlink" >>> >> > das:type="type/curated_exon"> >> >> 29 >> >> > xml:base="http://www.biodas.org/ns/das/genome/2.00/properties"> >> 2 >> > xlink:type="simple" >> >> xlink:href="http://www.wormbase.org/das/protein/volvox/2/feature/ >> CTEL54X.1" >> /> >> >> > > Vs. > > xmlns:das="http://www.biodas.org/ns/das/genome/2.00/" > > xmlns:prop="http://www.biodas.org/ns/das/genome/2.00/properties" > xml:base="http://www.wormbase.org/das/genome/volvox/1/" > xmlns:xlink="http://www.w3.org/1999/xlink"> > das:type="type/curated_exon"> > 29 > 2 > src="http://www.wormbase.org/das/protein/volvox/2/feature/CTEL54X.1" > /> > > > > The main differences are: > - the properties are defined elements in the prop: namespace (though > I think they can just as easily be in the das: namespace) > > - I'm using lower-case since that seems to be the trend these days. > > > >> So now we have the following arrangement: >> >> * the attribute keys 'das:id', 'das:type', and 'das:ptype' are >> defined >> within the xmlns:das namespace (i.e., the full id of 'das:type' is >> derived by appending 'type' to the xmlns:das URL). > > I don't follow why the attributes have full namespaces. Is that > to allow extensibility of element attribute on a per-element basis? > > I kept "das:type" above because "type" already has too many meanings. > >> * the attributes values of 'das:id', 'das:type', and 'das:ptype' are >> URLs relative to xml:base. > > Are all attribute values relative to xml:base or only those three? > > Are xlink:href fields relative to xml:base as well? I assume "yes". > >> * The FEATURE element may contain zero or more PROPERTIES >> sub-elements, each with it's own xml:base attribute, effectively >> changing what xml:base is used within the containted PROP >> sub-elements. >> >> So in this example, the property >> 'das:ptype="property/genefinder-score"' >> inherits its xml:base from its grandparent FEATURES element and so >> expands to: >> >> http://www.wormbase.org/das/genome/volvox/1/property/genefinder-score >> >> while the 'das:ptype="phase"' and 'das:ptype="protein_translation"' >> properties inherit xml:base from their PROPERTIES parent element and >> so expand to: >> >> http://www.biodas.org/ns/das/genome/2.00/properties/phase >> http://www.biodas.org/ns/das/genome/2.00/properties/ >> protein_translation > > This is also what happens with the "prop:" namespaced elements, just > at the element level instead of the attribute level. > > To keep this on key/value data I've shifted the rest of the reply > to the next email. Andrew dalke at dalkescientific.com From asims at bcgsc.ca Mon Nov 28 19:21:47 2005 From: asims at bcgsc.ca (Asim Siddiqui) Date: Mon, 28 Nov 2005 11:21:47 -0800 Subject: [DAS2] tiled queries for performance Message-ID: <86C6E520C12E52429ACBCB01546DF4D3BE3EF8@xchange1.phage.bcgsc.ca> Agreed - in light of this, my suggestion doesn't make sense, though Allen's idea may be workable through some other means. Asim -----Original Message----- From: Helt,Gregg [mailto:Gregg_Helt at affymetrix.com] Sent: Monday, November 28, 2005 9:14 AM To: Andrew Dalke; Asim Siddiqui Cc: DAS/2 Subject: RE: [DAS2] tiled queries for performance I don't think we should allow servers to return features than do not meet the criteria specified in the query feature filters, it's an invitation for ambiguity. This may seem harmless with just an "overlaps" region filter, but what about "inside", "contains", "identical"? What about "type", etc? If different DAS/2 server implementations contain the same data, they should return the same set of features for a given feature query. gregg > -----Original Message----- > From: das2-bounces at portal.open-bio.org [mailto:das2-bounces at portal.open- > bio.org] On Behalf Of Andrew Dalke > Sent: Friday, November 25, 2005 3:43 PM > To: Asim Siddiqui > Cc: DAS/2 > Subject: Re: [DAS2] tiled queries for performance > > > The change is simply that instead of the client getting exactly what it > > asks for, it may get more. > > While that's another matter - the client makes a request and the > server is free to expand the range to something it can handle a bit > better. Allen? Were you suggesting this instead? > > In this case there is a change to the spec, and all clients must be > able to filter or otherwise ignore extra results. > > I personally think it's an implementation issue related to performance > and there are ways to make the results be generated fast enough. > > Andrew > dalke at dalkescientific.com > From allenday at ucla.edu Mon Nov 28 20:11:59 2005 From: allenday at ucla.edu (Allen Day) Date: Mon, 28 Nov 2005 12:11:59 -0800 (PST) Subject: [DAS2] tiled queries for performance In-Reply-To: <200511281211.25239.lstein@cshl.edu> References: <86C6E520C12E52429ACBCB01546DF4D3BE3E5E@xchange1.phage.bcgsc.ca> <9ec33e6fb3efbbe8b39adc52d2b78db7@dalkescientific.com> <200511281211.25239.lstein@cshl.edu> Message-ID: On Mon, 28 Nov 2005, Lincoln Stein wrote: > One thing to do is to add to the spec a note that the server is free to return > features from a range larger than requested. This way the server is free to > expand the range to the 1k boundaries. This would require the returned payload to contain the bounds of the features actually returned. E.g. if client asks for 1500..1600, and server responds with 1001..2000, it needs a way to tell the client what the actual bounds of the response are. > > My preference, however, would be for the server to implement a filter that > removes from the precalculated tiled XML output all features that are outside > the range. This would be completely transparent to the client. Yes, this is what I plan to do if we agree to use one of the tiling variants. -Allen > > Lincoln > > On Friday 25 November 2005 06:43 pm, Andrew Dalke wrote: > > Asim Siddiqui > > > > > I think this is a great idea. > > > > > > I don't see this as a big change to the DAS/2 spec or requiring much in > > > the way of additional smarts on the client side. > > > > I agree with Allen on this - in some sense there's no effect on the > > spec. It ends up being an agreement among the clients to request > > aligned data, by rounding up/down to the nearest, say, kilobase and > > for the server implementers to cache those requests. > > > > > The change is simply that instead of the client getting exactly what it > > > asks for, it may get more. > > > > While that's another matter - the client makes a request > > and the server is free to expand the range to something it can handle > > a bit better. Allen? Were you suggesting this instead? > > > > In this case there is a change to the spec, and all clients must > > be able to filter or otherwise ignore extra results. > > > > I personally think it's an implementation issue related to performance > > and there are ways to make the results be generated fast enough. > > > > Andrew > > dalke at dalkescientific.com > > > > _______________________________________________ > > DAS2 mailing list > > DAS2 at portal.open-bio.org > > http://portal.open-bio.org/mailman/listinfo/das2 > > From Steve_Chervitz at affymetrix.com Mon Nov 28 22:07:29 2005 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Mon, 28 Nov 2005 14:07:29 -0800 Subject: properties and key/value data (was Re: [DAS2] Spec issues) In-Reply-To: Message-ID: To give some context to the message that Andrew recently forwarded to the list, below is the message I sent to Andrew that prompted his reply (I also meant to send to the list instead of to just Andrew). It contains my fix to the 'namespace in attribute values' problem regarding properties which I mentioned in today's conf call, and is, I believe, the only viable alternative to Andrew's relax-NG based solution. Basically, the trick is to enclose PROP elements that are relative to the same xml:base within a parent PROPERTIES element and then permit multiple PROPERTIES elements within a feature. This way you can allow property attribute URIs that are relative to different xml:bases. To clarify a point of possible confusion, there are really two sets of key-value pairs to keep in mind: 1. The key-value pair for the property type. 2. The key-value pair for the property itself. So in this example: 29 The key for the type is 'das:ptype' and it's value is 'property/genefinder-score' and this value is a relative URL based on xml:base in the enclosing PROPERTIES element (or in it's grandparent or great-grandparent element, etc.). The value of the property itself is 29 and it's key is the whole key-value pair for the type ( das:ptype="property/genefinder-score"). In Andrew's Relax-NG equivalent: 29 the element name contains both the key ('prop:') and the value of the property type ('genefinder-score'), while the element name as a whole serves as the key for the property itself (value=29). The 'prop:genefinder-score' string is not a relative URL, but is just a namespace-scoped element name, with 'prop:' serving merely to make 'genefinder-score' globally unique, relative to the URI defined by: xmlns:prop="http://www.biodas.org/ns/das/genome/2.00/properties" A potential drawback of the Relax-NG approach, as discussed in today's conf call, is that the value of the property type is not resolvable as in the other approach using the PROPERTIES parent element. Andrew doesn't see a need for resolvability, e.g., for a dynamically discoverable schema fragment. But I thought of another use case besides the one mentioned in today's call (determining data type such as int or float, which isn't of much use in practice). The URL for the type could point to a human readable definition of the term. A user may not need clarification of 'genefinder-score' but might for something like 'softberry-ztuple'. One could still satisfy such a use case under the Relax-NG approach by providing a resolvable URL based on the element name + namespace such as: http://www.biodas.org/ns/das/genome/2.00/properties#genefinder-score True, there's no XML spec that says this is legal, but we could declare that such a convention will hold for all biodas.org-based properties. One problem with the above convention is that it's not obvious what the URL resolves to. So we could have something like: http://www.biodas.org/ns/das/genome/2.00/properties?prop=genefinder-score&de fine=true http://www.biodas.org/ns/das/genome/2.00/properties?prop=genefinder-score&sc hema=true Just a thought. Steve > From: Steve Chervitz > Date: Mon, 14 Nov 2005 17:40:28 -0800 > To: Andrew Dalke > Conversation: [DAS2] Spec issues > Subject: Re: [DAS2] Spec issues > > > Andrew Dalke wrote on 14 Nov 2005: >> >> To: DAS/2 >> Subject: Re: [DAS2] Spec issues >> >> On Nov 4 Steve wrote: >>> >> das:type="type/curated_exon"> >>> 29 >>> 2 >>> >> xlink:type="simple" >>> >>> xlink:href="http://www.wormbase.org/das/protein/volvox/2/feature/ >>> CTEL54X.1 >>> /> >>> >> >> I think we're missing something. This is XML. We can do >> >> >> > ontology="http://song.sf.net/ontologies/sofa#gene" >> source="curated" >> xml:base="gene/"> >> 29 >> 2 >> > xlink:href="http://www.wormbase.org/..." /> >> This message brought to you by >> AT&T >> > >> >> The whole point of having namespaces in XML is to keep from needing >> to define new namespaces like . >> >> In doing that, there's no problem in supporting things like "bg:glyph", >> etc. because the values are expanded as expected by the XML processor. > > Interesting, but a problem with this is that it effectively creates a > new version of the TYPES schema every time a new property is added to > the DAS properties controlled vocabulary. I would hope for a solution > that decouples the content of the controlled vocab from the data > exchange format. > > Here's my next attempt, which more fully exploits xml:base to achieve > this decoupling: > > xmlns:das="http://www.biodas.org/ns/das/genome/2.00/" > xml:base="http://www.wormbase.org/das/genome/volvox/1/" > xmlns:xlink="http://www.w3.org/1999/xlink" >> > das:type="type/curated_exon"> > > 29 > > xml:base="http://www.biodas.org/ns/das/genome/2.00/properties"> > 2 > xlink:type="simple" > > xlink:href="http://www.wormbase.org/das/protein/volvox/2/feature/CTEL54X.1" /> > > > > So now we have the following arrangement: > > * the attribute keys 'das:id', 'das:type', and 'das:ptype' are defined > within the xmlns:das namespace (i.e., the full id of 'das:type' is > derived by appending 'type' to the xmlns:das URL). > > * the attributes values of 'das:id', 'das:type', and 'das:ptype' are > URLs relative to xml:base. > > * The FEATURE element may contain zero or more PROPERTIES > sub-elements, each with it's own xml:base attribute, effectively > changing what xml:base is used within the containted PROP > sub-elements. > > So in this example, the property 'das:ptype="property/genefinder-score"' > inherits its xml:base from its grandparent FEATURES element and so > expands to: > > http://www.wormbase.org/das/genome/volvox/1/property/genefinder-score > > while the 'das:ptype="phase"' and 'das:ptype="protein_translation"' > properties inherit xml:base from their PROPERTIES parent element and > so expand to: > > http://www.biodas.org/ns/das/genome/2.00/properties/phase > http://www.biodas.org/ns/das/genome/2.00/properties/protein_translation > > >>> Also, we might want to allow some controlled vocabulary terms to be >>> used for >>> the value of type.source (e.g., "das:curated"), to ensure that >>> different >>> users use the same term to specify that a feature type is produced by >>> curation. >> >> I talked with Andreas Prlic about what other metadata is needed for the >> registry system. He mentioned >> >> Together with the BioSapiens DAS people we recently decided that >> there should be the possibility to assign gene-ontology evidence >> codes to each das source, so in the next update of the registry, >> this will be changed. >> >> That's at the source level, but perhaps it's also needed at the >> annotation level. > > I like this idea. Good re-use of GO technology. > >> >> >> My thoughts on these are: >> - come up with a more consistent way to store key/value data >> - the Atom spec has a nice way to say "the data is in this CDATA >> as text/html/xml" vs. "this text is over there". I want to copy its >> way of doing things. >> >> - I'm still not clear about xlink. Another is the HTML-style >> >> >> Atom uses the "rel=" to encoding information about the link. For >> example, the URL to edit a given document is >> >> >> >> See http://atomenabled.org/developers/api/atom-api-spec.php > > Not sure about this one yet. In the Atom API, the value of the rel > attribute is restricted to a controlled vocabulary of link > relationships and available services pertaining to editing and > publishing syndicated content on the web: > http://atomenabled.org/developers/api/atom-api-spec.php#rfc.section.5.4.1 > > What would a controlled vocab for DAS resources be? > > Skimming through the DAS/2 retrieval spec, our use of hrefs is > simply for pointing at the location of resources on the web > containing some specified content (e.g., documentation, database > entry, image data, etc.). > > The next/prev/start idea for Atom might have good applicability in the > DAS world for iterating through versions of annotations or assemblies > (e.g., rel='link-to-gene-on-next-version-of-genome'). One relationship > that would be useful for DAS would be 'latest', to get the latest > version of an annotation. > > DAS get URLs themselves seem fairly self-documenting (it's clear a > given link is for feature, type, or sequence for example), so having a > separate rel attribute may not provide much additional value for these > links. But it might be handy for versioning and for DAS/2 writebacks. > > Here's another link about Atom: > http://en.wikipedia.org/wiki/Atom_%28standard%29 > > Steve From ed_erwin at affymetrix.com Mon Nov 28 22:09:23 2005 From: ed_erwin at affymetrix.com (Ed Erwin) Date: Mon, 28 Nov 2005 14:09:23 -0800 Subject: [DAS2] DAS intro In-Reply-To: <9a9ee9242a38f40049a7c5d973980e7d@dalkescientific.com> References: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> <59fa39752e4d792d2142fe2682813937@fruitfly.org> <1ac71c37969c1ef9dcc0d983157746aa@fruitfly.org> <9a9ee9242a38f40049a7c5d973980e7d@dalkescientific.com> Message-ID: <438B8013.3060107@affymetrix.com> Andrew Dalke wrote: > > I believe I understand this. There really is only one reference frame for > the entire genome sequence, for a given assembly, and all other coordinate > systems are a fixed and definite offset of that single reference frame. No. The coordinate transformations are often more complicated than simple offsets. The coordinate space for features on one contig can be 'backwards' with respect to a different contig, and the coordinate space for a gene may skip over one or more gaps with respect to the genomic sequence. Also, the term 'reference frame' bugs me a bit because 'frame' always makes me think of 'reading frame', which is not what you intend. From Steve_Chervitz at affymetrix.com Mon Nov 28 22:55:28 2005 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Mon, 28 Nov 2005 14:55:28 -0800 Subject: [DAS2] DAS/1 vs DAS/2 discussion list In-Reply-To: Message-ID: The DAS/1 list is still open and working. I updated biodas.org to reflect this and set up a special page to inform people about which list to use: http://biodas.org/documents/biodas-lists.html Subscribers on the DAS/1 list have not been automatically added to the DAS/2 list. They must actively subscribe themselves here: http://biodas.org/mailman/listinfo/das2 Steve > From: "Helt,Gregg" > Date: Mon, 21 Nov 2005 09:24:37 -0800 > To: Andrew Dalke , DAS/2 > Conversation: [DAS2] Getting individual features in DAS/1 > Subject: RE: [DAS2] Getting individual features in DAS/1 > > We need to discuss at today's meeting. I don't think the original DAS > list should be closed, but rather continue to serve as a list to discuss > the DAS/1 protocol and implementations, and the DAS2 mailing list should > focus on DAS/2. If we mix DAS/1 and DAS/2 discussions in the same > mailing list I think it's going to lead to a lot of confusion. > > gregg > >> -----Original Message----- >> From: das2-bounces at portal.open-bio.org > [mailto:das2-bounces at portal.open- >> bio.org] On Behalf Of Andrew Dalke >> Sent: Monday, November 21, 2005 9:09 AM >> To: DAS/2 >> Subject: Re: [DAS2] Getting individual features in DAS/1 >> >> Has anyone answered Ilari's question? >> >> I never used DAS/1 enough to answer it myself. >> >> If the normal DAS list is closed, is this the right place for DAS/1 >> questions? >> >> >> On Nov 18, 2005, at 4:22 PM, Ilari Scheinin wrote: >> >>> This mail is not really about DAS/2, but the web site says the >>> original DAS mailing list is now closed. >>> >>> I am setting up a DAS server that serves CGH data from my database > to >>> a visualization software, which in my case is gbrowse. I've already >>> set up Dazzle that serves the reference data from a local copy of >>> Ensembl. I need to be able to select individual CGH experiments to > be >>> visualized, and as the measurements from a single CGH experiment > cover >>> the entire genome, this cannot of course be done by specifying a >>> segment along with the features command. >>> >>> I noticed that there is a feature_id option for getting the features >>> in DAS/1.5, but on a closer look, it seems to work by getting the >>> segment that the specified feature corresponds to, and then getting >>> all features from that segment. My next approach was to use the >>> feature type to distinguish between different CGH experiments. As > all >>> my data is of the type CGH, I thought that I could use spare this >>> piece of information for identifying purposes. >>> >>> First I tried the generic seqfeature plugin. I created a database > for >>> it with some test data. However, getting features by type does not >>> seem to work. I always get all the features from the segment in >>> question. >>> >>> Next I tried the LDAS plugin. Again I created a compatible database >>> with some test data. I must have done something wrong the the data >>> file I imported to the database, because getting the features does > not >>> work. I can get the feature types, but trying to get the features >>> gives me an ERRORSEGMENT error. >>> >>> I thought that before I go further, it might be useful to ask > whether >>> my approach seems reasonable, or is there a better way to achieve > what >>> I am trying to do? What should I do to be able to visualize > individual >>> CGH profiles? >>> >>> I'm grateful for any advice, >>> Ilari >> >> Andrew >> dalke at dalkescientific.com >> >> _______________________________________________ >> DAS2 mailing list >> DAS2 at portal.open-bio.org >> http://portal.open-bio.org/mailman/listinfo/das2 > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 From dalke at dalkescientific.com Tue Nov 29 00:01:08 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Tue, 29 Nov 2005 01:01:08 +0100 Subject: properties and key/value data (was Re: [DAS2] Spec issues) In-Reply-To: References: Message-ID: Steve: > To clarify a point of possible confusion, there are really two sets of > key-value pairs to keep in mind: > > 1. The key-value pair for the property type. > 2. The key-value pair for the property itself. I don't see that #1 is a useful distinction. > So in this example: > > 29 > > The key for the type is 'das:ptype' and it's value is > 'property/genefinder-score' and this value is a relative URL based on > xml:base in the enclosing PROPERTIES element (or in it's grandparent or > great-grandparent element, etc.). The value of the property itself is > 29 and > it's key is the whole key-value pair for the type ( > das:ptype="property/genefinder-score"). How do I make an extension type? For example, I want to add a new property for 3D structure depiction, which can be one of "cartoon", "ribbons", or "wires". Let's say it's under my company web site in http://www.dalkescientific.com/das-types/rep3d How do I write it? I tried but couldn't figure it out. What does that URL resolve, if anything? > In Andrew's Relax-NG equivalent: > > 29 > > the element name contains both the key ('prop:') and the value of the > property type ('genefinder-score'), while the element name as a whole > serves > as the key for the property itself (value=29). The > 'prop:genefinder-score' > string is not a relative URL, but is just a namespace-scoped element > name, > with 'prop:' serving merely to make 'genefinder-score' globally unique, > relative to the URI defined by: > > xmlns:prop="http://www.biodas.org/ns/das/genome/2.00/properties" It took me a while to understand XML namespaces. This helped http://www.jclark.com/xml/xmlns.htm He uses (for purposes of explanation) the so-called "Clark notation". An example from that document is maps to <{http://www.cars.com/xml}part/> """The role of the URI in a universal name is purely to allow applications to recognize the name. There are no guarantees about the resource identified by the URI.""" Using Clark notation helps with remembering that, since { and } here are not valid for URLs. The element name "prop:genefinder-score" is a convenient way to write the full element name, and that's all. There is no meaning to the parts of the name. "prop:" is not a key, since given these two namespace definitions <... xmlns:prop="http://www.dalkescientific.com/" xmlns:wash="http://www.dalkescientific.com/"> then these two elements are identical 29 29 I think Steve is saying the same thing as I am - I wanted to rephrase it to make sure. > A potential drawback of the Relax-NG approach, as discussed in today's > conf > call, is that the value of the property type is not resolvable as in > the > other approach using the PROPERTIES parent element. > > Andrew doesn't see a need for resolvability, e.g., for a dynamically > discoverable schema fragment. But I thought of another use case > besides the > one mentioned in today's call (determining data type such as int or > float, > which isn't of much use in practice). The URL for the type could point > to a > human readable definition of the term. A user may not need > clarification of > 'genefinder-score' but might for something like 'softberry-ztuple'. Who is the user that would want the clarification? That is, what human will be doing the reading? Once clarified, what does that user do with the information? In my opinion, the only people who care about this are developers, and more specifically, developers who will extend a client to support new data types. Users of, say, the web front end or of IGB don't care. That's a relatively small number of people. And the use case is solved by having the doc_href for the versioned source include a link to any extensions served. Here's another solution. Somewhere early in the results include where the schema includes links for each of the fields, including any extensions. It doesn't need to be a , just something meant as a shout out to developer people. > One could still satisfy such a use case under the Relax-NG approach by > providing a resolvable URL based on the element name + namespace such > as: > > http://www.biodas.org/ns/das/genome/2.00/properties#genefinder-score > > True, there's no XML spec that says this is legal, but we could > declare that > such a convention will hold for all biodas.org-based properties. One > problem > with the above convention is that it's not obvious what the URL > resolves to. > So we could have something like: > > http://www.biodas.org/ns/das/genome/2.00/properties?prop=genefinder- > score&de > fine=true > > http://www.biodas.org/ns/das/genome/2.00/properties?prop=genefinder- > score&sc > hema=true We could do this, though it's a bit complicated with some tools which represent element via Clark notation - it needs a bit of string munging. I suggest that the reason why "it's not obvious what the URL resolves to" is because there's nothing which will actually use this. It is easier to just have a human-readable link either on the doc_href page or via some special "if you're a developer, look here" reference, and don't worry about automating it further. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Tue Nov 29 00:16:17 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Tue, 29 Nov 2005 01:16:17 +0100 Subject: [DAS2] DAS intro In-Reply-To: <438B8013.3060107@affymetrix.com> References: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> <59fa39752e4d792d2142fe2682813937@fruitfly.org> <1ac71c37969c1ef9dcc0d983157746aa@fruitfly.org> <9a9ee9242a38f40049a7c5d973980e7d@dalkescientific.com> <438B8013.3060107@affymetrix.com> Message-ID: Ed Erwin: > No. The coordinate transformations are often more complicated than > simple offsets. The coordinate space for features on one contig can > be 'backwards' with respect to a different contig, and the coordinate > space for a gene may skip over one or more gaps with respect to the > genomic sequence. The /region entities in the DAS/2 spec are defined as (zero or more) A top-level region on the genome (similar to the "entry points" of the DAS/1 protocol). id ? the URI of the sequence ID length ? length of the sequence name (optional) ? a human-readable label for use when referring to the region doc_href (optional) ? a URL that gives additional information about this region Here is an example This is a very simple definition. As far as I can tell it does not capture the information for, say, skipping. How would you represent "the coordinate space for a gene [that skips] over one or more gapes with respect to the genomic sequence" using the current DAS/2 object model? Or goes backwards? I don't see anything like that. > Also, the term 'reference frame' bugs me a bit because 'frame' always > makes me think of 'reading frame', which is not what you intend. Oh, I agree. It's a bad term. Very very few genomics people use it, according to Google. There's a theory, popular in usenet and in some wikis, is that experts rarely write the details because after all they know the topic. The best way to get a detailed explanation is to post something in error and wait for the corrections. :) Andrew dalke at dalkescientific.com From Steve_Chervitz at affymetrix.com Tue Nov 29 03:05:40 2005 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Mon, 28 Nov 2005 19:05:40 -0800 Subject: [DAS2] DAS/2 weekly meeting notes for 28 Nov 05 Message-ID: Notes from the weekly DAS/2 teleconference, 28 Nov 2005. $Id: das2-teleconf-2005-11-28.txt,v 1.1 2005/11/29 03:06:04 sac Exp $ Note taker: Steve Chervitz Attendees: Affy: Steve Chervitz, Ed E., Gregg Helt CSHL: Lincoln Stein UC Berkeley: Suzi Lewis Sanger: Thomas Down, Andreas Prlic Sweden: Andrew Dalke Action items are flagged with '[A]'. These notes are checked into the biodas.org CVS repository at das/das2/notes/2005. Instructions on how to access this repository are at http://biodas.org DISCLAIMER: The note taker aims for completeness and accuracy, but these goals are not always achievable, given the desire to get the notes out with a rapid turnaround. So don't consider these notes as complete minutes from the meeting, but rather abbreviated, summarized versions of what was discussed. There may be errors of commission and omission. Participants are welcome to post comments and/or corrections to these as they see fit. Today's topic: Spec issues (for DAS/2 retrievals) ------------------------------------------------- We are following the agenda summary in Andrew's email: http://portal.open-bio.org/pipermail/das2/2005-November/000352.html 1) DAS Status Code in headers ----------------------------- Use http error codes and not das-specific ones. das-error to provide more detail. GH: Do we really need a detailed response document? TD: How do you distinguish different parts of the error-causing request? AD: how detailed do we need to be? LS: If you wish to do error recovery, you could have problems with one part and not another. You give up granularity. GH: Willing to give up the granularity in favor of simplicity. AD: Possibilities of error LS: How about everything that can be turned into an http error should be. And have a special section to provide das details. E.g.: client is still going to have to understand das error codes GH, AD: client does need to be there. AD: Using only http error codes reduces complexity - you only need to check one place. Another benefit - you can provide a file-based das server (this was not an use case from the RFCs, just AD's pet idea he envisions as potentially useful). GH: Can't think of DAS/1 clients that did anything meaningful with those das error codes. AD: NCBI entrez server - does lots of extra error support. Don't want to go there with das. TD, LS: DAS error codes can be used to tell client which part of the URL is at fault. Now it will be just '404 not found'. AD: REST API says use the http protocol directly. LS: There are some things in the DAS API that don't translate into http error codes. AD: We can support this with error document. [A] Use HTTP error codes and x-das-error document with code and optional description. 2) Content-type --------------- [A] No objections to using: application/x-das+blah+xml 3) Key/value data ----------------- Three possibilities summarized in Andrew's email. 1) (current spec) using namespace in attrib value. 2) (steve, lincoln) all attribute values are URI's 3) (andrew) Relax-NG based, drop in well-structured XML SC: (clarified proposal #2). For more, see today's post at: http://portal.open-bio.org/pipermail/das2/2005-November/000363.html AD: What's wrong with the Relax-NG based approach? LS: I don't understand it yet. SC: Community lacks experience with Relax-NG in general. TD: Does it let you to point to schema fragments for data types? AD: There are ways to define it in the schema, haven't looked at it. LS: This looks great. Would propose having a convention that if it's a simple, single-valued key, value should be encoded in an attribute (value="blah"), not as content of a section (CDATA). Reason: It's more consistent with rest of spec, and it's easier to parse. So in the example, genefinder-score is not correctly encoded. AD: That's not in the das: namespace, hence is not under our control. We can use this convention for things in the das namespace. AD: User can put it any xml as long as it's reasonably well-formed. We can define what well-formed is. This is what atom uses. Allows some simple key val data on client as if it were native data. It permits searches without needing to know about complex data. GH: Likes idea of allowing arbitrary xml. SC: Not completely arbitrary since we limit use of das: namespace, and possibly other aspects. LS: So we're going to say we have properties represented as key/val pairs using this syntax. You'll find 'das:' as well as possibly other namespaces. I think that works. What becomes of /property url (ptype)? Does that go away and replaced by namespace? AD: Possibly use it for data type (e.g., float). Or we could make it discoverable? LS: Easier to make it part of the spec. TD: If this can work like XML schema, we could have a pointer to an xsi. Is there a way to put a pointer to a schema url? AD: Found this to be useless. Hard coding what is expected is better than having discoverability. TD: With the xsi schema location, you can put multiple schema locations for the das schema, and your extension, separate pointers to both in a single document. AD: Never found dynamically resolved schemas useful for anything LS: In theory they are. Why not? AD: Knowing that something's an int does say what that int is supposed to mean. LS: Right. Let's make sure that the common types of annotation a server would want to return are in the spec from the get go. Anyone that doesn't care about extensions can ignore additional properties. No doubt people will make extensions to DAS/2 that are implemented on client and server that are in-house, private extensions that only work in client-server pairs. Should we allow schema fragments to be brought in via xsi? TD: this would be in the top-level element. Or can put it on an enclosing element. AD: Is there a good reason to do it? LS: Let's not seek discoverability. [A] Andrew will flesh out his Relax-NG based property encoding approach. SC: You could put your schema at the url pointed do by 'das:' AD: Don't see a need. I found that many of the DAS/1 schema fragments/documents were in valid. This didn't seem to bother DAS/1 clients and users. LS: In the real world, people don't validate. 5) xlink and ------------------- AD: The official xlink spec is long. Have not fully groked it. GH: Does anyone else have experience with it? (silence...) Seems like a reason to not go there. AD: Atom, uses link to say, "Here's some generic linked out stuff". We could use it to say, "I'm looking for the stylesheet for this thing or the schema for the xml document." GH: We need to draw line between generic links and specific things. eg. feature ids, all ids are resolvable links, and so could in principle be specified with link tags. AD: Link from feature to versioned source it's a part of. Client can figure out context from url. Use case: DAS user sends email to colleague, 'look at this url for feature X'. The other user enters URL in his das browser, client can identify the das2-versioned source given the feature URL. LS: They would rely on xml:base. Nothing in the current DAS/2 spec says that the xml base is for the versioned source. LS: But it does give you the versioned source. This is absolutely part of the spec. AD: Nothing in the spec that says that features have to be on the same machine as the rest of the data. LS: Why does user want versioned source on the same machine that the feature came from? AD: Nothing in the spec says that that a feature has to be under 'feature' in the URL. GH: Generalizing the info href element to be more generic, to specify what that link means is fine as long as we don't do this for everything that can be a link. Doc hrefs are fine, not ids. LS: We're not going to demand that people specify links. (Something about giving people enough rope to hang themselves with...) GH: Ids are opaque uris to id the feature. LS: The HTML link tag has been around a long time, and used a total of two times: style sheets, copyright statements. This could have easily been done with a stylesheet tag and copyright tag (without needing a general link tag). [A] Consider the xlink/link tags issue tabled. 6) Source filters ----------------- GH: Use case: DAS/2 client is trying to discover what registry has, query can be the same as for any das server, you can just apply additional filters when dealing with a registry. AP: Client would use tags that a registry server must implement. GH: A non-registry server can implement as well. TD: say filtering is optional in general. AD: I tend to not like optional things. Filtering is required for features. GH: The spec can state the filters that a registry is required to implement on sources query. General DAS/2 servers are not requiredd, but can if they want. What if you send a sources query with filters that it doesn't understand? LS: Return everything GH: Return error AP: Client can filter out what they want GH: It's already important to have search capability in client. Use case: On given genome, show me all gene predicitons for this region. You need to go to all servers, which could be many. AD: Can you filter by type of features that can be returned? AP: Can be added. GH: Want to be able to search on ontology term, not just id of the type. AD: Need meta-data server to ask of DAS/2 servers what features do you implement? LS: Does metadata protocol need to be part of das spec, or an additional protocol on top? There should be an optional section of DAS/2 that is implemented by metadata servers or registrys that allows you to do servers. Shouldn't overload the core server spec. GH: Concerned with the response. It's so close to the same xml, it might as well be the same. Makes it easy for clients to know about both servers and metadata servers. could call it 'sources' or something else. LS: Filtering by feature type, do we need that info that's returned by sources document? GH: No, it's part of the query. LS: Metadata server would have to do a types request. AD: What if there's a mismatch in SOFA version? LS: We're in trouble. AD: Concerned about change in meaning. SL: Not important. LS: Use case: There's a 'restriction site' node in SOFA 1.4 with five terms underneath it. In version 1.5, now there's six terms. A metadata server running off of the old version is using an incomplete node. Metadata engine should always run off the latest version. AP: Registry at Sanger checks every 2 hrs with server. AD: How is this better than having client do it itself? What features do you know with this type and this range? GH: If lots of DAS servers, this will be time intensive AD: Can we wait until there are lots of servers? AP: We have 17. LS: Current paradigm - EBI has many servers that just do one type of feature e.g, there's a server that just does repeat elements. So there are servers that will serve up one or a few feat types. AD: Had not considered that. LS: Happy to have optional filter syntax added to sources request supported by metadata servers. Gregg is right about returning error (unimplemented). Will not change protocol in fundamental way. Just an annex, just optional section supported by metadata servers. GH: Based on Andreas' queries in soap, can we squeeze everything in to params on url? filterable? AP: yes AD: optional fields will include species, build#, type, etc. [A] Add optional filter syntax to sources request. Allow unimpl error return. 7) /regions ----------- LS: In sofa, a feature of type region is root of all other features - everything is a region. Has props - ref sequence it's on, start, strandedness. The reason for region is for retrieving assemblies. SC: Region is also currently the only way to get back a list of available sequence ids without getting all sequence data. The top-level sequence request returns data along with sequence. LS/GH: region could be called 'landmarks' [A] Andrew will work directly with Lincoln on revising region request. 8) Tiled queries ---------------- LS: This doesn't need to be in spec. If client filters features by a range, is there a contract such that server must return exact range he asked for, contained in, or is ok for server to return more? GH: We need to be more strict. LS: Agree. Client should trim it. [A] Tiled queries should not be part of the spec. Other issues ------------ AP: There are still some other issues not addressed in this call. E.g., Not possible to handle situation where protein sequence in a structure varies from genome. Can defer to the next spec discussion conf call. From ed_erwin at affymetrix.com Tue Nov 29 19:30:41 2005 From: ed_erwin at affymetrix.com (Ed Erwin) Date: Tue, 29 Nov 2005 11:30:41 -0800 Subject: [DAS2] DAS intro In-Reply-To: References: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> <59fa39752e4d792d2142fe2682813937@fruitfly.org> <1ac71c37969c1ef9dcc0d983157746aa@fruitfly.org> <9a9ee9242a38f40049a7c5d973980e7d@dalkescientific.com> <438B8013.3060107@affymetrix.com> Message-ID: <438CAC61.1090104@affymetrix.com> Andrew Dalke wrote: > Ed Erwin: > >> No. The coordinate transformations are often more complicated than >> simple offsets. The coordinate space for features on one contig can >> be 'backwards' with respect to a different contig, and the coordinate >> space for a gene may skip over one or more gaps with respect to the >> genomic sequence. > > > The /region entities in the DAS/2 spec are defined as > > (zero or more) > A top-level region on the genome (similar to the "entry points" of > the DAS/1 protocol). > id ? the URI of the sequence ID > length ? length of the sequence > name (optional) ? a human-readable label for use when referring > to the region > doc_href (optional) ? a URL that gives additional information > about this region > > Here is an example > > > I had to go back and look-up the context for this discussion. Here it is: >> [Suzi wrote] >> Third, just think of "reference sequence" as a coordinate system. One >> can have the exact same feature and indicate that: on >> coordinate-system-A this feature starts and ends here, and on >> coordinate-system-B it starts and ends there. Thus a feature's >> coordinates may be given both on a chromosome, and on a contig, and on >> any other coordinate-system that can be derived through a transform >> from these. > > [Andrew wrote] > I believe I understand this. There really is only one reference frame > for the entire genome sequence, for a given assembly, and all other > coordinate systems are a fixed and definite offset of that single > reference frame. I understand this as talking about coordinates in general, not the elements or "pos" attributes in the spec. Suzi specifically mentions chromosomes and contigs; one can definitely be backwards with respect to the other. But top-level regions in an assembly would probably all be chromosomes or all be contigs, rather than a mixture. There is not one single "reference frame" for an assembly: rather there is one coordinate axis for *each* top-level region. If those top-level regions are chromosomes, then there is no relationship between the coordinates on different ones. If those top-level regions are contigs or ESTs (which I believe is allowed by the spec), then positions on one of them can be related to positions on others through various transforms. > This is a very simple definition. As far as I can tell it does not > capture the information for, say, skipping. > > How would you represent "the coordinate space for a gene [that skips] > over one or more gapes with respect to the genomic sequence" using the > current DAS/2 object model? > > Or goes backwards? I don't see anything like that. You represent gaps with tag parent-child relationships, and going backwards by specifying "+1" strand on one contig and "-1" strand on the other. The spec does not requires a DAS/2 server to know how to perform transformations from one coordinate system to another, but your statement "there really is only one reference frame for the entire genome sequence" is wrong as I understand it. There is one coordinate axis for *each* top-level region. From ed_erwin at affymetrix.com Tue Nov 29 19:36:13 2005 From: ed_erwin at affymetrix.com (Ed Erwin) Date: Tue, 29 Nov 2005 11:36:13 -0800 Subject: [DAS2] DAS intro In-Reply-To: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> References: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> Message-ID: <438CADAD.8060403@affymetrix.com> Andrew Dalke wrote: > The front of the DAS doc starts > > DAS 2.0 is designed to address the shortcomings of DAS 1.0, including: > > That kinda assumes people know what DAS 1.0 is to understand DAS 2.0. > > How about this instead, as an overview/introduction. > > ====== > > DAS/2 describes a data model for genome annotations. In general I like this better than the original introduction. Thanks for writing it. But I agree with Andreas that the first line is better as: > DAS/2 is a protocol to share biological data. I definitely think of DAS as a protocol first, rather than a data model first. From ed_erwin at affymetrix.com Tue Nov 29 20:16:11 2005 From: ed_erwin at affymetrix.com (Ed Erwin) Date: Tue, 29 Nov 2005 12:16:11 -0800 Subject: [DAS2] mtg topics for Nov 28 In-Reply-To: References: Message-ID: <438CB70B.4030005@affymetrix.com> Andrew Dalke wrote: > Here are the spec issues I would like to talk about for today's meeting, > culled from the last few weeks of emails and phone calls > > 1) DAS Status Code in headers > > The current spec says > >> X-DAS-Status: XXX status code >> >> The list of status codes is similar, but not identical, to those used >> by DAS/1: >> >> 200 OK, data follows >> 400 Bad namespace >> 401 Bad data source >> 402 Bad data format >> 403 Unknown object ID >> 404 Invalid object ID >> 405 Region coordinate error >> 406 No lock >> 407 Access denied >> 500 Server error >> 501 Unimplemented feature > > > I argued that these are not needed. Some of them are duplicates with > HTTP error codes and those which are not can be covered by an error > code "300" along with an (optional) XML payload. > > The major problem with doing this seems to be in how MS IE handles > certain error codes. While IE is not a target browser, MS software > may use IE as a component for fetching data. From the link Ed dug > up, it looks like this won't be a problem. > I'm not going to argue anymore against moving the X-DAS-Status code up into the HTTP status code. I'm willing to try it and see if it works. But I want to re-iterate why I'm suspicious of this. I have experience trying this in two separate projects and it failed both times. (Still, I think those problems won't occur this time.) 1. I tried this on a project internally at Affymetrix. It didn't work in this case because the client code was (indirectly) using MS IE code, and IE was throwing away the HTTP content when the header had certain error codes. This doesn't bother me much now, though, because I doubt many DAS clients will be written that interface with IE, and because I now know that you can force IE to keep the HTTP content as long as you make sure the content is always at least 512 characters long. So if we ever run into this problem, there is an easy work-around. 2. I tried putting the X-DAS-Status codes into the HTTP status code in our internal DAS/1 server about a year ago. (In DAS/1 they are not supposed to be in the HTTP status codes, but I misunderstood the spec.) I ran into problems when I tried that, and that is the main reason I objected to trying that in DAS/2. Unfortunately, I can't remember what those problems were.... The problem might have been: a) the IGB client didn't understand the status codes because they weren't in the expected place. If this is the case, then the problem was benign, because we are now writing new code to support the new spec, so we can make IGB understand whatever we want. b) I use Apache's ".htaccess" files to do some URL re-direction on our DAS/1 client machine. see http://httpd.apache.org/docs/1.3/mod/mod_rewrite.html#RewriteRule It is possible that this was causing the original HTTP status code to be replaced with a different one. I'm currently using the "proxy" form of redirect, which seems to keep the status code intact. Earlier I was using the "redirect" form of redirect, which may change the status code to 302. ----- Based on my experience with apache re-direction, I have a vague fear that we may run into cases where firewalls, or html cachers and optimizers may mangle the HTTP status codes for some users at some point. But since I have no confirmed evidence that that will happen, I have no objection to going ahead and trying to use HTTP status codes. From Steve_Chervitz at affymetrix.com Tue Nov 29 20:33:29 2005 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Tue, 29 Nov 2005 12:33:29 -0800 Subject: [DAS2] DAS intro In-Reply-To: <438CADAD.8060403@affymetrix.com> Message-ID: Ed Erwin wrote: > Andrew Dalke wrote: >> The front of the DAS doc starts >> >> DAS 2.0 is designed to address the shortcomings of DAS 1.0, including: >> >> That kinda assumes people know what DAS 1.0 is to understand DAS 2.0. >> >> How about this instead, as an overview/introduction. >> >> ====== >> >> DAS/2 describes a data model for genome annotations. > > In general I like this better than the original introduction. Thanks > for writing it. > > But I agree with Andreas that the first line is better as: > >> DAS/2 is a protocol to share biological data. > > I definitely think of DAS as a protocol first, rather than a data model > first. I concur. The main aim of DAS is to define an API to allow clients to query servers in order to retrieve bioinformatics data objects in defined response formats. Of course, the writeback facility of DAS/2 will make DAS more of a two-way street so we could say 'sharing and editing', but I think retrieval is more fundamental and probably accounts for the majority of uses. How about this for the first line: DAS is a protocol for sharing biological data. No need to limit it to version 2. This applies to all versions. Use 'DAS/2' when talking about new features in this version, such as writeback. Steve From dalke at dalkescientific.com Tue Nov 29 22:17:02 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Tue, 29 Nov 2005 23:17:02 +0100 Subject: [DAS2] DAS intro In-Reply-To: References: Message-ID: Steve: > How about this for the first line: > > DAS is a protocol for sharing biological data. > > No need to limit it to version 2. This applies to all versions. Use > 'DAS/2' > when talking about new features in this version, such as writeback. Done. Made a few changes to the CVS intro text to reduce the use of "DAS/2". So that email I just sent is out of date. :) Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Wed Nov 30 00:02:07 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 30 Nov 2005 01:02:07 +0100 Subject: What are regions for? (was Re: [DAS2] DAS intro) In-Reply-To: <438CAC61.1090104@affymetrix.com> References: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> <59fa39752e4d792d2142fe2682813937@fruitfly.org> <1ac71c37969c1ef9dcc0d983157746aa@fruitfly.org> <9a9ee9242a38f40049a7c5d973980e7d@dalkescientific.com> <438B8013.3060107@affymetrix.com> <438CAC61.1090104@affymetrix.com> Message-ID: <921477a6bd799b5e19b965b3cd39d239@dalkescientific.com> Ed: > I understand this as talking about coordinates in general, not the > elements or "pos" attributes in the spec. Suzi specifically > mentions chromosomes and contigs; one can definitely be backwards with > respect to the other. But top-level regions in an assembly would > probably all be chromosomes or all be contigs, rather than a mixture. I'm trying to figure out when people use the /region. In my way of understanding things there is the genomic sequence. That consists of a set of chromosomes, each with a list of bases. A chromosome is assembled from parts. One of these parts is called a 'contig'. I thought I knew what it was, but according to http://staden.sourceforge.net/contig.html there are several meanings. What I understand is that a 'contig' is a sequenced chunk of DNA which has overlaps with other contigs and when combined can be used to deduce the entire sequence (excepting regions of repeats and other ambiguities). The best such deduction is the golden path. For DAS/2 we assume sequenced genomes. When will people use top-level regions which are not chromosomes? Chromosome top-level regions are identical to the /sequence, except for the ability to get the assembly and the sequence data directly. Is that correct? The spec allows links from a feature into several different regions. This suggests to me that sometimes there will be regions which are a mixture of contigs and chromosomes. Else why support that ability? There is nothing in the spec (that I know of) which allows any hierarchy to the regions - all regions are top-level. Is this correct? > If those top-level regions are chromosomes, then there is no > relationship between the coordinates on different ones. While I understand that, I did get it wrong when I wrote it down. In my head I was thinking "each base has a 1-to-1 mapping to a number, and if two bases are next to each other then the corresponding two numbers are next to each other." This is invalid because the converse is not true - if one number is the end of a chromosome and the other is the start of the next then the two bases are not next to each other. > If those top-level regions are contigs or ESTs (which I believe is > allowed by the spec), then positions on one of them can be related to > positions on others through various transforms. Those are allowed. Will people use them? What advantage is there to having these be a special category instead of a feature? > You represent gaps with tag parent-child relationships, and > going backwards by specifying "+1" strand on one contig and "-1" > strand on the other. Something like this? (Yes, this is hand-wavy) Here's a (and note, this is NOT a ) with two subfeatures, one on the forward strand and one on the reverse. This I understand just fine. I don't understand why the positions are given in /region spec instead of either: - directly to /sequence space, eg ... -or- - point to a feature of type 'region' which provides the region coordinates ... (Again, hand-wavy. I think best looking at data and code.) > The spec does not requires a DAS/2 server to know how to perform > transformations from one coordinate system to another, but your > statement "there really is only one reference frame for the entire > genome sequence" is wrong as I understand it. There is one coordinate > axis for *each* top-level region. Understood. My questions, to summarize, are: - why do we need a /region space when we can 1. point directly to a sequence (for chromosome regions) and/or 2. point to a "contig" or "assembly" or "region" feature type (for other regions) - When would someone have regions which have more than one of contigs, ESTs and chromosomes? Especially given that this is the genome spec, so chromosome-level info is known, at least enough for a rough assembly. In other words, what are regions for? Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Wed Nov 30 00:26:41 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 30 Nov 2005 01:26:41 +0100 Subject: [DAS2] mtg topics for Nov 28 In-Reply-To: <438CB70B.4030005@affymetrix.com> References: <438CB70B.4030005@affymetrix.com> Message-ID: <45f7dbc8e14fa2a68af6c1d03153d715@dalkescientific.com> Ed: > I'm not going to argue anymore against moving the X-DAS-Status code up > into the HTTP status code. I'm willing to try it and see if it works. > > But I want to re-iterate why I'm suspicious of this. I have > experience trying this in two separate projects and it failed both > times. (Still, I think those problems won't occur this time.) > > 1. I tried this on a project internally at Affymetrix. It didn't > work in this case because the client code was (indirectly) using MS IE > code, and IE was throwing away the HTTP content when the header had > certain error codes. This was a two-part problem: - identifying in client code that a given error occured - extracting the payload when the error occurred As far as I can tell, the problem you are concerned about is the second part. Personally I don't want an application/x-das-error+xml return document. Several others do. Thing is, when Gregg asked if anyone used the DAS/1 error codes for anything other than "there was an error", no one said anything. I could hear the proverbial crickets chirping (or in my case, snow falling). I am convinced that the actual error content will be server implementation specific and as such non-portable across clients. I will flesh out a document type for this then ask Thomas, Lincoln etc. to provide a list of defined error code extensions that their servers will return. It's likely they'll not be able to agree on it, because their code will do different styles of error checking. I'll also dodge the whole mess by saying that the error document payload is optional, so clients are highly unlikely to read it for anything meaningful. (Except perhaps some text shunted to the user.) That makes more work in the spec implementation for something I can almost guarantee will be ignored by DAS clients. > b) I use Apache's ".htaccess" files to do some URL re-direction on our > DAS/1 client machine. > > see http://httpd.apache.org/docs/1.3/mod/mod_rewrite.html#RewriteRule > > It is possible that this was causing the original HTTP status code to > be replaced with a different one. > > I'm currently using the "proxy" form of redirect, which seems to keep > the status code intact. Earlier I was using the "redirect" form of > redirect, which may change the status code to 302. I don't understand how the old one would be a problem in the web clients I'm familiar with. It should be: send request to server get 302 "moved temporarily" response along with new URL repeat until no redirect or reached max redirect limit request new URL get headers/payload back The redirects shouldn't affect the real response code, which would be the last in the chain. If it did, it would also affect 404 and 200 responses. > Based on my experience with apache re-direction, I have a vague fear > that we may run into cases where firewalls, or html cachers and > optimizers may mangle the HTTP status codes for some users at some > point. But since I have no confirmed evidence that that will happen, > I have no objection to going ahead and trying to use HTTP status > codes. I know that fear. I've had intermediate web caches misconfigured which cached anything HTML page for an hour, making me unable to edit my web site and see the changes. That was with a normal 200 response code, so likely misconfigured caches will affect other response codes. But what's there to do about that? What's the error rate? We're using normal HTTP and if a web cache breaks for us - we aren't doing anything fancy; no content-negotiation, no 'If-Modified-Since', etc - then it will break for anyone doing HTTP. That's anyone exchanging HTML, sending RSS, etc. Andrew dalke at dalkescientific.com From ed_erwin at affymetrix.com Wed Nov 30 00:34:11 2005 From: ed_erwin at affymetrix.com (Ed Erwin) Date: Tue, 29 Nov 2005 16:34:11 -0800 Subject: [DAS2] mtg topics for Nov 28 In-Reply-To: <45f7dbc8e14fa2a68af6c1d03153d715@dalkescientific.com> References: <438CB70B.4030005@affymetrix.com> <45f7dbc8e14fa2a68af6c1d03153d715@dalkescientific.com> Message-ID: <438CF383.5050604@affymetrix.com> >> I'm currently using the "proxy" form of redirect, which seems to keep >> the status code intact. Earlier I was using the "redirect" form of >> redirect, which may change the status code to 302. > > > I don't understand how the old one would be a problem in the > web clients I'm familiar with. It should be: > > send request to server > get 302 "moved temporarily" response along with new URL > repeat until no redirect or reached max redirect limit > request new URL > get headers/payload back Unlike modern web browsers, IGB isn't smart enough to do that. Maybe someday it will need to be, but it isn't there yet. From dalke at dalkescientific.com Tue Nov 29 22:13:49 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Tue, 29 Nov 2005 23:13:49 +0100 Subject: [DAS2] DAS intro In-Reply-To: <438CADAD.8060403@affymetrix.com> References: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> <438CADAD.8060403@affymetrix.com> Message-ID: <24b1a9183d9f344398f80839f4c71b6e@dalkescientific.com> Ed: > I definitely think of DAS as a protocol first, rather than a data > model first. Mmm. I see you all's point. All protocols express a data model, though neither side necessarily must implement it that way. Here's the updated text. This is what I just committed to CVS. Note that it's missing mention of the '/region' section. ===== DAS/2 is a protocol for sharing biological data. This version of the specification describes features located on the genomic sequence. Future extensions will add support for sharing annotations of expression data, protein sequences, 3D structures, and ontologies. A DAS/2 annotation server provides feature information about one or more genome sources. Each source may have one or more versions. Different versions are usually based on different assemblies. As an implementation detail an assembly and corresponding sequence data may be distributed via a different machine, which is called the reference server. Annotations are located on the genomic sequence with a start and end position. The range may be specified mutiple times if there are alternate reference frames. An annotation may contain multiple non-continguous parts, making it the parent of those parts. Some parts may have more than one parent. Annotations have a type based on terms in SOFA (Sequence Ontology for Feature Annotation). Stylesheets contain a set of properties used to depict a given type. Annotations can be searched by range, type, and a properties table associated with each annotation. These are called feature filters. DAS/2 is implemented using a ReST architecture. Each entity (also called a document or object) has a name, which is a URL. Fetching the URL gets information about the entity. The DAS-specific entities are all XML documents. Other entities contain data types with an existing and frequently used file format. Where possible, a DAS server returns data using existing formats. In some cases a server may describe how to fetch a given entity in several different formats. ===== Andrew dalke at dalkescientific.com From ed_erwin at affymetrix.com Wed Nov 30 00:37:07 2005 From: ed_erwin at affymetrix.com (Ed Erwin) Date: Tue, 29 Nov 2005 16:37:07 -0800 Subject: What are regions for? (was Re: [DAS2] DAS intro) In-Reply-To: <921477a6bd799b5e19b965b3cd39d239@dalkescientific.com> References: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> <59fa39752e4d792d2142fe2682813937@fruitfly.org> <1ac71c37969c1ef9dcc0d983157746aa@fruitfly.org> <9a9ee9242a38f40049a7c5d973980e7d@dalkescientific.com> <438B8013.3060107@affymetrix.com> <438CAC61.1090104@affymetrix.com> <921477a6bd799b5e19b965b3cd39d239@dalkescientific.com> Message-ID: <438CF433.1020707@affymetrix.com> Andrew Dalke wrote: > My questions, to summarize, are: > - why do we need a /region space when we can > 1. point directly to a sequence (for chromosome regions) and/or > 2. point to a "contig" or "assembly" or "region" feature type > (for other regions) The way I understand it, that is what region is for: to point directly to a location on a sequence and/or contig. > - When would someone have regions which have more than one of > contigs, ESTs and chromosomes? Especially given that this > is the genome spec, so chromosome-level info is known, at > least enough for a rough assembly. I think they do it mainly 1) when the assembly is incomplete or 2) to preserve annotations from the past when the assembly was incomplete. There could be more reasons. Here is an example of a DAS/1 server that contains both chromosomes and "other" short sequences as entry points: http://servlet.sanger.ac.uk:8080/das/ensembl_Homo_sapiens_core_28_35a/entry_points See here for some more genomes that are treated similarly: http://servlet.sanger.ac.uk:8080/das > In other words, what are regions for? > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 From dalke at dalkescientific.com Wed Nov 30 01:26:29 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 30 Nov 2005 02:26:29 +0100 Subject: What is /region for? (was Re: [DAS2] DAS intro) In-Reply-To: <438CF433.1020707@affymetrix.com> References: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> <59fa39752e4d792d2142fe2682813937@fruitfly.org> <1ac71c37969c1ef9dcc0d983157746aa@fruitfly.org> <9a9ee9242a38f40049a7c5d973980e7d@dalkescientific.com> <438B8013.3060107@affymetrix.com> <438CAC61.1090104@affymetrix.com> <921477a6bd799b5e19b965b3cd39d239@dalkescientific.com> <438CF433.1020707@affymetrix.com> Message-ID: <6fd85d539c25833e9b6f7f41b3429231@dalkescientific.com> (Changed the Subject line slightly to be a bit clearer. I hope.) On Nov 30, 2005, at 1:37 AM, Ed Erwin wrote: > Andrew Dalke wrote: >> My questions, to summarize, are: >> - why do we need a /region space when we can >> 1. point directly to a sequence (for chromosome regions) and/or >> 2. point to a "contig" or "assembly" or "region" feature type >> (for other regions) > > The way I understand it, that is what region is for: to point directly > to a location on a sequence and/or contig. Am I not asking the question correctly? Am I missing the obvious? Been known to happen before! I know what regions are. I don't know why they are in a distinct /region subtree. I'm happy - enthusiastic - ecstatic - that there are different ways to identify certain regions. I fully accept that they are in use every day and widely understood. Why are they special enough to get their own /region subtree? Why can't they be features? Here's my proposal. Leaf node parts of a always point to a /sequence and optionally point to one or more /feature elements which are of type "region". (Or some other part of SOFA - perhaps assembly-component?) What to know where the feature is on a given "region" feature? Then look up the region to find its /sequence location. Use these two /sequence locations to get the location in the region. Both /sequence locations are in the same "coordinate space" of "identifier + start/end offset" BTW, if regions are a type of features then you can search for them. Eg, search for all top-level regions in the range 100000 to 2000000. Can't do that with the /region container. Can if the region data is in the /feature container. >> - When would someone have regions which have more than one of >> contigs, ESTs and chromosomes? Especially given that this >> is the genome spec, so chromosome-level info is known, at >> least enough for a rough assembly. > > I think they do it mainly 1) when the assembly is incomplete or 2) to > preserve annotations from the past when the assembly was incomplete. > There could be more reasons. > > Here is an example of a DAS/1 server that contains both chromosomes > and "other" short sequences as entry points: Okay, I'm fine with that. Thanks. Is a goal of DAS to support incomplete genomes? Note, btw, that the /sequence subtree does not need to contain only chromosomes. From the spec seqid is the sequence ID, and can correspond to an assembled chromosome, a contig, a clone, or any other accessionable chunk of sequence. Hence for incomplete genomes, put the sequence data as best you can under /sequence and have the /feature subtree point to it. >> In other words, what are regions for? Still don't understand the need for a /region namespace. Repeat: I understand regions, I just don't see why they go in their own subtree and aren't part of some other data chunk. Please, someone sketch out some example with hand-waving XML that shows how having a /region is the appropriate solution. That's what I'm worried about now - the representation in XML. Andrew dalke at dalkescientific.com From Gregg_Helt at affymetrix.com Wed Nov 30 02:08:47 2005 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Tue, 29 Nov 2005 18:08:47 -0800 Subject: [DAS2] mtg topics for Nov 28 Message-ID: Actually I think by default the java networking library that IGB uses follows most redirections automatically without IGB having to worry about it. I'm not familiar with what different forms of redirection might do to the status codes, but I expect that as long as the redirection is successful the code IGB would actually see would be 200 OK. IGB does have a ways to go to properly respond to all possible HTTP status codes though... gregg > -----Original Message----- > From: das2-bounces at portal.open-bio.org [mailto:das2-bounces at portal.open- > bio.org] On Behalf Of Ed Erwin > Sent: Tuesday, November 29, 2005 4:34 PM > To: Andrew Dalke > Cc: DAS/2 > Subject: Re: [DAS2] mtg topics for Nov 28 > > > >> I'm currently using the "proxy" form of redirect, which seems to keep > >> the status code intact. Earlier I was using the "redirect" form of > >> redirect, which may change the status code to 302. > > > > > > I don't understand how the old one would be a problem in the > > web clients I'm familiar with. It should be: > > > > send request to server > > get 302 "moved temporarily" response along with new URL > > repeat until no redirect or reached max redirect limit > > request new URL > > get headers/payload back > > Unlike modern web browsers, IGB isn't smart enough to do that. Maybe > someday it will need to be, but it isn't there yet. > > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 From Gregg_Helt at affymetrix.com Wed Nov 30 02:17:24 2005 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Tue, 29 Nov 2005 18:17:24 -0800 Subject: [DAS2] mtg topics for Nov 28 Message-ID: > -----Original Message----- > From: das2-bounces at portal.open-bio.org [mailto:das2-bounces at portal.open- > bio.org] On Behalf Of Ed Erwin > Sent: Tuesday, November 29, 2005 12:16 PM > To: Andrew Dalke > Cc: DAS/2 > Subject: Re: [DAS2] mtg topics for Nov 28 ... > 2. I tried putting the X-DAS-Status codes into the HTTP status code in > our internal DAS/1 server about a year ago. (In DAS/1 they are not > supposed to be in the HTTP status codes, but I misunderstood the spec.) > I ran into problems when I tried that, and that is the main reason I > objected to trying that in DAS/2. > > Unfortunately, I can't remember what those problems were.... > > The problem might have been: > a) the IGB client didn't understand the status codes because they > weren't in the expected place. > > If this is the case, then the problem was benign, because we are now > writing new code to support the new spec, so we can make IGB understand > whatever we want. I'm pretty sure this was the problem (IGB didn't know where to find the status codes). gregg