From boconnor at ucla.edu Wed Mar 1 16:34:38 2006 From: boconnor at ucla.edu (Brian O'Connor) Date: Wed, 01 Mar 2006 13:34:38 -0800 Subject: [DAS2] Re: Re DAS2 Server In-Reply-To: References: Message-ID: <4406136E.6060703@ucla.edu> Hi Vidya, So I think your best option is to try the RPM. I built a Fedora Core 2 RPM for DAS2 and just released it to http://biopackages.net last night. I could really use someone to test it so feedback would be great. The RPM approach is nice because yum will take care of installing all the dependencies including the chado database. If you're not using FC2 then it's a little but more involved. We don't really have a lot of docs but I could update the README in cvs (see http://sourceforge.net/projects/gmod it's the "das2" module). Until recently there wasn't really an install process you just do a "perl Makefile.PL; make; make test" to run DAS2. There's now an "install" target so you can do "perl Makefile.PL; make; sudo make install". You need to set some environmental variables, install a chado DB, and make sure all the perl module dependencies are installed before you do this though. See the Makefile.PL for the environmental variables you need to set. I'll update the README to include information about the dependencies. Hope this helps! I cc'd Allen Day too, he might have some helpful hints... --Brian Vidya Edupuganti wrote: >Hi Brian, >I am trying to setup DAS/2 server so that it can be used with Affymetrix's >IGB browser. I was trying to find a user manual for setting up DAS/2 server. >I could not find any. Can you please direct me to a place where I can find >it. If there isn't any can you please give me some inputs on how to install >a DAS/2 server and load data. >I really appreciate your help, >Thanks >Vidya > > > > >Vidyadari Edupuganti >Bioinformatician, Bioinformatics Research Unit >The Translational Genomics Research Unit (TGen) >445 N. Fifth St >Phoenix, AZ, 85004, USA > > > > > From dalke at dalkescientific.com Fri Mar 3 04:55:02 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Fri, 3 Mar 2006 02:55:02 -0700 Subject: [DAS2] working das validator Message-ID: <44479892cb0e465913b82e02a5c2525c@dalkescientific.com> I have a running validator at http://cgi.biodas.org:8080/ I've only tested it with SOURCES document but there's little that would fail with the others. I had planned to get this up a couple days ago but I've been distracted learning more about Javascript and a couple of Javascript libraries. I used Mochikit to make the interactivity you see there, and I have some ideas about how to use Dojo -- but not for a couple of weeks. The code goes through the following validation steps: - TODO - handle if the URL is not fetchable and handle timeouts - check that the content-type agrees with the document type - check that it's well-formed XML; report error where not - check that the root element matches the document type - check that it passed the Relax-NG validation; - report the id and href fields which are empty strings - report if any date fields are not iso dates There are many more checks I could add. They are easy now that the scaffold is there. I'm going to work on the next draft now. After that I'll get back to the validator. I want to add hyperlinks on fields which are links, and I have an idea of how to add a "SEARCH" button next to the query urls which creates a popup where you can fill in the different fields before doing the search. Budget-wise I'm not sure how to charge the last few days of work as it was a "wouldn't it be neat if" project rather than something really needed. It is neat though ... Andrew dalke at dalkescientific.com From Steve_Chervitz at affymetrix.com Fri Mar 3 12:34:11 2006 From: Steve_Chervitz at affymetrix.com (Chervitz, Steve) Date: Fri, 3 Mar 2006 09:34:11 -0800 Subject: [DAS2] working das validator In-Reply-To: <44479892cb0e465913b82e02a5c2525c@dalkescientific.com> Message-ID: Andrew, Nice work on the web interface to the validator. Before you dive back into the spec, could you troubleshoot these 500 errors I'm getting on your server? URL: http://das.biopackages.net/das/genome With the "guess" radio button I get: 500 Internal error .... TypeError: GuessFromHeader() takes exactly 2 arguments (1 given) With any other radio button I get: 500 Internal error .... AttributeError: BodyError instance has no attribute 'args' Steve > From: Andrew Dalke > Date: Fri, 3 Mar 2006 02:55:02 -0700 > To: DAS/2 > Subject: [DAS2] working das validator > > I have a running validator at > > http://cgi.biodas.org:8080/ > > > I've only tested it with SOURCES document but there's little > that would fail with the others. > > I had planned to get this up a couple days ago but I've been > distracted learning more about Javascript and a couple of Javascript > libraries. I used Mochikit to make the interactivity you see > there, and I have some ideas about how to use Dojo -- but not > for a couple of weeks. > > The code goes through the following validation steps: > > - TODO - handle if the URL is not fetchable and handle timeouts > - check that the content-type agrees with the document type > - check that it's well-formed XML; report error where not > - check that the root element matches the document type > - check that it passed the Relax-NG validation; > - report the id and href fields which are empty strings > - report if any date fields are not iso dates > > There are many more checks I could add. They are easy now > that the scaffold is there. > > I'm going to work on the next draft now. > > After that I'll get back to the validator. I want to add > hyperlinks on fields which are links, and I have an idea of > how to add a "SEARCH" button next to the query urls which > creates a popup where you can fill in the different fields > before doing the search. > > Budget-wise I'm not sure how to charge the last few days > of work as it was a "wouldn't it be neat if" project rather > than something really needed. It is neat though ... > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 From dalke at dalkescientific.com Fri Mar 3 13:04:12 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Fri, 3 Mar 2006 11:04:12 -0700 Subject: [DAS2] working das validator In-Reply-To: References: Message-ID: <5d7729f77f8d4b6dcbd8dacd04701c19@dalkescientific.com> Hi Steve, I saw those errors in the log file but wasn't sure if they were from you or Gregg. > URL: http://das.biopackages.net/das/genome > > With the "guess" radio button I get: > > 500 Internal error > .... > TypeError: GuessFromHeader() takes exactly 2 arguments (1 given) Fixed. > With any other radio button I get: > > 500 Internal error > .... > AttributeError: BodyError instance has no attribute 'args' Fixed. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Sat Mar 4 20:59:15 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Sat, 4 Mar 2006 18:59:15 -0700 Subject: [DAS2] current text of draft 3 of spec Message-ID: <5e3c38635022ba8ae291cd6c4e036eef@dalkescientific.com> I've been working on the 3rd draft for the spec. Because of the confusion in the previous version I've decided on a different approach where I jump into the middle and describe how the parts fit together before getting into the details of every element type or the theory behind the architecture. I think this flows much better. ==================== DAS is a protocol for sharing biological data. This version of the specification, DAS 2.0, describes features located on the genomic sequence. Future versions will add support for sharing annotations of protein sequences, expression data, 3D structures and ontologies. The genomic DAS interface is deliberately designed so there will be a large core shared with the protein sequence DAS. A DAS 2.0 annotation server provides feature information about one or more genome sources. Each source may have one or more versions. Different versions are usually based on different assemblies. As an implementation detail an assembly and corresponding sequence data may be distributed via a different machine, which is called the reference server. Annotations are located on the genomic sequence with a start and end position. The range may be specified multiple times if there are alternate coordinate systems. An annotation may contain multiple non-continguous parts, making it the parent of those parts. Some parts may have more than one parent. Annotations have a type based on terms in SOFA (Sequence Ontology for Feature Annotation). Stylesheets contain a set of properties used to depict a given type. Annotations can be searched by range, type, and a properties table associated with each annotation. These are called feature filters. DAS 2.0 is implemented using a ReST architecture. Each document (also called an entity or object) has a name, which is a URL. Fetching the URL gets information about the document. The DAS-specific documents are all in XML. Other data types have existing widely used formats, and sometimes more than one for the same data. A DAS server may provide a distinct document for each of these formats, along with information about which formats are available. DAS 2.0 addresses some shortcomings of the DAS 1.x protocol, including: * Better support for hierachical structures (e.g. transcript + exons) * Ontology-based feature annotations * Allow multiple formats, including formats only appropriate for some feature types * A lock-based editing protocol for curational clients * An extensible namespacing system that allows annotations in non-genomic coordinates (e.g. uniprot protein coordinates or PDB structure coordinates) ===== A DAS server supplies information about genomic sequence data sources. The collection of all sources, each data source, and each version of a data source are accessible through a URL. All three classes of URLs return a document of content-type 'application/x-das-sources+xml' though likely with differing amounts of detail. A 'versioned source' request returns information only about a specific version of a data source. A 'source' request returns the list of all the versioned source data for that source. A 'sources' request returns the list of all the source data, including all the versioned source data. The URLs might not be distinct. For example, a server with only one version of one data source may use the same URL for all three documents, and a server for a single organism may use the same URL for the 'sources' and 'source' documents. Most servers will list only the data sources provided by that server. Some servers combine the sources documents from other servers into a single document. These registry servers act as a centralized index and reduce configuration and network overhead. A registry server uses the same sources format as an annotation server. Here is an example of a simple sources document which makes no distinction between the three sources categories. Request: http://www.example.com/das/genome/yeast.xml Response: Content-Type: application/x-das-sources+xml All identifiers and href attributes in DAS documents follow the XML Base specification (see http://www.w3.org/TR/xmlbase/ ) in resolving partial identifiers and href attributes. In this case the id "yeast.xml" is fully resolved to "http://www.example.com/das/genome/yeast.xml". Here is an example of a more complicated sources document with multiple organisms each with multiple versions. Each of the two source documents (one for each organism) has a distinct URL as does each of the version for each organism. This is a pure registry server because the actual annotation data comes from other machines. Request: http://www.biodas.org/known_servers Response: Content-Type: application/x-das-sources+xml Each SOURCE id and VERSION id is individually fetchable so the URL "http://das.ensembl.org/das/SPICEDS/" returns a sources document with the SOURCE record for "das_vega_trans" and both of its VERSION subelements while "http://das.ensembl.org/das/SPICEDS/128/" returns a sources document with only the second of its VERSION subelements. DAS documents refer to other documents through URLs. There are no restrictions on the internal form of the URLs, other than the query string portion. Server implementers are free to choose URLs which best fit the architecture needs. For example, a simple DAS server may be implemented as a set of XML files hosted by a standard web server while more complex servers with search support may be implemented as CGI scripts or through embedded web server extensions. The URLs do not need to define a hierarchical structure nor even be on the same machine. Compare this to the DAS1 specification where some URLs were constructed by direct string modification of other URLs. ===== Each versioned source contains a set of segments. A segment is the largest chunk of contiguous sequence. For fully sequenced organisms a segment may be a chromosome. For partially assembled genomes where the distance between the assembled regions is not known then each region may be its own segment. If a server provides annotations in contig space then each contig is a segment. Feature locations are specified on ranges of segments which is why a specific set of segments is called a coordinate system. [coordinate-system] This specification does not describe how to do alignments between different coordinate systems. The sources document format has two ways to describe the coordinate system. The optional COORDINATES element uniquely characterize the coordinate system. If two data sources have the same authority and source values then they must be annotations on the same coordinate system. The specific coordinate system is also called the "reference sequence". A versioned source may contain CAPABILITY elements which describe different ways to request additional data from a DAS server. Each CAPABILITY has a type that describes how to use the corresponding URL to query a DAS server. A CAPABILITY element of type "segments" has a query URL which returns a document of content-type "application/x-das-segments+xml". A segments document lists information about the segments in the coordinate system. Here is an example of a segments document. Request: http://www.biodas.org/das2/h.sapiens/v3/segments.xml Response: Content-Type: application/x-das-segments+xml ===== The versioned source record for an annotation server must include a CAPABILITY of type "features". A client may use the query URL from the features CAPABILTY points to select features which match certain criteria. If no criteria are specified the server must return all features unless there are too many features to return. In that case it must respond with an error message. Unless an alternate format is specified, the response from the features query is a document of content-type "application/x-das-features+xml" containing all of the matching features. Here is an example features document for a server which contains a gene and an alignment. Request: http://das.biopackages.net/das/genome/yeast/S228C/features.pl Response: Content-Type: application/x-das-features+xml Each feature has a unique identifier and an identifer linking it to a type record. Both identifiers are URLs and should be directly fetchable. Simple features can be located on a region of a segment. More complex features like a gapped alignment are represented through a parent/part relationship. A feature may have multiple parents and multiple parts. ===== An annotation server may contain many features while the client may only be interested in a subset; most likely features in a given portion of the reference sequence. To help minimize the bandwidth overhead the feature query URL should support the DAS feature filter language. The syntax uses the standard HTML form-urlencoded GET query syntax. For example, here is a request for all features on Chr2. Request: http://www.example.org/volvox/1/features.cgi?inside=Chr2 Response: Content-Type: application/x-das-features+xml and here is the rather long one for all EST alignments Request: http://www.example.org/volvox/1/features.cgi? type=http%3A%2F%2Fwww.example.org%2Fvolvox%2F1%2Ftype%2Fest-alignment Response: Content-Type: application/x-das-features+xml ===== All features are linked to a type record. DAS types do not describe a formal type system in that DAS types do not derive from other DAS types. Instead it links to an external ontology term and describes how to depict features of that type. A DAS annotation server must contain a CAPABILITY element of type "types". A client may use its query URL to fetch a document of content-type "application/x-das-types+xml". The document lists all of the types available on the server. We expect that servers will have at most a few dozen types so DAS does not support type filters. The following is a hypothetical example of a DAS annotation server providing GENSCAN gene predictions for zebrafish. Each feature is either of type "http://www.example.org/das/zebrafish/build19/high-type" or "http://www.example.org/das/zebrafish/build19/low-type" depending on if the data provider determined it was a high probability or low probability prediction. Even though there are two different type records the refer to the same ontology term, in this case the SO term for "gene". The distinction exists so that the high probability features are depicted differently from the low probability features. Request: http://www.example.org/das/zebrafish/build19/types Response: Content-Type: application/x-das-types+xml [coordinate-system] We make a distinction between "coordinate system" and "numbering system". The coordinate system is the set of segment on which features are located. The numbering system describes how to identify the specific residues in the segment. DAS uses a 0-based coordinate system where the first residue is numbered "0", the second "1", and so on. Other numbering systems include 1-based coordinates and the PDB numbering system which preserves the residue number for key residues across homologous family by allowing discontinuities, insertions and negative values as position numbers. Andrew dalke at dalkescientific.com From nomi at fruitfly.org Mon Mar 6 03:09:22 2006 From: nomi at fruitfly.org (Nomi Harris) Date: Mon, 6 Mar 2006 00:09:22 -0800 (PST) Subject: [DAS2] DAS/2 teleconference? Message-ID: <17419.60978.358549.246997@kinked.lbl.gov> Is there a DAS/2 teleconference tomorrow morning? Last week it didn't happen. Nomi From dalke at dalkescientific.com Mon Mar 6 04:14:30 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 6 Mar 2006 02:14:30 -0700 Subject: [DAS2] DAS/2 teleconference? In-Reply-To: <17419.60978.358549.246997@kinked.lbl.gov> References: <17419.60978.358549.246997@kinked.lbl.gov> Message-ID: Nomi: > Is there a DAS/2 teleconference tomorrow morning? Last week it didn't > happen. I plan on calling in. Andrew dalke at dalkescientific.com From Gregg_Helt at affymetrix.com Mon Mar 6 09:03:24 2006 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Mon, 6 Mar 2006 06:03:24 -0800 Subject: [DAS2] DAS/2 teleconference? Message-ID: Apologies for the mixup with the teleconference last week! Yes we're definitely on for a teleconference today at the standard time, 9:30 AM Pacific time. Thanks, Gregg > -----Original Message----- > From: das2-bounces at portal.open-bio.org [mailto:das2-bounces at portal.open- > bio.org] On Behalf Of Nomi Harris > Sent: Monday, March 06, 2006 12:09 AM > To: DAS/2 > Subject: [DAS2] DAS/2 teleconference? > > Is there a DAS/2 teleconference tomorrow morning? Last week it didn't > happen. > Nomi > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 From lstein at cshl.edu Mon Mar 6 09:49:18 2006 From: lstein at cshl.edu (Lincoln Stein) Date: Mon, 6 Mar 2006 09:49:18 -0500 Subject: [DAS2] DAS/2 teleconference? In-Reply-To: References: Message-ID: <200603060949.19299.lstein@cshl.edu> Hi Gregg, I'll miss the first half hour of the call today because of an overlap with an NCI teleconference. Lincoln On Monday 06 March 2006 09:03, Helt,Gregg wrote: > Apologies for the mixup with the teleconference last week! Yes we're > definitely on for a teleconference today at the standard time, 9:30 AM > Pacific time. > > Thanks, > Gregg > > > -----Original Message----- > > From: das2-bounces at portal.open-bio.org > > [mailto:das2-bounces at portal.open- > > > bio.org] On Behalf Of Nomi Harris > > Sent: Monday, March 06, 2006 12:09 AM > > To: DAS/2 > > Subject: [DAS2] DAS/2 teleconference? > > > > Is there a DAS/2 teleconference tomorrow morning? Last week it didn't > > happen. > > Nomi > > > > _______________________________________________ > > DAS2 mailing list > > DAS2 at portal.open-bio.org > > http://portal.open-bio.org/mailman/listinfo/das2 > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 -- Lincoln D. Stein Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724 FOR URGENT MESSAGES & SCHEDULING, PLEASE CONTACT MY ASSISTANT, SANDRA MICHELSEN, AT michelse at cshl.edu From Gregg_Helt at affymetrix.com Mon Mar 6 11:44:43 2006 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Mon, 6 Mar 2006 08:44:43 -0800 Subject: [DAS2] Proposed agenda for DAS/2 teleconference, March 6 Message-ID: upcoming Code Sprint, March 13-17 at Affymetrix status reports coordinate system resolution via COORDINATES element features with multiple locations vs. alignments features with multiple parents ??? From lstein at cshl.edu Mon Mar 6 12:37:39 2006 From: lstein at cshl.edu (Lincoln Stein) Date: Mon, 6 Mar 2006 12:37:39 -0500 Subject: [DAS2] Proposed agenda for DAS/2 teleconference, March 6 In-Reply-To: References: Message-ID: <200603061237.41288.lstein@cshl.edu> Hi, The teleconference system now asks me for a passcode. Previously I just had to enter the conference ID. What's up? Lincoln On Monday 06 March 2006 11:44, Helt,Gregg wrote: > upcoming Code Sprint, March 13-17 at Affymetrix > status reports > > coordinate system resolution via COORDINATES element > features with multiple locations vs. alignments > features with multiple parents > ??? > > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 -- Lincoln D. Stein Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724 FOR URGENT MESSAGES & SCHEDULING, PLEASE CONTACT MY ASSISTANT, SANDRA MICHELSEN, AT michelse at cshl.edu From Gregg_Helt at affymetrix.com Mon Mar 6 12:38:37 2006 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Mon, 6 Mar 2006 09:38:37 -0800 Subject: [DAS2] Proposed agenda for DAS/2 teleconference, March 6 Message-ID: Please try again, it shouldn't ask for a passcode, but if it does, it's 1365. There may be some glitch in our teleconferencing... Thanks, Gregg > -----Original Message----- > From: Brian O'Connor [mailto:boconnor at ucla.edu] > Sent: Monday, March 06, 2006 9:36 AM > To: Helt,Gregg > Cc: das2 at portal.open-bio.org > Subject: Re: [DAS2] Proposed agenda for DAS/2 teleconference, March 6 > > Hi Gregg, > > I tried calling in to the DAS conference call but it asked for a > passcode in addition to the conference ID. All I have is the conference > ID... > > --Brian > > Helt,Gregg wrote: > > >upcoming Code Sprint, March 13-17 at Affymetrix > >status reports > > > >coordinate system resolution via COORDINATES element > >features with multiple locations vs. alignments > >features with multiple parents > >??? > > > > > >_______________________________________________ > >DAS2 mailing list > >DAS2 at portal.open-bio.org > >http://portal.open-bio.org/mailman/listinfo/das2 > > > > > > From nomi at fruitfly.org Mon Mar 6 12:40:26 2006 From: nomi at fruitfly.org (Nomi Harris) Date: Mon, 6 Mar 2006 09:40:26 -0800 Subject: [DAS2] Proposed agenda for DAS/2 teleconference, March 6 In-Reply-To: References: Message-ID: <17420.29706.575212.913804@spongecake.lbl.gov> i am calling in (800-531-3250, id: 2879055) but it is then asking me for a passcode. i tried entering 2879055 again but that didn't work. we didn't used to have a passcode, did we? can someone tell me what it is? if you prefer not to email it, you can phone me at 510 486-5078. Nomi From Gregg_Helt at affymetrix.com Mon Mar 6 13:10:23 2006 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Mon, 6 Mar 2006 10:10:23 -0800 Subject: [DAS2] Examples of features with multiple locations from biopackages server Message-ID: In the teleconference today, we?re talking about features with multiple locations, here?s an example from biopackages server: ? From boconnor at ucla.edu Mon Mar 6 12:36:28 2006 From: boconnor at ucla.edu (Brian O'Connor) Date: Mon, 06 Mar 2006 09:36:28 -0800 Subject: [DAS2] Proposed agenda for DAS/2 teleconference, March 6 In-Reply-To: References: Message-ID: <440C731C.5070303@ucla.edu> Hi Gregg, I tried calling in to the DAS conference call but it asked for a passcode in addition to the conference ID. All I have is the conference ID... --Brian Helt,Gregg wrote: >upcoming Code Sprint, March 13-17 at Affymetrix >status reports > >coordinate system resolution via COORDINATES element >features with multiple locations vs. alignments >features with multiple parents >??? > > >_______________________________________________ >DAS2 mailing list >DAS2 at portal.open-bio.org >http://portal.open-bio.org/mailman/listinfo/das2 > > > From dalke at dalkescientific.com Mon Mar 13 09:00:45 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 13 Mar 2006 06:00:45 -0800 Subject: [DAS2] format information for the reference server Message-ID: <23b58bf3b2a561142bfd5f6fafb3523a@dalkescientific.com> (NOTE: the open-bio mailing lists were moved from portal.open-bio.org to lists.open-bio.org. My first email on this bounced because I sent to the old email address.) Summary of questions: - what does it mean for the annotation server to list the formats available from the reference server? - can the reference server format information be moved to the segments document? - are there formats which will only work at the segment level and not at the segments level (ie, formats which don't handle multiple records)? Something's been bothering me about the segments request. Currently the DAS sources request responds with something like ... This says "go to 'blah' for information about the sequence". But it says more than that. It provides metadata about the reference server. It says that the reference server can respond in 'fasta' and 'agp' formats. Hence the following are allowed from this URL http://blah/seq?format=agp -- return the assembly http://blah/seq?format=fasta -- return all sequences in FASTA format Does this mean that all annotations servers using the given reference server must list all of the available formats? If a client sees multiple CAPABILITY elements for the same query_url is it okay to merge the list of supported formats? That is, if server X says that annotation server A supports fasta and server Y says that A supports genbank then a client may assume A supports both fasta and genbank formats? (This makes sense to me.) Second, does it make sense to require the annotation servers to list the formats on the reference server? What about making that information available from the segments document, like this. query: http://www.biodas.org/das/h.sapiens/38/segments.cgi response: A problem with this the lack of data saying that the segments query URL itself supports multiple formats. For example, http://www.biodas.org/das/h.sapiens/38/segments.cgi?format=fasta might support returning all of the chromosomes in FASTA format. Are there any formats which only work at the segment level and not at the segments level? That is, which only work with single gene/chromosome/contig/etc. but don't support multiple sequences? The only one I could think of off-hand is "raw", since there's no concept of a "record" given a bunch of letters, unless the usual way is to separate them by an extra newline? If all formats are supported for both single and all segments then here is another possible response [possibility #1] I think all formats which work on the "segments" level also work on a single segment level, so another possibility is the following, which lets a given segment say that it supports more formats. [possibility #2] Here's another, using a flag to say if a format is for a single segment, the segments URL, or both (feel free to pick better names!). By default it applies to both. [possibility #3] Yet another option is [possibility #4] .. Of these I support [possibility #1], with the ability to go to [possibility #3] if there's ever a case where a given format cannot be applied to both levels. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Mon Mar 13 09:29:28 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 13 Mar 2006 06:29:28 -0800 Subject: [DAS2] id, url, uri, and iri Message-ID: Something to settle. I've been using 'id' like this > type_id = "type/est-alignment" > created = "2001-12-15T22:43:36" > modified = "2004-09-26T21:10:15" > > > > > > As Dave Howorth pointed out, most people use 'id' as an in-document identifier, and not as an identifier to link to other documents. Eg, there's a "getElementById()" method in the DOM which is mean to find DOM nodes given the id. In looking around I found that it's keyed off of the type (as determined by the schema) and not by the string 'id'. I added 'xml:id' as a possible DAS attribute, which is defined by the XML spec to work as expected for getElementById. In private email Gregg asked about using 'uri' instead of 'id' for this. I'm now leaning that way. Either 'uri' or 'url' or 'iri'. I prefer url because everyone knows what that means. Gregg prefers 'uri' I think because that's what allows fragment identifiers, and because it includes things which are other than URLs, like LSIDs. However, the latest thing these days is an "iri" which means "internationalized resource identifier" http://www.ietf.org/rfc/rfc3987.txt I haven't read enough of it to understand it. My first attempt says that it's okay to use "uri" because there are 1-to-1 mappings between uris and iris. Also, I don't want to test bidirectional text and I suspect there isn't yet widely used library support for iris. So I want to change the DAS use of 'id' to 'url' and say "the value of the 'url' attribute is a URI". Andrew dalke at dalkescientific.com From Steve_Chervitz at affymetrix.com Mon Mar 13 10:38:58 2006 From: Steve_Chervitz at affymetrix.com (Chervitz, Steve) Date: Mon, 13 Mar 2006 07:38:58 -0800 Subject: [DAS2] Notes from the weekly DAS/2 teleconference, 6 Mar 2006 Message-ID: [These are notes from last week's meeting. -Steve] Notes from the weekly DAS/2 teleconference, 6 Mar 2006 $Id: das2-teleconf-2006-03-06.txt,v 1.1 2006/03/13 15:41:03 sac Exp $ Note taker: Steve Chervitz Attendees: Affy: Steve Chervitz, Ed E., Gregg Helt CSHL: Lincoln Stein Sanger: Thomas Down Dalke Scientific: Andrew Dalke UC Berkeley: Nomi Harris UCLA: Brian O'Connor Action items are flagged with '[A]'. These notes are checked into the biodas.org CVS repository at das/das2/notes/2006. Instructions on how to access this repository are at http://biodas.org DISCLAIMER: The note taker aims for completeness and accuracy, but these goals are not always achievable, given the desire to get the notes out with a rapid turnaround. So don't consider these notes as complete minutes from the meeting, but rather abbreviated, summarized versions of what was discussed. There may be errors of commission and omission. Participants are welcome to post comments and/or corrections to these as they see fit. Agenda: ------- upcoming Code Sprint, March 13-17 at Affymetrix status reports coordinate system resolution via COORDINATES element features with multiple locations vs. alignments features with multiple parents ??? [ Some trouble with passcode for teleconf - hopefully fixed ] TD: The coord syst things we were hoping to discuss with Andreas who won't make it today. GH: We can push this off till next week. Code Sprint ------------- LS: At sanger mon-tues for ensembl sab meeting, able to participate from tues pm to fri eve. AD: Planning to come to Affy BO: Allen and I are planning to come up to Emeryville GH: For payment, submit expenses to affy. Hotels? Marriott or Woodfin. Will send out rec's today. NH: Planning to attend at affy mon-tues, thur. [A] Ed will look into accts for andrew and brian (internet access) GH: Plan on 9-10am phone teleconf daily. Greg can pick up people from hotel. GH: Goals/deliverables for this code sprint? LS: Write das/2 client for bioperl. Plan to plug into Gbrowse All I need is a working server AD: Writing writeback and locks, improving validator . NH: Apollo and registry, feature types. Wrote a writer, can test in AD's validator (plan to). GH: Keep working on das/2 client for igb at affy. Hoping by then to have an affy das/2 server up and running. SC: Can help get it up GH: Can we put on in our dmz, so it's publically accessible at least for the code sprint. [A] Steve will look into setting up publically accessible affy das/2 test server TD: Working on getting an Ensembl das/2 server up. GH: Java middle ware on top of biojava? TD: Yes. Using the biojava to ensembl bridges. EE: Getting IGB to use style sheets. AD: And/or using a proper style sheet system, if you decide what I put in there is not good enough. BO: Looking for something to do. Hoping to start on writeback., Helping separate out igb model layer. Finished rpm packages in last code sprint, this is pretty much done. GH: Guess Allen will be working on the biopackages server. BO: Waiting on spec for writeback. AD: My writeup specifies how they do writeback at Sanger, overlaps well with Lincoln's proposal. See that. GH: Need to tighten up the read-only spec. A fair number of things to resolve. AD: A partial draft of 3rd version. Planning to update it before next sprint. Examples so people can get a feel for how things go together. GH: My agenda stuff: coord system resolution system to match annotations on same genome coming from diff servers. [A] Gregg will wait for Andreas to join in before discussing coordinate issues. GH: Feats w/ multiple locations (see email Gregg sent to the list today with examples). Current spec says if you use >1 coord system, you can have feats with multiple locations. Is this what we want to say? GH: Allen's server has feats w/ >1 location on same coord system. Do we want to allow or disallow? If disallow, how? AD: Possible usecase for alignments. GH: Feat model for bioperl. Locations have multiple parts. Feats with mult locations feels similar to that. Do you have multple children each with a loc, or do you use the align element? LS: Prefers children. That's what SO ended up doing after much arguing. Makes it easier. GH: Enforce it with the ontology. E.g, an alignment hit has alignment hsps. This forces client to understand the ontology. LS: Consider that an hsp will have scores attached to it, different cigar line. So you end up with mult children anyway. An improverished type of alignment. Can use cigar line to indicate mismatches. Can have a single HSP and a cigar line to indicate gaps. Only one child. You don't have to have multiple locations GH: Looking for use case of multiple locations with PCR products... My main concern is how much semantic knowledge the clients need to understand these things. Nothing in the spec that restricts mult locations. AD: Won't client just get the multiple children and not care about types? GH: I gues a simple client could do that. It disturbs me that it's up the server how to handle multip loc, childrent, vs aligmnets. Will send an example. LS: Yes. this is a vague area. There should be a best-practices section in the spec. Single match feature from begin to end. HSP children, each one covers major gaps. Cigar line w/in hsp to cover minor gaps. Can give each hsp an alignment score. GH: Main diff between locn and alignment is cigar string, and cigar string is optional. If we're allowed to use locations to designate alignments... LS: How about if we consolidate location and alignment: location has an optional cigar and then do away with alignment. Generalize location to allow for gaps. TD: Example: Aligning an est to the genome. Falls into several blocks of exact/near exact matching. If location has cigar line, could serve it up as a single feature. GH: You can do this since cigar can represent arbitrary length gaps. TD: Neat and compact way to do it. Does this scare anyone? GH: Sounds reasonable. AD: Let's do it. And will put in examples of best practices. [A] Consolidate location and alignment in spec, loc has optional cigar GH: Feats with mult parents. Need examples to test. This is a question to people putting up servers. Will anyone have these? TD: Ensembl might do this. Exon shared between several transcripts. A toss up between multiple parents vs. multiple copies of same exon. Think mult parents is the way to do it. LS: Flybase use multiple parents for exons in this way. TD: Ensemble db is a many-to-many between transcripts and exons. GH: Spec says: If you have a child in the feat document, you have to include its parent; If you have a parent you must include it's children. As long as this plays policy nice with that requirement, I'm ok with it. GH: Anyone else see things that need to be ironed out in spec? AD: Not yet NH: We should write a paper about das/2. This will help get more people using it, increase the success of the spec. GH: Agreed -- good idea. We have lots of text in grant about the philosophy of das/2. NH: Can pull text from these places. Publish at a conference perhaps? ISMB, CSB2006 GH: PLoS Bioinformatics? NH: Conference would be nice, to involve people in discussion. AD: Poster session is available for ISMB. NH: Prefers a conference talk. Paper will require more finished stable. Poster is too much work for little payoff. AD: Ann L complains that the only paper to cite for das is an old ref. Wants an updatable citable paper. NH: CSB will publish a proceedings. Genome informatics at CSHL (they don't publish though). NH/GH: What's the best conference to get published in these days? LS: ISMB NH: We missed deadline for it. LS: Biocurators meeting? NH: Can ask Sima about. Another one: Computational Genomics (TIGR sponsored). Not published. Submit abstracts, they select talks. Halloween in Baltimore. If conf proceedings are published, you can't submit to a paper, so we could go that way, get double mileage out of it. GH: Sounds good to get something ready for a paper rather than a conference. Did a presentation at Bosc, Genome informatics last year. [A] Nomi will help get paper ready for PLoS (after code sprint) AD: Can do poster for ismb, bosc in Brazil, if I end up going. NH: ISMB deadline is 10 May, so we should get going on it GH: Continuation grant submission, in theory has been reviewed, but haven't heard back. Maybe will take another month, to get score back. Final word? LS: Have you checked ERA Commons? They may update it there before you get the note. From dalke at dalkescientific.com Mon Mar 13 10:58:29 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 13 Mar 2006 07:58:29 -0800 Subject: [DAS2] definition of coordinate system attributes In-Reply-To: References: <3124ef2656aa51af817f16b1b71b16a2@sanger.ac.uk> Message-ID: I've been exchanging emails with Andreas >> Me? I don't know what it's for. Which means I've wiped it. > > is this a spec change? then I need to update the source response form > the new devel dasregistry ... > > actually the new_spec.txt says it has not been changed since feb. > 10th... I had hoped to have an updated spec by now. (After all, the conf. call is in an hour.) That didn't happen. :( I've attached what I have so far. I'll be working on it more today, and getting things in CVS updated. -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: draft3.txt URL: -------------- next part -------------- Andrew dalke at dalkescientific.com From ap3 at sanger.ac.uk Mon Mar 13 11:47:32 2006 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Mon, 13 Mar 2006 16:47:32 +0000 Subject: [DAS2] format information for the reference server In-Reply-To: <23b58bf3b2a561142bfd5f6fafb3523a@dalkescientific.com> References: <23b58bf3b2a561142bfd5f6fafb3523a@dalkescientific.com> Message-ID: On 13 Mar 2006, at 14:00, Andrew Dalke wrote: > Summary of questions: > - what does it mean for the annotation server to list the formats > available from the reference server? should this happen? I thought that annotation servers are described by their "coordinate system" then the registry provides a list of available references servers that provide the sequences for this. > Something's been bothering me about the segments request. > > Currently the DAS sources request responds with something like > > > > > > > > > ... > > > This says "go to 'blah' for information about the sequence". > > But it says more than that. It provides metadata about > the reference server. It says that the reference server can > respond in 'fasta' and 'agp' formats. I think an annotation server should not know/provide this information this should come from the reference server / registry > If a client sees multiple CAPABILITY elements for the same > query_url is it okay to merge the list of supported formats? that does not sound clean. > That is, if server X says that annotation server A supports > fasta and server Y says that A supports genbank then a client > may assume A supports both fasta and genbank formats? > (This makes sense to me.) the client should ask the reference server directly what it speaks / rely on the registration server to have validated that server A speaks indeed what it says it does. Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 From Gregg_Helt at affymetrix.com Mon Mar 13 12:13:14 2006 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Mon, 13 Mar 2006 09:13:14 -0800 Subject: [DAS2] DAS/2 code sprint conference starting now Message-ID: We just started the daily DAS/2 code sprint teleconference at Affymetrix. US number #: 800-531-3250 International #: 303-928-2693 Conference ID: 2879055 Passcode: 1365 From Gregg_Helt at affymetrix.com Mon Mar 13 15:48:50 2006 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Mon, 13 Mar 2006 12:48:50 -0800 Subject: [DAS2] Problem with name feature filter on biopackages server Message-ID: I'm looking into adding the ability in the IGB DAS/2 client to retrieve features by name/id. Trying this out with the biopackages server almost gives me what I want: http://das.biopackages.net/das/genome/yeast/S228C/feature?name=YGL076C except that in the returned XML the parent feature (YGL076C) does not list it's children as , though the children list YGL076C as . Any ideas? thanks! gregg From nomi at fruitfly.org Mon Mar 13 17:32:49 2006 From: nomi at fruitfly.org (Nomi Harris) Date: Mon, 13 Mar 2006 14:32:49 -0800 (PST) Subject: [DAS2] Where to publish [was Re: Notes from the weekly DAS/2 teleconference, 6 Mar 2006] In-Reply-To: References: Message-ID: <17429.62225.230884.764469@kinked.lbl.gov> On 13 March 2006, Chervitz, Steve wrote: > NH/GH: What's the best conference to get published in these days? > LS: ISMB > NH: We missed deadline for it. > LS: Biocurators meeting? > NH: Can ask Sima about. Sima said: > Next biocurator meeting is probably in early 2007 in the UK. No plans at > the moment to publish the proceedings, however. > > I think publishing soon in PLoS is a good idea. From dalke at dalkescientific.com Mon Mar 13 18:45:04 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 13 Mar 2006 15:45:04 -0800 Subject: [DAS2] URIs for sequence identifiers Message-ID: <32b20a44f60c916d9b3649fbcdacd31f@dalkescientific.com> Proposals: - do not use segment "name" as an identifier - rename it "title" (human readable only) - allow a new optional "alias-of" attribute which is the link to the primary identifier for this segment - change the feature location to use the segment uri - change the feature filter range searches so there is a new "segment" keyword and so the "includes", "overlaps", etc. only work on the given segment, as segment= inside=$start:$stop overlaps=$start:$stop contains=$start:$stop identical=$start:$stop - If 'includes', 'overlaps', etc. are given then the 'segment' must be given (do we need this restriction? It doesn't make sense to me to ask for "annotations on 1000 to 2000 of anything" - only allow at most one each of includes, overlaps, contains, or identical (do we need this restriction?) - multiple segments may be given, but then range searches are not supported (do we need this restriction?) Discussion: The discussion on this side of things was based on today's phone conference. Andreas needs data sources to work on multiple coordinate spaces. To quote from Andreas: > There are several servers that understand more than one coordinate > system and can return the same type of data in different coordinates. > (depending on which type of accession code/range was used for the > request ) E.g. there are a couple of zebrafish servers that speak > both in Chromosome and Scaffold coordinates. (reason perhaps > being that zebrafish is an organism that seems to be very difficult > to assemble ?) The current DAS system does not support this because of how it does segment identifiers. The current scheme looks like this: .... Problem #1: We need two entry points, one to view the segments in Scaffold space, the other to view them in Chromosome space. Solution #1 (don't like it though). Add a "source=" attribute to the CAPABILITY and allow multiple segments capabilities .... I don't like it because it feels like the COORDINATES and CAPABILITY[type="segments"] field should be merged. Still, I'll go with it for now. Problem #2: feature searches return features from either namespace Consider search for name=*ABC* (that is, "ABC" as a substring in the "name" or "alias" fields). Then the result might be Where "A" is a short-hand notation for one of the segments? Which one? The client goes to the segment servers: Query: http://sanger/andreas/scaffolds.xml" Response: Query: http://sanger/andreas/chromosomes.xml" The segment name "A" matches either ChromosomeA or ScaffoldA, and there's no way to figure out which is correct! This comes because our own naming scheme is not very good at being globally unique. We could fix it by also stating the namespace in the result, as Gregg asked "why don't we just use the URI"? After a long discussion we decided to propose just that. That is, get rid of the "name" attribute. Instead, use a "title" attribute which is human readable and an optional "alias-of" which contains is the primary identifier for the given segment. The alias-of value is determined by the person who defined the COORDINATES. It could be a URL. It could a URI. It does not need to be resolvable (though it should - perhaps to a human readable document? Or to something which lists all known aliases to it?) That is, the segments document will look like this Query: http://sanger/andreas/scaffolds.xml" Response: Query: http://sanger/andreas/chromosomes.xml" This has a few implications. Feature locations must be given with respect to the segment uri, as Given this segment_uri a client can figure out if it is in Scaffold or Chromosome space because it can check all of the URIs in each space for a match. The other change is in range searches. Consider the current scheme, which looks like includes=ChrA includes=A/100:300 The query is of the form $ID or $ID/$start:$end. It needs to be changed to support URLs. For examples, includes={http://www.whatever.com/ChromosomeA includes={http://www.whatever.com/ScaffoldA}/100:300 We couldn't come up with a better syntax. Then Gregg asked "why do we need multiple includes"? That is, the current syntax supports includes=ChrA/0:1000;includes=ChrB/2000:3000;includes=ChrC/5000:6000 to mean "anywhere on the first 1000 bases of ChrA, the 3rd 1000 bases of ChrB, or the 6th 1000 bases of ChrC". Given the query language, we're looking for way to write that using URLs, as includes={http://www.whatever.com/ChromosomeA}0:1000;includes={http:// www.whatever.com/ChromosomeB}:2000:3000;includes={http:// www.whatever.com/ChromosomeC}:5000:6000; However, that's a very unlikely query. What if we split the "includes", "overlaps", etc. into "includes_segment" and "includes_range". In that case: old-style: includes=A/500:600 new-style: includes_segment=http://www.whatever.com/ChromosomeA; includes_range=500:600 old-style: includes=A/500:600,Chr3/700:800 new-style: includes_segment=http://www.whatever.com/ChromosomeA; includes_range=500:600; includes_range=700:800 old-style: includes=A/500:600,D/700:800 new-style: -- NOT POSSIBLE old-style: includes=A/500:600,D/500:600 new-style: (not likely to be used in real life) includes_segment=http://www.whatever.com/ChromosomeA; includes_segment=http://www.whatever.com/ChromosomeD; includes_range=500:600; This no longer allows searches with subranges from different segments. The again -- who cares? Those sorts of searches are strange. Talking some more. Who needs the ability to do more than one includes / overlaps / etc. query at a time? Gregg wants the ability to do a combination of includes and overlaps, but that's all. We can simplify the server code by only supporting one inside search, one contains search, and/or one overlaps search, instead of the current system which allows a more constructive geometry, and we can move the segment id out into its own parameter. Allen said that that would prevent more complicated types of analysis on the server, but that anyone doing more complicated searches would pull the data down locally. Does anyone want to do more than one overlaps search at at time? More than one contains search at a time? More than one identical search at a time? (For that matter, does anyone actually want to do a "identical" search? Gregg thinks it will be useful to find any other annotations which are exactly matching the given range. I think that might be better with a "include"/"exclude" combination to have start/end positions within a couple of bases from the specified range.) PROPOSAL: Change the range query language to have segment= < inside= $start:$end overlaps= $start:$end contains= $start:$end Example: segment=http://whatever.com/ChromosomeD;inside=5000:6000 Also, only allow at most one includes, one overlaps, and one contains (unless people want it). I'm less sure about the need for this restriction. It might be as easy to implement the more complex search as it would be to check for the error cases. Andrew dalke at dalkescientific.com From ed_erwin at affymetrix.com Mon Mar 13 18:56:56 2006 From: ed_erwin at affymetrix.com (Ed Erwin) Date: Mon, 13 Mar 2006 15:56:56 -0800 Subject: [DAS2] URIs for sequence identifiers In-Reply-To: <32b20a44f60c916d9b3649fbcdacd31f@dalkescientific.com> References: <32b20a44f60c916d9b3649fbcdacd31f@dalkescientific.com> Message-ID: <441606C8.3070902@affymetrix.com> Andrew Dalke wrote: >>There are several servers that understand more than one coordinate >>system and can return the same type of data in different coordinates. >>(depending on which type of accession code/range was used for the >>request ) E.g. there are a couple of zebrafish servers that speak >>both in Chromosome and Scaffold coordinates. (reason perhaps >>being that zebrafish is an organism that seems to be very difficult >>to assemble ?) > > > The current DAS system does not support this because of how > it does segment identifiers. > > > Problem #1: We need two entry points, one to view the segments > in Scaffold space, the other to view them in Chromosome space. > > Solution #1 (don't like it though). > Add a "source=" attribute to the CAPABILITY and allow multiple > segments capabilities > Problem #2: feature searches return features from either namespace > A different solution: Scaffold and Chromosome coordinate systems are served by separate DAS/2 servers. Each server returns data from one and only one namespace. Those separate servers can, behind-the-scenes, use the same database. DAS/2 clients, like IGB, would choose to connect to either the Scaffold-based server or the Chromosome-based server, but not usually to both at once. Does this handle all the issues? Ed From dalke at dalkescientific.com Mon Mar 13 19:12:52 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 13 Mar 2006 16:12:52 -0800 Subject: [DAS2] URIs for sequence identifiers In-Reply-To: <441606C8.3070902@affymetrix.com> References: <32b20a44f60c916d9b3649fbcdacd31f@dalkescientific.com> <441606C8.3070902@affymetrix.com> Message-ID: <54829d8554d9b044908965d80b158c60@dalkescientific.com> Ed: >> Problem #2: feature searches return features from either namespace > > A different solution: > > Scaffold and Chromosome coordinate systems are served by separate > DAS/2 servers. Each server returns data from one and only one > namespace. > > Those separate servers can, behind-the-scenes, use the same database. > > DAS/2 clients, like IGB, would choose to connect to either the > Scaffold-based server or the Chromosome-based server, but not usually > to both at once. > > Does this handle all the issues? Here's the email I got from Andreas when I proposed this. >>> There may be more than one COORDINATE element if ... (XXX why?) > > There are several servers that understand more than one coordinate > system and > can return the same type of data in different coordinates. (depending > on which type of accession code/range was used for the request ) > E.g. there are a couple of zebrafish servers that speak both in > Chromosome and Scaffold coordinates. > (reason perhaps being that zebrafish is an organism that seems to be > very difficult to assemble ?) >> Will there be separate CAPABILITY items for each source? > > no. if there are then this should be registered as two independent > servers. (but see clarification later) > Allowing multiple coordinate systems per server is a way to slightly > reduce the already long list of known > servers. Currently there are about 90 in the registry (+10 in the last > few weeks...) and there still are about 20 more > which have not been registered (and are provided by the BioSapiens > project). >> Long for who? It isn't that much data. > > It is long for somebody who browses manually through the ensembl DAS > configuration and searches for a DAS source to add to. > It a "long" list for myself to read through the DAS server list at > http://das.sanger.ac.uk/registry/listServices.jsp > and although I know this list pretty well, it seems to me a lot of > text/descriptions, etc. >> There is only one reference server for an annotation server. > > I think it should be one reference server per coordinate system. >> But if there are two COORDINATES elements, and you say that >> each has its own reference server, then aren't you saying that >> a single annotation server may have multiple reference servers? > > yes. i believe that this should be possible. >> What's the concern about having >> no more than one coordinate per data source? > > Just last friday somebody asked me how to add a DAS server that has > two coordinate systems to different Ensembl views ( ContigView and > GeneView) > Her initial solution was to provide multiple DAS sources > http://das.sanger.ac.uk/registry/showdetails.jsp?auto_id=DS_211 > and > http://das.sanger.ac.uk/registry/showdetails.jsp?auto_id=DS_219 > > but I think this could be joint into a single server. In any case, I think the proposal I outlined in the previous email makes things cleaner even without support for multiple coordinate systems on the same server. Andrew dalke at dalkescientific.com From Steve_Chervitz at affymetrix.com Mon Mar 13 23:22:36 2006 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Mon, 13 Mar 2006 20:22:36 -0800 Subject: [DAS2] Notes from DAS/2 code sprint #2, day one, 13 Mar 2006 Message-ID: Notes from DAS/2 code sprint #2, day one, 13 Mar 2006 $Id: das2-teleconf-2006-03-13.txt,v 1.1 2006/03/14 04:31:36 sac Exp $ Note taker: Steve Chervitz Attendees: Affy: Steve Chervitz, Ed E., Gregg Helt Sanger: Andreas Prlic Dalke Scientific: Andrew Dalke (at Affy) UC Berkeley: Nomi Harris (at Affy) UCLA: Allen Day, Brian O'Connor (at Affy) Action items are flagged with '[A]'. These notes are checked into the biodas.org CVS repository at das/das2/notes/2006. Instructions on how to access this repository are at http://biodas.org DISCLAIMER: The note taker aims for completeness and accuracy, but these goals are not always achievable, given the desire to get the notes out with a rapid turnaround. So don't consider these notes as complete minutes from the meeting, but rather abbreviated, summarized versions of what was discussed. There may be errors of commission and omission. Participants are welcome to post comments and/or corrections to these as they see fit. General note: Passcode is now required to enter teleconf. This is a change in their system. Issue: Continuation Grant ------------------------- gh: no word yet. Issue: Coordinate System ------------------------ ad: question of what happens when there are multiple coordinate systems for an assembly. auth and source, source: contig space, scaffold space auth: organization (e.g. ncbi, ucsc) gh: not enough to get uniqueness. ncbi, genome, human is not enough, need version to uniquely id the coord system ad: auth, source, species, version identification string gh: use case: need to know whether uris for two versioned source refer to the same genome. gh: ncbi version numbers are separate from organism info, eg. v35. ad: we could have a service for mapping strings gh: idea - every server can say this assembly name is same as that. Clients could chain together statements from multiple servers. For the affy das server used by igb, we now have a synonyms file on our server which igb reads. It's a pain to maintain. ad: type of alignment server? gh: a synonym server. Here's a uri, give me a list of synonyms that refer to the same thing. This is something tho talk more about when Andreas is on line. [Andreas joins in.] GH: How would a das server verify the version info in a sources document point to same genome assembly? AP: You would check auth=ncbi, vers=35, taxid=human AP: In protein structure space, you check verison on every object you work with. Protein seq. gh: so we have to map version info on sequences as well as genome assemblies. gh: use case: two segment responses from diff servers, diff uris for the diff sequences, how you know they are refering to the same seq. name=chromosome21 vs name=chr21? ad: we require the same name for the same segments. gh: going to fall apart fast. no way to enforce it. People use 1, I, chr1, chromI. ee: can put this in the validation suite. aday: yes. gh: but what do you use for name: accession # for entry, string chr1, etc. gh: important since this is the name that goes to user. ad: could have one slot for computer to use, one for human consumption. ad: for segments there seem to be two diff ids: url, ad: the point of having special ids for segments is segment equivalence from different servers. Separate coordinates element that says how to merge things together. Identifiers in here that are just coordinate space ids, not necessarily for human use. Only for identifying coords. gh: but how do we get people to use it? sc: what about the idea of using checksums as identifiers for a seq? ad: problem of duplicate seqs in an assembly. eg., same seq from chr1 and chr9. gh: if they are the same seq they should get the same id. ad: don't you want to know if there is a region on chr1 that is an exact duplicate of a region on chr9? sc: we could create the checksum on source:sequence gh: useful to have a central place to ask for diff names for the same coord system. ad: uniqueness idea: coords element, has: auth, source, version, species (optional) uniqueness says these are the names you use. gh: this can fail. What do we say happens when it fails? Should there be a way of resolving it. ad: this is where your synonym table comes it. Publish it? gh: maybe as part of the registry, knows ap: there isn't a big variety in naming because there aren't many people providing assemblies. gh: we already have 10 different synonyms for an assembly ee: this has some performance impact on igb. should have to do it. ap: we should say this is how naming works. gh: will fail. ad: is this required for this version of the spec? gh: need something that can be used now. aday: without hardwiring gh: if we don't agree during the code sprint, then it won't happen for everyone else. aday: using roman numerals for yeast since sgd uses it. ee: trouble with chrX ad: andreas: is there a place for naming of segments to use ap: no, something for the reference server, not coords ad: given these coords, here are the names that are used. ap: same as reference server. gh: maybe registry should provide: here's a coord system and here are the names you can use for ap: you would get a long list for proteins aday: a user who wants to gh: question for brian g: LSID, when you come across this for LSIDs, ncbi is auth for human genome assembly yet they have no lsid for their assembly, how do people refer to their lsid when there's no authority to say what it is? bg: you can't, no one is the authority. but you can write a resolver that queries ncbi under the cover, in your resolver you make ncbi the authority of the lsid, add namespace, object id. Then everyone has to know that your resolver is hosted at some site somewhere. So there is no satisfactory answer. It's a problem if the authority does not host the resolver. bg: I'm at the w3c meeting at mit, providing a webified resolver, they would host a resolver, everyone would know to go to a well-known web address. bg: you start a convention, enforce it, give error if people don't use it. gh: thinking we need it associated with registry. ap: ref server + coord system, provides ids that can be used, gh: so other ids can be used, but registry server wouldn't support it. ad: site has ftp site for downloading chromosomes, contains names for different segments in the file. How do I go from the ids in ths file to the ids that Andreas describes. To make my annotations in the same space. Mapping from file from ncbi. bg: what are your use cases? write back to server? ad: user publishing locally, bg: you make a ref server. gh: experience from das1 is that everyone makes their own reference server and refers to it from their annotation server, using different names. ad: new tag 'coordinates' gh: like enforcing common names at registry server. Can use their own names, they just won't be allowed to post on the registry. ad: need documentation ap: could point to docn on reference server bg: workflow1: fish researcher looking for abberant regions in chr7, 11 and 3, singled out the abctransporter gene. How does that work in das/2? type 'abc' in web page for reference server? This is a gene name. ad: your client browser can go to to registry to find servers that host the assemblies for your fish. Go to those reference servers, do searches there. Will go to coord system, get a segments document, get display chromosome by title. gh: get a das features xml document saying the sequence and coordinates. gh: our discussion here is on getting the diff. ad: we don't have anything on coordinates saying which is the latest version. bg: latest build may have changed their gene coordinate. gh: mapping servers is part of our continuation grant. Can push an annotation on one assembly to another assembly. bg: a hard thing. gh: that's why where enlisting UCSC to do it! ad: Topic: id, url, uri, iri (see email) gh: likes uri, not url. Some things aren't really urls (resolvable). Iri might work. ad: multiple coord elements for same ref server. ap: originally there was one, but some use two, zebrafish guy chrom and scaffold coordinates. or chromosomes vs. gene ids. same types, different accession codes and features. ad: if you have graphical browser, do you get scaffolds or chromosomes. ap: depends on your view. gh: if you do a segments query, do you get segments and contigs? ap: depending on the coordinate system of the requrest. ad: one capabilities for scaffolds and one for chromosomes? gh: maybe Deliverables: [A] gregg: by end of week, load stuff from multiple servers, compare in the same view. [A] steve will work on getting gregg's das/2 server up and running. gh: trouble with biopackages.net server aday: possible power outage interference. gh: target filters have been dropped. aday: yay! From dalke at dalkescientific.com Tue Mar 14 10:14:44 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Tue, 14 Mar 2006 07:14:44 -0800 Subject: [DAS2] use cases Message-ID: <8bc46502eb164882394a3f4acbe08987@dalkescientific.com> I think these cover the basic use cases. Let me know if there are other reasonable ones I should add. Use Case #1 Biologist viewing genomic region wants to add information from server www.biodas.org/das2/ . Example of use: - Go to "open DAS server" option. Type/paste URL for DAS server. + DAS viewer connects to server, verifies that it annotates the same sequence source and has under (say) 10 types so it makes a new track for each type and does a request for all the features in the current display. Use Case #2 Biologist wants all lac repressors on build 12 of mouse. Example of use: - Start DAS viewer. Go to "find server" option. Select "mouse" from the list of "model organisms". Select "build 12" from a pull-down menu of build descriptions. Select all the listed servers. - Go to "find annotations" option Now what? Is "lac repressor" a name? Is it a combination of a name and ontology term? Is it a pure ontology term? Use Case #3 Biologist wants to find all the annotation servers for the most recent build of H. sapiens. Example of use: - Start DAS viewer. Go to "find server" option. Type "human" (or "H. sapiens" or "Homo sapiens"). Search. + DAS viewer consults internal NCBI taxonomy table to get taxid. DAS viewer displays all matches. - Sort by build date, select all matching servers by hand Problem: DAS has no field to search by build date Use Case #4 Bioinformaticist wants to make annotations available for build v32 of human. Example of use: - Go to registry server to get a human-readable description of the COORDINATES fields for build v32. - decide to point people to a reference server instead of providing local sequence data - create the sources, types and features document - put them on a web server - go to registry and submit site for future inclusion Use Case #5 IT wants people to use local mirrors of reference server when possible. Example of use: - set up a local registry server + server connects to Andreas' registry server and downloads all the data + server rewrites "segments" sections to use local server - configure all DAS viewers to consult local registry server Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Tue Mar 14 10:13:44 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Tue, 14 Mar 2006 07:13:44 -0800 Subject: [DAS2] using 'uri' instead of 'id' Message-ID: <9779f55861a4e800d0d21ec8d96deb8c@dalkescientific.com> Okay, I'm convinced. Where things in the spec use 'id' they will now use 'uri'. There are going to be a few wide-spread but shallow changes because of this. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Tue Mar 14 11:09:12 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Tue, 14 Mar 2006 08:09:12 -0800 Subject: [DAS2] segments and coordinates Message-ID: <24b8f786997fdabd72d3cc9c2a370352@dalkescientific.com> Summary: I want to - move the COORDINATE element inside of the CAPABILITY[type="segments"] element - add a 'created' timestamp to the COORDINATE (for sorting by time) - add a unique 'uri' identifier attribute to the COORDINATE (two coordinates are equal if and only if they have the same id) - have that identifier be resolvable, to get information about the coordinate system (but perhaps leave the contents for a future spec) In writing the documentation I've been struggling with COORDINATES. No surprise there. The current spec has COORDINATES and the "segments" capability as different elements, like (Note the 'created' timestamp to sort a list of coordinates by the time it was established.) With the current discussion on multiple coordinates, it looks like there is a 1-to-1 relationship between a COORDIANTES record and a CAPABILITY record. As that's the case I want to merge them together, as in (note change from "_id" to "_uri") In talking with Andreas I think he agrees that this makes sense. Second, there's a question of identity. When are two coordinates the same? Is it when they have the same (authority, source, version) the same (authority, source, version, taxid) Since taxid is optional, what if one server leaves it out; are the two still the same? I decided to solve it with a unique identifier. Two COORDINATES are the same if and only if they have the same identifier. That identifier just happens to be a URI. It does not need to be resolvable (but should be, with the results viewable at least for humans). Let's say that http://das.sanger.ac.uk/registry/coordinates/ABC123 is the identifier for: authority=NCBI version=v22 taxid=9606 source=Chromosome created=2006-03-14T07:27:49 Then the following are equivalent. The only difference is the number of properties defined in the COORDINATES tag. In theory these extra values don't need to be in the COORDINATES tag. They are knowable given the uri. But that requires a discovery mechanism for the properties (eg, the COORDINATES identifier might need to be retrievable, with some format or other). There is the possibility of value mismatch, but as Andreas pointed out the registry server can do that validation pretty easily. I mentioned property discovery earlier. Given a coordinates URI there are three things you might want to know: - what is the full list of coordinate system properties? - what is the authoritative reference server for the coordinates? - are there alternate reference servers? What if that was resolvable (doesn't need to be defined for DAS, so this is hypothetical) into something like (Hmmm, those are some ugly names. I usually shy away from '-'s in element and attribute names.) OR, what if the authoritative URL also implemented the segments interface, and we added a COORDINATES element to it? Errr, I don't like that. We will be in charge of the coordinate system URIs but we won't be in charge of the primary reference server. Use Case #6. NCBI releases a new human build. Ensembl releases annotations for it and wants to put the information with Andreas' registry. Example of use: - Set up an Ensembl reference server and annotation server for the new build; test it out - Create a new coordinate system record on the registry - fill in the species, source, doc_href, etc. fields - when finished the result is a URL, tied to coordinate info - Stick the COORDINATES information in the versioned source record - Tell the registry server to register the given versioned source URL Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Tue Mar 14 11:21:54 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Tue, 14 Mar 2006 08:21:54 -0800 Subject: [DAS2] today's sprint meeting Message-ID: Gregg can't make it this morning and asked that I let today's meeting. Here are the things I would like to talk about: == segment identifier. Quoting from my email yesterday - do not use segment "name" as an identifier - rename it "title" (human readable only) - allow a new optional "alias-of" attribute which is the link to the primary identifier for this segment - change the feature location to use the segment uri - change the feature filter range searches so there is a new "segment" keyword and so the "includes", "overlaps", etc. only work on the given segment, as segment= inside=$start:$stop overlaps=$start:$stop contains=$start:$stop identical=$start:$stop http://biodas.org/feature.cgi?segment=http://whatever.com/ChromosomeD; inside=5000:6000 (with URL escaping rules for the query string that's ...feature.cgi? segment=http%3A%2F%2Fwhatever.com%2FChromosomeD&inside=5000%3A6000 - If 'includes', 'overlaps', etc. are given then the 'segment' must be given (do we need this restriction? It doesn't make sense to me to ask for "annotations on 1000 to 2000 of anything" - only allow at most one each of includes, overlaps, contains, or identical (do we need this restriction? Then again, Gregg only needs a single includes and a single overlaps; perhaps make this even more restrictive?) - multiple segments may be given, but then range searches are not supported (do we need this restriction?) Consensus on this side seems to be fine. The biggest worry is the increasing use of URIs in URL query strings. == coordinate systems Quoting from an email I wrote recently - move the COORDINATE element inside of the CAPABILITY[type="segments"] element - add a 'created' timestamp to the COORDINATE (for sorting by time) - add a unique 'uri' identifier attribute to the COORDINATE (two coordinates are equal if and only if they have the same id) Result looks like - have that identifier be resolvable, to get information about the coordinate system (but perhaps leave the contents for a future spec) == use 'uri' instead of 'id' in the spec I've decided to go with 'uri' instead of 'id' (or 'url' or 'iri') in its various uses in the spec. == churn My feeling is this is the last major churn. I'm not able to keep up with the documentation writing, which makes it hard for people to get things done. Should I work with people today on getting data sources working and developing example data files for people to review? That is, examples which show and explain the various element in the spec? I figure more people work from example than from spec description. Andrew dalke at dalkescientific.com From ap3 at sanger.ac.uk Tue Mar 14 11:35:07 2006 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Tue, 14 Mar 2006 16:35:07 +0000 Subject: [DAS2] URIs for sequence identifiers In-Reply-To: <441606C8.3070902@affymetrix.com> References: <32b20a44f60c916d9b3649fbcdacd31f@dalkescientific.com> <441606C8.3070902@affymetrix.com> Message-ID: <0cd005042c73d6080c568576a08bb987@sanger.ac.uk> > > A different solution: > > Scaffold and Chromosome coordinate systems are served by separate DAS/2 > servers. Each server returns data from one and only one namespace. > > Those separate servers can, behind-the-scenes, use the same database. > > DAS/2 clients, like IGB, would choose to connect to either the > Scaffold-based server or the Chromosome-based server, but not usually > to > both at once. > > Does this handle all the issues? Hm I see this as a possibility but what about the following: ? ? ? ? This would be how to write one server which has two coordinate systems. according to the "one coord sys/server" rule. I think it would be shorter to provide two coordinates sections for that and only one source description... --- fyi, a yeast by Gene_ID server is e.g. http://das.sanger.ac.uk/registry/showdetails.jsp?auto_id=DS_169 Cheers, Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 From ap3 at sanger.ac.uk Tue Mar 14 11:48:09 2006 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Tue, 14 Mar 2006 16:48:09 +0000 Subject: [DAS2] segments and coordinates In-Reply-To: <24b8f786997fdabd72d3cc9c2a370352@dalkescientific.com> References: <24b8f786997fdabd72d3cc9c2a370352@dalkescientific.com> Message-ID: On 14 Mar 2006, at 16:09, Andrew Dalke wrote: > Summary: I want to > - move the COORDINATE element inside of the > CAPABILITY[type="segments"] element Is this really needed? > The current spec has COORDINATES and the "segments" capability > as different elements, like > > taxid="9606" created="2006-03-14T07:27:49" /> > query_id="http://localhost/das2/h.sapiens/v22/segments" /> > With the current discussion on multiple coordinates, it > looks like there is a 1-to-1 relationship between a COORDIANTES > record and a CAPABILITY record. As that's the case I want > to merge them together, as in (note change from "_id" to "_uri") I think hat this is a many to many relationship. Do you still want to provide the link to the reference server from an annotation server? This is not needed because the coordinates describe the reference server sufficiently. Annotation servers do not need the segments capability - only the features capability. > query_uri="http://localhost/das2/h.sapiens/v22/segments"> > taxid="9606" created="2006-03-14T07:27:49" /> > > > In talking with Andreas I think he agrees that this makes sense. If you really *want* to have the link back from the annotation server to the reference then I would propose to put capability under coordinates - i.e. the other way round. > econd, there's a question of identity. When are two coordinates > the same? Is it when they have the same > (authority, source, version) > the same > (authority, source, version, taxid) yes > > Since taxid is optional, what if one server leaves it out; > are the two still the same? no - because if a taxid is specified that is a restriction for one organism. no taxid means that this refers to multiple organisms. > I decided to solve it with a unique identifier. that might be good. this identifier could also be used to restrict searches on servers with many coordinate systems. > > Let's say that > http://das.sanger.ac.uk/registry/coordinates/ABC123 > is the identifier for: > authority=NCBI > version=v22 > taxid=9606 > source=Chromosome > created=2006-03-14T07:27:49 fine > Then the following are equivalent. The only difference is the > number of properties defined in the COORDINATES tag. > > query_uri="http://localhost/das2/h.sapiens/v22/segments"> > uri="http://das.sanger.ac.uk/registry/coordinates/ABC123" /> > > > > query_uri="http://localhost/das2/h.sapiens/v22/segments"> > uri="http://das.sanger.ac.uk/registry/coordinates/ABC123" > source="Chromosome"/> > > > > query_uri="http://localhost/das2/h.sapiens/v22/segments"> > uri="http://das.sanger.ac.uk/registry/coordinates/ABC123" > source="Chromosome" authority="NCBI" version="v22" taxid="9606" > created="2006-03-14T07:27:49" /> > o.k. This is a lot of change to the spec for us being already on the second code sprint, but I think it makes things clearer Cheers, Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 From dalke at dalkescientific.com Tue Mar 14 15:46:27 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Tue, 14 Mar 2006 12:46:27 -0800 Subject: [DAS2] description and title Message-ID: <84c508c1625b5507dd511c8d1ef0f682@dalkescientific.com> Andreas' DAS registry has a description for each versioned source. See http://das.sanger.ac.uk/registry/listServices.jsp . Here's an example of what's in it Machine learning approach used SWISSPROT variants annotated as disease/neutral as training dataset. Predictions made on all ENSEMBL nscSNPs as to their disease status I've added an optional 'description' field to the versioned source record for servers that wish to provide that information. Allen's types response had 'name' and 'description' attributes. These were not in the types record. I've added 'description' and added 'title'. I've been using 'title' for short descriptions; a few words long. I've been using 'description' for plain text up to a paragraph. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Tue Mar 14 19:34:55 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Tue, 14 Mar 2006 16:34:55 -0800 Subject: [DAS2] updated examples Message-ID: Checked into das CVS. das/das2/draft3/ The current (incomplete) spec is 'spec.txt'. It is already out of date. The .rnc files are up-to-date. The subdirectory "ucla" contains data from Allen's server, with the format hand-updated. A couple of things to note. I used three different ways of specifying the same namespace: This is to check that you all are doing correct namespace processing. :) Also, I've gone ahead and added the 'SUPPORTS' element, like this This says that the server only supports 'basic' searches, which means you can only ask it for all the feature. There is no feature query language. There is also 'das2queries' which says that the server supports the das2 query language. The following says that you can ask for everything or you can ask for things in the DAS2 query language. If not given the client should assume it supports 'das2queries'. Note that 'basic' is a subset of 'das2queries'. Andrew dalke at dalkescientific.com From lstein at cshl.edu Wed Mar 15 05:46:41 2006 From: lstein at cshl.edu (Lincoln Stein) Date: Wed, 15 Mar 2006 10:46:41 +0000 Subject: [DAS2] biopackages.net out of synch with spec? In-Reply-To: References: Message-ID: <200603151046.43196.lstein@cshl.edu> Hi Folks, I just ran through the source request on biopackages.net and it is returning something that is very different from the current spec (CVS updated as of this morning UK time). I understand why there is a discrepancy, but for the purposes of the code sprint, should I code to what the spec says or to what biopackages.net returns? It is much more fun for me to code to a working server because I have the opportunity to watch my code run. Best, Lincoln -- Lincoln D. Stein Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724 FOR URGENT MESSAGES & SCHEDULING, PLEASE CONTACT MY ASSISTANT, SANDRA MICHELSEN, AT michelse at cshl.edu (516 367-5008) From lstein at cshl.edu Wed Mar 15 05:39:35 2006 From: lstein at cshl.edu (Lincoln Stein) Date: Wed, 15 Mar 2006 10:39:35 +0000 Subject: [DAS2] Shouldn't prefix be /das2? In-Reply-To: References: Message-ID: <200603151039.36405.lstein@cshl.edu> Hi Folks, Shouldn't the prefix to das2 requests be http://server/blahblah/das2 ? It would make it easier for clients to load the correct parsing code and would avoid the client having to make a round-trip to the server just to determine whether it is dealing with a das/1 or das/2 server. My apologies if this has already been discussed. Lincoln -- Lincoln D. Stein Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724 FOR URGENT MESSAGES & SCHEDULING, PLEASE CONTACT MY ASSISTANT, SANDRA MICHELSEN, AT michelse at cshl.edu (516 367-5008) From dalke at dalkescientific.com Wed Mar 15 09:32:26 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 15 Mar 2006 06:32:26 -0800 Subject: [DAS2] biopackages.net out of synch with spec? In-Reply-To: <200603151046.43196.lstein@cshl.edu> References: <200603151046.43196.lstein@cshl.edu> Message-ID: <4d86b8f899632c8cd506297938fffd8a@dalkescientific.com> Lincoln: > I just ran through the source request on biopackages.net and it is > returning > something that is very different from the current spec (CVS updated as > of > this morning UK time). The server isn't synched with any specific version of the spec. For example, if I make a features request from http://das.biopackages.net/das/genome/yeast/S228C/feature?inside=chr1/ 0:1000") I get As from the discussion a few weeks ago we shouldn't be using the standalone="no" since that says the document cannot be understood without consulting the DTD, which doesn't exist. And I don't want a DTD. Also, the namespace needs to be "http://www.biodas.org/ns/das/genome/2.00" (It's missing the 'genome') and the 'FEATURELIST' was replaced with 'FEATURES' a year ago. In the types request the commented out namespace declaration needs to there, and the type id 'SO:ARS' needs to be escaped as it's treated as an identifier resolved with the "SO" protocol. Plus, until yesterday I didn't know about the 'name' or 'definition' attributes. These are now in the schema as 'title' and 'description'. There are a few other differences, like problems in the taxid and empty strings for timestamps. I hand-updated examples from Allen's server yesterday, in cvs under das/das2/draft3/ucla . I found some of these during the update, though others I pointed out about a year ago. Allen doesn't want to update the server until the spec is stable, for two reasons. First, he doesn't like the churn of doing work only to have to make more changes. Second, you're not the only one who says > It is much more fun for me to code to a working > server because I have the opportunity to watch my code run. and Allen's setup doesn't have the ability to implement two versions at the same time. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Wed Mar 15 09:46:39 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 15 Mar 2006 06:46:39 -0800 Subject: [DAS2] Shouldn't prefix be /das2? In-Reply-To: <200603151039.36405.lstein@cshl.edu> References: <200603151039.36405.lstein@cshl.edu> Message-ID: > Shouldn't the prefix to das2 requests be http://server/blahblah/das2 > ? > > It would make it easier for clients to load the correct parsing code > and would > avoid the client having to make a round-trip to the server just to > determine > whether it is dealing with a das/1 or das/2 server. It doesn't need the round-trip. It can look at the Content-Type to figure that out. Plus, few of the DAS1 servers follow the DAS1 naming scheme. Here's a list from Andreas' registry server. genome.cbs.dtu.dk:9000/das/tmhmm/ genome.cbs.dtu.dk:9000/das/netoglyc/ das.ensembl.org/das/ens_sc1_ygpm/ atgc.lirmm.fr/cgi-bin/das/MethDB/ smart.embl.de/smart/das/smart/ supfam.org/SUPERFAMILY/cgi-bin/das/up/ mips.gsf.de/cgi-bin/proj/biosapiens/das/saccharomyces_cerevisiae/ All of them do have the substring '/das/' somewhere, but not at the start/end of the string. Now, the content-type might be "application/xml" and not sufficient to disambiguate between the two documents, but in that case you can dispatch based on the root element type. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Wed Mar 15 10:05:52 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 15 Mar 2006 07:05:52 -0800 Subject: [DAS2] XML namespaces Message-ID: <9fc7158a198c7d5d62c5c9be2624f5f9@dalkescientific.com> I mentioned this yesterday but am doing it again as its own email. This is a quick tutorial on XML namespaces. The DAS spec uses XML namespaces. XML didn't start with namespaces. They were added later. Older parsers, like SAX 1.0, did not understand namespaces. Newer ones, like SAX 2.0, do. By default a document does not have a namespace. For example, has no namespace. To declare a default namespace use the 'xmlns' attribute. All attributes which start 'xml' or are in the 'xml:' namespace are reserved. This is the name 'person' in the namespace 'http://www.biodas.org/'. The namespace is an opaque identifer. It leverages URIs in part because it's much easier to guarantee uniqueness. The combination of (namespace, tag name) is unique. The tag name is also called the "local name". That's to distinguish it from a "qualified name", also called a "qname". These look like This element has identical meaning to the previous element using the default namespace. It's qname is 'abc:person' but the full name is the tuple of ("http://www.biodas.org/", "person") For notational convenience this is sometimes written in Clark notation, as {http://www.biodas.org}person Element Clark notation person {}person ("empty namespace" is different than "no namespace") {http://biodas.org/}person {http://biodas.org/}person {http://biodas.org/}person The prefix used doesn't matter. Only the combination of (namespace, local name) is important. The Clark notation string captures that as a single string, which is much easier when doing comparisons. For example, if you try the dasypus verifier at http://cgi.biodas.org:8080/verify?url=http://das.biopackages.net/das/ genome/yeast/S228C/feature?inside=chr1/0:1000&doctype=features one of the output messages is Expected element '{http://www.biodas.org/ns/das/genome/2.00}FEATURES' but got '{http://www.biodas.org/ns/das/2.00}FEATURELIST' at byte 113, line 3, column 2 This shows the Clark name for the elements, indicating that the root element has a different namespace and local name from what Dasypus expects. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Wed Mar 15 10:15:40 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 15 Mar 2006 07:15:40 -0800 Subject: [DAS2] xml namespaces Message-ID: related to the previous email. The spec uses the namespace http://www.biodas.org/ns/das/genome/2.00 I propose using a smaller and simpler URL. The content does not matter to XML processors. The practice though is to use a URI which is resolvable for more information about the element. For example, xmlns:xlink="http://www.w3.org/1999/xlink" Go to that and the response is > This is an XML namespace defined in the XML Linking Language (XLink) > specification. > > For more information about XML, please refer to The Extensible Markup > Language (XML) 1.0 specification. For more information about XML > namespaces, please refer to the Namespaces in XML specification. Similarly the XML namespace URI is http://www.w3.org/1999/xhtml XSLT is http://www.w3.org/1999/XSL/Transform FOAF is http://xmlns.com/foaf/0.1/ which points to the actual documentation I like the last approach and propose that DAS2 use the namespace http://biodas.org/documents/das2/ Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Wed Mar 15 10:22:14 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 15 Mar 2006 07:22:14 -0800 Subject: [DAS2] xml namespaces In-Reply-To: References: Message-ID: Me: > I propose using a smaller and simpler URL. ... > I like the last approach and propose that DAS2 use the namespace > > http://biodas.org/documents/das2/ But it's such a minor point that not changing it is fine with me. On the other hand, Allen's server doesn't given the right namespace and Gregg's client currently ignores the namespace, so there isn't any extra work. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Wed Mar 15 10:29:56 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 15 Mar 2006 07:29:56 -0800 Subject: [DAS2] search by segment id Message-ID: <712b5b29c53161455f3d9d1b34768937@dalkescientific.com> One thing I came up with yesterday when moving from local identifiers to URIs for the segment names. There are two possible identifiers for a given segment The local name is "http://localhost/das2/segment/chr1" while the well-known global name (of which the local name is an alias) is "http://dalkescientific.com/human35v1/chr1" The global name can be anything. It can be "urn:lsid:chr1" or anything else. It only needs to be unique across all identifiers. Now, are range queries done with the local name or the global one? That is, features?segment=http://localhost/das2/segment/chr1&range=100:200 or features?segment=http://dalkescientific.com/human35v1/chr1&range=100: 200 ( or features?segment=urn:lsid:chr1&range=100:200 if that was the uri) If it's the local name then the client must first query all servers to get the mapping from global name to local name, and perform the translation itself. I propose that the client can query using the global name, and not need to do the mapping to the local name. In addition, a server may support both names in the query, since by using URIs we guarantee there are no accidental id collisions. Andrew dalke at dalkescientific.com From ap3 at sanger.ac.uk Wed Mar 15 10:34:06 2006 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Wed, 15 Mar 2006 15:34:06 +0000 Subject: [DAS2] Shouldn't prefix be /das2? In-Reply-To: References: <200603151039.36405.lstein@cshl.edu> Message-ID: <9370c22dda73ba356c665eca3838e6e6@sanger.ac.uk> > > genome.cbs.dtu.dk:9000/das/tmhmm/ > genome.cbs.dtu.dk:9000/das/netoglyc/ > das.ensembl.org/das/ens_sc1_ygpm/ > atgc.lirmm.fr/cgi-bin/das/MethDB/ > smart.embl.de/smart/das/smart/ > supfam.org/SUPERFAMILY/cgi-bin/das/up/ > mips.gsf.de/cgi-bin/proj/biosapiens/das/saccharomyces_cerevisiae/ all these servers match to the DAS 1 spec which says that the second to last bit is "das" and the last bit is the "data source name". The registry contains a check for that. Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 From td2 at sanger.ac.uk Wed Mar 15 10:16:25 2006 From: td2 at sanger.ac.uk (Thomas Down) Date: Wed, 15 Mar 2006 15:16:25 +0000 Subject: [DAS2] Shouldn't prefix be /das2? In-Reply-To: References: <200603151039.36405.lstein@cshl.edu> Message-ID: <58C7DFD3-9B5A-4BC5-B863-49B2366D06A3@sanger.ac.uk> On 15 Mar 2006, at 14:46, Andrew Dalke wrote: > Plus, few of the DAS1 servers follow the DAS1 naming scheme. Here's > a list from Andreas' registry server. > > genome.cbs.dtu.dk:9000/das/tmhmm/ > genome.cbs.dtu.dk:9000/das/netoglyc/ > das.ensembl.org/das/ens_sc1_ygpm/ > atgc.lirmm.fr/cgi-bin/das/MethDB/ > smart.embl.de/smart/das/smart/ > supfam.org/SUPERFAMILY/cgi-bin/das/up/ > mips.gsf.de/cgi-bin/proj/biosapiens/das/saccharomyces_cerevisiae/ These all look fine to me -- but they're URLs for individual data sources, rather than complete server installations. Remove the last element and you'll get a server URL (e.g. genome.cbs.dtu.dk:9000/ das/) which ends /das/ in all cases. The registry records datasources, not server installations. In general, I'm not sure a server installation is a terribly "interesting" object, since it's quite possible that one server installation will host many datasources with little or no semantic connection between them -- the only thing they have in common is that they're hosted at the same site. Thomas. From lstein at cshl.edu Wed Mar 15 10:41:46 2006 From: lstein at cshl.edu (Lincoln Stein) Date: Wed, 15 Mar 2006 15:41:46 +0000 Subject: [DAS2] biopackages.net out of synch with spec? In-Reply-To: <4d86b8f899632c8cd506297938fffd8a@dalkescientific.com> References: <200603151046.43196.lstein@cshl.edu> <4d86b8f899632c8cd506297938fffd8a@dalkescientific.com> Message-ID: <200603151541.47538.lstein@cshl.edu> I'll use your hand-edited examples for testing. Lincoln On Wednesday 15 March 2006 14:32, Andrew Dalke wrote: > Lincoln: > > I just ran through the source request on biopackages.net and it is > > returning > > something that is very different from the current spec (CVS updated as > > of > > this morning UK time). > > The server isn't synched with any specific version of the spec. For > example, if I make a features request from > > http://das.biopackages.net/das/genome/yeast/S228C/feature?inside=chr1/ > 0:1000") > > I get > > > "http://www.biodas.org/dtd/das2feature.dtd"> > xmlns="http://www.biodas.org/ns/das/2.00" > xmlns:xlink="http://www.w3.org/1999/xlink" > xml:base="http://das.biopackages.net/das/genome/yeast/S228C/feature"> > > > As from the discussion a few weeks ago we shouldn't be using the > standalone="no" > since that says the document cannot be understood without consulting > the DTD, which doesn't exist. And I don't want a DTD. > > Also, the namespace needs to be > "http://www.biodas.org/ns/das/genome/2.00" > (It's missing the 'genome') and the 'FEATURELIST' was replaced with > 'FEATURES' a year ago. > > In the types request > > > > > > xmlns:xlink="http://www.w3.org/1999/xlink" > xml:base="http://das.biopackages.net/das/genome/yeast/S228C/type/"> > name="ARS" definition="A sequence that can autonomously replicate, as a > plasmid, when transformed into a bacterial host."> > > > the commented out namespace declaration needs to there, and the type > id 'SO:ARS' needs to be escaped as it's treated as an identifier > resolved > with the "SO" protocol. Plus, until yesterday I didn't know about the > 'name' or 'definition' attributes. These are now in the schema as > 'title' and 'description'. > > There are a few other differences, like problems in the taxid and > empty strings for timestamps. I hand-updated examples from Allen's > server yesterday, in cvs under das/das2/draft3/ucla . I found some > of these during the update, though others I pointed out about a > year ago. > > Allen doesn't want to update the server until the spec is stable, > for two reasons. First, he doesn't like the churn of doing work only > to have to make more changes. Second, you're not the only one who says > > > It is much more fun for me to code to a working > > server because I have the opportunity to watch my code run. > > and Allen's setup doesn't have the ability to implement two versions > at the same time. > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/das2 -- Lincoln D. Stein Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724 FOR URGENT MESSAGES & SCHEDULING, PLEASE CONTACT MY ASSISTANT, SANDRA MICHELSEN, AT michelse at cshl.edu (516 367-5008) From lstein at cshl.edu Wed Mar 15 10:49:40 2006 From: lstein at cshl.edu (Lincoln Stein) Date: Wed, 15 Mar 2006 15:49:40 +0000 Subject: [DAS2] XML namespaces In-Reply-To: <9fc7158a198c7d5d62c5c9be2624f5f9@dalkescientific.com> References: <9fc7158a198c7d5d62c5c9be2624f5f9@dalkescientific.com> Message-ID: <200603151549.41773.lstein@cshl.edu> I have just finished adding XML namespace support to the early-version Perl DAS2 client. BTW, if a namespace tag is reused in an inner scope with a different Andrew K. Dalke I put middle into namespace http://addresses.com/address/2.0 and put first and last into namespace http://foo.bar.das. This is the correct scoping behavior, right? Lincoln On Wednesday 15 March 2006 15:05, Andrew Dalke wrote: > I mentioned this yesterday but am doing it again as its own email. > This is a quick tutorial on XML namespaces. > > The DAS spec uses XML namespaces. XML didn't start with namespaces. > They were added later. Older parsers, like SAX 1.0, did not understand > namespaces. Newer ones, like SAX 2.0, do. > > By default a document does not have a namespace. For example, > > > > has no namespace. > > To declare a default namespace use the 'xmlns' attribute. All > attributes which start 'xml' or are in the 'xml:' namespace are > reserved. > > > > This is the name 'person' in the namespace 'http://www.biodas.org/'. > The namespace is an opaque identifer. It leverages URIs in part > because it's much easier to guarantee uniqueness. > > The combination of (namespace, tag name) is unique. The tag > name is also called the "local name". > > That's to distinguish it from a "qualified name", also called > a "qname". These look like > > > > This element has identical meaning to the previous element > using the default namespace. It's qname is 'abc:person' but > the full name is the tuple of > > ("http://www.biodas.org/", "person") > > For notational convenience this is sometimes written in Clark > notation, as > {http://www.biodas.org}person > > Element Clark notation > person > {}person > ("empty namespace" is different than "no > namespace") > > > {http://biodas.org/}person > > {http://biodas.org/}person > > {http://biodas.org/}person > > The prefix used doesn't matter. Only the combination of > (namespace, local name) > is important. The Clark notation string captures that as a single > string, > which is much easier when doing comparisons. > > For example, if you try the dasypus verifier at > > http://cgi.biodas.org:8080/verify?url=http://das.biopackages.net/das/ > genome/yeast/S228C/feature?inside=chr1/0:1000&doctype=features > > one of the output messages is > > Expected element '{http://www.biodas.org/ns/das/genome/2.00}FEATURES' > but > got '{http://www.biodas.org/ns/das/2.00}FEATURELIST' at byte 113, line > 3, column 2 > > This shows the Clark name for the elements, indicating that the root > element has a different namespace and local name from what Dasypus > expects. > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/das2 -- Lincoln D. Stein Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724 FOR URGENT MESSAGES & SCHEDULING, PLEASE CONTACT MY ASSISTANT, SANDRA MICHELSEN, AT michelse at cshl.edu (516 367-5008) From dalke at dalkescientific.com Wed Mar 15 10:53:11 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 15 Mar 2006 07:53:11 -0800 Subject: [DAS2] Shouldn't prefix be /das2? In-Reply-To: <9370c22dda73ba356c665eca3838e6e6@sanger.ac.uk> References: <200603151039.36405.lstein@cshl.edu> <9370c22dda73ba356c665eca3838e6e6@sanger.ac.uk> Message-ID: <0e5d03e0bc2f9ab791a891f058ca664b@dalkescientific.com> Andreas (and Thomas) >> genome.cbs.dtu.dk:9000/das/tmhmm/ >> genome.cbs.dtu.dk:9000/das/netoglyc/ > all these servers match to the DAS 1 spec which says that the second > to last bit > is "das" and the last bit is the "data source name". > The registry contains a check for that. Ahh, right. I misremembered and thought that "/das" had to be immediately after the hostname. Looking now there can be an arbitrary prefix. What I remembered was the servers at http://das.bcgsc.ca:8080/das which don't have regular names. Then again, they have nearly bit-rotted away. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Wed Mar 15 11:04:38 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 15 Mar 2006 08:04:38 -0800 Subject: [DAS2] XML namespaces In-Reply-To: <200603151549.41773.lstein@cshl.edu> References: <9fc7158a198c7d5d62c5c9be2624f5f9@dalkescientific.com> <200603151549.41773.lstein@cshl.edu> Message-ID: <2de39a4a831f6a06c408bdf31ef2a41f@dalkescientific.com> Linconl: > BTW, if a namespace tag is reused in an inner scope with a > different > > > Andrew > xmlns:das="http://addresses.com/address/2.0">K. > Dalke > > > I put middle into namespace http://addresses.com/address/2.0 and put > first and > last into namespace http://foo.bar.das. > > This is the correct scoping behavior, right? Yes. I tested it with an XML process and it says the following is equivalent (after fixing a typo). Andrew K. Dalke BTW, it should be "P." :) Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Wed Mar 15 10:58:15 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 15 Mar 2006 07:58:15 -0800 Subject: [DAS2] Shouldn't prefix be /das2? In-Reply-To: <58C7DFD3-9B5A-4BC5-B863-49B2366D06A3@sanger.ac.uk> References: <200603151039.36405.lstein@cshl.edu> <58C7DFD3-9B5A-4BC5-B863-49B2366D06A3@sanger.ac.uk> Message-ID: Thomas: > The registry records datasources, not server installations. In > general, I'm not sure a server installation is a terribly > "interesting" object, since it's quite possible that one server > installation will host many datasources with little or no semantic > connection between them -- the only thing they have in common is that > they're hosted at the same site. I agree. The only thing that's interesting about the server installation is knowing who is in charge when it goes down. :) That's found from the MAINTAINER element at the level of the sources document. Andrew dalke at dalkescientific.com From Steve_Chervitz at affymetrix.com Wed Mar 15 11:37:51 2006 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Wed, 15 Mar 2006 08:37:51 -0800 Subject: [DAS2] Notes from DAS/2 code sprint #2, day two, 14 Mar 2006 Message-ID: Notes from DAS/2 code sprint #2, day two, 14 Mar 2006 $Id: das2-teleconf-2006-03-14.txt,v 1.1 2006/03/15 16:47:50 sac Exp $ Note taker: Steve Chervitz Attendees: Affy: Steve Chervitz, Ed E. Sanger: Andreas Prlic, Thomas Down Dalke Scientific: Andrew Dalke (at Affy) UC Berkeley: Nomi Harris (at Affy) UCLA: Allen Day (at Affy) Action items are flagged with '[A]'. These notes are checked into the biodas.org CVS repository at das/das2/notes/2006. Instructions on how to access this repository are at http://biodas.org DISCLAIMER: The note taker aims for completeness and accuracy, but these goals are not always achievable, given the desire to get the notes out with a rapid turnaround. So don't consider these notes as complete minutes from the meeting, but rather abbreviated, summarized versions of what was discussed. There may be errors of commission and omission. Participants are welcome to post comments and/or corrections to these as they see fit. Agenda: ---------- See Andrew's email. Here's a summary. * segment ids * coord systems and how to handle [Gregg is out, Andrew is leading the teleconf.] ap: ad proposed changes re: coords and capabilities i think is not really needed. the question is do annotation servers need to provide to link to reference servers back. If the link is apparent from, c ad: summary: moving coord element inside capabilities element (one part of 4 things mentioned). the reason: coords and capabilities are tied together. They refer to the same thing. E.g., you need know which of the segments are tied to which coords. ap: annotation server does need to, it can find the reference server by the coordinates. ad: if you have local coords, and you want to point to a local server, how do you specify that this segment corresponds to these coords. ap: you should have a reference server that speaks the coords you want to annotate. td: if you have your own assembly you have your own coord system, ad: yes, and i set up my own ref server for it. ad: if I have mult coords, won't I have multiple segments? isn't there a 1:1 relationship between coords and segments? ap: I think many:many.... wait td: each segment is a member of one coord system, a coord system contains many segments. ad: andreas has features, some annotated on scaffold, some annotated on chromosome. So, you need the ability to have two segments provided by server. ap: coords should contain segment capabilities, i.e., the other way around. ad: proposing to have a uri to id the coords, capapbility should have a field to say the coord uri is 'this' mailed out the idea to have a unique identifier for coords. keep them separate now, have the ability sc: optional? ad: yes only needed if you have mult coord systems. ad: like features and feature type. segment is saying it's of that type ad: will add optional id to the capability, so that you can figure out what the segments are. in proposal this am, 1) timestamp to coord info (optional) -- use case: sort by most recent coord system for a given build. 2) unique id for the coord ( ap: this will be useful for searches as well. can request only results from a particular coord system. (see email discussion this am) td: server alignment btwn human and mouse, you can say whether you are referencing human or mouse just by specifying coord system. ad: also two different human assemblies. ap: I have to leave now. Topic: Segment identifiers email td: segment had a name and url form id so that feature server doesn't have to give a concrete url for the seq of chrm22, nice for lightweight server sans sequence. getting rid of ability to reference sequence by name instead of url breaks this. You need a concrete url if you just want to serve features on a sequence. You end up having to rewrite urls rather than saying this feature is attached to chr22 in xxx coord system. ad: one thing gregg and I discussed, the fact that url is by itself an opaque id, you have to resolve it someway, http, or something else too. You can use any mechanism you want to turn the name you want. ad: in segments list, if you have your own local copy. Your segments section says my local copy is td: you need a segments capability. I can't have a server that uses only features capabilities. ad: if you have your own segments. if all your features are described using standard names/ids, no you don't need a segments capability. td: ok, my assembly is human build 35, and feature lives on chr22. ad: yes. every place you see optional alias attribute link back to primary id of segment, that id can be anything. td: arbitrary string scoped by the coord system, which now has a uri id string. ad: yes. and it's also globally unique, not scoped just by coord system . td: I don't see what's wrong with .... ad: we were discussing yesterday having diff names for the same chromosome. chrI vs chr1. td: that can be addressed using aliases ad: alias of field provides a synonym table for what you map locally to a global id. td: you're saying the global ids have to be universally unique even when taken out of the coord system ad: yes. feat server providing feats from two diff coord systems, you need a way to distinguish one segment from another segment, in a global sense. td: I don't totally understand cases involving mult coord systems. How do I find out which of three possible coord systems a given segment came from? ad: td: all clones in embl system. could be a lot. ad: your client will have to know how to look up the right one. if you have one coord system that has all your clones, you have to do the look up anyway to know where to display the features from the various clones. td: suppose looking for gene names: you get back a feature on clone AL19823. I want to start from that feature and build a meaningful display. So I need to work out what coord system this feature lives on. If my server speaks multiple coord systems, one for all embl accessions and gi ids, I have to test for membership in the set. My server could put the coord system id on each feature. Would be optional for servers only attached to one coord system. ad: right. Andreas also wants coord uri part of feature filter. Could add it to the feature filter. td: yes. give me all genes called xyz. Do you always want to limit to one coord system? ad: I see your point. Having to search ad: New thing called title for humans to read. Also proposed inside, overlaps, contains so they don't td: to avoid a nastiness in query lang, I like that. Removes an issue that scares me about having urls in the query. pathological case: client has a good reason to retrieve features on part of a two sequences that have lots of features on. e.g., all cutting sites for all restriction enzymes. Very high density. If the genome is made of 10kb clones, the user may want to get features that span clone boundaries. server may do lots of extra fetching that's not really necessary. ad: it's the number of requests that's the issue, same amout of info. so it's an issue of network overhead. advantage: makes servers easier to implement since it eliminates searching partial regions. Some use cases exists, but can be done on the client side. td: seems a shame to lose the capability, but not a huge loss. the alternative would be to say that you parse the query string left to right. overlaps=5000-10000; ... puts limits on how server parses. ad: or we propose a new query interface ad: this sounds like I should go ahead with segment ids. ad: using uri vs id (internal link id vs link to something else) td: seems to be enough impl-breaking changes, not a big argument either way. ad: enough changes going on now, but probably won't change much more. td: if you want to make a small change that's quick to implement, no objections. Also fine with using id, since all dom stuff about id refers to things marked id in the scheme, not attrib names. Changing to uri, won't cause much effect. nh: like a gobal replace. ad: in general there's been lots of changes, want people to get clients/servers going. ad: spec writing is going slow, would like to show examples that people can use. nh: feature parsing can use canned examples. aday: would prefer to have spec written, trouble with ambiguity ad: you need to impl before you can figure out how to write it. nh: server people need full spec, client can use examples ad: previous slow going since lincoln had little time to work on it. aday: would like a snapshot, version number. impl after last code sprint. nh: don't have time to work on das after this. will just break when/if allen's server changes. This just happens when working on developing spec. ad: the idea is to get code and examples up today. td: waiting for spec to stabilize a bit. ad: changes made this week won't have major impact on people's work in UK? td: no. nh: can you provide a changes document? ad: those would be my emails. a pain. nh: registry, I was suprised to find a versioned sources in it. won't there be an explosion of org x versions x server. It provides convenience td: as long as it's not thousands and thousands of data sources, it won't be a problem. ad: 2k per server x 1000 servers, = 2M td: if it gets to point where retrieving whole registry is a problem, we could add capability to restrict what you get. nh: need human-friendly title for each data source. would be nice if that explained more to the person who was choosing that data source (e.g., date). ad: Andreas' system (web-based) has a description. Status reports -------------- sc: adding more data to affy das server, working on building das2_server code recently checked into genoviz code base by gregg. Then will work on setting it up on a publically accessible server at affy. ee: will be working on style sheets in igb. aday: spent time on setting up dev environment since laptop died yesterday. bo: got food poisoning -- bad pizza?, was up till 4am. td: not much das-related stuff yet. From Steve_Chervitz at affymetrix.com Wed Mar 15 16:24:59 2006 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Wed, 15 Mar 2006 13:24:59 -0800 Subject: [DAS2] New affymetrix das/2 development server Message-ID: Gregg's latest spec-compliant, but still development-grade, das/2 server is now publically available via http://205.217.46.81:9091 It's currently serving annotations from the following assemblies: - human hg16 - human hg17 - drosophila dm2 Send me requests for any other data sources that would help your development efforts. Example query to get back a das-source xml document: http://205.217.46.81:9091/das2/genome/sequence It's compliance with the spec is steadily improving, on a daily if not hourly basis during the code sprint. Within IGB you can access this server from the DAS/2 servers tab under 'Affy-temp'. You'll need the latest version of IGB from the CVS repository at http://sf.net/projects/genoviz Steve From dalke at dalkescientific.com Wed Mar 15 16:25:53 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 15 Mar 2006 13:25:53 -0800 Subject: [DAS2] on local and global ids Message-ID: <4319dfa55d78ca849bcc8a231daf679c@dalkescientific.com> The discussion today was on local segment identifiers vs. global segment identifiers. I'm going to characterize them as "abstract" vs. "concrete" identifiers. An abstract id has no default resolution to a resource. A concrete one does. The identifier "http://www.biodas.org/" is concrete identifier because it has a default resolver. "lsid:ncbi:human:35" is an abstract identifier because it has no default resolver (though there are resolvers for lsid they are not default resolvers.) The global segment identifier may be a concrete identifier. It may implement the segments interface. But who is in charge of that? Who defines and maintains the service? If it goes down, (power outage, network cable cut) then what does the rest of the world do? For the purposes of DAS it is better (IMO) that the global identifiers be abstract, though they should be http URLs which are resolvable to something human readable. (This is what the XML namespace elements do.) Reference servers are concrete identifiers. They exist. They can change (eg, change technologies and change the URLs, say from cgi-bin/*.pl to an in-process servlet.) Now, they should be long-lived, but that's not how life works. Suppose someone wants to set up an annotation server, without setting up a reference server. One solution is to point to an existing reference server. In this case all the features are returned with segments labeled as in the reference server. There's no problem. Second, Andreas wants an abstract "COORDINATE" space id This requires a more complicated client because it must have other information to figure out how to convert from the coordinate identifier into the corresponding types. The answer that Andreas and others give is "consult the registry". That is, look for other other segments CAPABILITY elements with the same coordinates id. For that to happen there needs to be a way to associate a segments doc with a coordinate system. For example, this is what the current spec allows (almost - there's no example of it and I'm still trying to get the schema working for it) This makes a resolution scheme from an abstract coordinate identifier into a concrete segments document identifier. Why are there so many fields on the coordinates? It could be normalized, so you fetch the coordinate id to get the information. It's there to support searches. A goal has been that the top-level sources document gives you everything you need to know about the system. (Doesn't mean it's elegant. I won't talk about alternatives. It's not important. There's at most an extra 150 or so bytes per versioned source.) The problem comes when a site wants a local reference server. These segments have concrete local names. DAS1 experience suggests that people almost always set up local servers. They do not refer to an well-known server. There are good reasons for doing this. If the local annotation server works then the local reference server is almost certain to work. The well-known server might not work. Also, the configuration data is in the sources document. There's no need to set up a registry server to resolve coordinates. There's no configuration needed in the client to point to the appropriate concrete identifier given an abstract URL. My own experience has been that people do not read specifications. I am an odd-ball. According to http://diveintomark.org/archives/2004/08/16/specs I am an asshole. That's okay -- most people are morons. > Morons, on the other hand, don?t read specs until someone yells at > them. Instead, they take a few examples that they find ?in the wild? > and write code that seems to work based on their limited sample. Soon > after they ship, they inevitably get yelled at because their product > is nowhere near conforming to the part of the spec that someone else > happens to be using. Someone points them to the sentence in the spec > that clearly spells out how horribly broken their software is, and > they fix it. Someone who wants to implement a DAS reference server will take the data from somewhere and make up a local naming scheme. That's what happened with DAS1. That's why Gregg was saying he maintains a synonym table saying human 1 = chr1 = Chromo1 = ChrI 2 = chr2 = Chromo2 = ChrII This will not change. People will write a server for local data and point a DAS client at it. The client had better just work for the simple case of viewing the data even through there is no coordinate system -- it needs to, because people will work on systems with no coordinate system. Sites will even write multiple in-house DAS servers providing data, which work because everything refers to the same in-house reference server. It's only the first time that someone wants to merge in-house data with external data that there's a problem. This might be several months after setting up the server. At that point they do NOT want to rewrite all the in-house servers to switch to a new naming scheme. That's why the primary key for a paired annotation server and feature must be a local name. That's what morons will use. Few will consult some global registry to make things interoperable at the start. > For example, some people posit the existence of what I will call the > ?angel? developer. ?Angels? read specs closely, write code, and then > thoroughly test it against the accompanying test suite before shipping > their product. Angels do not actually exist, but they are a useful > fiction to make spec writers to feel better about themselves. Lincoln could come up with universal names for every coordinate system that ever existed or will exist. But people will not consult it. However, they will when there is a need to do that. The need comes in when they want to import external data. At that point they need a way to join between two different data sources. They consult the spec and see that there's a "synonym" (or "reference", or "global", or "master" or *whatever* name -- I went with synonym because it doesn't imply that it's the better name.) The local name + "segment/ChrI" is also known as http://dalkescientific.com/yeast1/ChrI . Simple, and requires very little change in the server code. The only other change is to support the synonym name when doing segment requests, as segment=http://dalkescientific.com/yeast1/ChrI This is important because then clients can make range requests from servers without having to download the segment document first. It's also easy to implement, because it's a lookup table in the web server interface, and not something which needs to be in the database proper. Most people are morons. The spec as-is is written for that. It's not written for angels. It allows post-facto patch-ups once people realize they need a globally recognized name. It does require smarter clients. They need to map from local name to global name, through a translation table provided by the server. This is fast and easy to implement. It's easier to implement than consulting multiple registry servers and trying to figure out which is appropriate. And the XML returned will be smaller. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Wed Mar 15 17:39:36 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 15 Mar 2006 14:39:36 -0800 Subject: [DAS2] xml namespace uri Message-ID: Please use "http://biodas.org/documents/das2" for the XML element namespace. The two current servers (Allen's and Steve's) use "http://www.biodas.org/ns/das/2.00" which is wrong according to the spec, for the last 2 years it's been "http://www.biodas.org/ns/das/genome/2.00" Since the servers need to change anyway, might as well make it something a bit more readable, and shorter. :) I've checked all the current dasypus (validator) software into CVS, btw, and updated all of the example xml (draft3/ucla/) to use the new namespace. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Thu Mar 16 00:17:24 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 15 Mar 2006 21:17:24 -0800 Subject: [DAS2] query language description Message-ID: <7ed68f2baa961f932e369cb449371439@dalkescientific.com> The query fields are name | takes | matches features ... ========================== xid | URI | which have the given xid type | URI | with the given type or subtype (XX keep this one???) exacttype | URI | with exactly the given type segment | URI | on the given segment overlaps | region | which overlap the given region inside | region | which are contained inside the given region (XX needed??) contains | region | which contain the given region (XX needed?? ) name | string | with a name or alias which matches the given string prop-* | string | with the property "*" matching the given string Queries are form-urlencoded requests. For example, if the features query URL is 'http://biodas.org/features' and there is a segment named 'http://ncbi.org/human/Chr1' then the following is a request for all the features on the first 10,000 bases of that segment The query is for segment = 'http://ncbi.org/human/Chr1' overlaps = 0:10000 which is form-urlencoded as http://biodas.org/features? segment=http%3A%2F%2Fncbi.org%2Fhuman%2FChr1;overlaps=0%3A1000 Multiple search terms with the same key are OR'ed together. The following searches for features containing the name or alias of either BC048328 or BC015400 http://biodas.org/features?name=BC048328;name=BC015400 Multiple search terms with different keys are AND'ed together, but only after doing the OR search for each set of search terms with identical keys. The following searches for features which have a name or alias of BC048328 or BC015400 and which are on the segment http://ncbi.org/human/Chr1 http://biodas.org/features?name=BC048328; segment=http%3A%2F%2Fncbi.org%2Fhuman%2FChr1;name=BC015400 The order of the search terms in the query string does not affect the results. If any part of a complex feature (that is, one with parents or parts) matches a search term then all of the parents and parts are returned. (XXX Gregg -- is this correct? XXX) The fields which take URLs require exact matches. I think we decided that there is no type inferencing done in the server; it's a client side thing. In that case the 'type' field goes away. We can still keep 'exacttype'. The URI used for the matching is the type uri, and NOT the ontology URI. (We don't have an ontology URI yet, and when we do we can add an 'ontology' query.) The segment URI must accept the local identifier. For interoperability with other servers they must also accept the equivalent global identifier, if there is one. If range searches are given then one and only one segment is allowed. Multiple segments may be given, but then ranges are not allowed. The string searches support a simple search language. ABC -- contains a word which exactly matches "ABC" (identity, not substring) *ABC -- words ending in "ABC" ABC* -- words starting with "ABC" *ABC* -- words containing the substring "ABC" If you want a field which exactly contains a '*' you're kinda out of luck. The interpretation of whitespace in the query or in the search string is implementation dependent. For that matter, the meaning of "word" is implementation dependent. (Is *O'Malley* one word? *Lethbridge-Stewart*?) When we looked into this last month at Sanger we verified that all the databases could handle %substring% searches, which was all that people there wanted. The Affy people want searches for exact word, prefix and suffix matches, as supported by the the back-end databases. XXX CORRECT ME XXX The 'name' search searches.... It used to search the 'name' attribute and the 'alias' fields. There is no 'name' now. I moved it to 'title'. I think I did the wrong thing; it should be 'name', but it's a name meant for people, not computers. Some features (sub-parts) don't have human-readable names so this field must be optional. The "prop-*" is a search of the elements. Features may have properties, like To do a string search for all 'membrane' cellular components, construct the query key by taking the string "prop-" and appending the property key text ("cellular_component"). The query value is the text to search for. prop-cellular_component=membrane To search for any cellular_component containing the substring "mem" prop-cellular_component=*membrane* The rules for multiple searches with the same key also apply to the prop-* searches. To search for all 'membrane' or 'nuclear' cellular components, use two 'prop-cellular_component' terms, as http://biodas.org/features?prop-cellular_component=membrane;prop- cellular_component=membrane The range searches are defined with explicit start and end coordinates. The range syntax is in the form "start:end", for example, "1:9". Let 'min' be the smallest coordinate for a feature on a given segment and 'max' be one larger than the largest coordinate. These are the lower and upper founds for the feature. An 'overlaps' search matches if and only if min < end AND max > start XXX For GREG XXX What do 'inside' and 'contains' do? Can't we just get away with 'excludes', which has complement of 'overlaps'? Searches are done as: Step 0) specify the segment Step 1) do all the includes (if none, match all features on segment) Step 2) do all the excludes, inverted (like an includes search) Step 3) only return features which are in Step 1 but not in Step 2) Step 4) ... Step 5) Profit! I think this will support your smart code, and it's easy enough to implement. Every one but you was planning to use 'overlaps'. Only you wanted to use 'inside'. Anyone want to use 'contains'? Andrew dalke at dalkescientific.com From td2 at sanger.ac.uk Thu Mar 16 04:24:03 2006 From: td2 at sanger.ac.uk (Thomas Down) Date: Thu, 16 Mar 2006 09:24:03 +0000 Subject: [DAS2] on local and global ids In-Reply-To: <4319dfa55d78ca849bcc8a231daf679c@dalkescientific.com> References: <4319dfa55d78ca849bcc8a231daf679c@dalkescientific.com> Message-ID: <9F3DD0BD-25B5-49BC-839A-B6134B17E9B0@sanger.ac.uk> On 15 Mar 2006, at 21:25, Andrew Dalke wrote: > > The problem comes when a site wants a local reference server. > These segments have concrete local names. > > DAS1 experience suggests that people almost always set up local > servers. They do not refer to an well-known server. I'm not sure that DAS1 experience is a good model for this. It's true that people didn't always point to well-known reference servers, but I think this has more to do with the fact that people didn't know which server to point to. Some people did set up their own reference servers. Many didn't, and many of those didn't give a valid MAPMASTER URL at all. This situation didn't actually cause too much trouble since a lot of these users just wanted to add a track to Ensembl -- which doesn't care about MAPMASTER URLs and just trusts the user to add tracks that live in an appropriate coordinate system. I'd still argue that the majority -- probably the vast majority -- of people setting up DAS servers really just want to make an assertion like "I'm annotating build NCBI35 of the human genome" and be done with it. That's what the coordinate system stuff in DAS/2 is for. If this is documented properly I don't think we'll see many "end- user" sites setting up their own reference servers unless a) they want an internal mirror of a well-known server purely for performance/ bandwidth reasons or b) they want to annotate an unpublished/new/ whatever genome assembly. (Actually, some of the "annotation providers set up their own reference servers" stuff might be my fault -- early versions of Dazzle were pretty strict about requiring a valid [and functional!] MAPMASTER for every datasource, so this pushed people towards setting up reference servers.) Thomas. From lstein at cshl.edu Thu Mar 16 06:03:49 2006 From: lstein at cshl.edu (Lincoln Stein) Date: Thu, 16 Mar 2006 11:03:49 +0000 Subject: [DAS2] on local and global ids In-Reply-To: <9F3DD0BD-25B5-49BC-839A-B6134B17E9B0@sanger.ac.uk> References: <4319dfa55d78ca849bcc8a231daf679c@dalkescientific.com> <9F3DD0BD-25B5-49BC-839A-B6134B17E9B0@sanger.ac.uk> Message-ID: <200603161103.50323.lstein@cshl.edu> I think it will help considerably to have a document that lists the valid sequence IDs for popular annotation targets. I've spoken with Ewan on this, and Ensembl will generate a list of IDs for all vertebrate builds. I'll take responsibility for creating IDs for budding yeast, two nematodes and 12 flies. Lincoln On Thursday 16 March 2006 09:24, Thomas Down wrote: > On 15 Mar 2006, at 21:25, Andrew Dalke wrote: > > The problem comes when a site wants a local reference server. > > These segments have concrete local names. > > > > DAS1 experience suggests that people almost always set up local > > servers. They do not refer to an well-known server. > > I'm not sure that DAS1 experience is a good model for this. It's > true that people didn't always point to well-known reference servers, > but I think this has more to do with the fact that people didn't know > which server to point to. Some people did set up their own reference > servers. Many didn't, and many of those didn't give a valid > MAPMASTER URL at all. This situation didn't actually cause too much > trouble since a lot of these users just wanted to add a track to > Ensembl -- which doesn't care about MAPMASTER URLs and just trusts > the user to add tracks that live in an appropriate coordinate system. > > I'd still argue that the majority -- probably the vast majority -- of > people setting up DAS servers really just want to make an assertion > like "I'm annotating build NCBI35 of the human genome" and be done > with it. That's what the coordinate system stuff in DAS/2 is for. > If this is documented properly I don't think we'll see many "end- > user" sites setting up their own reference servers unless a) they > want an internal mirror of a well-known server purely for performance/ > bandwidth reasons or b) they want to annotate an unpublished/new/ > whatever genome assembly. > > (Actually, some of the "annotation providers set up their own > reference servers" stuff might be my fault -- early versions of > Dazzle were pretty strict about requiring a valid [and functional!] > MAPMASTER for every datasource, so this pushed people towards setting > up reference servers.) > > Thomas. > > _______________________________________________ > DAS2 mailing list > DAS2 at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/das2 -- Lincoln D. Stein Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724 FOR URGENT MESSAGES & SCHEDULING, PLEASE CONTACT MY ASSISTANT, SANDRA MICHELSEN, AT michelse at cshl.edu (516 367-5008) From lstein at cshl.edu Thu Mar 16 06:06:38 2006 From: lstein at cshl.edu (Lincoln Stein) Date: Thu, 16 Mar 2006 11:06:38 +0000 Subject: [DAS2] Spec freeze In-Reply-To: <9F3DD0BD-25B5-49BC-839A-B6134B17E9B0@sanger.ac.uk> References: <4319dfa55d78ca849bcc8a231daf679c@dalkescientific.com> <9F3DD0BD-25B5-49BC-839A-B6134B17E9B0@sanger.ac.uk> Message-ID: <200603161106.39074.lstein@cshl.edu> Hi, I just spoke with Thomas and Andreas on this, and all three of us are experiencing difficulty coding to a changing spec. In my opinion the spec is really good right now and issues such as whether to use "uri" or "id" as attribute names are not germaine. Can I propose that we declare a three month spec freeze starting at midnight tonight (GMT)? Lincoln -- Lincoln D. Stein Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724 FOR URGENT MESSAGES & SCHEDULING, PLEASE CONTACT MY ASSISTANT, SANDRA MICHELSEN, AT michelse at cshl.edu (516 367-5008) From dalke at dalkescientific.com Thu Mar 16 10:38:00 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 16 Mar 2006 07:38:00 -0800 Subject: [DAS2] on local and global ids In-Reply-To: <9F3DD0BD-25B5-49BC-839A-B6134B17E9B0@sanger.ac.uk> References: <4319dfa55d78ca849bcc8a231daf679c@dalkescientific.com> <9F3DD0BD-25B5-49BC-839A-B6134B17E9B0@sanger.ac.uk> Message-ID: <41d2d7197710e14d4ba898ae758bf280@dalkescientific.com> Thomas: > I'm not sure that DAS1 experience is a good model for this. It's true > that people didn't always point to well-known reference servers, but I > think this has more to do with the fact that people didn't know which > server to point to. I think I said there are two cases; there's actually several 1. the sources document states a well-known COORDINATES and makes no links to segments 2. the sources document refers to a well-known segments server ("the" reference server) and no COORDINATES 3. the source document has a segments document, and each segment listed uses URIs from "the" reference server 4. the server implements its own coordinates server, with new segment ids 5. When uploading a track to Ensembl there's no need to have either COORDINATE or segments -- the upload server can verify for itself that the upload uses the right ids. The *only* concern is with #4. Everything else uses the well-known global identifier for segments. > I'd still argue that the majority -- probably the vast majority -- of > people setting up DAS servers really just want to make an assertion > like "I'm annotating build NCBI35 of the human genome" and be done > with it. I'm fine with that. There are two ways to do it. #1 and #2 above. In theory only one of those is needed. The document can point to "the" reference server for NCBI 35. In practice that's not sufficient because there is no authoritative NCBI 35 server. Hence COORDINATES provides an abstract global identifier describing the reference server. > That's what the coordinate system stuff in DAS/2 is for. If this is > documented properly I don't think we'll see many "end-user" sites > setting up their own reference servers unless a) they want an internal > mirror of a well-known server purely for performance/bandwidth reasons > or b) they want to annotate an unpublished/new/whatever genome > assembly. A philosophical comment. I'm a distributed, self-organizing kinda guy. I don't think single root centralized systems work well when there are many different groups involved. I think many people will use the registry server, but not all. I think there will be public DAS servers which aren't in the registry. I know there will be in-house DAS servers which aren't. I'm just about certain that some sites will have local copies of the primary data. They do for GenBank, for PDB, for SWISS-PROT, for EnsEMBL. Why not for DAS? That said, here's a couple of questions for you to answer: a) When connecting to a new versioned source containing only COORDINATES data, what should the client do to get the list of segments, sizes, and primary sequence? I can think of several answers. My answer is that the versioned source should state the preferred reference server and unless otherwise configured a client should use that reference server and only that reference server. Yes, all the reference servers for that coordinate system are supposed to return the same results. But that's only if they are available. There are performance issues too, like low bandwidth or hosting the server on a slow machine. The DAS client shouldn't round-robin through the list until it finds one which works because that could take several minutes to timeout on a single server, with another 10 to try. Yes, a client can be configured and told "for coordinate system A use reference server Z". But that's a user configuration. b) If there is a local mirror of some reference server, how should the local DAS clients be made aware of it? (And should this be a supportable configuration? I think so.) I'm pretty sure that most DAS clients won't be configurable to look for local servers instead of global ones. Even if they are, I'm pretty sure each will have a different way to do so. Apollo and Bioperl will use different mechanisms. I have no good answer for this. It sounds like your answer is "people won't have local copies." I think they will. Ideas: - have a rewriting registry server which does a rewrite of the information from the other servers. But this doesn't work because the feature result from the remote server (in my scheme) is given using its local segment names. There's no way to go from that local name to the appropriate mirror reference server. This suggests that the results really do need to be given through global ids, with no support for local ones. The segments result optionally provides a way to resolve a global name through a local resource. - set up an HTTP proxy service for DAS requests which transparently detects, translates and redirects to the appropriate local resource. Cute, but not likely to be done in real life. c) A group has been working on a new genome/assembly. The data is annotated on local machines using DAS and DAS writeback Finally it's published. Do they need to rewrite all their segment identifiers to use the newly defined global ones? As there are only a few places where the segment identifier is used, and it's an interface layer, I think the conversion is easy. But it is a flag day event which means people don't want to do it. Instead, it's more likely that local people will set up a synonym table to help with the conversion. There are perhaps a dozen groups which might do this and they all have competent people. This should not be a problem. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Thu Mar 16 11:06:26 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 16 Mar 2006 08:06:26 -0800 Subject: [DAS2] on local and global ids In-Reply-To: <200603161103.50323.lstein@cshl.edu> References: <4319dfa55d78ca849bcc8a231daf679c@dalkescientific.com> <9F3DD0BD-25B5-49BC-839A-B6134B17E9B0@sanger.ac.uk> <200603161103.50323.lstein@cshl.edu> Message-ID: Lincoln: > I think it will help considerably to have a document that lists the > valid > sequence IDs for popular annotation targets. I've spoken with Ewan on > this, > and Ensembl will generate a list of IDs for all vertebrate builds. > I'll take > responsibility for creating IDs for budding yeast, two nematodes and 12 > flies. What should people use if there aren't defined? Like now? If everyone must use the same well-defined global id for the features response then doesn't that mean we can't have any DAS servers until this document is made? Is the general requirement that the first person to make a server for a given build/genome/etc. is the one who gets to define the global ids? Or is it Andreas at Sanger who defines the names? Suppose one group in California starts defining names for, say, the barley genome. Another group in say, Germany, is also working on the barley genome. They hate each others guts and don't work together, so they make their own names. The names refer to the same thing because it was a group in Japan which produced the genome. Do we wait for an alignment service? An identity service? before people can merge data from these two groups? Maybe we can solve all this by having an identity mapper format. And defer defining that format until there is a problem. There is no perfect solution. This is a sociological problem. Gregg's current client, I think, used hard-coded knowledge about the mapping between the two current servers. Then again, his code already supports a synonym table. Andrew dalke at dalkescientific.com From gilmanb at pantherinformatics.com Thu Mar 16 10:52:51 2006 From: gilmanb at pantherinformatics.com (Brian Gilman) Date: Thu, 16 Mar 2006 10:52:51 -0500 Subject: [DAS2] on local and global ids In-Reply-To: <41d2d7197710e14d4ba898ae758bf280@dalkescientific.com> References: <4319dfa55d78ca849bcc8a231daf679c@dalkescientific.com> <9F3DD0BD-25B5-49BC-839A-B6134B17E9B0@sanger.ac.uk> <41d2d7197710e14d4ba898ae758bf280@dalkescientific.com> Message-ID: <441989D3.90202@pantherinformatics.com> Hey Guys, Where's the latest spec and use case document? Sorry if this is a super dumb question. I couldn't find it on the website. Best, -B Andrew Dalke wrote: >Thomas: > > >>I'm not sure that DAS1 experience is a good model for this. It's true >>that people didn't always point to well-known reference servers, but I >>think this has more to do with the fact that people didn't know which >>server to point to. >> >> > >I think I said there are two cases; there's actually several > > 1. the sources document states a well-known COORDINATES > and makes no links to segments > 2. the sources document refers to a well-known segments server > ("the" reference server) and no COORDINATES > 3. the source document has a segments document, and each segment > listed uses URIs from "the" reference server > 4. the server implements its own coordinates server, with > new segment ids > 5. When uploading a track to Ensembl there's no need to have > either COORDINATE or segments -- the upload server can > verify for itself that the upload uses the right ids. > > >The *only* concern is with #4. Everything else uses the well-known >global identifier for segments. > > > >>I'd still argue that the majority -- probably the vast majority -- of >>people setting up DAS servers really just want to make an assertion >>like "I'm annotating build NCBI35 of the human genome" and be done >>with it. >> >> > >I'm fine with that. There are two ways to do it. #1 and #2 above. >In theory only one of those is needed. The document can point to >"the" reference server for NCBI 35. > >In practice that's not sufficient because there is no authoritative >NCBI 35 server. > >Hence COORDINATES provides an abstract global identifier describing >the reference server. > > > >> That's what the coordinate system stuff in DAS/2 is for. If this is >>documented properly I don't think we'll see many "end-user" sites >>setting up their own reference servers unless a) they want an internal >>mirror of a well-known server purely for performance/bandwidth reasons >>or b) they want to annotate an unpublished/new/whatever genome >>assembly. >> >> > >A philosophical comment. I'm a distributed, self-organizing kinda >guy. I don't think single root centralized systems work well when >there are many different groups involved. > >I think many people will use the registry server, but not all. >I think there will be public DAS servers which aren't in the registry. >I know there will be in-house DAS servers which aren't. > >I'm just about certain that some sites will have local copies of >the primary data. They do for GenBank, for PDB, for SWISS-PROT, >for EnsEMBL. Why not for DAS? > >That said, here's a couple of questions for you to answer: > > a) When connecting to a new versioned source containing only >COORDINATES data, what should the client do to get the list >of segments, sizes, and primary sequence? > >I can think of several answers. My answer is that the versioned >source should state the preferred reference server and unless >otherwise configured a client should use that reference server >and only that reference server. > >Yes, all the reference servers for that coordinate system >are supposed to return the same results. But that's only if >they are available. There are performance issues too, like >low bandwidth or hosting the server on a slow machine. The >DAS client shouldn't round-robin through the list until it >finds one which works because that could take several minutes >to timeout on a single server, with another 10 to try. > >Yes, a client can be configured and told "for coordinate >system A use reference server Z". But that's a user >configuration. > > b) If there is a local mirror of some reference server, how >should the local DAS clients be made aware of it? (And >should this be a supportable configuration? I think so.) > >I'm pretty sure that most DAS clients won't be configurable >to look for local servers instead of global ones. Even if >they are, I'm pretty sure each will have a different way >to do so. Apollo and Bioperl will use different mechanisms. > >I have no good answer for this. It sounds like your answer >is "people won't have local copies." I think they will. > >Ideas: > - have a rewriting registry server which does a rewrite of >the information from the other servers. But this doesn't >work because the feature result from the remote server (in >my scheme) is given using its local segment names. There's >no way to go from that local name to the appropriate mirror >reference server. This suggests that the results really do >need to be given through global ids, with no support for >local ones. The segments result optionally provides a way >to resolve a global name through a local resource. > > - set up an HTTP proxy service for DAS requests which >transparently detects, translates and redirects to the >appropriate local resource. Cute, but not likely to be >done in real life. > > c) A group has been working on a new genome/assembly. The >data is annotated on local machines using DAS and DAS writeback >Finally it's published. Do they need to rewrite all their >segment identifiers to use the newly defined global ones? > >As there are only a few places where the segment identifier is >used, and it's an interface layer, I think the conversion is >easy. But it is a flag day event which means people don't >want to do it. Instead, it's more likely that local people >will set up a synonym table to help with the conversion. > >There are perhaps a dozen groups which might do this and they >all have competent people. This should not be a problem. > > Andrew > dalke at dalkescientific.com > >_______________________________________________ >DAS2 mailing list >DAS2 at lists.open-bio.org >http://lists.open-bio.org/mailman/listinfo/das2 > > > > From dalke at dalkescientific.com Thu Mar 16 11:33:58 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 16 Mar 2006 08:33:58 -0800 Subject: [DAS2] on local and global ids In-Reply-To: <441989D3.90202@pantherinformatics.com> References: <4319dfa55d78ca849bcc8a231daf679c@dalkescientific.com> <9F3DD0BD-25B5-49BC-839A-B6134B17E9B0@sanger.ac.uk> <41d2d7197710e14d4ba898ae758bf280@dalkescientific.com> <441989D3.90202@pantherinformatics.com> Message-ID: <24b985c0229970562a9e2612f00f2da5@dalkescientific.com> Brian: > Where's the latest spec and use case document? Sorry if this is a > super dumb question. I couldn't find it on the website. CVS for the spec. The history is: draft 1 - written by Lincoln, freeze for summer last year. This is the one with HTML, etc. and is on the web site. draft 2 - written by me in January. In CVS under das/das2/new_spec.txt with examples under das/das2/scratch . This was the version for the spring last month draft 3 - under development I rewrote beginning of it because no one liked the pedantic pedagogical style it used. This draft starts with examples. The incomplete version, as of Monday morning, is das/das2/draft3/spec.txt However, I am slow at writing spec text, especially new text. Instead of working on it more I put example output files in das/das2/draft3/ucla/ starting with 'sources.xml' in that directory. As for use cases, the email you saw from me a couple of days ago is the only thing even close to formal. Andrew dalke at dalkescientific.com From ap3 at sanger.ac.uk Thu Mar 16 12:05:10 2006 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Thu, 16 Mar 2006 17:05:10 +0000 Subject: [DAS2] sources responses Message-ID: <355af8b441fefe8690a9e78de55fc2f9@sanger.ac.uk> Hi! the (toy) sources responses at http://www.spice-3d.org/dasregistry/das1/sources/ http://www.spice-3d.org/dasregistry/das2/sources/ now are updated to the latest spec and validate with Andrew's validator at http://cgi.biodas.org:8080/ Cheers, Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 From Steve_Chervitz at affymetrix.com Thu Mar 16 15:37:16 2006 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Thu, 16 Mar 2006 12:37:16 -0800 Subject: [DAS2] Notes from DAS/2 code sprint #2, day three, 15 Mar 2006 Message-ID: Notes from DAS/2 code sprint #2, day three, 15 Mar 2006 $Id: das2-teleconf-2006-03-15.txt,v 1.1 2006/03/16 20:45:35 sac Exp $ Note taker: Steve Chervitz Attendees: Affy: Steve Chervitz, Ed E., Gregg Helt Sanger: Thomas Down, Andreas Prlic CSHL: Lincoln Stein Dalke Scientific: Andrew Dalke (at Affy) UCLA: Allen Day, Brian O'Connor (at Affy) Action items are flagged with '[A]'. These notes are checked into the biodas.org CVS repository at das/das2/notes/2006. Instructions on how to access this repository are at http://biodas.org DISCLAIMER: The note taker aims for completeness and accuracy, but these goals are not always achievable, given the desire to get the notes out with a rapid turnaround. So don't consider these notes as complete minutes from the meeting, but rather abbreviated, summarized versions of what was discussed. There may be errors of commission and omission. Participants are welcome to post comments and/or corrections to these as they see fit. [Notetaker: joining 10 min into the discussion] ls: how does synonym business work? ad: if server has access to data... ls: we ask server for the global id, uses same global id for segments, and uses same global id for the sequence. gh: to do this in the capabilities for annot server, the global id for segments query points to reference server. ls: if the local machine current server, has sequence capabilities, then it passes global id for segments to current server and it gets the sequence. if it doesn't have that capability, then we need to figure out a way for it to get the sequence. the easiest way to do that would be to resolve that url and fetch it. I'm open to any suggestion. I don't see how this uri/synonym is getting us any closer to being able to find the server where sequence can be fetched. The synonym isn't always a fetchable thing. ad: syn is a global id ad: look at the uri for the segment and fetch it from there ls: could be a remote url. gh: segments query is only thing that gives segment url segments capabilities for the annot server should point ls: break apart segments into: id=a string, then have an attribute seq_url, when fetched returns the seq. returns the bases. ad: is that's what's there already? ls: no, uri is an id ad: every url is an id, but it's up to whim of the server ls: i don't want people to think its for an id. want an agreed upon uri identifier, then optionally have a url. turn synonym into uri, turn uri into resolver make uri required, bases not required. ad: additional constraint is 'agreed upon'. what about a group starts a new sequencing project. There is no globally known uri for it yet. ls: they just create their own ids td: the natural authority is the creator of the assembly. gh: ncbi won't do it. they don't have a das server, unlikely to. ls: can point to genome assembly. can create a url that will return bases from ncbi in a supported format. this approach will disentangle issue of resolvable vs non-resolvable, local vs non-local segment ids and how to get segment dna. gh: I think this will work. ad: 'this' changing key names? ls: key semantics uri is required, global identifier sequence is an optional pointer gh: you say that for feat xml, the id for seq will be the globally agreed on id. ls: yes ad: if you don't have a local copy, if you have ability to map global identifiers, then you know what it is from the coordinates. there are two ways to specificy coordinates: coordinates and segments ad: if you just need the segments and some identifier. only when you need to do an overlay with someone else that you need the coords. gh: no, coords don't say anything about ids of coord (?) gh: if we do it the way lincoln proposed, then the logical way to relate those is that the segments capapbilities points to ref server. ad: when feat returns a location is it in global or local space? gh: lincoln - global space ls: every annot server will know length of its landmarks (chrms). some people will not want to be served dna, they will point somewhere else where to get the dna. There will be many places to get dna for a given global id, they chose one they like. ls: feature locations are given in global id ad: this changes the way it's been working. xml:base issues ls: I know. gh: if base of sequence and base of features are different, the xml will get bigger. ls: so an argument for having local ids is so you can make location string shorter. gh: yes. ls: probably not worth it ad: also makes it easier to set up a basic server. if you want to overlay them, yes you do. ls: you can always set up a local server if you gh: segments response local and global id as we talked about yesterday (which one feature locatn is relative to) gh: if the only way to overlay for a client to know things are in the same coord system is segid=xxxx and globalid=yyyy, how much harder is it for server to use global ids. ls: server can have configuration file to know where its global ids are coming from aday: would need to think about it more. ad: who will set up these identifiers (yeast, human) ls: I'll do it for model org databases, I will specify segments, and their dna fetchers and will look up their lengths. gh: versions? ls: most recent. community can then keep it up to date. I bet ensembl will be happy to generate this file automatically with every build (for vertebrates) ad: local id uri, and a bunch of synonyms. People will set up own server not referencing a global system. ls: then client would do a closure over all systems. imagine three servers: server-a says here is my segment server-b says it can be b or c server-c says it can be c or a so you have to do a join over all servers gh: not encourage people to do that with local seq ids, encourage people to use. need a global referencing system to say this uri is same as that uri. ad: bad logic for the web. If one is wrong, could be a problem td: (proposal - based on genomic coord alignments) ad: that says only alignable things are the same. ad: don't think it will work, they will already have local servers gh: what about 'the stick': people who want to register their server with central registry can only do so if they use global ids for their segments. ls, td: fine ad: if they've been working for a while in house, they would have a big effort to retrofit their system to comply. just won't do. ls: in draft 3, where's assembly info? ad: same as before. ask segments for agp format. draft not complete. gh: the thing that ids which assembly you're on is the coordinates element (authority, taxonomy, ...) ls: authority is a recognized, globally unique organization. Should it be a uri? ad: authority and version is human visible so people can search by it. ls: fine. gh: can invoke the 'stick' idea here: if you 're trying to register something on same genomome assembly, then registry can check your segments to verify they are agreed up. ls: taxon, source, authority, version all must match ad: also an id ap: we discussed in email ad: the only stuff that is complete is in the ucla subdir. ls: the examples are definitive ad: yes, unless we change things today. ls: what if taxon, source, version match but uri doesn't? registry gets submission. makes a segments request on submitter, if it gets a list of same segment identifiers, it accepts it. what if it gets a subset? gh: ok ls: superset is not ok. aday: why? gh: if you allow subset and superset, you can have everything. aday: use case: bacteria with extra plasmid identifier. nh: signing off. will be at affy tomorrow. ls: you would have to create your own coord system. gh: could argue with maintainer to added it. ls: can you have multiple coordinates in a given assembly? aday: proposal: make coords an attribute of the segment. could keep your segment references local. ls: we shouldn't give people ways to create new names. human chr1 ncbi build 35 should be something that everybody can agree on. gh: then we wouldn't allow allen's use case where someone wants a superset of what's in reference? ls: add new coord tag to source version entry, says I'm creating a superset consisting of coords from ref 1, 2, 3, any of these can be a new namespace that I set up. gh: how do you know which ones come from where? right now there's now way to get coord for a segment. ad: can as of yesterday afternoon. ls: to indicate which segments come from which auth. put coord id into segments tag. aday: thank you! ad: alternative proposal - multiple segments use case: when you have scaffolds or chromosomes, or mouse and yeast ls: say you want human mouse scaffolds + chrms, and human chrms three diff coords tags in the sources document each one gives auth, taxon, etc. when client goes to get segments, it will get human chromosomes, mouse chrms, and mouse scaffolds, in one big list, each will point back to coord it got in features requets. gh: knowing what coordinates doesn't tell you global id for segment aday: ok. gh: multiple segments elements vs mult coords in a segment work for me. ad: what does a client do gh: ... ls: three types of entry points, hu chrms, mo chrms, mo scaffolds, now tell me what you want to start browsing. human readable. scaffold on mouse with name xxx from two ad: displaying all together vs one or the other or the other. ee: affymetrix use case in igb. [probe gh: doesn't seem to matter aday: the tag values are easier to implement td: not a big difference to me gh: drawing on whiteboard... ls: let's rename das to distributed annotation research network. then we can say "darn1, darn2"! ad: gregg's request for search to find everything identical (start and end are same) td: if you have contained and inside, you can do identical with an and operation. ls: doesn't make server any more complicated, for completeness you may want to do that. ad: how about includes 1-5000 and excludes ... some of this is asethetic. ls: overlaps, contains, contained-in have good use cases for. exact match - maybe searching for curated exons that exactly match predicted. [Lincoln has to leave.] gh: drawing options for segments and coordinate systems. [whether you put a coords tag per segment, or ome capabilities one for each coord system] allen's approach - one query with filter or multiple fetches aday: uniprot example gh: separate segments query. ap: can we leave it out and add later if necessary? ad: these are things that haven't been discussed in last two years aday: uri ad: xml namespace issue - what do we call it (see email) gh: you pick it ad: required syntax for entry points /das/source gh: recommended, but not required ad: lincoln was only one who felt strongly about it being required, and he's not here. gh: feature xml, every feature can have multiple locations feaures can represent alignments (collapsed alignment tag into feature tag) td: like it gh: naive user- given a feat with multip location on genome, represent as multip locations, or parent child relations td: don't see as a problem. using parent-child you have things to say about child features specific to them gh: genscan prediction, a problem: one server can serve them up as parent child or as multiple locations on parent four child exons in one case four diff locations in other case problem is with feat filters. if yo do an overlaps query and any children meet the condition, you have to return the parent as well and it's parent on up. agreed? ad: yes gh: works fine for parent child, but for multip location situation, if inside query fully contains only two eons, do you return parent? td: I'd assume inside query would return both. as long as one exon is inside the region, the parent is return. define inside as applying to any level. gh: so even though the transcript is not inside, you still return it? td: using the get parent-if-get-children rule gh: rule must apply to all of them, so you don't get transcript since it doesn't meet the inside condition. aday: multiple locations makes sense - just aligned mult times. human alu feature 100,000s, do you want to create a single feature, or just a single identifier and put it in many different locations. ee: that is for alignments not parent-child relationship aday: you consider location as a attribute of the object.. ee: I agree. alu is only one object, but the exon-transcript are different ad: would someone want to annotate the separate exons differently? aday: you would split it off ad: eg blast alignment, hsp is part of the conceptual alignment. gh: in bioperl, some people will go one path, some go the other path, so we need to figure out how to deal with it. feat filters is clear for parent child relationship. aday: inside and overlaps gh: if your overlap query only grazes one child, you return the parent. this is the only one I'm certain about. gh: we haven't specified that the child is within bounds of parent. with insides, we have a difference of opinion. one exon is within, do you return it? ad: most clients will be doing overlaps, you are the only one doing insides what do you want? gh: the multiple locations muddies the issue. if parent child rule is you only return it if parent is inside (and recursive parent), I've already optimized for that. For multiple locations, I can catch that and handle it. the way I want, the behaviour of mult location will be diff than parent child. td: for me, the overlaps is the most important thing. Andreas just get everything. ad: can we delegate to gregg here for what to do in case of inside. [A] gregg will write up description for inside query and multiple locations Status reports ----------------- gh: updating server. overlaps, insides, types, and each good news: latest genome assembly on human on affy server overlayed with allen's server. using hardcoded knowledge in igb for assembly id, not coordinates yet. with andrew: making sure clients can understand any variants of namespace usage in the xml. get client to use more capabilities like links ad: example data set together, updated schema to latest spec, but forgot cigar thing. update validator to use most recent version or rnc schemas. gh: even if your server isn't public you can cut and paste into you validator at http://cgi.biodas.org:8080 aday: biopackages up to date with version 200 of spec file. issues for nomi, and gregg. off by one error. bo: small code refactor in the das server. testing that today. ee: nothing das related yet, but will. implementing style sheets to get colors for features. ap: registry ui for upload of a das/2 source. coding for that gh: what about registry rejecting segment ids if they don't match standard ids for that coord system. sound good to you? ap: basically yes. td: not done a great deal gh: Nomi has been here working on apollo client. we'll hear from her tomorrow. ----------------------- post teleconf discussion re: using global identifiers for uri [Notetaker: just a few morsels were captured here.] ad: most folks i work with get something going locally, then after it's going, hook it up with the rest of the world, integrate with other people. they don't want to revamp their work in order to do that. gh: slightly in favor with andrew ad: get what we have now. they are still uri's so it's just an interpretation. will change attributes to be 'uri and 'reference_uri' gh: how does it get length of segments? ad: good idea to have coordinates and segments in the document. add your own track to ensembl, you don't need to give it a segments, just specify coordinates. gh: seems like it will encourage servers that can only work with particular clients. ad: what about getting rid of coordinates, just needed by Andreas for registry. From Steve_Chervitz at affymetrix.com Thu Mar 16 15:38:13 2006 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Thu, 16 Mar 2006 12:38:13 -0800 Subject: [DAS2] Notes from DAS/2 code sprint #2, day four, 16 Mar 2006 Message-ID: Notes from DAS/2 code sprint #2, day four, 16 Mar 2006 $Id: das2-teleconf-2006-03-16.txt,v 1.1 2006/03/16 20:45:48 sac Exp $ Note taker: Steve Chervitz Attendees: Affy: Steve Chervitz, Gregg Helt CSHL: Lincoln Stein Dalke Scientific: Andrew Dalke (at Affy) Sanger: Andreas Prlic UC Berkeley: Nomi Harris (at Affy) UCLA: Allen Day, Brian O'Connor (at Affy) Action items are flagged with '[A]'. These notes are checked into the biodas.org CVS repository at das/das2/notes/2006. Instructions on how to access this repository are at http://biodas.org DISCLAIMER: The note taker aims for completeness and accuracy, but these goals are not always achievable, given the desire to get the notes out with a rapid turnaround. So don't consider these notes as complete minutes from the meeting, but rather abbreviated, summarized versions of what was discussed. There may be errors of commission and omission. Participants are welcome to post comments and/or corrections to these as they see fit. Status reports --------------- nh: apollo work, reading the registry, saving capabilties. modifications to code that was based on prototype das adaptor. Generally lots of under the hood work to bring it up to spec. bo: diff functionality between allen's server biopackages.net server and andrew's samepl xml. Updated templates in allen's das server to match andrew's sample xml. ad: worked on validation server, all stuff is in cvs. the http://cgi.openbio.org:8080 server is built off cvs, just check out and rebuild. gh: worked on affy das2 server and client up to current spec based on whatever the rnc documents say (schema doc) as for xml. no chance to read andrew's email on query syntax, will incorporate that today. sc: got latest version of gregg's das/2 server up at affy. serving hg17, hg16, dm2. Updated code that the das1 server is using based on latest genoviz jars. Getting some errors when loading data for new affy arrays. Investigating. aday: minor bug fixes for spec v200. exporting assay data as different views. ucsc browser can viz expression data out of das server in bed format. das viewer can view as egr format. working on single chip at a time. ls: here's a great use case for you: there's a cshl fellow creating dna spectrographs of oligo frequencies presented as audiographs. can really tell diffs from coding vs non-coding, CpG triplets, microsatellites harmonics, big matrices of floating point data tied to genome. consider this a challenge to das to serve this up. my postdoc sheldon mckay is serving this up give you heatmap back given a genomic region. new glyph for spectrographic data aday: format netCDF is good for this, but clients out there don't vizualize it. gh: would like to support netCDF in igb. not sure if this is default way to represent qualtitative data for das. [A] allen will send lincoln pointer to netCDF. aday: netCDF is great for cross-lang, cross platform support. gh: people are pushing wiggle format to ucsc, so we don't want to restrict to just netCDF. aday: my refactor yesterday allows treatment of these as templates. gh: how do this via region query in das? ls: feature query, tag says here comes binary data, each column corresponds to a base (or maybe a scaling factor to indicate # of bp per column). tag says here comes binary qualtitatilve data, scale is 1:1. gh: better way is to use alternative content format stuff (already in spec for types) ls: if you do feat request and don't filter by type, you'll get a mix of binary and non binary. aday: not in genome domain, genome/sequence the fetch to assay service to get quant data. then do intersection to find overlap. performance goes out window if you make the query too complex. fine to do just two fetches. ls: how indicate scale for numerical scale? aday: good question. units are not encoded now. ls: spectogrphic data one value per window where window is 100 bp aday: so two diff units window size, amplitude value and frequency, and that's in four channels for the bases. we're representing as 4 matrices. aday: one matrix per channel.many formats don't support n-dimensional data. only 2d at most. ls: in das1 did base64 encoded string in the notes. It worked. gh: we can't require all clients to know how to interpret it. This is why we have the alt content functionality... [A] das should support dense numeric data across regions, format specified by the existing alternative format mechanism Topic: Spec Freeze ------------------- ls: can we talk about feezing spec? ad: what good will it do? ls: allow us to code to a fixed spec. you freeze spec, people write code for a defined period of time, during that time we compare notes, then make changes, freeze, and repeat. ad: concerned there hasn't been enough work since the changes in jan/feb. ls: now that i'm 'on the other side of the fence' of spec writing, i'd like to see it not change, and have time to make an informed view of what it's strengths and weaknesses are. ad: haven't gotten feedback about my questions, until the codesprints. two months ago, only now being addressed. ls: these issues don't become pressing until we start implementing. this is why we do code sprints. ad: worry because there's been no extensive data modeling for features. ls: can do a 1 month freeze gh: comfortable with 1 mon freeze of schemas as they are in the rnc's now. issues will come up. ls: announce on biodas.org - march 18th das/2 is frozen for 1 month. gh: we'll have to live to ambiguity with how server does certain things. ls: hence the time limited 'trial' freeze. ad: would have like people to write code from last feb so I could get feedback. ls: you very much improved the spec. grateful for what you've done. I wasn't getting feedback when I was writing either. gh: validation website is great for implementers, rather than having to read a spec document everyday. ad: schemas aren't going to change after today (pm). would like to clear some things up about filter language, today? ls: most urgent freeze [A] spec will freeze as of end of today (3/16/06, PST) for one month. Topic: Feature filters ---------------------- ad: feature filters is most important, and how do we define global names? schema is a simple change - which is req'd and which is optional but for impls makes a big diff. ls: global is req'd and local is optional. ad: who comes up with global names ls: first person to do it has naming rights. people have been able to do it for the ensembl service. ad: I need documented names gh: it means you don't know whether two names are the same thing until this document comes out. ls: filter language? ad: gregg needs inside and contains, - type and exact type: das type or ontology type? ls: das type gh: uri attribute of the type ad: that type or it's subtype makes no sense for das types ls: it's just an exact match. client can use ontology to get a series of types ls: should be an exact match, does not traverse ontology. client should ask user: do you want all exons or a specific type of exon? ls: client goes through ontology as necesary [A] drop exacttype, type now has exacttype semantics Topic: XID, feature ids ------------------------ ad: xid in features. no one used yet. gives a ref to some other db. all it is is a url/uri. feels like there should be more info (type?) ad: primary name field for feature, feels like should be name ls: name is human readable. title would be ok ad: but feature filter is called name searches name and id fields ls: this is correct behavior, you can do a fetch on the url/uri this is ok. ad: the name feature searches title and alias. gh: if feature id is resolvable and you resolve it, there's no guarantee it gives back a das2xml document. if the feature uri is resolvable, and you fetch it, you will get back a das2xml document right? can you put uri in the feature query? aday: feels that having auto-generated names ad: do all features have a human readable name? gh/ls: optional ad: why would you want to put a url in a name field? gh: rdf ad: should be a resolvable resource, das2xml for that feature. ad: features with aliases, do aliases need type pk or accession? prosite has false match to ... ls: this is a property or xid, not alias ad: suggests that xid needs extra stuff to it. gh: file with an optional type attribute on xid ad: let's wait to someone has a need. Topic: Feature filters (continued) ---------------------------------- gh: feature filters, inside, contains, identical. Which do we need, which can we drop? [A] overlaps - keep (all agree) inside - gregg needs contains - dropping, maybe identical - dropping ad: what about excludes - the complement of overlap? gh: haven't had time to investigate whether I can use excludes rather than the inside + overlaps (contains?) combination I need now. ls: use case: pointing to children and they haven't arrived yet. gh: my client keeps stuff around, when you get parent/child if you have parent + all children you can construct feature. ls: the spec requires single parent, right? gh: no you can have multiple. ls: gff3 spec also allows mult parent and children [A] Lincoln will provide use cases/examples of these features scenarios: - three or greater hierarchy features - multiple parents - alignments Topic: Registry ---------------- ap: still here. gh: looking at registry, having trouble retrieving in a normal browser. when looking at it in client, I only see biopackages server registered as a server. Lincoln said there was more? ap: this is related to mime types, changed from text plain to x-das-sources gh: I get an error: source file could not be red. lincoln said you added other test das2 servers to it. ap: working on interface so users can upload servers. half way through it now. upload a link to sources. will send email once it's there. [A] Steve will add gregg's new affy das/2 server to registry when Andreas' web interface is ready gh: same time tomorrow. From cjm at fruitfly.org Thu Mar 16 15:50:37 2006 From: cjm at fruitfly.org (chris mungall) Date: Thu, 16 Mar 2006 12:50:37 -0800 Subject: [DAS2] query language description In-Reply-To: <7ed68f2baa961f932e369cb449371439@dalkescientific.com> References: <7ed68f2baa961f932e369cb449371439@dalkescientific.com> Message-ID: Hi Andrew I presume one constraint is that you want to preserve standard CGI URL syntax? I think this is the best that can be done using that constraint, which is to say, fairly limited. This lacks one of the most important features of a real query language, composability. These ad-hoc constraint syntaxes have their uses but you'll eventually want to go beyond the limits and end up adding awkward extensions. Why not just forego the URL constraint and go with a composable extendable query language in the first place and save a lot of bother downstream? On Mar 15, 2006, at 9:17 PM, Andrew Dalke wrote: > The query fields are > > name | takes | matches features ... > ========================== > xid | URI | which have the given xid > type | URI | with the given type or subtype (XX keep this > one???) > exacttype | URI | with exactly the given type > segment | URI | on the given segment > overlaps | region | which overlap the given region > inside | region | which are contained inside the given region (XX > needed??) > contains | region | which contain the given region (XX needed?? ) > name | string | with a name or alias which matches the given > string > prop-* | string | with the property "*" matching the given string > > Queries are form-urlencoded requests. For example, if the features > query URL is 'http://biodas.org/features' and there is a segment named > 'http://ncbi.org/human/Chr1' then the following is a request for all > the > features on the first 10,000 bases of that segment > > The query is for > segment = 'http://ncbi.org/human/Chr1' > overlaps = 0:10000 > > which is form-urlencoded as > > > http://biodas.org/features? > segment=http%3A%2F%2Fncbi.org%2Fhuman%2FChr1;overlaps=0%3A1000 > > Multiple search terms with the same key are OR'ed together. The > following > searches for features containing the name or alias of either > BC048328 or BC015400 > > http://biodas.org/features?name=BC048328;name=BC015400 > > Multiple search terms with different keys are AND'ed together, > but only after doing the OR search for each set of search terms with > identical keys. The following searches for features which have > a name or alias of BC048328 or BC015400 and which are on the segment > http://ncbi.org/human/Chr1 > > > http://biodas.org/features?name=BC048328; > segment=http%3A%2F%2Fncbi.org%2Fhuman%2FChr1;name=BC015400 > > The order of the search terms in the query string does not affect > the results. > > If any part of a complex feature (that is, one with parents > or parts) matches a search term then all of the parents and > parts are returned. (XXX Gregg -- is this correct? XXX) > > > The fields which take URLs require exact matches. > > I think we decided that there is no type inferencing done in > the server; it's a client side thing. In that case the 'type' > field goes away. We can still keep 'exacttype'. The URI > used for the matching is the type uri, and NOT the ontology URI. > > (We don't have an ontology URI yet, and when we do we can add > an 'ontology' query.) > > The segment URI must accept the local identifier. For > interoperability with other servers they must also accept the > equivalent global identifier, if there is one. > > If range searches are given then one and only one segment is > allowed. Multiple segments may be given, but then ranges are not > allowed. > > The string searches support a simple search language. > ABC -- contains a word which exactly matches "ABC" (identity, not > substring) > *ABC -- words ending in "ABC" > ABC* -- words starting with "ABC" > *ABC* -- words containing the substring "ABC" > > If you want a field which exactly contains a '*' you're kinda > out of luck. The interpretation of whitespace in the query or > in the search string is implementation dependent. For that > matter, the meaning of "word" is implementation dependent. (Is > *O'Malley* one word? *Lethbridge-Stewart*?) > > When we looked into this last month at Sanger we verified that > all the databases could handle %substring% searches, which was > all that people there wanted. The Affy people want searches for > exact word, prefix and suffix matches, as supported by the the > back-end databases. > > > XXX CORRECT ME XXX > > The 'name' search searches.... It used to search the 'name' > attribute and the 'alias' fields. There is no 'name' now. I > moved it to 'title'. I think I did the wrong thing; it should > be 'name', but it's a name meant for people, not computers. > > Some features (sub-parts) don't have human-readable names so > this field must be optional. > > > The "prop-*" is a search of the elements. Features may > have properties, like > > > > To do a string search for all 'membrane' cellular components, > construct the query key by taking the string "prop-" and > appending the property key text ("cellular_component"). The > query value is the text to search for. > > prop-cellular_component=membrane > > To search for any cellular_component containing the substring "mem" > > prop-cellular_component=*membrane* > > The rules for multiple searches with the same key also apply to the > prop-* searches. To search for all 'membrane' or 'nuclear' > cellular components, use two 'prop-cellular_component' terms, as > > > http://biodas.org/features?prop-cellular_component=membrane;prop- > cellular_component=membrane > > > The range searches are defined with explicit start and end > coordinates. The range syntax is in the form "start:end", for > example, "1:9". > > Let 'min' be the smallest coordinate for a feature on a given > segment and 'max' be one larger than the largest coordinate. > These are the lower and upper founds for the feature. > > An 'overlaps' search matches if and only if > min < end AND max > start > > XXX For GREG XXX > > What do 'inside' and 'contains' do? Can't we just get > away with 'excludes', which has complement of 'overlaps'? > Searches are done as: > Step 0) specify the segment > Step 1) do all the includes (if none, match all features on > segment) > Step 2) do all the excludes, inverted (like an includes search) > Step 3) only return features which are in Step 1 but not > in Step 2) > Step 4) ... > Step 5) Profit! > > I think this will support your smart code, and it's easy > enough to implement. > > Every one but you was planning to use 'overlaps'. Only you > wanted to use 'inside'. Anyone want to use 'contains'? > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/das2 From dalke at dalkescientific.com Thu Mar 16 18:24:25 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 16 Mar 2006 15:24:25 -0800 Subject: [DAS2] 'source' attribute in the types document Message-ID: Types have a 'source' field. The first draft shows examples like source='curated' source='genescan' source='tRNAscan-SE-1.11' My interpretation is that this is a human readable field, with no machine interpretation other than as a string. It does not come from a controlled vocabulary. It may contain spaces. This field is not currently searchable because we expect the number of types to be small enough a client will download everything and do the search locally. Let me know if I'm wrong. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Thu Mar 16 17:46:14 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 16 Mar 2006 14:46:14 -0800 Subject: [DAS2] query language description In-Reply-To: References: <7ed68f2baa961f932e369cb449371439@dalkescientific.com> Message-ID: Hi Chris, > I presume one constraint is that you want to preserve standard CGI URL > syntax? Yes. > I think this is the best that can be done using that constraint, > which is to say, fairly limited. Then again, the functionality we need is also fairly limited. > This lacks one of the most important features of a real query > language, composability. These ad-hoc constraint syntaxes have their > uses but you'll eventually want to go beyond the limits and end up > adding awkward extensions. Why not just forego the URL constraint and > go with a composable extendable query language in the first place and > save a lot of bother downstream? Because no one can decide on a generic language which is more powerful than this. Anything more powerful would need to support .. boolean algebra? numeric searches? regexps? What about quoting rules for "multiple word phrases"? Is it SQL-like? XPath/XQuery-like? Is it a context-free grammar? How easy is it to implement and work cross-platform? For what people need now, this search solution seems good. For the future we can have and clients which understand that interface will know that it's there. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Thu Mar 16 18:38:07 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 16 Mar 2006 15:38:07 -0800 Subject: [DAS2] new search terms Message-ID: <5a29cf88a8fc1e8e8448c6e1dd248dbb@dalkescientific.com> "note=" is a string search of the note fields Example: note=And* finds all features where which have a note containing a word starting with 'And' "coordinates=" filters for features on that coordinate system. (We talked about this one yesterday.) I'm republish the search terms before the end of the day. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Thu Mar 16 18:54:12 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 16 Mar 2006 15:54:12 -0800 Subject: [DAS2] comments in schema Message-ID: I've updated the schema docs (das/das2/draft3/*.rnc ) to include more detailed comments. Also, updated the ucla examples to change 'synonym' to 'reference'. Everything should be up to date. Andrew dalke at dalkescientific.com From cjm at fruitfly.org Thu Mar 16 19:04:03 2006 From: cjm at fruitfly.org (chris mungall) Date: Thu, 16 Mar 2006 16:04:03 -0800 Subject: [DAS2] query language description In-Reply-To: References: <7ed68f2baa961f932e369cb449371439@dalkescientific.com> Message-ID: <8b7582943da22dfed23ba7b5386402fb@fruitfly.org> On Mar 16, 2006, at 2:46 PM, Andrew Dalke wrote: > Hi Chris, > >> I presume one constraint is that you want to preserve standard CGI URL >> syntax? > > Yes. I'm guessing you've been through this debate before, so no comment.. > >> I think this is the best that can be done using that constraint, >> which is to say, fairly limited. > > Then again, the functionality we need is also fairly limited. ignorant question.. (I have only been tangentially aware of the outer edges of the whole das2 process).. how are you determining the functionality required? surely someone somewhere will want to write a das2 client that implements boolean queries I speak from experience - I designed the GO Database API to have a very similar constraint language (it's expressed using perl hash keys rather than CGI parameters but the same basic idea). For years people have been clamouring for the ability to do more complex queries - right now they are forced bypass the constraint language and go direct to SQL. > >> This lacks one of the most important features of a real query >> language, composability. These ad-hoc constraint syntaxes have their >> uses but you'll eventually want to go beyond the limits and end up >> adding awkward extensions. Why not just forego the URL constraint and >> go with a composable extendable query language in the first place and >> save a lot of bother downstream? > > Because no one can decide on a generic language which is more > powerful than this. > > Anything more powerful would need to support .. boolean algebra? > numeric searches? regexps? What about quoting rules for "multiple > word phrases"? > > Is it SQL-like? XPath/XQuery-like? Is it a context-free grammar? > How easy is it to implement and work cross-platform? None of these really lit into the DAS paradigm. I'm guessing you want something simple that can be used as easily as an API with get-by-X methods but will seamlessly blend into something more powerful. I think what you have is on the right lines. I'm just arguing to make this language composable from the outset, so that it can be extended to whatever expressivity is required in the future, without bolting on a new query system that's incompatible with the existing one. The generic language could just be some kind of simple extensible function syntax for search terms, boolean operators, and some kind of (optional) nesting syntax. If you have boolean operators and it's composable, then yep it does have to be as expressive as boolean algebra. I'd argue that implementing a composable query language is easier than an ad-hoc one > For what people need now, this search solution seems good. > > For the future we can have > > > > and clients which understand that interface will know that it's > there. hmm, not sure how useful this would be - surely you'd want something more dasmodel-aware? if you're going to just pass-through to xpath or sql then why have a das protocol at all? > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/das2 From Gregg_Helt at affymetrix.com Thu Mar 16 19:22:54 2006 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Thu, 16 Mar 2006 16:22:54 -0800 Subject: [DAS2] query language description Message-ID: For the type query filter, I'd suggest keeping the exacttype semantics you discuss below, but using "type" for the field name rather than "exacttype". If we're getting rid of one of them, and a non-exact type is a meaningless concept, it seems like keeping that "exact" part is unnecessary and potentially confusing. gregg > > I think we decided that there is no type inferencing done in > the server; it's a client side thing. In that case the 'type' > field goes away. We can still keep 'exacttype'. The URI > used for the matching is the type uri, and NOT the ontology URI. > > (We don't have an ontology URI yet, and when we do we can add > an 'ontology' query.) > > The segment URI must accept the local identifier. For > interoperability with other servers they must also accept the > equivalent global identifier, if there is one. > > If range searches are given then one and only one segment is > allowed. Multiple segments may be given, but then ranges are not > allowed. > > The string searches support a simple search language. > ABC -- contains a word which exactly matches "ABC" (identity, not > substring) > *ABC -- words ending in "ABC" > ABC* -- words starting with "ABC" > *ABC* -- words containing the substring "ABC" > > If you want a field which exactly contains a '*' you're kinda > out of luck. The interpretation of whitespace in the query or > in the search string is implementation dependent. For that > matter, the meaning of "word" is implementation dependent. (Is > *O'Malley* one word? *Lethbridge-Stewart*?) > > When we looked into this last month at Sanger we verified that > all the databases could handle %substring% searches, which was > all that people there wanted. The Affy people want searches for > exact word, prefix and suffix matches, as supported by the the > back-end databases. > > > XXX CORRECT ME XXX > > The 'name' search searches.... It used to search the 'name' > attribute and the 'alias' fields. There is no 'name' now. I > moved it to 'title'. I think I did the wrong thing; it should > be 'name', but it's a name meant for people, not computers. > > Some features (sub-parts) don't have human-readable names so > this field must be optional. > > > The "prop-*" is a search of the elements. Features may > have properties, like > > > > To do a string search for all 'membrane' cellular components, > construct the query key by taking the string "prop-" and > appending the property key text ("cellular_component"). The > query value is the text to search for. > > prop-cellular_component=membrane > > To search for any cellular_component containing the substring "mem" > > prop-cellular_component=*membrane* > > The rules for multiple searches with the same key also apply to the > prop-* searches. To search for all 'membrane' or 'nuclear' > cellular components, use two 'prop-cellular_component' terms, as > > > http://biodas.org/features?prop-cellular_component=membrane;prop- > cellular_component=membrane > > > The range searches are defined with explicit start and end > coordinates. The range syntax is in the form "start:end", for > example, "1:9". > > Let 'min' be the smallest coordinate for a feature on a given > segment and 'max' be one larger than the largest coordinate. > These are the lower and upper founds for the feature. > > An 'overlaps' search matches if and only if > min < end AND max > start > > XXX For GREG XXX > > What do 'inside' and 'contains' do? Can't we just get > away with 'excludes', which has complement of 'overlaps'? > Searches are done as: > Step 0) specify the segment > Step 1) do all the includes (if none, match all features on segment) > Step 2) do all the excludes, inverted (like an includes search) > Step 3) only return features which are in Step 1 but not > in Step 2) > Step 4) ... > Step 5) Profit! > > I think this will support your smart code, and it's easy > enough to implement. > > Every one but you was planning to use 'overlaps'. Only you > wanted to use 'inside'. Anyone want to use 'contains'? > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/das2 From dalke at dalkescientific.com Thu Mar 16 21:05:06 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 16 Mar 2006 18:05:06 -0800 Subject: [DAS2] query language description In-Reply-To: <8b7582943da22dfed23ba7b5386402fb@fruitfly.org> References: <7ed68f2baa961f932e369cb449371439@dalkescientific.com> <8b7582943da22dfed23ba7b5386402fb@fruitfly.org> Message-ID: Chris: > ignorant question.. (I have only been tangentially aware of the outer > edges of the whole das2 process).. > > how are you determining the functionality required? surely someone > somewhere will want to write a das2 client that implements boolean > queries It was informal, based on feedback from client developers and maintainers. Lincoln, Thomas, Andreas, Gregg and others provided that feedback. It was not by talking with users. I know there's a wide range of users and use cases. The point of this query language is to have basic functionality that all servers can implement. > right now they are forced bypass the constraint language and go direct > to SQL. In addition, we provide defined ways for a server to indicate that there are additional ways to query the server. > None of these really lit into the DAS paradigm. I'm guessing you want > something simple that can be used as easily as an API with get-by-X > methods but will seamlessly blend into something more powerful. I > think what you have is on the right lines. I'm just arguing to make > this language composable from the outset, so that it can be extended > to whatever expressivity is required in the future, without bolting on > a new query system that's incompatible with the existing one. We have two ways to compose the system. If the simple query language is extended, for example, to support word searches of the text field instead of substring searches, then a server can say This is backwards compatible, so the normal DAS queries work. But a client can recognize the new feature and support whatever new filters that 'word-search' indicates, eg http://somewhere.over.rainbox/server.cgi?note-wordsearch=Andre* (finds features with notes containing words starting with 'Andre' ) These are composable. For example, suppose Sanger allows modification date searches of curation events. Then it might say and I can search for notes containing words starting with "Andre" which were modified by "dalke" between 2002 and 2005 by doing http://somewhere.over.rainbox/server.cgi?note-wordsearch=Andre*& modified-by=dalke&modified-before=2005&modified-after=2002 An advantage to the simple boolean logic of the current system is that the GUI interface is easy, and in line with existing simple search systems. If someone wants to implement a new search system which is not backwards compatible then the server can indicate that alternative with a new CAPABILITY. Suppose Thomas at Sanger comes up with a new search mechanism based on an object query language he invented, The Sanger and EBI clients might understand that and support a more complex GUI, eg, with a text box interface. Everyone else must ignore unknown capability types. Then that would be POSTED (or whatever the protocol defines) to the given URL, which returns back whatever results are desired. Or the server can point to a public MySQL port, like That's what you are doing to bypass the syntax, except that here it isn't a bypass; you can define the new interface in the DAS sources document. > The generic language could just be some kind of simple > extensible function syntax for search terms, boolean operators, > and some kind of (optional) nesting syntax. Which syntax? Is it supposed to be easy for people to write? Text oriented? Or tree structured, like XML, or SQL-like? And which clients and servers will implement that search language? If there was a generic language it would allow OR("on segment Chr1 between 1000 and 2000", "on segment ChrX between 99 and 777") which is something we are expressly not allowing in DAS2 queries. It doesn't make sense for the target applications and by excluding it it simplifies the server development, which means less chance for bugs. Also, I personally haven't figured out a decent way to do a GUI composition of a complex boolean query which is as easy as learning the query language in the first place. A more generic language implementation is a lot of overhead if most (80%? 90%) need basic searches, and many of the rest can fake it by breaking a request into parts and doing the boolean logic on the client side. Feedback I've heard so far is that DAS1 queries were acceptable, with only a few new search fields needed. > hmm, not sure how useful this would be - surely you'd want something > more dasmodel-aware? The example I gave was a bad one. What I meant was to show how there's an extension point so someone can develop a new search interface and clients can know that the new functionality exists, without having to change the DAS spec. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Thu Mar 16 23:47:58 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 16 Mar 2006 20:47:58 -0800 Subject: [DAS2] query language description In-Reply-To: <7ed68f2baa961f932e369cb449371439@dalkescientific.com> References: <7ed68f2baa961f932e369cb449371439@dalkescientific.com> Message-ID: Updated: - added 'note' as a query field - changed string searches to substring (not word) searches and made them be case insensitive "AB" matches only the strings "AB", "Ab", "aB" and "ab" "*AB" matches only fields which exactly end with "AB", "ab", "aB", and "Ab" "AB*" matches only fields which exactly match, up to case "*AB*" matches only fields which contain the substring, up to case - added 'coordinates' search - removed 'type' and renamed 'exacttype' to 'type' - removed 'contains' search, which no one said they wanted. Instead, supporting (EXPERIMENTAL) an 'excludes' search. ================================== The query fields are name | takes | matches features ... ========================== xid | URI | which have the given xid type | URI | with exactly the given type segment | URI | on the given segment coordinates | URI | which are part of the given coordinate system overlaps | region | which overlap the given region excludes | region | which have no overlap to the given region inside | region | which are contained inside the given region name | string | with a title or alias which matches the given string note | string | with a note which matches the given string prop-* | string | with the property "*" matching the given string Queries are form-urlencoded requests. For example, if the features query URL is 'http://biodas.org/features' and there is a segment named 'http://ncbi.org/human/Chr1' then the following is a request for all the features on the first 10,000 bases of that segment The query is for segment = 'http://ncbi.org/human/Chr1' overlaps = 0:10000 which is form-urlencoded as http://biodas.org/features? segment=http%3A%2F%2Fncbi.org%2Fhuman%2FChr1;overlaps=0%3A1000 Multiple search terms with the same key are OR'ed together. The following searches for features containing the name or alias of either BC048328 or BC015400 http://biodas.org/features?name=BC048328;name=BC015400 The 'excludes' search is an exception. See below. Multiple search terms with different keys are AND'ed together, but only after doing the OR search for each set of search terms with identical keys. The following searches for features which have a name or alias of BC048328 or BC015400 and which are on the segment http://ncbi.org/human/Chr1 http://biodas.org/features?name=BC048328; segment=http%3A%2F%2Fncbi.org%2Fhuman%2FChr1;name=BC015400 The order of the search terms in the query string does not affect the results. If any part of a complex feature (that is, one with parents or parts) matches a search term then all of the parents and parts are returned. (XXX Gregg -- is this correct? XXX) The fields which take URLs require exact matches, that is, a character by character match. (For details on the nuances of comparing URIs see http://www.textuality.com/tag/uri-comp-3.html ) (We don't have an ontology URI yet, and when we do we can add an 'ontology' query.) The segment query filter takes a URI. This must accept the segment URI and, if known to the server, the equivalent reference identifier for the segment. If range searches are given then one and only one segment must be given. If there are multiple segment queries then ranges are not allowed. The string searches may be exact matches, substring, prefix or suffix searches. The query type depends on if the search value starts and/or ends with a '*'. ABC -- field exactly matches "ABC" *ABC -- field ends with "ABC" ABC* -- field starts with "ABC" *ABC* -- field contains the substring "ABC" The "*" has no special meaning except at the start or end of the query value. The search term "***" will match a field which contains the character "*" anywhere. There is no way to match fields which exactly match '*' or which only start or end with that character. Text searches are case-insensitive. The string "ABC" matches "abc", "aBc", "ABC", etc. A server may choose to collapse multiple whitespace characters into a single space character for search purposes. For example, the query "*a newline*" should match "This is a line of text which contains a newline" The 'name' search does a text search of the 'title' and 'alias' fields. The "prop-*" is shorthand for a class of text searches of elements. Features may have properties, like To do a string search for all 'membrane' cellular components, construct the query key by taking the string "prop-" and appending the property key text ("cellular_component"). The query value is the text to search for, in this case: prop-cellular_component=membrane To search for any cellular_component containing the substring "membrane" prop-cellular_component=*membrane* The rules for multiple searches with the same key also apply to the prop-* searches. To search for all 'membrane' or 'nuclear' cellular components, use two 'prop-cellular_component' terms, as http://biodas.org/features?prop-cellular_component=membrane;prop- cellular_component=membrane The range searches are defined with explicit start and end coordinates. The range syntax is in the form "start:end", for example, "1:9". There is no way to restrict the search to a specific strand. A feature may have several locations. An annotation may have several features in a parent/part relationship. The relationship may have several levels. If a range search matches any feature in the annotation then the search returns all of the features in the annotation. An 'overlaps' search matches if and only if any feature location of any of the parent or part overlaps the query range and segment. An 'inside' search matches if and only if at least one feature in the annotation has a location on the query segment and all features which have a location on the query segment have at least one location which starts and ends in the query range. EXPERIMENTAL: An 'excludes' matches if and only if at least one feature of the annotation is on the query segment and no features are in the query range. This is the complement of the 'overlaps' search, for annotations on the same query segment. Unlike the other search keys, if there multiple 'excludes' searches then the results are AND'ed together. That is, if the query is has two excludes ranges segment=ChrX excludes=RANGE1 excludes=RANGE2 then the result are those features which on ChrX which are not in RANGE1 and are not in RANGE2. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Fri Mar 17 02:05:54 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 16 Mar 2006 23:05:54 -0800 Subject: [DAS2] alternate formats Message-ID: <3f895441c38b74460da9f8e4582b7a74@dalkescientific.com> If you've read the updated schema definitions you saw I've added the following comment in the CAPABILITY # Format names which can be passed to the query_uri. # The names are type dependent. At present the # only reserved names are for the 'features' capability. # These are: das2xml, count, uris format*, We talked about this in the UK I think, and I mentioned it to people here. The 'count' format returns the count of features which would be returned for a given query. This is a single line containing the integer followed by a newline. The content-type of the document is text/plain . For example, to get the number of all the features on the server Request: http://www.example.com/das2/mus/v22/features?format=count Response: Content-Type: text/plain 129254 I will add this format description to the spec. When does the server need to declare that it implements a given document type? My thought is that if the format list is not specified then the server must implement 'das2xml' and 'count' formats. If it doesn't implement the 'count' format then it needs to declare the complete list of what it does support. In addition I'll describe here the 'uris' format. It is a document of content-type text/plain containing the matching feature URIs, one per line. For example, file://Users/dalke/ucla/feature/Affymetrix_U133_X3P: Hs.21346.0.A1_3p_a_at file://Users/dalke/ucla/feature/Affymetrix_U133_X3P: Hs.21346.0.A1_3p_x_at file://Users/dalke/ucla/feature/Affymetrix_U133_X3P: Hs.21346.1.S1_3p_x_at file://Users/dalke/ucla/feature/Affymetrix_U133_X3P: Hs.21346.2.S1_3p_x_at file://Users/dalke/ucla/feature/Affymetrix_U133_X3P: Hs.21346.3.S1_3p_x_at file://Users/dalke/ucla/feature/Affymetrix_U133_X3P:Hs.271468.0.S1_3p_at (I feel like it should implement an xml:base scheme to reduce the amount of traffic.) The idea is that a client can request the URIs only, eg, to do more complex boolean-esque searches by doing simpler ones on the server and combining the results in client space. For another example, if the client already knows the feature data for a URI then it doesn't need to download the data again. So it gets a list of URIs then only fetches the ones it does not know about. This requires HTTP/1.1 pipelining for good performance. Because there are no clients which want it, because I'm not certain on the format, and because of the lack of pipelining in the existing servers, I will not document this format. I'll just leave it as a reserved format name. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Fri Mar 17 02:33:44 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 16 Mar 2006 23:33:44 -0800 Subject: [DAS2] debugging validation proxy Message-ID: After a conversation with Gregg this afternoon I this evening implemented a debugging validation proxy for DAS. The code is about 100 lines long and combines Python's "twisted" network library and the dasypus validator. To make it work, configure your DAS client to use a proxy, which is this validation proxy. Then do things like normal. The request go through the proxy. It dumps the request info to stdout and forwards the request to the real server. It requires the response headers and body. When finished it passed the data to dasypus. I stuck some DAS-ish XML on my company web server and did the connection like this % curl -x localhost:8080 http://www.dalkescientific.com/sources.xml The output from the debug window is Making request for 'http://www.dalkescientific.com/sources.xml' Warning: Unknown Content-Type 'application/xml'. Info: Assuming doctype of 'sources' based on root element at byte 40, line 2, column 2 Finished processing Andrew dalke at dalkescientific.com From allenday at ucla.edu Thu Mar 16 13:27:56 2006 From: allenday at ucla.edu (Allen Day) Date: Thu, 16 Mar 2006 10:27:56 -0800 (PST) Subject: [DAS2] biopackages.net out of synch with spec? In-Reply-To: <200603151046.43196.lstein@cshl.edu> References: <200603151046.43196.lstein@cshl.edu> Message-ID: Hi Lincoln, Please just code to what is there, and expect your code to break when I update the biopackages server to v300 (probably next week). -Allen On Wed, 15 Mar 2006, Lincoln Stein wrote: > Hi Folks, > > I just ran through the source request on biopackages.net and it is returning > something that is very different from the current spec (CVS updated as of > this morning UK time). I understand why there is a discrepancy, but for the > purposes of the code sprint, should I code to what the spec says or to what > biopackages.net returns? It is much more fun for me to code to a working > server because I have the opportunity to watch my code run. > > Best, > > Lincoln > > From Gregg_Helt at affymetrix.com Fri Mar 17 03:22:12 2006 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Fri, 17 Mar 2006 00:22:12 -0800 Subject: [DAS2] New affymetrix das/2 development server Message-ID: I checked in a new version of the Affymetrix DAS/2 server this evening that supports XML responses based on the latest DAS/2 spec, version 300. For sample sources, segments, types, and features responses it passes the Dasypus validator tests. The validator was _very_ useful for bringing the server up to the current spec! Steve rolled the new version out on our public test server, the root sources query URL is http://205.217.46.81:9091/das2/genome/sequence. In the latest version of IGB checked into CVS, this server can be accessed as "Affy-temp" in the list of DAS/2 servers. Although the server's XML responses conform to spec v.300, the query strings it recognizes still only conform to a subset of spec v.200. I expect to have the queries upgraded to v.300 tonight. But it will probably still only support a subset of the query filters: one type (required), one overlaps (required), one inside (optional). This server also supports bed, psl, and some binary formats as alternative content formats, depending on the type of the annotations. gregg > -----Original Message----- > From: das2-bounces at lists.open-bio.org [mailto:das2-bounces at lists.open- > bio.org] On Behalf Of Steve Chervitz > Sent: Wednesday, March 15, 2006 1:25 PM > To: DAS/2 > Subject: [DAS2] New affymetrix das/2 development server > > > Gregg's latest spec-compliant, but still development-grade, das/2 server > is > now publically available via http://205.217.46.81:9091 > > It's currently serving annotations from the following assemblies: > - human hg16 > - human hg17 > - drosophila dm2 > > Send me requests for any other data sources that would help your > development > efforts. > > Example query to get back a das-source xml document: > http://205.217.46.81:9091/das2/genome/sequence > > It's compliance with the spec is steadily improving, on a daily if not > hourly basis during the code sprint. > > Within IGB you can access this server from the DAS/2 servers tab > under 'Affy-temp'. > > You'll need the latest version of IGB from the CVS repository at > http://sf.net/projects/genoviz > > Steve > > _______________________________________________ > DAS2 mailing list > DAS2 at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/das2 From dalke at dalkescientific.com Fri Mar 17 11:09:44 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Fri, 17 Mar 2006 08:09:44 -0800 Subject: [DAS2] biopackages.net out of synch with spec? In-Reply-To: References: <200603151046.43196.lstein@cshl.edu> Message-ID: Allen: > Please just code to what is there, and expect your code to break when I > update the biopackages server to v300 (probably next week). So you all know, "300" is what we've been calling the current version of the spec, based on the code freeze that started 8 hours ago. It's the one currently only described in the schema definitions and in the example files under das/das2/draft3. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Fri Mar 17 11:40:20 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Fri, 17 Mar 2006 08:40:20 -0800 Subject: [DAS2] proxies, caching and network configuration Message-ID: <58f16cd7fac095a708fd81a5cc5e40df@dalkescientific.com> I'm writing to encourage DAS client authors to include support for proxies when fetching DAS URLs. Nomi pointed out that Apollo supports proxies, because users asked for it. I think it's because some sites don't have direct access to the internet. I know a few of my clients have internal networks set up that way. Yesterday we talked a bit about how to point to local mirrors. It would be hard to have a standard configuration so that all DAS client code can know about local mirrors. I mentioned setting up proxies, but dismissed the idea. Now I'm thinking that that might be the solution. If there are local ways to get, say, sequence data then that could be done at the proxy level. Someone can easily (with less than 100 lines of code) write a new proxy server which points to a local resource if it knows that a URI is resolvable that way. Having proxy support also helps with debugging, like in the debugging proxy server I wrote yesterday. A nice thing is that some people want proxy support anyway, so if client code supports proxies then these other things (redirection to local mirrors, debugging) can be set up later, and with no extra work in the client. Andrew dalke at dalkescientific.com From Steve_Chervitz at affymetrix.com Fri Mar 17 13:47:51 2006 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Fri, 17 Mar 2006 10:47:51 -0800 Subject: [DAS2] New affymetrix das/2 development server In-Reply-To: Message-ID: The affy das/2 development server at http://205.217.46.81:9091 has been updated to better support DAS/2 spec version 300. Gregg says: > Changed genometry DAS/2 server so that it responds to feature queries that use > DAS/2 v.300 feature filters. Currently implements a subset of > the v.300 feature query spec: > requires one and only one segment filter > requires one and only one type filter > accepts zero or one inside filter > Also attempts to support DAS/2 v.200 feature filters, but success is not > guaranteed. Steve > From: Steve Chervitz > Date: Wed, 15 Mar 2006 13:24:59 -0800 > To: DAS/2 > Conversation: New affymetrix das/2 development server > Subject: New affymetrix das/2 development server > > > Gregg's latest spec-compliant, but still development-grade, das/2 server is > now publically available via http://205.217.46.81:9091 > > It's currently serving annotations from the following assemblies: > - human hg16 > - human hg17 > - drosophila dm2 > > Send me requests for any other data sources that would help your development > efforts. > > Example query to get back a das-source xml document: > http://205.217.46.81:9091/das2/genome/sequence > > It's compliance with the spec is steadily improving, on a daily if not hourly > basis during the code sprint. > > Within IGB you can access this server from the DAS/2 servers tab > under 'Affy-temp'. > > You'll need the latest version of IGB from the CVS repository at > http://sf.net/projects/genoviz > > Steve From dalke at dalkescientific.com Fri Mar 17 15:09:42 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Fri, 17 Mar 2006 12:09:42 -0800 Subject: [DAS2] defined minimum limits Message-ID: We should define minimum sizes for fields in the server database. For example, "the server must support feature titles of at least 40 characters", "must handle at least two 'excludes' feature filters". And define what do to when the server decides that writeback of a 30MB feature is just a bit too large. Andrew dalke at dalkescientific.com From boconnor at ucla.edu Fri Mar 17 18:23:09 2006 From: boconnor at ucla.edu (Brian O'Connor) Date: Fri, 17 Mar 2006 15:23:09 -0800 Subject: [DAS2] das.biopackages.net Updated to Spec 300 Message-ID: <441B44DD.5010505@ucla.edu> Hi, So I checked in my changes to the DAS/2 server which should bring it up to the 300 spec. Allen updated the das.biopackages.net server and I tested the following URLs in Andrew's validation app. They all appear to be OK: * http://das.biopackages.net/das/genome * http://das.biopackages.net/das/genome/yeast * http://das.biopackages.net/das/genome/human * http://das.biopackages.net/das/genome/yeast/S228C * http://das.biopackages.net/das/genome/human/17 * http://das.biopackages.net/das/genome/yeast/S228C/segment * http://das.biopackages.net/das/genome/human/17/segment * http://das.biopackages.net/das/genome/yeast/S228C/type * http://das.biopackages.net/das/genome/human/17/type * http://das.biopackages.net/das/genome/yeast/S228C/feature?overlaps=chrI/1:1000 * http://das.biopackages.net/das/genome/human/17/feature?overlaps=chr1/1000:2000 Let Allen or I know if you run into problems. --Brian From cjm at fruitfly.org Fri Mar 17 19:20:14 2006 From: cjm at fruitfly.org (chris mungall) Date: Fri, 17 Mar 2006 16:20:14 -0800 Subject: [DAS2] query language description In-Reply-To: References: <7ed68f2baa961f932e369cb449371439@dalkescientific.com> <8b7582943da22dfed23ba7b5386402fb@fruitfly.org> Message-ID: On Mar 16, 2006, at 6:05 PM, Andrew Dalke wrote: >> right now they are forced bypass the constraint language and go direct >> to SQL. > > In addition, we provide defined ways for a server to indicate > that there are additional ways to query the server. I was positing this as a bad feature, not a good one. or at least a symptom of an incorrectly designed system (at least in the case of the GO DB API - it may not carry forward to DAS - though if you're going to allow querying by terms...) > >> None of these really lit into the DAS paradigm. I'm guessing you want >> something simple that can be used as easily as an API with get-by-X >> methods but will seamlessly blend into something more powerful. I >> think what you have is on the right lines. I'm just arguing to make >> this language composable from the outset, so that it can be extended >> to whatever expressivity is required in the future, without bolting on >> a new query system that's incompatible with the existing one. > > We have two ways to compose the system. If the simple query language > is extended, for example, to support word searches of the text field > instead of substring searches, then a server can say > > query_uri="http://somewhere.over.rainbow/server.cgi"> > > > > This is backwards compatible, so the normal DAS queries work. But > a client can recognize the new feature and support whatever new filters > that 'word-search' indicates, eg > > http://somewhere.over.rainbox/server.cgi?note-wordsearch=Andre* > > (finds features with notes containing words starting with 'Andre' ) > > These are composable. For example, suppose Sanger allows modification > date searches of curation events. Then it might say > > query_uri="http://somewhere.over.rainbow/server.cgi"> > > > so this is limited to single-argument search functions? > > and I can search for notes containing words starting with "Andre" > which were modified by "dalke" between 2002 and 2005 by doing > > http://somewhere.over.rainbox/server.cgi?note-wordsearch=Andre*& > modified-by=dalke&modified-before=2005&modified-after=2002 but the compositionality is always associative since the CGI parameter constraint forbids nesting > An advantage to the simple boolean logic of the current system > is that the GUI interface is easy, and in line with existing > simple search systems. there's nothing preventing you from implementing a simple GUI on top of an expressive system - there is nothing forcing you to use the expressivity > If someone wants to implement a new search system which is > not backwards compatible then the server can indicate that > alternative with a new CAPABILITY. Suppose Thomas at Sanger > comes up with a new search mechanism based on an object query > language he invented, > > query_uri="http://sanger.ac.uk/oql-search" /> > > The Sanger and EBI clients might understand that and support > a more complex GUI, eg, with a text box interface. Everyone > else must ignore unknown capability types. but this doesn't integrate with the existing query system > > Then that would be POSTED (or whatever the protocol defines) > to the given URL, which returns back whatever results are > desired. > > Or the server can point to a public MySQL port, like > > query_uri="mysql://username:password at hostname:port/databasename" > /> > > That's what you are doing to bypass the syntax, except that > here it isn't a bypass; you can define the new interface in > the DAS sources document. > >> The generic language could just be some kind of simple >> extensible function syntax for search terms, boolean operators, >> and some kind of (optional) nesting syntax. > > Which syntax? Is it supposed to be easy for people to write? > Text oriented? Or tree structured, like XML, or SQL-like? I'd favour some concrete asbtract syntax that looks much like the existing DAS QL > And which clients and servers will implement that search > language? all servers. clients optional > > If there was a generic language it would allow > OR("on segment Chr1 between 1000 and 2000", > "on segment ChrX between 99 and 777") > which is something we are expressly not allowing in DAS2 > queries. It doesn't make sense for the target applications > and by excluding it it simplifies the server development, > which means less chance for bugs. this example is pointless but it's easy to imagine plenty of ontology term queries or other queries in which this would be useful I guess I depart from the normal DAS philosophy - I don't see this being a high barrier for entry for servers, if they're not up to this it'll probably be a buggy hacky server anyway > Also, I personally haven't figured out a decent way to > do a GUI composition of a complex boolean query which is > as easy as learning the query language in the first place. doesn't mean it doesn't exist. i'm not sure what's hard about having say, a clipboard of favourite queries, then allowing some kind of drag-and-drop composition > A more generic language implementation is a lot of overhead > if most (80%? 90%) need basic searches, and many of the > rest can fake it by breaking a request into parts and > doing the boolean logic on the client side. this is always an option - if the user doesn't mind the additional possibly very high overhead. it's just a little bit of a depressing approach, as if Codd's seminal paper from 1970 or whenever it was never happened. > Feedback I've heard so far is that DAS1 queries were > acceptable, with only a few new search fields needed. > >> hmm, not sure how useful this would be - surely you'd want something >> more dasmodel-aware? > > The example I gave was a bad one. What I meant was to show > how there's an extension point so someone can develop a new > search interface and clients can know that the new functionality > exists, without having to change the DAS spec. ok that's probably all I've got to say on the matter, sorry for being irksome. I guess I'm fundamentally missing something, that is, why wrap simple and expressive declarative query languages with limited ad-hoc constraint systems with consciously limited expressivity and limited means of extensibility.. cheers chris > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/das2 From Steve_Chervitz at affymetrix.com Sun Mar 19 23:54:36 2006 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Sun, 19 Mar 2006 20:54:36 -0800 Subject: [DAS2] Notes from DAS/2 code sprint #2, day five, 17 Mar 2006 Message-ID: Notes from DAS/2 code sprint #2, day five, 17 Mar 2006 $Id: das2-teleconf-2006-03-17.txt,v 1.2 2006/03/20 05:05:22 sac Exp $ Note taker: Steve Chervitz Attendees: Affy: Steve Chervitz, Ed E., Gregg Helt Dalke Scientific: Andrew Dalke (at Affy) UCLA: Allen Day, Brian O'Connor (at Affy) Action items are flagged with '[A]'. These notes are checked into the biodas.org CVS repository at das/das2/notes/2006. Instructions on how to access this repository are at http://biodas.org DISCLAIMER: The note taker aims for completeness and accuracy, but these goals are not always achievable, given the desire to get the notes out with a rapid turnaround. So don't consider these notes as complete minutes from the meeting, but rather abbreviated, summarized versions of what was discussed. There may be errors of commission and omission. Participants are welcome to post comments and/or corrections to these as they see fit. Agenda: * Status reports * Writeback progress Status reports: --------------- gh: This is the last mtg of code sprint. For the status reports, focus on where you are at and what you are hoping to accomplish post-sprint. gh: working on version of affy server that impls das/2 v300 spec for all xml responses. sample responses passed andrew's validation. steve rolled it out to public server. updated igb client to handle v300 xml. worked more on server to impl v300 query syntax using full uri for type segment, segment separate from overlaps and inside. only impls a subset of the feature query. requires one and only one segment, type, insides. hoping todo for rest of sprint and after: 1. supporting name feat filters in igb client 2. remove restrictions from the server 3. making sure new version of server gets rolled out, 4. roll out jar for this version of igb. maybe put on genoviz sf site for testing purposes. bo: looked at xml docs that andrew checked in, updating ucla templates on server, not rolled out to biopackages.net, waiting to make rpm, hoping to do code cleanup in igb. getting andrew's help running validator on local copy of server. gh: igb would like to support v300, but one server is v200+ (ucla), one at v300 (affy) complicates things. so getting your server good to go would be my priority. bo: code clean up involves assay and ontology interface. gh: we're planning an igb release at end of march. as long as the code is clean by then it's ok. aday: code cleanup, things removed from protocol. exporting data matrices from assay part of server. validate sources document w/r/t v300 validator. work with brian to make sure everything is update to v300. probably working on fiter query, since we now treat things as names not full uri's. ad: what extra config info do you need in server for that? can you get it from the http headers? gh: mine is being promiscuous, just name of type will work. might give the wrong thing back, but for data we're serving back now, it can't be wrong. ad: how much trouble does the uri handling cause for you? gh: has to be full uri of the type, doing otherwise is not an option (in the spec). ad: you could just use name internally, then put together full uri when you go to the outside world. ad: I updated comments in schema definitions, updated query lang spec. string searches are substring searches not word-substring searches. abc = whole field must be equal *abc = suffix match abc* = prefix match previously said it was word match, but that's too complicated on server. worked with gregg to pin down what inside search means. I'm thinking about the possibility of a validating proxy server, configure das client to go through proxy before outside world, the server would sniff everything going by. Support for proxys can enable lots of sorts of things w/o needing additional config for each client. gh: how do you do proxy in java? i.e., redirect all network calls to a proxy. bo: there's a way to set proxy options via the system object in the java vm. can show you some examples of this. aday: performance. gh: current webstart based ibg works with the existing public das/2 server, [comment pertaining to: the new version of igb and a new version of the affy das/2 server]. ad: when will we get reference names from lincoln? gh: should happen yesterday. poke him about this. would be really nice to be able to overlay anotations! The current version of igb can turn off v300 options, and then ti can load stuff from the ucla server. The version of igb in cvs now can hit both biopackages.net and affy server in the dmz. and there's hardwiring to get things to overlay. temporary patch. ee: two things: 1. style sheets. info from andrew yesterday. looking over that. will discuss questions w/ andrew. 2. making sure that when we do a new release of igb in a couple of weeks (when I'm not here) that it will go smoothly . go over w/ gregg, steve. lots of testing. made changes in parser code, should still work. sc: I updated jars for das/1 not das/2 on netaffxdas.affymetrix.com. ee: it's the das/1 I'm most concerned about. sc: installed and updated gregg's new das/2 server on a publically accessible machine (separate box from the production das/1 and das/2 servers on netaffxdas.affymetrix.com). Also spent a time loading data for new affy arrays (mouse rat exons). this required lots of memory, had to disable support for some other arrays. [gregg's das servers load all annotations into memory at start up, hance the big memory requirements for arrays with lots of probe sets.] [A] gregg optimize affy das server memory reqts for exon arrays. gh: we' gotten a lot done this week. I think we have a stable spec. gh: serving alignments, no cigars, but blat alignment to genome as coords on mrna and coords on the genome. igb doesn't use it yet, but it's there. ad: xid in region elements. gh: we haven't exercised the xids. so 'link' in das/1 is equivalent to xid in das/2? ad: yes. i believe gh: if you have links in das/1. without links it can build links from feature id using a template. This is used for building links from within IGB back to netaffx, for example. Topic: Writebacks ----------------- gh: writebacks haven't been mentioned at all this week. ad: we need people committed to writing a server to implement it. gh: we decided that since ed griffith would be working on it at Sanger, we wouldn't worry about it for ucla server. bo: we started prototyping. locking mechanism. persisting part of a mage document. the spec changed after that. andrew's delta model. a little different from what we were prototyping. actual persistence will be done in the assay portion of our server. gh: grant focuses on write back for genome portion, and this was a big chunk of the grant. ends in end of may or june. ad: delta model was: here's a list of add, delete, modify in one document. An issue was if you change an existing record, do you give it a new identifier? gh: you never modify something with an existing id, just make a new one, new id, with a pointer back to old one. Ed Griffith said this a month ago. I like this idea. but told we cannot make this requirement on the database. but very few dbs will be writeback, so it's not affecting all servers ad: making new uris, client has to know the new uri for the old one. needs to return a mapping document. if network crashes partway through, client won't know mapping is and will be lost. gh: server doesn't know if client got it. it could act(?) back. gh: if a response from http server dies, server has no way to know. ad: There could be a proxy in the middle, or isp's proxy server. The server sent it successfully to the proxy, but never made it to the client. gh: how is this dealt with for commits into relational dbs? same thing applies ad: don't know ee: could ask for everything in this region. ad: have a new element that says 'i used to be this'. bo: you do an insert in a db, to get last pk that was issued. client talks back to server, give me last feature uri that was provisioned on my connection. so the client is in control. sc: it's up to client to get confirmation from server. If it failed to get the response after sending in the modification request, it could request that the server send it again. ad: (drawing on whiteboard) two stage strategy, get a transaction state. post "get transaction url" <--------------- post (put?) to transaction URL -------------> can do multiple (if identical) ----------> ----------> Get was successful and here's transformation info <--------------- ad: server can hold transformation info for some timespan in case client needs to re-fetch. gh: I'm more insterested in getting a server up than a client regarding writeback. complex parts of the client are already implemented (apollo). gh: locks are region based not feature based. ad: can't lock... gh: we can talk about how to trigger ucla locking mechanism. bo: did flock transactional locking the suggested in perl cookbook. mage document has content. server locks an id using flock, (for assay das). gh: to lock a region on the genome, lock on all ids for features in this region. bo: make a file containing all the ids that are locked. flock this file. ad: file locking is frought with problems. why not keep it in the database and let the db lock it for you. don't let perl + file system do it for you. there could be fs problems. nfs isn't good at that. a database is much more reliable. bo: I went with perl flock mechanism since you could have other non-database sources (though so far it's all db). [A] steve, allen send brian code tips regarding locking. gh: putting aside pushing large data chunks into the server, for curation it's ok if protocol is a little error prone, since the curator-caused errors will be much more likely/common. ad: UK folks haven't done any writeback work as far as I know. gh: they haven't billed us in 2 years. Tony cox is contact, ed griffith is main developer. ad: andreas and thomas are not funded by this grant or the next one. gh: they are already funded by other means. ad: if someone want's to change an annotation should they need to get a lock first or can it work like cvs? do it if it can, get lock, release lock in one transaction. ee: that's my preference. ad: if every feature has it's own id, you know if it's... ee: some servers might not have any writeback facility at all. conflicts will be rare. [A] ask ed/tony on whether they plan to have any writeback facility gh: ed g wanted to work on client to do writeback, don't know who would work on a server there. ad: someone else, can't remember - roy? gh: unless we hear back from sanger, the highest priority for ucla folks after updating server for v300, is working server-side writeback. gh: spec freeze is for the read portion. the writeback portion will have to change as needed. ad: and arithmetic? ;-) From lstein at cshl.edu Mon Mar 20 12:27:59 2006 From: lstein at cshl.edu (Lincoln Stein) Date: Mon, 20 Mar 2006 12:27:59 -0500 Subject: [DAS2] Notes from DAS/2 code sprint #2, day five, 17 Mar 2006 In-Reply-To: References: Message-ID: <200603201227.59816.lstein@cshl.edu> Hi Folks, I will join the DAS2 call a little late today (no more than 10 min). I'm assuming that we're on? Lincoln On Sunday 19 March 2006 23:54, Steve Chervitz wrote: > Notes from DAS/2 code sprint #2, day five, 17 Mar 2006 > > $Id: das2-teleconf-2006-03-17.txt,v 1.2 2006/03/20 05:05:22 sac Exp $ > > Note taker: Steve Chervitz > > Attendees: > Affy: Steve Chervitz, Ed E., Gregg Helt > Dalke Scientific: Andrew Dalke (at Affy) > UCLA: Allen Day, Brian O'Connor (at Affy) > > Action items are flagged with '[A]'. > > These notes are checked into the biodas.org CVS repository at > das/das2/notes/2006. Instructions on how to access this > repository are at http://biodas.org > > DISCLAIMER: > The note taker aims for completeness and accuracy, but these goals are > not always achievable, given the desire to get the notes out with a > rapid turnaround. So don't consider these notes as complete minutes > from the meeting, but rather abbreviated, summarized versions of what > was discussed. There may be errors of commission and omission. > Participants are welcome to post comments and/or corrections to these > as they see fit. > > Agenda: > > * Status reports > * Writeback progress > > > Status reports: > --------------- > > gh: This is the last mtg of code sprint. For the status reports, focus > on where you are at and what you are hoping to accomplish post-sprint. > > gh: working on version of affy server that impls das/2 v300 spec for > all xml responses. sample responses passed andrew's validation. > steve rolled it out to public server. > > updated igb client to handle v300 xml. > worked more on server to impl v300 query syntax using full uri for > type segment, segment separate from overlaps and inside. > only impls a subset of the feature query. requires one and only one > segment, type, insides. > > hoping todo for rest of sprint and after: > 1. supporting name feat filters in igb client > 2. remove restrictions from the server > 3. making sure new version of server gets rolled out, > 4. roll out jar for this version of igb. maybe put on genoviz sf site for > testing purposes. > > bo: looked at xml docs that andrew checked in, updating ucla templates > on server, not rolled out to biopackages.net, waiting to make rpm, > hoping to do code cleanup in igb. > getting andrew's help running validator on local copy of server. > > gh: igb would like to support v300, but one server is v200+ (ucla), > one at v300 (affy) complicates things. so getting your server good to > go would be my priority. > > bo: code clean up involves assay and ontology interface. > > gh: we're planning an igb release at end of march. as long as the code > is clean by then it's ok. > > aday: code cleanup, things removed from protocol. exporting data > matrices from assay part of server. > validate sources document w/r/t v300 validator. work with brian to > make sure everything is update to v300. probably working on fiter > query, since we now treat things as names not full uri's. > > ad: what extra config info do you need in server for that? can you get > it from the http headers? > gh: mine is being promiscuous, just name of type will work. might give > the wrong thing back, but for data we're serving back now, it can't be > wrong. > > ad: how much trouble does the uri handling cause for you? > > gh: has to be full uri of the type, doing otherwise is not an option > (in the spec). > ad: you could just use name internally, then put together full uri > when you go to the outside world. > > ad: I updated comments in schema definitions, updated query lang > spec. string searches are substring searches not word-substring > searches. > abc = whole field must be equal > *abc = suffix match > abc* = prefix match > > previously said it was word match, but that's too complicated on > server. > worked with gregg to pin down what inside search means. > > I'm thinking about the possibility of a validating proxy server, > configure das client to go through proxy before outside world, the > server would sniff everything going by. > Support for proxys can enable lots of sorts of things w/o needing > additional config for each client. > > gh: how do you do proxy in java? i.e., redirect all network calls to a > proxy. > bo: there's a way to set proxy options via the system object in the > java vm. can show you some examples of this. > > aday: performance. > gh: current webstart based ibg works with the existing public das/2 > server, [comment pertaining to: the new version of igb and a new > version of the affy das/2 server]. > > ad: when will we get reference names from lincoln? > gh: should happen yesterday. poke him about this. > would be really nice to be able to overlay anotations! > > The current version of igb can turn off v300 options, and then ti can > load stuff from the ucla server. The version of igb in cvs now can hit > both biopackages.net and affy server in the dmz. and there's > hardwiring to get things to overlay. temporary patch. > > ee: two things: > 1. style sheets. info from andrew yesterday. looking over that. will > discuss questions w/ andrew. > 2. making sure that when we do a new release of igb in a couple of > weeks (when I'm not here) that it will go smoothly . go over w/ > gregg, steve. lots of testing. > made changes in parser code, should still work. > > sc: I updated jars for das/1 not das/2 on netaffxdas.affymetrix.com. > ee: it's the das/1 I'm most concerned about. > > sc: installed and updated gregg's new das/2 server on a publically > accessible machine (separate box from the production das/1 and das/2 > servers on netaffxdas.affymetrix.com). > Also spent a time loading data for new affy arrays (mouse rat > exons). this required lots of memory, had to disable support for some > other arrays. [gregg's das servers load all annotations into memory at > start up, hance the big memory requirements for arrays with lots of > probe sets.] > > [A] gregg optimize affy das server memory reqts for exon arrays. > > gh: we' gotten a lot done this week. I think we have a stable spec. > > gh: serving alignments, no cigars, but blat alignment to genome as > coords on mrna and coords on the genome. igb doesn't use it yet, but > it's there. > ad: xid in region elements. > gh: we haven't exercised the xids. so 'link' in das/1 is equivalent to > xid in das/2? > ad: yes. i believe > gh: if you have links in das/1. without links it can build links from > feature id using a template. This is used for building links from > within IGB back to netaffx, for example. > > Topic: Writebacks > ----------------- > > gh: writebacks haven't been mentioned at all this week. > ad: we need people committed to writing a server to implement it. > gh: we decided that since ed griffith would be working on it at > Sanger, we wouldn't worry about it for ucla server. > bo: we started prototyping. locking mechanism. persisting part of a > mage document. the spec changed after that. andrew's delta model. a > little different from what we were prototyping. > actual persistence will be done in the assay portion of our server. > gh: grant focuses on write back for genome portion, and this was a big > chunk of the grant. ends in end of may or june. > > ad: delta model was: here's a list of add, delete, modify in one > document. An issue was if you change an existing record, do you give > it a new identifier? > gh: you never modify something with an existing id, just make a new > one, new id, with a pointer back to old one. Ed Griffith said this a > month ago. I like this idea. but told we cannot make this requirement > on the database. but very few dbs will be writeback, so it's not > affecting all servers > > ad: making new uris, client has to know the new uri for the old > one. needs to return a mapping document. > if network crashes partway through, client won't know mapping is and > will be lost. > gh: server doesn't know if client got it. it could act(?) back. > > gh: if a response from http server dies, server has no way to know. > ad: There could be a proxy in the middle, or isp's proxy server. The > server sent it successfully to the proxy, but never made it to the > client. > > gh: how is this dealt with for commits into relational dbs? same thing > applies > ad: don't know > ee: could ask for everything in this region. > ad: have a new element that says 'i used to be this'. > bo: you do an insert in a db, to get last pk that was issued. client > talks back to server, give me last feature uri that was provisioned on > my connection. so the client is in control. > > sc: it's up to client to get confirmation from server. If it failed to > get the response after sending in the modification request, it could > request that the server send it again. > > ad: (drawing on whiteboard) two stage strategy, get a transaction state. > > post "get transaction url" > <--------------- > post (put?) to transaction URL > -------------> > can do multiple (if identical) > ----------> > ----------> > Get was successful and here's transformation info > <--------------- > > ad: server can hold transformation info for some timespan in case > client needs to re-fetch. > > gh: I'm more insterested in getting a server up than a client > regarding writeback. complex parts of the client are already > implemented (apollo). > > gh: locks are region based not feature based. > ad: can't lock... > > gh: we can talk about how to trigger ucla locking mechanism. > bo: did flock transactional locking the suggested in perl > cookbook. mage document has content. server locks an id using flock, > (for assay das). > gh: to lock a region on the genome, lock on all ids for features in > this region. > bo: make a file containing all the ids that are locked. flock this > file. > > ad: file locking is frought with problems. why not keep it in the > database and let the db lock it for you. don't let perl + file system > do it for you. there could be fs problems. nfs isn't good at that. a > database is much more reliable. > > bo: I went with perl flock mechanism since you could have other > non-database sources (though so far it's all db). > > [A] steve, allen send brian code tips regarding locking. > > gh: putting aside pushing large data chunks into the server, for > curation it's ok if protocol is a little error prone, since the > curator-caused errors will be much more likely/common. > > ad: UK folks haven't done any writeback work as far as I know. > gh: they haven't billed us in 2 years. Tony cox is contact, ed > griffith is main developer. > ad: andreas and thomas are not funded by this grant or the next one. > gh: they are already funded by other means. > > ad: if someone want's to change an annotation should they need to get > a lock first or can it work like cvs? do it if it can, get lock, > release lock in one transaction. > ee: that's my preference. > > ad: if every feature has it's own id, you know if it's... > > ee: some servers might not have any writeback facility at > all. conflicts will be rare. > > [A] ask ed/tony on whether they plan to have any writeback facility > > gh: ed g wanted to work on client to do writeback, don't know who > would work on a server there. > ad: someone else, can't remember - roy? > gh: unless we hear back from sanger, the highest priority for ucla > folks after updating server for v300, is working server-side > writeback. > > gh: spec freeze is for the read portion. the writeback portion will > have to change as needed. > ad: and arithmetic? ;-) > > > _______________________________________________ > DAS2 mailing list > DAS2 at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/das2 -- Lincoln D. Stein Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724 FOR URGENT MESSAGES & SCHEDULING, PLEASE CONTACT MY ASSISTANT, SANDRA MICHELSEN, AT michelse at cshl.edu (516 367-5008) From lstein at cshl.edu Mon Mar 20 12:32:40 2006 From: lstein at cshl.edu (Lincoln Stein) Date: Mon, 20 Mar 2006 12:32:40 -0500 Subject: [DAS2] query language description In-Reply-To: References: <7ed68f2baa961f932e369cb449371439@dalkescientific.com> Message-ID: <200603201232.41522.lstein@cshl.edu> The current filter query language, which provides one level of ANDs and a nested level of ORs, satisfies our use cases. It is not clear to me what additional benefit we'll get from a composable query language. Note that none of the popular and functional genome information sources -- NCBI, UCSC, Ensembl or BioMart -- offer a composable query language, and there does not seem to be rioting on the streets! Lincoln On Friday 17 March 2006 19:20, chris mungall wrote: > On Mar 16, 2006, at 6:05 PM, Andrew Dalke wrote: > >> right now they are forced bypass the constraint language and go direct > >> to SQL. > > > > In addition, we provide defined ways for a server to indicate > > that there are additional ways to query the server. > > I was positing this as a bad feature, not a good one. or at least a > symptom of an incorrectly designed system (at least in the case of the > GO DB API - it may not carry forward to DAS - though if you're going to > allow querying by terms...) > > >> None of these really lit into the DAS paradigm. I'm guessing you want > >> something simple that can be used as easily as an API with get-by-X > >> methods but will seamlessly blend into something more powerful. I > >> think what you have is on the right lines. I'm just arguing to make > >> this language composable from the outset, so that it can be extended > >> to whatever expressivity is required in the future, without bolting on > >> a new query system that's incompatible with the existing one. > > > > We have two ways to compose the system. If the simple query language > > is extended, for example, to support word searches of the text field > > instead of substring searches, then a server can say > > > > > query_uri="http://somewhere.over.rainbow/server.cgi"> > > > > > > > > This is backwards compatible, so the normal DAS queries work. But > > a client can recognize the new feature and support whatever new filters > > that 'word-search' indicates, eg > > > > http://somewhere.over.rainbox/server.cgi?note-wordsearch=Andre* > > > > (finds features with notes containing words starting with 'Andre' ) > > > > These are composable. For example, suppose Sanger allows modification > > date searches of curation events. Then it might say > > > > > query_uri="http://somewhere.over.rainbow/server.cgi"> > > > > > > > > so this is limited to single-argument search functions? > > > and I can search for notes containing words starting with "Andre" > > which were modified by "dalke" between 2002 and 2005 by doing > > > > http://somewhere.over.rainbox/server.cgi?note-wordsearch=Andre*& > > modified-by=dalke&modified-before=2005&modified-after=2002 > > but the compositionality is always associative since the CGI parameter > constraint forbids nesting > > > An advantage to the simple boolean logic of the current system > > is that the GUI interface is easy, and in line with existing > > simple search systems. > > there's nothing preventing you from implementing a simple GUI on top of > an expressive system - there is nothing forcing you to use the > expressivity > > > If someone wants to implement a new search system which is > > not backwards compatible then the server can indicate that > > alternative with a new CAPABILITY. Suppose Thomas at Sanger > > comes up with a new search mechanism based on an object query > > language he invented, > > > > > query_uri="http://sanger.ac.uk/oql-search" /> > > > > The Sanger and EBI clients might understand that and support > > a more complex GUI, eg, with a text box interface. Everyone > > else must ignore unknown capability types. > > but this doesn't integrate with the existing query system > > > Then that would be POSTED (or whatever the protocol defines) > > to the given URL, which returns back whatever results are > > desired. > > > > Or the server can point to a public MySQL port, like > > > > > query_uri="mysql://username:password at hostname:port/databasename" > > /> > > > > That's what you are doing to bypass the syntax, except that > > here it isn't a bypass; you can define the new interface in > > the DAS sources document. > > > >> The generic language could just be some kind of simple > >> extensible function syntax for search terms, boolean operators, > >> and some kind of (optional) nesting syntax. > > > > Which syntax? Is it supposed to be easy for people to write? > > Text oriented? Or tree structured, like XML, or SQL-like? > > I'd favour some concrete asbtract syntax that looks much like the > existing DAS QL > > > And which clients and servers will implement that search > > language? > > all servers. clients optional > > > If there was a generic language it would allow > > OR("on segment Chr1 between 1000 and 2000", > > "on segment ChrX between 99 and 777") > > which is something we are expressly not allowing in DAS2 > > queries. It doesn't make sense for the target applications > > and by excluding it it simplifies the server development, > > which means less chance for bugs. > > this example is pointless but it's easy to imagine plenty of ontology > term queries or other queries in which this would be useful > > I guess I depart from the normal DAS philosophy - I don't see this > being a high barrier for entry for servers, if they're not up to this > it'll probably be a buggy hacky server anyway > > > Also, I personally haven't figured out a decent way to > > do a GUI composition of a complex boolean query which is > > as easy as learning the query language in the first place. > > doesn't mean it doesn't exist. > > i'm not sure what's hard about having say, a clipboard of favourite > queries, then allowing some kind of drag-and-drop composition > > > A more generic language implementation is a lot of overhead > > if most (80%? 90%) need basic searches, and many of the > > rest can fake it by breaking a request into parts and > > doing the boolean logic on the client side. > > this is always an option - if the user doesn't mind the additional > possibly very high overhead. it's just a little bit of a depressing > approach, as if Codd's seminal paper from 1970 or whenever it was never > happened. > > > Feedback I've heard so far is that DAS1 queries were > > acceptable, with only a few new search fields needed. > > > >> hmm, not sure how useful this would be - surely you'd want something > >> more dasmodel-aware? > > > > The example I gave was a bad one. What I meant was to show > > how there's an extension point so someone can develop a new > > search interface and clients can know that the new functionality > > exists, without having to change the DAS spec. > > ok > > that's probably all I've got to say on the matter, sorry for being > irksome. I guess I'm fundamentally missing something, that is, why wrap > simple and expressive declarative query languages with limited ad-hoc > constraint systems with consciously limited expressivity and limited > means of extensibility.. > > cheers > chris > > > Andrew > > dalke at dalkescientific.com > > > > _______________________________________________ > > DAS2 mailing list > > DAS2 at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/das2 > > _______________________________________________ > DAS2 mailing list > DAS2 at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/das2 -- Lincoln D. Stein Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724 FOR URGENT MESSAGES & SCHEDULING, PLEASE CONTACT MY ASSISTANT, SANDRA MICHELSEN, AT michelse at cshl.edu (516 367-5008) From Gregg_Helt at affymetrix.com Mon Mar 20 12:40:19 2006 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Mon, 20 Mar 2006 09:40:19 -0800 Subject: [DAS2] call today? Message-ID: Apologies, I forgot to post that today's DAS/2 teleconference was cancelled. The feeling on Friday was that after the code sprint last week we needed a break. The teleconference will resume next week on the regular schedule (Mondays at 9:30 AM Pacific time). Thanks, Gregg > -----Original Message----- > From: Andreas Prlic [mailto:ap3 at sanger.ac.uk] > Sent: Monday, March 20, 2006 9:02 AM > To: Andrew Dalke; Helt,Gregg > Cc: Thomas Down > Subject: call today? > > Hi Dasians, > > do we have a conference call today? > > Cheers, > Andreas > > ----------------------------------------------------------------------- > > Andreas Prlic Wellcome Trust Sanger Institute > Hinxton, Cambridge CB10 1SA, UK > +44 (0) 1223 49 6891 From cjm at fruitfly.org Mon Mar 20 18:45:46 2006 From: cjm at fruitfly.org (chris mungall) Date: Mon, 20 Mar 2006 15:45:46 -0800 Subject: [DAS2] query language description In-Reply-To: <200603201232.41522.lstein@cshl.edu> References: <7ed68f2baa961f932e369cb449371439@dalkescientific.com> <200603201232.41522.lstein@cshl.edu> Message-ID: <7900d1398d5045a268a5f6fe51af529d@fruitfly.org> I guess things need to be left open for a DAS/3... On Mar 20, 2006, at 9:32 AM, Lincoln Stein wrote: > The current filter query language, which provides one level of ANDs > and a > nested level of ORs, satisfies our use cases. It is not clear to me > what > additional benefit we'll get from a composable query language. Note > that none > of the popular and functional genome information sources -- NCBI, UCSC, > Ensembl or BioMart -- offer a composable query language, and there > does not > seem to be rioting on the streets! > > Lincoln > > > On Friday 17 March 2006 19:20, chris mungall wrote: >> On Mar 16, 2006, at 6:05 PM, Andrew Dalke wrote: >>>> right now they are forced bypass the constraint language and go >>>> direct >>>> to SQL. >>> >>> In addition, we provide defined ways for a server to indicate >>> that there are additional ways to query the server. >> >> I was positing this as a bad feature, not a good one. or at least a >> symptom of an incorrectly designed system (at least in the case of the >> GO DB API - it may not carry forward to DAS - though if you're going >> to >> allow querying by terms...) >> >>>> None of these really lit into the DAS paradigm. I'm guessing you >>>> want >>>> something simple that can be used as easily as an API with get-by-X >>>> methods but will seamlessly blend into something more powerful. I >>>> think what you have is on the right lines. I'm just arguing to make >>>> this language composable from the outset, so that it can be extended >>>> to whatever expressivity is required in the future, without bolting >>>> on >>>> a new query system that's incompatible with the existing one. >>> >>> We have two ways to compose the system. If the simple query language >>> is extended, for example, to support word searches of the text field >>> instead of substring searches, then a server can say >>> >>> >> query_uri="http://somewhere.over.rainbow/server.cgi"> >>> >>> >>> >>> This is backwards compatible, so the normal DAS queries work. But >>> a client can recognize the new feature and support whatever new >>> filters >>> that 'word-search' indicates, eg >>> >>> http://somewhere.over.rainbox/server.cgi?note-wordsearch=Andre* >>> >>> (finds features with notes containing words starting with 'Andre' ) >>> >>> These are composable. For example, suppose Sanger allows >>> modification >>> date searches of curation events. Then it might say >>> >>> >> query_uri="http://somewhere.over.rainbow/server.cgi"> >>> >>> >>> >> >> so this is limited to single-argument search functions? >> >>> and I can search for notes containing words starting with "Andre" >>> which were modified by "dalke" between 2002 and 2005 by doing >>> >>> http://somewhere.over.rainbox/server.cgi?note-wordsearch=Andre*& >>> modified-by=dalke&modified-before=2005&modified-after=2002 >> >> but the compositionality is always associative since the CGI parameter >> constraint forbids nesting >> >>> An advantage to the simple boolean logic of the current system >>> is that the GUI interface is easy, and in line with existing >>> simple search systems. >> >> there's nothing preventing you from implementing a simple GUI on top >> of >> an expressive system - there is nothing forcing you to use the >> expressivity >> >>> If someone wants to implement a new search system which is >>> not backwards compatible then the server can indicate that >>> alternative with a new CAPABILITY. Suppose Thomas at Sanger >>> comes up with a new search mechanism based on an object query >>> language he invented, >>> >>> >> query_uri="http://sanger.ac.uk/oql-search" /> >>> >>> The Sanger and EBI clients might understand that and support >>> a more complex GUI, eg, with a text box interface. Everyone >>> else must ignore unknown capability types. >> >> but this doesn't integrate with the existing query system >> >>> Then that would be POSTED (or whatever the protocol defines) >>> to the given URL, which returns back whatever results are >>> desired. >>> >>> Or the server can point to a public MySQL port, like >>> >>> >> query_uri="mysql://username:password at hostname:port/databasename" >>> /> >>> >>> That's what you are doing to bypass the syntax, except that >>> here it isn't a bypass; you can define the new interface in >>> the DAS sources document. >>> >>>> The generic language could just be some kind of simple >>>> extensible function syntax for search terms, boolean operators, >>>> and some kind of (optional) nesting syntax. >>> >>> Which syntax? Is it supposed to be easy for people to write? >>> Text oriented? Or tree structured, like XML, or SQL-like? >> >> I'd favour some concrete asbtract syntax that looks much like the >> existing DAS QL >> >>> And which clients and servers will implement that search >>> language? >> >> all servers. clients optional >> >>> If there was a generic language it would allow >>> OR("on segment Chr1 between 1000 and 2000", >>> "on segment ChrX between 99 and 777") >>> which is something we are expressly not allowing in DAS2 >>> queries. It doesn't make sense for the target applications >>> and by excluding it it simplifies the server development, >>> which means less chance for bugs. >> >> this example is pointless but it's easy to imagine plenty of ontology >> term queries or other queries in which this would be useful >> >> I guess I depart from the normal DAS philosophy - I don't see this >> being a high barrier for entry for servers, if they're not up to this >> it'll probably be a buggy hacky server anyway >> >>> Also, I personally haven't figured out a decent way to >>> do a GUI composition of a complex boolean query which is >>> as easy as learning the query language in the first place. >> >> doesn't mean it doesn't exist. >> >> i'm not sure what's hard about having say, a clipboard of favourite >> queries, then allowing some kind of drag-and-drop composition >> >>> A more generic language implementation is a lot of overhead >>> if most (80%? 90%) need basic searches, and many of the >>> rest can fake it by breaking a request into parts and >>> doing the boolean logic on the client side. >> >> this is always an option - if the user doesn't mind the additional >> possibly very high overhead. it's just a little bit of a depressing >> approach, as if Codd's seminal paper from 1970 or whenever it was >> never >> happened. >> >>> Feedback I've heard so far is that DAS1 queries were >>> acceptable, with only a few new search fields needed. >>> >>>> hmm, not sure how useful this would be - surely you'd want something >>>> more dasmodel-aware? >>> >>> The example I gave was a bad one. What I meant was to show >>> how there's an extension point so someone can develop a new >>> search interface and clients can know that the new functionality >>> exists, without having to change the DAS spec. >> >> ok >> >> that's probably all I've got to say on the matter, sorry for being >> irksome. I guess I'm fundamentally missing something, that is, why >> wrap >> simple and expressive declarative query languages with limited ad-hoc >> constraint systems with consciously limited expressivity and limited >> means of extensibility.. >> >> cheers >> chris >> >>> Andrew >>> dalke at dalkescientific.com >>> >>> _______________________________________________ >>> DAS2 mailing list >>> DAS2 at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/das2 >> >> _______________________________________________ >> DAS2 mailing list >> DAS2 at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/das2 > > -- > Lincoln D. Stein > Cold Spring Harbor Laboratory > 1 Bungtown Road > Cold Spring Harbor, NY 11724 > FOR URGENT MESSAGES & SCHEDULING, > PLEASE CONTACT MY ASSISTANT, > SANDRA MICHELSEN, AT michelse at cshl.edu (516 367-5008) From dalke at dalkescientific.com Tue Mar 21 18:21:11 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Tue, 21 Mar 2006 15:21:11 -0800 Subject: [DAS2] complex features Message-ID: I've been working on the data model some, trying to get a feel for complex features. I've also been evaluating how GFF3 handles them. Both use a parent/child link, though GFF3 only has the reference to the parent while DAS has both. That means DAS clients can determine when all of the complex feature have been downloaded. GFF3 potentially requires waiting until the end of the library, though there is a way to hint that all the results have been returned. Both allow complex graphs. That is, both allow cycles. I assume we are restricting complex features to DAGs, but even then the following is possible [root1] [root2] [root3] | \ | / | \ | / | ------------------ | | node 4 | | ------------------ | / | / |/ [node 5] Node 4 has three parents (root1, root2 and root3) and node 5 has two parents (root1 and node4) This may or may not make biological sense. I don't know. I only point out that it's there. I feel that complex annotations must only have a single root element, even if it's a synthetic one with no location. Next, consider writeback, with the following two complex features [root1] [root2] | \ | | \ | | \ | [node1.1] [node1.2] [node2.1] Suppose someone adds a new "connector" node >-->---. | V [root1] | [root2] | \ | | | \ | | | \ ^ | [node1.1] [node1.2] | [node2.1] | | V | [connector]-->--->--^ Should that sort of thing be allowed? What's the model for the behavior? It seems to me there's a missing concept in DAS relating to complex features. My model is that the "complex feature" is its own concept, which I've been calling an "annotation". All simple features are annotations. The connected nodes of a complex feature are also annotations. As such, two annotations cannot be combined like this. Writeback only occurs at the annotation level, in that new feature elements cannot be used to connect two existing annotations. We might also consider having a new interface for annotations (complex features), so they can be referred to by URI. I don't think that's needed right now. Andrew dalke at dalkescientific.com From cjm at fruitfly.org Tue Mar 21 19:43:49 2006 From: cjm at fruitfly.org (chris mungall) Date: Tue, 21 Mar 2006 16:43:49 -0800 Subject: [DAS2] complex features In-Reply-To: References: Message-ID: <3879834dc8786f628c68e47a076c1e90@fruitfly.org> The GFF3 spec says that Parent can only be used to indicate part_of relations. If we go by the definition of part_of in the OBO relations ontology, or any other definition of part_of (there are many), then cycles are explicitly verboten, although the GFF3 docs do not state this. There's no reason in general why part_of graphs should have a single root, although it's certainly desirable from a software perspective. Dicistronic genes thow a bit of a spanner in the works. There's nothing to stop you adding a fake root, or refering to the maximally connected graph as an entity in its own right however. I don't know enough about DAS/2 to be helpful with the writeback example. It looks like your example below is a gene merge. On Mar 21, 2006, at 3:21 PM, Andrew Dalke wrote: > I've been working on the data model some, trying to get a feel > for complex features. I've also been evaluating how GFF3 handles > them. > > Both use a parent/child link, though GFF3 only has the reference > to the parent while DAS has both. That means DAS clients can > determine when all of the complex feature have been downloaded. > GFF3 potentially requires waiting until the end of the library, > though there is a way to hint that all the results have been > returned. > > Both allow complex graphs. That is, both allow cycles. I > assume we are restricting complex features to DAGs, but even > then the following is possible > > [root1] [root2] [root3] > | \ | / > | \ | / > | ------------------ > | | node 4 | > | ------------------ > | / > | / > |/ > [node 5] > > Node 4 has three parents (root1, root2 and root3) and > node 5 has two parents (root1 and node4) > > This may or may not make biological sense. I don't know. I > only point out that it's there. > > I feel that complex annotations must only have a single root > element, even if it's a synthetic one with no location. > > Next, consider writeback, with the following two complex features > > [root1] [root2] > | \ | > | \ | > | \ | > [node1.1] [node1.2] [node2.1] > > > Suppose someone adds a new "connector" node > >> -->---. > | V > [root1] | [root2] > | \ | | > | \ | | > | \ ^ | > [node1.1] [node1.2] | [node2.1] > | | > V | > [connector]-->--->--^ > > Should that sort of thing be allowed? What's the model > for the behavior? > > It seems to me there's a missing concept in DAS relating to > complex features. My model is that the "complex feature" is > its own concept, which I've been calling an "annotation". > All simple features are annotations. The connected nodes of > a complex feature are also annotations. > > As such, two annotations cannot be combined like this. > Writeback only occurs at the annotation level, in that > new feature elements cannot be used to connect two existing > annotations. > > We might also consider having a new interface for annotations > (complex features), so they can be referred to by URI. I > don't think that's needed right now. > > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/das2 From boconnor at ucla.edu Tue Mar 21 19:47:51 2006 From: boconnor at ucla.edu (Brian O'Connor) Date: Tue, 21 Mar 2006 16:47:51 -0800 Subject: [DAS2] das.biopackages.net Message-ID: <44209EB7.9070008@ucla.edu> The DAS/2 server located at das.biopackages.net may be unavailable on and off for the next hour or so. Just wanted to let everyone know in case someone is using it. --Brian From dalke at dalkescientific.com Thu Mar 23 16:44:00 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 23 Mar 2006 13:44:00 -0800 Subject: [DAS2] complex features In-Reply-To: <3879834dc8786f628c68e47a076c1e90@fruitfly.org> References: <3879834dc8786f628c68e47a076c1e90@fruitfly.org> Message-ID: <53840452abca7236130efd4e57f42aef@dalkescientific.com> chris: > The GFF3 spec says that Parent can only be used to indicate part_of > relations. If we go by the definition of part_of in the OBO relations > ontology, or any other definition of part_of (there are many), then > cycles are explicitly verboten, although the GFF3 docs do not state > this. It looks like the most recent spec at http://song.sourceforge.net/gff3.shtml does state this, although the earlier one did not: "A Parent relationship between two features that is not one of the Part-Of relationships listed in SO should trigger a parse exception Similarly, a set of Parent relationships that would cause a cycle should also trigger an exception." > There's no reason in general why part_of graphs should have a single > root, although it's certainly desirable from a software perspective. > Dicistronic genes thow a bit of a spanner in the works. There's nothing > to stop you adding a fake root, or refering to the maximally connected > graph as an entity in its own right however. I've been working with GFF3 data for a few days now, trying to catch the different cases. It isn't hard, but it had been a long time since I worried about cycle detection. The biggest problem has been keeping all the "could be a parent" elements around until the entire data set is finished. Except for features with no ID and no Parents, parsers need to go to the end of the file (or no-forward-references line) before being able to do anything with the data. In DAS it's easier because each feature lists all parents and children, so it's possible to detect when a complex feature is ready. Even then it requires a bit of thinking to handle cases with multiple roots. It would be much easier if either all complex features were in an element or if there was a unique name to tie them together Another solution is to make the problem simpler. I see, for example, that the biopython doesn't have any gff code and the biojava one only works at the single feature level. Only bioperl implements a gff3 parser with support for complex features, but it assumes all complex features are single rooted and that the features are topologically sorted, so that parents come before children. It also looks like a diamond structure (single root, two children, both with the same child) is supported on input but the output assumes features are trees. For example, I tried it just now on dmel-4-r4.3.gff from wormbase, which I'm finding to be a bad example of what a GFF file should look like. It contains one duplicate ID, which bioperl catches and dies on. I fixed it. It then complains with a lot of MSG: Bio::SeqFeature::Annotated=HASH(0xba4a93c) is not contained within parent feature, and expansion is not valid, ignoring. because the features are not topologically sorted, as in this (trimmed) example. The order is the same as in the file. 4 sim4:na_dbEST.same.dmel match_part 5175 5627 ... Parent=88682278868229;Name=GH01459.5prime 4 sim4:na_dbEST.same.dmel match 5175 5627 ... ID=88682278868229;Name=GH The simpler the data model we use (eg, single rooted, output must be topologically sorted with parents first) then the more likely it is for client and server code to be correct and the more likely there will be more DAS code. Andrew dalke at dalkescientific.com From ap3 at sanger.ac.uk Fri Mar 24 13:19:41 2006 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Fri, 24 Mar 2006 18:19:41 +0000 Subject: [DAS2] 100th das1 source in registry Message-ID: <23fe2aa8d3c4a9afc28782b3d3e58032@sanger.ac.uk> Hi! Today the 100th DAS1 source was registered in the DAS registration server at http://das.sanger.ac.uk/registry/ It currently counts 101 DAS sources from 23 institutions in 9 countries. The purpose of the DAS registration service is to keep track which DAS services are available and to help with automated discovery of new DAS servers on the client side. Regards, Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 From Gregg_Helt at affymetrix.com Fri Mar 24 13:37:21 2006 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Fri, 24 Mar 2006 10:37:21 -0800 Subject: [DAS2] 100th das1 source in registry Message-ID: Congratulations! On a related note, is there a way to automatically register DAS/2 servers yet? If not, can I send you info to add the Affymetrix test DAS/2 server to the registry? Thanks, Gregg > -----Original Message----- > From: das2-bounces at lists.open-bio.org [mailto:das2-bounces at lists.open- > bio.org] On Behalf Of Andreas Prlic > Sent: Friday, March 24, 2006 10:20 AM > To: DAS/2 > Subject: [DAS2] 100th das1 source in registry > > Hi! > > Today the 100th DAS1 source was registered in the DAS registration > server at > > http://das.sanger.ac.uk/registry/ > > It currently counts 101 DAS sources from 23 institutions in 9 countries. > > The purpose of the DAS registration service is to keep track which DAS > services are available > and to help with automated discovery of new DAS servers on the client > side. > > Regards, > Andreas > > > ----------------------------------------------------------------------- > > Andreas Prlic Wellcome Trust Sanger Institute > Hinxton, Cambridge CB10 1SA, UK > +44 (0) 1223 49 6891 > > _______________________________________________ > DAS2 mailing list > DAS2 at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/das2 From ap3 at sanger.ac.uk Sat Mar 25 06:13:06 2006 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Sat, 25 Mar 2006 11:13:06 +0000 Subject: [DAS2] 100th das1 source in registry In-Reply-To: References: Message-ID: > On a related note, is there a way to automatically register DAS/2 > servers yet? A beta - version can be tried at the toy-registry at http://www.spice-3d.org/dasregistry/registerDas2Source.jsp and the results will be visible at http://www.spice-3d.org/dasregistry/das2/sources - so far this provides a simple upload mechanism that is based on the sources decription. what is still missing is a validation of the user provided data ("does this request give really a features response?") plus other things like a html representation of the das2 servers. I think it would be great if Andrew's Dasypus server could provide an interface to the validation mechanism that could be used by programs. If validation fails the response could contain a link, to point the user to the nice error report web page. will be abroad next week so can't join for the call... Cheers, Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 From Gregg_Helt at affymetrix.com Mon Mar 27 11:24:53 2006 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Mon, 27 Mar 2006 08:24:53 -0800 Subject: [DAS2] Agenda for today's teleconference Message-ID: We're back on the standard DAS/2 teleconference schedule, every Monday at 9:30 AM Pacific time. Suggestions for today's agenda: Code sprint summary DAS/2 grant status Writeback spec & implementation ??? Teleconference # US: 800-531-3250 International: 303-928-2693 Conference ID: 2879055 Passcode: 1365 From Steve_Chervitz at affymetrix.com Mon Mar 27 14:05:28 2006 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Mon, 27 Mar 2006 11:05:28 -0800 Subject: [DAS2] Notes from the weekly DAS/2 teleconference, 27 Mar 2006 Message-ID: Notes from the weekly DAS/2 teleconference, 27 Mar 2006 $Id: das2-teleconf-2006-03-27.txt,v 1.1 2006/03/27 19:03:30 sac Exp $ Note taker: Steve Chervitz Attendees: Affy: Steve Chervitz, Gregg Helt CSHL: Lincoln Stein Dalke Scientific: Andrew Dalke UC Berkeley: Nomi Harris UCLA: Allen Day Action items are flagged with '[A]'. These notes are checked into the biodas.org CVS repository at das/das2/notes/2006. Instructions on how to access this repository are at http://biodas.org DISCLAIMER: The note taker aims for completeness and accuracy, but these goals are not always achievable, given the desire to get the notes out with a rapid turnaround. So don't consider these notes as complete minutes from the meeting, but rather abbreviated, summarized versions of what was discussed. There may be errors of commission and omission. Participants are welcome to post comments and/or corrections to these as they see fit. Proposed agenda: * Code sprint summary * DAS/2 grant status * Writeback spec & implementation [Notetaker: missed the first 40min - apologies] Topic: Code sprint summary -------------------------- gh: pleased with our progress during the last code sprint (13-17 Mar) [Notetaker: detailed summaries of what folks did during this code sprint are described here: http://lists.open-bio.org/pipermail/das2/2006-March/000668.html ] Topic: Writeback ---------------- [Discussion in progress] ls: in my model, every feature has a unique id, when you update it, it's going to make the change to the object and not create a new one. the object is associated with url in some way, when you update the position of this exon, it's going to change some attributes of it. gh: thomas proposed the alternative: every time you change a feature you create a new one with a pointer back to the old one. ad: can't speak for what db implementers will do for versioning of features. only taking about merging from different complex features. So only when you merge from complex ones. ls: this is the history tracking business. writeback will explicitly support merges and splits. ad: how detailed does the spec need to be? ls: driven by requirements. ad: what are the reqts? I can't go further without more details. roy said eevery modification gets new version, so you could do time travel, if your db supported that. ls: does igb or apollo explicitly support merges and splits among transcripts? gh: yes. curation in igb is experimental (now turned off). but it does support these. as does apollo. so these are essential. ls: writeback should have instructions for how feature will adopt children of a subfeature. one feature adopts children of the other and previous feature is now deprecated. there's a specific set of operations for creating new features, renaming, spliting, and merging. perhaps Nomi should write down what operations that apollo supports. nh: yes, all those are supported as well as things like adjusting endpoints of start of translation. apollo can merge transcripts within a gene and between genes (which offers to merge the associated genes). curators can do 'splurge' - a split, merge combo. ls: that sounds like suzi's nomenclature. gh: the db that apollo writes back to, do changes create new versions of feature or change the feature itself? nh: not sure. mark did the work with chado. I know they were doing something to rewrite the entire feature if anything changed. [A] nomi will ask Mark to join in discussion next week (3 April). aday: what fraction of the operations are doing simple vs complex things? eg., revising the gene model. nh: revision happens a lot. mostly adjusting endpoints. splits and merges are infrequent. adding annotation. But it doesn't matter how infrequent the operations are, we either support them or we don't. ad: when there are changes in the model, how does the client get notified that the change occurred? nh: that's tricky. gh: this is outside the scope of the das/2 spec itself. as long as we have locks to prevent simultaneous modification, that is sufficient. ad: there's no mechanism for polling server. gh: yes, just requery server. gh: but your client doesn't do it. gh: I'm thinking of adding polling to get the last modified stuff. For now, one can simply re-start your session to see what has changed. aday: is the portion of writeback spec for modifying endpoints, simple add/delete of annotations stable? ad: the general idea is unchanged. gh: priority here is before next meeting: brian and allen read over writeback spec and identify any issues as implementers. aday: looking for an 80% solution. not dealing with heritance wihich is difficult. nh: splits and merges can be done with combos of simpler ops. aday: performace operations will be affected. graph flattening and partial indexes. splits and merges will affect this table, so will have to trigger update of that table any time there's a split/merge. this will have big impact on query performance: could be 1-2 sec for yeast, 30-60 min for human. gh: what about if you do that update 1x/day? Then users would be working off a snapshot that was current as of the end of previous day. aday: caching on server responses will also be affected, unless we turn caching off. maybe I can tell apache to remove a subset of cached pages and leave others intact. aday: for tiling requests - server could find affected blocks and purge those, instead of purging the entire cache. gh: you can't rely on any client to use your tiling strategy. but could be helpful for those clients that use it. aday: basically we'll have to turn caching off when we start doing writeback. gh: is there a way for server to detect what has changed? gh: if database detects change it can flush cache for that sequence. aday: maybe. possibly the easiest way to do this is via tiling. gh: say you have two servers: 1) everthing that can be edited 2) everything that has been edited (slower) aday: main server has all features and second server handles writeback, just writes to gff file, then cron runs once a night to merge the gff into the db. gh: separate dbs: 1) curation 2) everything that has been edited. aday: yes. persistent flat file adapter can be used for one of them. gh: this is the sort of detail I'm looking for w/r/t development of the writeback spec. [A] allen and brian look over writeback spec to discuss on 3 April. From nomi at fruitfly.org Mon Mar 27 14:42:59 2006 From: nomi at fruitfly.org (Nomi Harris) Date: Mon, 27 Mar 2006 11:42:59 -0800 Subject: [DAS2] Mark Gibson on Apollo writeback to Chado Message-ID: mark gibson said that he plans to attend next monday's DAS/2 teleconference. he also gave me permission to forward this message that he wrote recently in response to a group that is adapting apollo and wondered what he thought about direct-to-chado writeback vs. the use of chadoxml as an intermediate storage format. FlyBase Harvard prefers to use the latter approach because (we gather) they worry about possibly corrupting the database by having clients write directly to it. if anyone from harvard is reading this and feels that mark has misrepresented their approach, please set us straight! Nomi On 10 March 2006, Mark Gibson wrote: > Im rather biased as a I wrote the chado jdbc adapter [for Apollo], but let me put forth my > view of chado jdbc vs chado xml. > > The chado Jdbc adapter is transactional, the chado xml adapter is not. What this > means is jdbc only makes changes in the database that reflect what has actually > been changed in the apollo session, like updating a row in a table; with chado > xml you just get the whole dump. So if a synonym has been added jdbc will add a > row to the synonym table. For xml you will get the whole dump of the region you > were editing (probably a gene) no matter how small the edit. > > What I believe Harvard/Flybase then does (with chado xml) is wipe out the gene > from the database and reinsert the gene from the chado xml. The problem with > this approach is if you have data in the db thats not associated with apollo > (for flybase this would be phenotype data) then that will get wiped out as well, > and there has to be some way of reinstating non-apollo data. If you dont have > non-apollo data and dont intend on having it in the future this isnt a huge > issue I suppose. I think Harvard is integrating non-apollo data into their chado > database. > > I think what they are going to do is actually figure out all of the transactions > by comparing the chado xml with the chado database, which is what apollo already > does, but I'm not sure as Im not so in touch with them these days (as Im not > working with apollo these days - waiting for new grant to kick in). > > Since the paradigm with chado xml is wipe out & reload, then apollo has to make > sure it preserves every bit of the chado xml that came in. Theres a bunch of > stuff thats in chado/chado xml that the apollo datamodel is unconcerned with, > and has no need to be concerned with as its stuff that it doesnt visualize. In > other words apollos data model is solely for apollos task of visualizing data, > not for roundtripping what we call non-apollo data. In writing the chado xml > adapter for FlyBase, Nomi Harris had a heck of a time with these issues, and she > can elaborate on this I suppose. > > I'm personally not fond of chado xml because its basically a relational database > dump, so its extremely verbose. It redundantly has information for lots of joins > to data in other tables - like a cvterm entry can take 10 or 20 lines of chado > xml, and a given cvterm may be used a zillion times in a given chado xml file > (as every feature has a cvterm). So these files can get rather large. > > The solution for this verbose output is to use what I call macros in chado xml. > Macros are supported by xort. They take the 15 line cvterm entry and reduce it > to a line or 2 making the file size much more reasonable. The apollo chado xml > adapter does not support macros, so you have to use unmacro'd chado xml for > apollo purposes. Nomi Harris had a hard enough time getting the chado xml > adapter working for flybase(and did a great job with a harrowing task), that she > did not have time to take on the macro issue. If you wanted macros (and smaller > file sizes) you would have to add this functionality to the chado xml adapter > (are there java programmers in your group?). > > One of the arguments against the jdbc adapter is that its dangerous because it > goes straight into the database so if there are any bugs in the data adapter > then the database could get corrupted - some groups find this a bit precarious. > This is a valid argument. I think theres 2 solutions here. One is to thoroughly > test the adapter out against a test database until you are confident that bugs > are hammered out. > > Another solution is to not go straight from apollo to the database. You can use > an interim format and actually use apollo to get that interim format into the > database. Of course one choice for interim format is chado xml and then you are > at the the chado xml solution. The other choice for file format is GAME xml. You > can then use apollo to load game into the chado database, and this can be done > at the command line (with batching) so you dont have to bring up the gui to do > it. Also chado xml can be loaded into chado via apollo as well (of course xort > does this as well but not with transactions) > > So then the question is if Im not going to go straight into the database, why > would I choose game over chado xml? Or if Im using chado xml should I use > apollo or xort to load into chado. I think if you are using chado xml it makes > sense to use xort as it is the tried & true technology for chado xml. The > advantage of going through apollo is that it also uses the transactions from > apollo (theres a transaction xml file) and thus writes back the edits in a > transactional way as mentioned above rather than in a wipe out & reload fashion. > > Also Game is a tried & true technology that has been used with apollo in > production at flybase (before chado came along) for many years now. One > criticism of it has been that DTD/XSD/schema has been a moving target, nor has > it been described. That is not as true anymore. Nomi Harris has made a xsd for > it as well as a rng. But I must confess that I have recently added the ability > to have one level annotations in game (previously 1 levels had to be hacked as 3 > levels). Also game is a lot less verbose than un-macro'd chado xml, as it more > or less fits with the apollo datamodel. One advantage of chado xml over game xml > is that it is more flexible in terms of taking on features of arbitrary depth. > > The chado xml adapter was developed for FlyBase and as far as I know has not > been taken on by any other groups yet. Nomi can elaborate on this, but I think > what this might mean is that there are places where things are FlyBase specific. > If you went with chado xml the adapter would have to be generalized. Its a good > exercise for the adapter to go through, but it will take a bit of work. Nomi can > probably comment on how hard generalizing might be. I could be wrong about this > but I think the current status with the chado xml adapter is that Harvard has > done a bunch of testing on it but they havent put it into production yet. > > The jdbc adapter is being used by several groups so has been forced to be > generalized. One thing I have found is that chado databases vary all too much > from mod to mod (ontologies change). There is a configuration file for the jdbc > adapter that has settings for the differences that I encountered. I initially > wrote it for cold spring harbors rice database that will be used in classrooms. > Its working for rice in theory, but they havent actually used it much in the > classroom yet. For rice the model is to save to game and use apollo command line > to save game & transactions back to chado. > > Cyril Pommier, at the INRA - URGI - Bioinformatique, has taken on the jdbc > adapter for his group. I have cc'd him on this email as I think he will have a > lot to say about the jdbc adapter. Cyril has uncovered many bugs and has fixed a > lot of them (thank you cyril) as hes a very savvy java programmer. And he has > also forced the adapter to generalize and brought about the evolution of the > config file to adapt to chado differences. But as Cyril can attest (Cyril feel > free to elaborate) it has been a lot of work to get jdbc working for him. There > were a lot of bugs to fix that we both went after. Hopefully now its a bit more > stable and the next db/mod wont have as many problems. I think Cyril is still at > the test phase and hasn't gone into production (Cyril?) > > Berkeley is using the jdbc adapter for an in house project. They are using the > jdbc reader to load up game files (as the straight jdbc reader is slow as the > chado db is rather slow) which are then loaded by a curator. They are saving > game, and then I think chris mungall is xslting game to chado xml which is then > saved with xort - or he is somehow writing game in another way - not actually > sure. The Berkeley group drove the need for 1 level annotations(in jdbc,game,& > apollo datmodel) > > Jonathan Crabtree at TIGR wrote the jdbc read adapter, and they use it there. I > believe they are intending to use the write adapter but dont yet do so (Jonathan?). > > I should mention that reading jdbc straight from chado tends to be slow, as I > find that chado is a slow database, at least for Berkeley. It really depends on > the db vendor and the amount of data. TIGRs reading is actually really zippy. > The workaround for slow chados is to dump game files that read in pretty fast. > > In all fairness, you should probably email with FlyBase (& Chris Mungall) and > get the pros of using chado xml & xort, which they can give a far better answer > on than I. > > Hope this helps, > Mark From dalke at dalkescientific.com Mon Mar 27 15:59:28 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 27 Mar 2006 13:59:28 -0700 Subject: [DAS2] cell phone battery dead Message-ID: <3d9298aced5c4efb7d9c34574fcf7618@dalkescientific.com> Sorry about the drop out towards the end of today's conversation. The battery on my phone died. Andrew dalke at dalkescientific.com From boconnor at ucla.edu Wed Mar 1 21:34:38 2006 From: boconnor at ucla.edu (Brian O'Connor) Date: Wed, 01 Mar 2006 13:34:38 -0800 Subject: [DAS2] Re: Re DAS2 Server In-Reply-To: References: Message-ID: <4406136E.6060703@ucla.edu> Hi Vidya, So I think your best option is to try the RPM. I built a Fedora Core 2 RPM for DAS2 and just released it to http://biopackages.net last night. I could really use someone to test it so feedback would be great. The RPM approach is nice because yum will take care of installing all the dependencies including the chado database. If you're not using FC2 then it's a little but more involved. We don't really have a lot of docs but I could update the README in cvs (see http://sourceforge.net/projects/gmod it's the "das2" module). Until recently there wasn't really an install process you just do a "perl Makefile.PL; make; make test" to run DAS2. There's now an "install" target so you can do "perl Makefile.PL; make; sudo make install". You need to set some environmental variables, install a chado DB, and make sure all the perl module dependencies are installed before you do this though. See the Makefile.PL for the environmental variables you need to set. I'll update the README to include information about the dependencies. Hope this helps! I cc'd Allen Day too, he might have some helpful hints... --Brian Vidya Edupuganti wrote: >Hi Brian, >I am trying to setup DAS/2 server so that it can be used with Affymetrix's >IGB browser. I was trying to find a user manual for setting up DAS/2 server. >I could not find any. Can you please direct me to a place where I can find >it. If there isn't any can you please give me some inputs on how to install >a DAS/2 server and load data. >I really appreciate your help, >Thanks >Vidya > > > > >Vidyadari Edupuganti >Bioinformatician, Bioinformatics Research Unit >The Translational Genomics Research Unit (TGen) >445 N. Fifth St >Phoenix, AZ, 85004, USA > > > > > From dalke at dalkescientific.com Fri Mar 3 09:55:02 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Fri, 3 Mar 2006 02:55:02 -0700 Subject: [DAS2] working das validator Message-ID: <44479892cb0e465913b82e02a5c2525c@dalkescientific.com> I have a running validator at http://cgi.biodas.org:8080/ I've only tested it with SOURCES document but there's little that would fail with the others. I had planned to get this up a couple days ago but I've been distracted learning more about Javascript and a couple of Javascript libraries. I used Mochikit to make the interactivity you see there, and I have some ideas about how to use Dojo -- but not for a couple of weeks. The code goes through the following validation steps: - TODO - handle if the URL is not fetchable and handle timeouts - check that the content-type agrees with the document type - check that it's well-formed XML; report error where not - check that the root element matches the document type - check that it passed the Relax-NG validation; - report the id and href fields which are empty strings - report if any date fields are not iso dates There are many more checks I could add. They are easy now that the scaffold is there. I'm going to work on the next draft now. After that I'll get back to the validator. I want to add hyperlinks on fields which are links, and I have an idea of how to add a "SEARCH" button next to the query urls which creates a popup where you can fill in the different fields before doing the search. Budget-wise I'm not sure how to charge the last few days of work as it was a "wouldn't it be neat if" project rather than something really needed. It is neat though ... Andrew dalke at dalkescientific.com From Steve_Chervitz at affymetrix.com Fri Mar 3 17:34:11 2006 From: Steve_Chervitz at affymetrix.com (Chervitz, Steve) Date: Fri, 3 Mar 2006 09:34:11 -0800 Subject: [DAS2] working das validator In-Reply-To: <44479892cb0e465913b82e02a5c2525c@dalkescientific.com> Message-ID: Andrew, Nice work on the web interface to the validator. Before you dive back into the spec, could you troubleshoot these 500 errors I'm getting on your server? URL: http://das.biopackages.net/das/genome With the "guess" radio button I get: 500 Internal error .... TypeError: GuessFromHeader() takes exactly 2 arguments (1 given) With any other radio button I get: 500 Internal error .... AttributeError: BodyError instance has no attribute 'args' Steve > From: Andrew Dalke > Date: Fri, 3 Mar 2006 02:55:02 -0700 > To: DAS/2 > Subject: [DAS2] working das validator > > I have a running validator at > > http://cgi.biodas.org:8080/ > > > I've only tested it with SOURCES document but there's little > that would fail with the others. > > I had planned to get this up a couple days ago but I've been > distracted learning more about Javascript and a couple of Javascript > libraries. I used Mochikit to make the interactivity you see > there, and I have some ideas about how to use Dojo -- but not > for a couple of weeks. > > The code goes through the following validation steps: > > - TODO - handle if the URL is not fetchable and handle timeouts > - check that the content-type agrees with the document type > - check that it's well-formed XML; report error where not > - check that the root element matches the document type > - check that it passed the Relax-NG validation; > - report the id and href fields which are empty strings > - report if any date fields are not iso dates > > There are many more checks I could add. They are easy now > that the scaffold is there. > > I'm going to work on the next draft now. > > After that I'll get back to the validator. I want to add > hyperlinks on fields which are links, and I have an idea of > how to add a "SEARCH" button next to the query urls which > creates a popup where you can fill in the different fields > before doing the search. > > Budget-wise I'm not sure how to charge the last few days > of work as it was a "wouldn't it be neat if" project rather > than something really needed. It is neat though ... > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 From dalke at dalkescientific.com Fri Mar 3 18:04:12 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Fri, 3 Mar 2006 11:04:12 -0700 Subject: [DAS2] working das validator In-Reply-To: References: Message-ID: <5d7729f77f8d4b6dcbd8dacd04701c19@dalkescientific.com> Hi Steve, I saw those errors in the log file but wasn't sure if they were from you or Gregg. > URL: http://das.biopackages.net/das/genome > > With the "guess" radio button I get: > > 500 Internal error > .... > TypeError: GuessFromHeader() takes exactly 2 arguments (1 given) Fixed. > With any other radio button I get: > > 500 Internal error > .... > AttributeError: BodyError instance has no attribute 'args' Fixed. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Sun Mar 5 01:59:15 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Sat, 4 Mar 2006 18:59:15 -0700 Subject: [DAS2] current text of draft 3 of spec Message-ID: <5e3c38635022ba8ae291cd6c4e036eef@dalkescientific.com> I've been working on the 3rd draft for the spec. Because of the confusion in the previous version I've decided on a different approach where I jump into the middle and describe how the parts fit together before getting into the details of every element type or the theory behind the architecture. I think this flows much better. ==================== DAS is a protocol for sharing biological data. This version of the specification, DAS 2.0, describes features located on the genomic sequence. Future versions will add support for sharing annotations of protein sequences, expression data, 3D structures and ontologies. The genomic DAS interface is deliberately designed so there will be a large core shared with the protein sequence DAS. A DAS 2.0 annotation server provides feature information about one or more genome sources. Each source may have one or more versions. Different versions are usually based on different assemblies. As an implementation detail an assembly and corresponding sequence data may be distributed via a different machine, which is called the reference server. Annotations are located on the genomic sequence with a start and end position. The range may be specified multiple times if there are alternate coordinate systems. An annotation may contain multiple non-continguous parts, making it the parent of those parts. Some parts may have more than one parent. Annotations have a type based on terms in SOFA (Sequence Ontology for Feature Annotation). Stylesheets contain a set of properties used to depict a given type. Annotations can be searched by range, type, and a properties table associated with each annotation. These are called feature filters. DAS 2.0 is implemented using a ReST architecture. Each document (also called an entity or object) has a name, which is a URL. Fetching the URL gets information about the document. The DAS-specific documents are all in XML. Other data types have existing widely used formats, and sometimes more than one for the same data. A DAS server may provide a distinct document for each of these formats, along with information about which formats are available. DAS 2.0 addresses some shortcomings of the DAS 1.x protocol, including: * Better support for hierachical structures (e.g. transcript + exons) * Ontology-based feature annotations * Allow multiple formats, including formats only appropriate for some feature types * A lock-based editing protocol for curational clients * An extensible namespacing system that allows annotations in non-genomic coordinates (e.g. uniprot protein coordinates or PDB structure coordinates) ===== A DAS server supplies information about genomic sequence data sources. The collection of all sources, each data source, and each version of a data source are accessible through a URL. All three classes of URLs return a document of content-type 'application/x-das-sources+xml' though likely with differing amounts of detail. A 'versioned source' request returns information only about a specific version of a data source. A 'source' request returns the list of all the versioned source data for that source. A 'sources' request returns the list of all the source data, including all the versioned source data. The URLs might not be distinct. For example, a server with only one version of one data source may use the same URL for all three documents, and a server for a single organism may use the same URL for the 'sources' and 'source' documents. Most servers will list only the data sources provided by that server. Some servers combine the sources documents from other servers into a single document. These registry servers act as a centralized index and reduce configuration and network overhead. A registry server uses the same sources format as an annotation server. Here is an example of a simple sources document which makes no distinction between the three sources categories. Request: http://www.example.com/das/genome/yeast.xml Response: Content-Type: application/x-das-sources+xml All identifiers and href attributes in DAS documents follow the XML Base specification (see http://www.w3.org/TR/xmlbase/ ) in resolving partial identifiers and href attributes. In this case the id "yeast.xml" is fully resolved to "http://www.example.com/das/genome/yeast.xml". Here is an example of a more complicated sources document with multiple organisms each with multiple versions. Each of the two source documents (one for each organism) has a distinct URL as does each of the version for each organism. This is a pure registry server because the actual annotation data comes from other machines. Request: http://www.biodas.org/known_servers Response: Content-Type: application/x-das-sources+xml Each SOURCE id and VERSION id is individually fetchable so the URL "http://das.ensembl.org/das/SPICEDS/" returns a sources document with the SOURCE record for "das_vega_trans" and both of its VERSION subelements while "http://das.ensembl.org/das/SPICEDS/128/" returns a sources document with only the second of its VERSION subelements. DAS documents refer to other documents through URLs. There are no restrictions on the internal form of the URLs, other than the query string portion. Server implementers are free to choose URLs which best fit the architecture needs. For example, a simple DAS server may be implemented as a set of XML files hosted by a standard web server while more complex servers with search support may be implemented as CGI scripts or through embedded web server extensions. The URLs do not need to define a hierarchical structure nor even be on the same machine. Compare this to the DAS1 specification where some URLs were constructed by direct string modification of other URLs. ===== Each versioned source contains a set of segments. A segment is the largest chunk of contiguous sequence. For fully sequenced organisms a segment may be a chromosome. For partially assembled genomes where the distance between the assembled regions is not known then each region may be its own segment. If a server provides annotations in contig space then each contig is a segment. Feature locations are specified on ranges of segments which is why a specific set of segments is called a coordinate system. [coordinate-system] This specification does not describe how to do alignments between different coordinate systems. The sources document format has two ways to describe the coordinate system. The optional COORDINATES element uniquely characterize the coordinate system. If two data sources have the same authority and source values then they must be annotations on the same coordinate system. The specific coordinate system is also called the "reference sequence". A versioned source may contain CAPABILITY elements which describe different ways to request additional data from a DAS server. Each CAPABILITY has a type that describes how to use the corresponding URL to query a DAS server. A CAPABILITY element of type "segments" has a query URL which returns a document of content-type "application/x-das-segments+xml". A segments document lists information about the segments in the coordinate system. Here is an example of a segments document. Request: http://www.biodas.org/das2/h.sapiens/v3/segments.xml Response: Content-Type: application/x-das-segments+xml ===== The versioned source record for an annotation server must include a CAPABILITY of type "features". A client may use the query URL from the features CAPABILTY points to select features which match certain criteria. If no criteria are specified the server must return all features unless there are too many features to return. In that case it must respond with an error message. Unless an alternate format is specified, the response from the features query is a document of content-type "application/x-das-features+xml" containing all of the matching features. Here is an example features document for a server which contains a gene and an alignment. Request: http://das.biopackages.net/das/genome/yeast/S228C/features.pl Response: Content-Type: application/x-das-features+xml Each feature has a unique identifier and an identifer linking it to a type record. Both identifiers are URLs and should be directly fetchable. Simple features can be located on a region of a segment. More complex features like a gapped alignment are represented through a parent/part relationship. A feature may have multiple parents and multiple parts. ===== An annotation server may contain many features while the client may only be interested in a subset; most likely features in a given portion of the reference sequence. To help minimize the bandwidth overhead the feature query URL should support the DAS feature filter language. The syntax uses the standard HTML form-urlencoded GET query syntax. For example, here is a request for all features on Chr2. Request: http://www.example.org/volvox/1/features.cgi?inside=Chr2 Response: Content-Type: application/x-das-features+xml and here is the rather long one for all EST alignments Request: http://www.example.org/volvox/1/features.cgi? type=http%3A%2F%2Fwww.example.org%2Fvolvox%2F1%2Ftype%2Fest-alignment Response: Content-Type: application/x-das-features+xml ===== All features are linked to a type record. DAS types do not describe a formal type system in that DAS types do not derive from other DAS types. Instead it links to an external ontology term and describes how to depict features of that type. A DAS annotation server must contain a CAPABILITY element of type "types". A client may use its query URL to fetch a document of content-type "application/x-das-types+xml". The document lists all of the types available on the server. We expect that servers will have at most a few dozen types so DAS does not support type filters. The following is a hypothetical example of a DAS annotation server providing GENSCAN gene predictions for zebrafish. Each feature is either of type "http://www.example.org/das/zebrafish/build19/high-type" or "http://www.example.org/das/zebrafish/build19/low-type" depending on if the data provider determined it was a high probability or low probability prediction. Even though there are two different type records the refer to the same ontology term, in this case the SO term for "gene". The distinction exists so that the high probability features are depicted differently from the low probability features. Request: http://www.example.org/das/zebrafish/build19/types Response: Content-Type: application/x-das-types+xml [coordinate-system] We make a distinction between "coordinate system" and "numbering system". The coordinate system is the set of segment on which features are located. The numbering system describes how to identify the specific residues in the segment. DAS uses a 0-based coordinate system where the first residue is numbered "0", the second "1", and so on. Other numbering systems include 1-based coordinates and the PDB numbering system which preserves the residue number for key residues across homologous family by allowing discontinuities, insertions and negative values as position numbers. Andrew dalke at dalkescientific.com From nomi at fruitfly.org Mon Mar 6 08:09:22 2006 From: nomi at fruitfly.org (Nomi Harris) Date: Mon, 6 Mar 2006 00:09:22 -0800 (PST) Subject: [DAS2] DAS/2 teleconference? Message-ID: <17419.60978.358549.246997@kinked.lbl.gov> Is there a DAS/2 teleconference tomorrow morning? Last week it didn't happen. Nomi From dalke at dalkescientific.com Mon Mar 6 09:14:30 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 6 Mar 2006 02:14:30 -0700 Subject: [DAS2] DAS/2 teleconference? In-Reply-To: <17419.60978.358549.246997@kinked.lbl.gov> References: <17419.60978.358549.246997@kinked.lbl.gov> Message-ID: Nomi: > Is there a DAS/2 teleconference tomorrow morning? Last week it didn't > happen. I plan on calling in. Andrew dalke at dalkescientific.com From Gregg_Helt at affymetrix.com Mon Mar 6 14:03:24 2006 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Mon, 6 Mar 2006 06:03:24 -0800 Subject: [DAS2] DAS/2 teleconference? Message-ID: Apologies for the mixup with the teleconference last week! Yes we're definitely on for a teleconference today at the standard time, 9:30 AM Pacific time. Thanks, Gregg > -----Original Message----- > From: das2-bounces at portal.open-bio.org [mailto:das2-bounces at portal.open- > bio.org] On Behalf Of Nomi Harris > Sent: Monday, March 06, 2006 12:09 AM > To: DAS/2 > Subject: [DAS2] DAS/2 teleconference? > > Is there a DAS/2 teleconference tomorrow morning? Last week it didn't > happen. > Nomi > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 From lstein at cshl.edu Mon Mar 6 14:49:18 2006 From: lstein at cshl.edu (Lincoln Stein) Date: Mon, 6 Mar 2006 09:49:18 -0500 Subject: [DAS2] DAS/2 teleconference? In-Reply-To: References: Message-ID: <200603060949.19299.lstein@cshl.edu> Hi Gregg, I'll miss the first half hour of the call today because of an overlap with an NCI teleconference. Lincoln On Monday 06 March 2006 09:03, Helt,Gregg wrote: > Apologies for the mixup with the teleconference last week! Yes we're > definitely on for a teleconference today at the standard time, 9:30 AM > Pacific time. > > Thanks, > Gregg > > > -----Original Message----- > > From: das2-bounces at portal.open-bio.org > > [mailto:das2-bounces at portal.open- > > > bio.org] On Behalf Of Nomi Harris > > Sent: Monday, March 06, 2006 12:09 AM > > To: DAS/2 > > Subject: [DAS2] DAS/2 teleconference? > > > > Is there a DAS/2 teleconference tomorrow morning? Last week it didn't > > happen. > > Nomi > > > > _______________________________________________ > > DAS2 mailing list > > DAS2 at portal.open-bio.org > > http://portal.open-bio.org/mailman/listinfo/das2 > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 -- Lincoln D. Stein Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724 FOR URGENT MESSAGES & SCHEDULING, PLEASE CONTACT MY ASSISTANT, SANDRA MICHELSEN, AT michelse at cshl.edu From Gregg_Helt at affymetrix.com Mon Mar 6 16:44:43 2006 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Mon, 6 Mar 2006 08:44:43 -0800 Subject: [DAS2] Proposed agenda for DAS/2 teleconference, March 6 Message-ID: upcoming Code Sprint, March 13-17 at Affymetrix status reports coordinate system resolution via COORDINATES element features with multiple locations vs. alignments features with multiple parents ??? From lstein at cshl.edu Mon Mar 6 17:37:39 2006 From: lstein at cshl.edu (Lincoln Stein) Date: Mon, 6 Mar 2006 12:37:39 -0500 Subject: [DAS2] Proposed agenda for DAS/2 teleconference, March 6 In-Reply-To: References: Message-ID: <200603061237.41288.lstein@cshl.edu> Hi, The teleconference system now asks me for a passcode. Previously I just had to enter the conference ID. What's up? Lincoln On Monday 06 March 2006 11:44, Helt,Gregg wrote: > upcoming Code Sprint, March 13-17 at Affymetrix > status reports > > coordinate system resolution via COORDINATES element > features with multiple locations vs. alignments > features with multiple parents > ??? > > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 -- Lincoln D. Stein Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724 FOR URGENT MESSAGES & SCHEDULING, PLEASE CONTACT MY ASSISTANT, SANDRA MICHELSEN, AT michelse at cshl.edu From Gregg_Helt at affymetrix.com Mon Mar 6 17:38:37 2006 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Mon, 6 Mar 2006 09:38:37 -0800 Subject: [DAS2] Proposed agenda for DAS/2 teleconference, March 6 Message-ID: Please try again, it shouldn't ask for a passcode, but if it does, it's 1365. There may be some glitch in our teleconferencing... Thanks, Gregg > -----Original Message----- > From: Brian O'Connor [mailto:boconnor at ucla.edu] > Sent: Monday, March 06, 2006 9:36 AM > To: Helt,Gregg > Cc: das2 at portal.open-bio.org > Subject: Re: [DAS2] Proposed agenda for DAS/2 teleconference, March 6 > > Hi Gregg, > > I tried calling in to the DAS conference call but it asked for a > passcode in addition to the conference ID. All I have is the conference > ID... > > --Brian > > Helt,Gregg wrote: > > >upcoming Code Sprint, March 13-17 at Affymetrix > >status reports > > > >coordinate system resolution via COORDINATES element > >features with multiple locations vs. alignments > >features with multiple parents > >??? > > > > > >_______________________________________________ > >DAS2 mailing list > >DAS2 at portal.open-bio.org > >http://portal.open-bio.org/mailman/listinfo/das2 > > > > > > From nomi at fruitfly.org Mon Mar 6 17:40:26 2006 From: nomi at fruitfly.org (Nomi Harris) Date: Mon, 6 Mar 2006 09:40:26 -0800 Subject: [DAS2] Proposed agenda for DAS/2 teleconference, March 6 In-Reply-To: References: Message-ID: <17420.29706.575212.913804@spongecake.lbl.gov> i am calling in (800-531-3250, id: 2879055) but it is then asking me for a passcode. i tried entering 2879055 again but that didn't work. we didn't used to have a passcode, did we? can someone tell me what it is? if you prefer not to email it, you can phone me at 510 486-5078. Nomi From Gregg_Helt at affymetrix.com Mon Mar 6 18:10:23 2006 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Mon, 6 Mar 2006 10:10:23 -0800 Subject: [DAS2] Examples of features with multiple locations from biopackages server Message-ID: In the teleconference today, we?re talking about features with multiple locations, here?s an example from biopackages server: ? From boconnor at ucla.edu Mon Mar 6 17:36:28 2006 From: boconnor at ucla.edu (Brian O'Connor) Date: Mon, 06 Mar 2006 09:36:28 -0800 Subject: [DAS2] Proposed agenda for DAS/2 teleconference, March 6 In-Reply-To: References: Message-ID: <440C731C.5070303@ucla.edu> Hi Gregg, I tried calling in to the DAS conference call but it asked for a passcode in addition to the conference ID. All I have is the conference ID... --Brian Helt,Gregg wrote: >upcoming Code Sprint, March 13-17 at Affymetrix >status reports > >coordinate system resolution via COORDINATES element >features with multiple locations vs. alignments >features with multiple parents >??? > > >_______________________________________________ >DAS2 mailing list >DAS2 at portal.open-bio.org >http://portal.open-bio.org/mailman/listinfo/das2 > > > From dalke at dalkescientific.com Mon Mar 13 14:00:45 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 13 Mar 2006 06:00:45 -0800 Subject: [DAS2] format information for the reference server Message-ID: <23b58bf3b2a561142bfd5f6fafb3523a@dalkescientific.com> (NOTE: the open-bio mailing lists were moved from portal.open-bio.org to lists.open-bio.org. My first email on this bounced because I sent to the old email address.) Summary of questions: - what does it mean for the annotation server to list the formats available from the reference server? - can the reference server format information be moved to the segments document? - are there formats which will only work at the segment level and not at the segments level (ie, formats which don't handle multiple records)? Something's been bothering me about the segments request. Currently the DAS sources request responds with something like ... This says "go to 'blah' for information about the sequence". But it says more than that. It provides metadata about the reference server. It says that the reference server can respond in 'fasta' and 'agp' formats. Hence the following are allowed from this URL http://blah/seq?format=agp -- return the assembly http://blah/seq?format=fasta -- return all sequences in FASTA format Does this mean that all annotations servers using the given reference server must list all of the available formats? If a client sees multiple CAPABILITY elements for the same query_url is it okay to merge the list of supported formats? That is, if server X says that annotation server A supports fasta and server Y says that A supports genbank then a client may assume A supports both fasta and genbank formats? (This makes sense to me.) Second, does it make sense to require the annotation servers to list the formats on the reference server? What about making that information available from the segments document, like this. query: http://www.biodas.org/das/h.sapiens/38/segments.cgi response: A problem with this the lack of data saying that the segments query URL itself supports multiple formats. For example, http://www.biodas.org/das/h.sapiens/38/segments.cgi?format=fasta might support returning all of the chromosomes in FASTA format. Are there any formats which only work at the segment level and not at the segments level? That is, which only work with single gene/chromosome/contig/etc. but don't support multiple sequences? The only one I could think of off-hand is "raw", since there's no concept of a "record" given a bunch of letters, unless the usual way is to separate them by an extra newline? If all formats are supported for both single and all segments then here is another possible response [possibility #1] I think all formats which work on the "segments" level also work on a single segment level, so another possibility is the following, which lets a given segment say that it supports more formats. [possibility #2] Here's another, using a flag to say if a format is for a single segment, the segments URL, or both (feel free to pick better names!). By default it applies to both. [possibility #3] Yet another option is [possibility #4] .. Of these I support [possibility #1], with the ability to go to [possibility #3] if there's ever a case where a given format cannot be applied to both levels. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Mon Mar 13 14:29:28 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 13 Mar 2006 06:29:28 -0800 Subject: [DAS2] id, url, uri, and iri Message-ID: Something to settle. I've been using 'id' like this > type_id = "type/est-alignment" > created = "2001-12-15T22:43:36" > modified = "2004-09-26T21:10:15" > > > > > > As Dave Howorth pointed out, most people use 'id' as an in-document identifier, and not as an identifier to link to other documents. Eg, there's a "getElementById()" method in the DOM which is mean to find DOM nodes given the id. In looking around I found that it's keyed off of the type (as determined by the schema) and not by the string 'id'. I added 'xml:id' as a possible DAS attribute, which is defined by the XML spec to work as expected for getElementById. In private email Gregg asked about using 'uri' instead of 'id' for this. I'm now leaning that way. Either 'uri' or 'url' or 'iri'. I prefer url because everyone knows what that means. Gregg prefers 'uri' I think because that's what allows fragment identifiers, and because it includes things which are other than URLs, like LSIDs. However, the latest thing these days is an "iri" which means "internationalized resource identifier" http://www.ietf.org/rfc/rfc3987.txt I haven't read enough of it to understand it. My first attempt says that it's okay to use "uri" because there are 1-to-1 mappings between uris and iris. Also, I don't want to test bidirectional text and I suspect there isn't yet widely used library support for iris. So I want to change the DAS use of 'id' to 'url' and say "the value of the 'url' attribute is a URI". Andrew dalke at dalkescientific.com From Steve_Chervitz at affymetrix.com Mon Mar 13 15:38:58 2006 From: Steve_Chervitz at affymetrix.com (Chervitz, Steve) Date: Mon, 13 Mar 2006 07:38:58 -0800 Subject: [DAS2] Notes from the weekly DAS/2 teleconference, 6 Mar 2006 Message-ID: [These are notes from last week's meeting. -Steve] Notes from the weekly DAS/2 teleconference, 6 Mar 2006 $Id: das2-teleconf-2006-03-06.txt,v 1.1 2006/03/13 15:41:03 sac Exp $ Note taker: Steve Chervitz Attendees: Affy: Steve Chervitz, Ed E., Gregg Helt CSHL: Lincoln Stein Sanger: Thomas Down Dalke Scientific: Andrew Dalke UC Berkeley: Nomi Harris UCLA: Brian O'Connor Action items are flagged with '[A]'. These notes are checked into the biodas.org CVS repository at das/das2/notes/2006. Instructions on how to access this repository are at http://biodas.org DISCLAIMER: The note taker aims for completeness and accuracy, but these goals are not always achievable, given the desire to get the notes out with a rapid turnaround. So don't consider these notes as complete minutes from the meeting, but rather abbreviated, summarized versions of what was discussed. There may be errors of commission and omission. Participants are welcome to post comments and/or corrections to these as they see fit. Agenda: ------- upcoming Code Sprint, March 13-17 at Affymetrix status reports coordinate system resolution via COORDINATES element features with multiple locations vs. alignments features with multiple parents ??? [ Some trouble with passcode for teleconf - hopefully fixed ] TD: The coord syst things we were hoping to discuss with Andreas who won't make it today. GH: We can push this off till next week. Code Sprint ------------- LS: At sanger mon-tues for ensembl sab meeting, able to participate from tues pm to fri eve. AD: Planning to come to Affy BO: Allen and I are planning to come up to Emeryville GH: For payment, submit expenses to affy. Hotels? Marriott or Woodfin. Will send out rec's today. NH: Planning to attend at affy mon-tues, thur. [A] Ed will look into accts for andrew and brian (internet access) GH: Plan on 9-10am phone teleconf daily. Greg can pick up people from hotel. GH: Goals/deliverables for this code sprint? LS: Write das/2 client for bioperl. Plan to plug into Gbrowse All I need is a working server AD: Writing writeback and locks, improving validator . NH: Apollo and registry, feature types. Wrote a writer, can test in AD's validator (plan to). GH: Keep working on das/2 client for igb at affy. Hoping by then to have an affy das/2 server up and running. SC: Can help get it up GH: Can we put on in our dmz, so it's publically accessible at least for the code sprint. [A] Steve will look into setting up publically accessible affy das/2 test server TD: Working on getting an Ensembl das/2 server up. GH: Java middle ware on top of biojava? TD: Yes. Using the biojava to ensembl bridges. EE: Getting IGB to use style sheets. AD: And/or using a proper style sheet system, if you decide what I put in there is not good enough. BO: Looking for something to do. Hoping to start on writeback., Helping separate out igb model layer. Finished rpm packages in last code sprint, this is pretty much done. GH: Guess Allen will be working on the biopackages server. BO: Waiting on spec for writeback. AD: My writeup specifies how they do writeback at Sanger, overlaps well with Lincoln's proposal. See that. GH: Need to tighten up the read-only spec. A fair number of things to resolve. AD: A partial draft of 3rd version. Planning to update it before next sprint. Examples so people can get a feel for how things go together. GH: My agenda stuff: coord system resolution system to match annotations on same genome coming from diff servers. [A] Gregg will wait for Andreas to join in before discussing coordinate issues. GH: Feats w/ multiple locations (see email Gregg sent to the list today with examples). Current spec says if you use >1 coord system, you can have feats with multiple locations. Is this what we want to say? GH: Allen's server has feats w/ >1 location on same coord system. Do we want to allow or disallow? If disallow, how? AD: Possible usecase for alignments. GH: Feat model for bioperl. Locations have multiple parts. Feats with mult locations feels similar to that. Do you have multple children each with a loc, or do you use the align element? LS: Prefers children. That's what SO ended up doing after much arguing. Makes it easier. GH: Enforce it with the ontology. E.g, an alignment hit has alignment hsps. This forces client to understand the ontology. LS: Consider that an hsp will have scores attached to it, different cigar line. So you end up with mult children anyway. An improverished type of alignment. Can use cigar line to indicate mismatches. Can have a single HSP and a cigar line to indicate gaps. Only one child. You don't have to have multiple locations GH: Looking for use case of multiple locations with PCR products... My main concern is how much semantic knowledge the clients need to understand these things. Nothing in the spec that restricts mult locations. AD: Won't client just get the multiple children and not care about types? GH: I gues a simple client could do that. It disturbs me that it's up the server how to handle multip loc, childrent, vs aligmnets. Will send an example. LS: Yes. this is a vague area. There should be a best-practices section in the spec. Single match feature from begin to end. HSP children, each one covers major gaps. Cigar line w/in hsp to cover minor gaps. Can give each hsp an alignment score. GH: Main diff between locn and alignment is cigar string, and cigar string is optional. If we're allowed to use locations to designate alignments... LS: How about if we consolidate location and alignment: location has an optional cigar and then do away with alignment. Generalize location to allow for gaps. TD: Example: Aligning an est to the genome. Falls into several blocks of exact/near exact matching. If location has cigar line, could serve it up as a single feature. GH: You can do this since cigar can represent arbitrary length gaps. TD: Neat and compact way to do it. Does this scare anyone? GH: Sounds reasonable. AD: Let's do it. And will put in examples of best practices. [A] Consolidate location and alignment in spec, loc has optional cigar GH: Feats with mult parents. Need examples to test. This is a question to people putting up servers. Will anyone have these? TD: Ensembl might do this. Exon shared between several transcripts. A toss up between multiple parents vs. multiple copies of same exon. Think mult parents is the way to do it. LS: Flybase use multiple parents for exons in this way. TD: Ensemble db is a many-to-many between transcripts and exons. GH: Spec says: If you have a child in the feat document, you have to include its parent; If you have a parent you must include it's children. As long as this plays policy nice with that requirement, I'm ok with it. GH: Anyone else see things that need to be ironed out in spec? AD: Not yet NH: We should write a paper about das/2. This will help get more people using it, increase the success of the spec. GH: Agreed -- good idea. We have lots of text in grant about the philosophy of das/2. NH: Can pull text from these places. Publish at a conference perhaps? ISMB, CSB2006 GH: PLoS Bioinformatics? NH: Conference would be nice, to involve people in discussion. AD: Poster session is available for ISMB. NH: Prefers a conference talk. Paper will require more finished stable. Poster is too much work for little payoff. AD: Ann L complains that the only paper to cite for das is an old ref. Wants an updatable citable paper. NH: CSB will publish a proceedings. Genome informatics at CSHL (they don't publish though). NH/GH: What's the best conference to get published in these days? LS: ISMB NH: We missed deadline for it. LS: Biocurators meeting? NH: Can ask Sima about. Another one: Computational Genomics (TIGR sponsored). Not published. Submit abstracts, they select talks. Halloween in Baltimore. If conf proceedings are published, you can't submit to a paper, so we could go that way, get double mileage out of it. GH: Sounds good to get something ready for a paper rather than a conference. Did a presentation at Bosc, Genome informatics last year. [A] Nomi will help get paper ready for PLoS (after code sprint) AD: Can do poster for ismb, bosc in Brazil, if I end up going. NH: ISMB deadline is 10 May, so we should get going on it GH: Continuation grant submission, in theory has been reviewed, but haven't heard back. Maybe will take another month, to get score back. Final word? LS: Have you checked ERA Commons? They may update it there before you get the note. From dalke at dalkescientific.com Mon Mar 13 15:58:29 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 13 Mar 2006 07:58:29 -0800 Subject: [DAS2] definition of coordinate system attributes In-Reply-To: References: <3124ef2656aa51af817f16b1b71b16a2@sanger.ac.uk> Message-ID: I've been exchanging emails with Andreas >> Me? I don't know what it's for. Which means I've wiped it. > > is this a spec change? then I need to update the source response form > the new devel dasregistry ... > > actually the new_spec.txt says it has not been changed since feb. > 10th... I had hoped to have an updated spec by now. (After all, the conf. call is in an hour.) That didn't happen. :( I've attached what I have so far. I'll be working on it more today, and getting things in CVS updated. -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: draft3.txt URL: -------------- next part -------------- Andrew dalke at dalkescientific.com From ap3 at sanger.ac.uk Mon Mar 13 16:47:32 2006 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Mon, 13 Mar 2006 16:47:32 +0000 Subject: [DAS2] format information for the reference server In-Reply-To: <23b58bf3b2a561142bfd5f6fafb3523a@dalkescientific.com> References: <23b58bf3b2a561142bfd5f6fafb3523a@dalkescientific.com> Message-ID: On 13 Mar 2006, at 14:00, Andrew Dalke wrote: > Summary of questions: > - what does it mean for the annotation server to list the formats > available from the reference server? should this happen? I thought that annotation servers are described by their "coordinate system" then the registry provides a list of available references servers that provide the sequences for this. > Something's been bothering me about the segments request. > > Currently the DAS sources request responds with something like > > > > > > > > > ... > > > This says "go to 'blah' for information about the sequence". > > But it says more than that. It provides metadata about > the reference server. It says that the reference server can > respond in 'fasta' and 'agp' formats. I think an annotation server should not know/provide this information this should come from the reference server / registry > If a client sees multiple CAPABILITY elements for the same > query_url is it okay to merge the list of supported formats? that does not sound clean. > That is, if server X says that annotation server A supports > fasta and server Y says that A supports genbank then a client > may assume A supports both fasta and genbank formats? > (This makes sense to me.) the client should ask the reference server directly what it speaks / rely on the registration server to have validated that server A speaks indeed what it says it does. Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 From Gregg_Helt at affymetrix.com Mon Mar 13 17:13:14 2006 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Mon, 13 Mar 2006 09:13:14 -0800 Subject: [DAS2] DAS/2 code sprint conference starting now Message-ID: We just started the daily DAS/2 code sprint teleconference at Affymetrix. US number #: 800-531-3250 International #: 303-928-2693 Conference ID: 2879055 Passcode: 1365 From Gregg_Helt at affymetrix.com Mon Mar 13 20:48:50 2006 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Mon, 13 Mar 2006 12:48:50 -0800 Subject: [DAS2] Problem with name feature filter on biopackages server Message-ID: I'm looking into adding the ability in the IGB DAS/2 client to retrieve features by name/id. Trying this out with the biopackages server almost gives me what I want: http://das.biopackages.net/das/genome/yeast/S228C/feature?name=YGL076C except that in the returned XML the parent feature (YGL076C) does not list it's children as , though the children list YGL076C as . Any ideas? thanks! gregg From nomi at fruitfly.org Mon Mar 13 22:32:49 2006 From: nomi at fruitfly.org (Nomi Harris) Date: Mon, 13 Mar 2006 14:32:49 -0800 (PST) Subject: [DAS2] Where to publish [was Re: Notes from the weekly DAS/2 teleconference, 6 Mar 2006] In-Reply-To: References: Message-ID: <17429.62225.230884.764469@kinked.lbl.gov> On 13 March 2006, Chervitz, Steve wrote: > NH/GH: What's the best conference to get published in these days? > LS: ISMB > NH: We missed deadline for it. > LS: Biocurators meeting? > NH: Can ask Sima about. Sima said: > Next biocurator meeting is probably in early 2007 in the UK. No plans at > the moment to publish the proceedings, however. > > I think publishing soon in PLoS is a good idea. From dalke at dalkescientific.com Mon Mar 13 23:45:04 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 13 Mar 2006 15:45:04 -0800 Subject: [DAS2] URIs for sequence identifiers Message-ID: <32b20a44f60c916d9b3649fbcdacd31f@dalkescientific.com> Proposals: - do not use segment "name" as an identifier - rename it "title" (human readable only) - allow a new optional "alias-of" attribute which is the link to the primary identifier for this segment - change the feature location to use the segment uri - change the feature filter range searches so there is a new "segment" keyword and so the "includes", "overlaps", etc. only work on the given segment, as segment= inside=$start:$stop overlaps=$start:$stop contains=$start:$stop identical=$start:$stop - If 'includes', 'overlaps', etc. are given then the 'segment' must be given (do we need this restriction? It doesn't make sense to me to ask for "annotations on 1000 to 2000 of anything" - only allow at most one each of includes, overlaps, contains, or identical (do we need this restriction?) - multiple segments may be given, but then range searches are not supported (do we need this restriction?) Discussion: The discussion on this side of things was based on today's phone conference. Andreas needs data sources to work on multiple coordinate spaces. To quote from Andreas: > There are several servers that understand more than one coordinate > system and can return the same type of data in different coordinates. > (depending on which type of accession code/range was used for the > request ) E.g. there are a couple of zebrafish servers that speak > both in Chromosome and Scaffold coordinates. (reason perhaps > being that zebrafish is an organism that seems to be very difficult > to assemble ?) The current DAS system does not support this because of how it does segment identifiers. The current scheme looks like this: .... Problem #1: We need two entry points, one to view the segments in Scaffold space, the other to view them in Chromosome space. Solution #1 (don't like it though). Add a "source=" attribute to the CAPABILITY and allow multiple segments capabilities .... I don't like it because it feels like the COORDINATES and CAPABILITY[type="segments"] field should be merged. Still, I'll go with it for now. Problem #2: feature searches return features from either namespace Consider search for name=*ABC* (that is, "ABC" as a substring in the "name" or "alias" fields). Then the result might be Where "A" is a short-hand notation for one of the segments? Which one? The client goes to the segment servers: Query: http://sanger/andreas/scaffolds.xml" Response: Query: http://sanger/andreas/chromosomes.xml" The segment name "A" matches either ChromosomeA or ScaffoldA, and there's no way to figure out which is correct! This comes because our own naming scheme is not very good at being globally unique. We could fix it by also stating the namespace in the result, as Gregg asked "why don't we just use the URI"? After a long discussion we decided to propose just that. That is, get rid of the "name" attribute. Instead, use a "title" attribute which is human readable and an optional "alias-of" which contains is the primary identifier for the given segment. The alias-of value is determined by the person who defined the COORDINATES. It could be a URL. It could a URI. It does not need to be resolvable (though it should - perhaps to a human readable document? Or to something which lists all known aliases to it?) That is, the segments document will look like this Query: http://sanger/andreas/scaffolds.xml" Response: Query: http://sanger/andreas/chromosomes.xml" This has a few implications. Feature locations must be given with respect to the segment uri, as Given this segment_uri a client can figure out if it is in Scaffold or Chromosome space because it can check all of the URIs in each space for a match. The other change is in range searches. Consider the current scheme, which looks like includes=ChrA includes=A/100:300 The query is of the form $ID or $ID/$start:$end. It needs to be changed to support URLs. For examples, includes={http://www.whatever.com/ChromosomeA includes={http://www.whatever.com/ScaffoldA}/100:300 We couldn't come up with a better syntax. Then Gregg asked "why do we need multiple includes"? That is, the current syntax supports includes=ChrA/0:1000;includes=ChrB/2000:3000;includes=ChrC/5000:6000 to mean "anywhere on the first 1000 bases of ChrA, the 3rd 1000 bases of ChrB, or the 6th 1000 bases of ChrC". Given the query language, we're looking for way to write that using URLs, as includes={http://www.whatever.com/ChromosomeA}0:1000;includes={http:// www.whatever.com/ChromosomeB}:2000:3000;includes={http:// www.whatever.com/ChromosomeC}:5000:6000; However, that's a very unlikely query. What if we split the "includes", "overlaps", etc. into "includes_segment" and "includes_range". In that case: old-style: includes=A/500:600 new-style: includes_segment=http://www.whatever.com/ChromosomeA; includes_range=500:600 old-style: includes=A/500:600,Chr3/700:800 new-style: includes_segment=http://www.whatever.com/ChromosomeA; includes_range=500:600; includes_range=700:800 old-style: includes=A/500:600,D/700:800 new-style: -- NOT POSSIBLE old-style: includes=A/500:600,D/500:600 new-style: (not likely to be used in real life) includes_segment=http://www.whatever.com/ChromosomeA; includes_segment=http://www.whatever.com/ChromosomeD; includes_range=500:600; This no longer allows searches with subranges from different segments. The again -- who cares? Those sorts of searches are strange. Talking some more. Who needs the ability to do more than one includes / overlaps / etc. query at a time? Gregg wants the ability to do a combination of includes and overlaps, but that's all. We can simplify the server code by only supporting one inside search, one contains search, and/or one overlaps search, instead of the current system which allows a more constructive geometry, and we can move the segment id out into its own parameter. Allen said that that would prevent more complicated types of analysis on the server, but that anyone doing more complicated searches would pull the data down locally. Does anyone want to do more than one overlaps search at at time? More than one contains search at a time? More than one identical search at a time? (For that matter, does anyone actually want to do a "identical" search? Gregg thinks it will be useful to find any other annotations which are exactly matching the given range. I think that might be better with a "include"/"exclude" combination to have start/end positions within a couple of bases from the specified range.) PROPOSAL: Change the range query language to have segment= < inside= $start:$end overlaps= $start:$end contains= $start:$end Example: segment=http://whatever.com/ChromosomeD;inside=5000:6000 Also, only allow at most one includes, one overlaps, and one contains (unless people want it). I'm less sure about the need for this restriction. It might be as easy to implement the more complex search as it would be to check for the error cases. Andrew dalke at dalkescientific.com From ed_erwin at affymetrix.com Mon Mar 13 23:56:56 2006 From: ed_erwin at affymetrix.com (Ed Erwin) Date: Mon, 13 Mar 2006 15:56:56 -0800 Subject: [DAS2] URIs for sequence identifiers In-Reply-To: <32b20a44f60c916d9b3649fbcdacd31f@dalkescientific.com> References: <32b20a44f60c916d9b3649fbcdacd31f@dalkescientific.com> Message-ID: <441606C8.3070902@affymetrix.com> Andrew Dalke wrote: >>There are several servers that understand more than one coordinate >>system and can return the same type of data in different coordinates. >>(depending on which type of accession code/range was used for the >>request ) E.g. there are a couple of zebrafish servers that speak >>both in Chromosome and Scaffold coordinates. (reason perhaps >>being that zebrafish is an organism that seems to be very difficult >>to assemble ?) > > > The current DAS system does not support this because of how > it does segment identifiers. > > > Problem #1: We need two entry points, one to view the segments > in Scaffold space, the other to view them in Chromosome space. > > Solution #1 (don't like it though). > Add a "source=" attribute to the CAPABILITY and allow multiple > segments capabilities > Problem #2: feature searches return features from either namespace > A different solution: Scaffold and Chromosome coordinate systems are served by separate DAS/2 servers. Each server returns data from one and only one namespace. Those separate servers can, behind-the-scenes, use the same database. DAS/2 clients, like IGB, would choose to connect to either the Scaffold-based server or the Chromosome-based server, but not usually to both at once. Does this handle all the issues? Ed From dalke at dalkescientific.com Tue Mar 14 00:12:52 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 13 Mar 2006 16:12:52 -0800 Subject: [DAS2] URIs for sequence identifiers In-Reply-To: <441606C8.3070902@affymetrix.com> References: <32b20a44f60c916d9b3649fbcdacd31f@dalkescientific.com> <441606C8.3070902@affymetrix.com> Message-ID: <54829d8554d9b044908965d80b158c60@dalkescientific.com> Ed: >> Problem #2: feature searches return features from either namespace > > A different solution: > > Scaffold and Chromosome coordinate systems are served by separate > DAS/2 servers. Each server returns data from one and only one > namespace. > > Those separate servers can, behind-the-scenes, use the same database. > > DAS/2 clients, like IGB, would choose to connect to either the > Scaffold-based server or the Chromosome-based server, but not usually > to both at once. > > Does this handle all the issues? Here's the email I got from Andreas when I proposed this. >>> There may be more than one COORDINATE element if ... (XXX why?) > > There are several servers that understand more than one coordinate > system and > can return the same type of data in different coordinates. (depending > on which type of accession code/range was used for the request ) > E.g. there are a couple of zebrafish servers that speak both in > Chromosome and Scaffold coordinates. > (reason perhaps being that zebrafish is an organism that seems to be > very difficult to assemble ?) >> Will there be separate CAPABILITY items for each source? > > no. if there are then this should be registered as two independent > servers. (but see clarification later) > Allowing multiple coordinate systems per server is a way to slightly > reduce the already long list of known > servers. Currently there are about 90 in the registry (+10 in the last > few weeks...) and there still are about 20 more > which have not been registered (and are provided by the BioSapiens > project). >> Long for who? It isn't that much data. > > It is long for somebody who browses manually through the ensembl DAS > configuration and searches for a DAS source to add to. > It a "long" list for myself to read through the DAS server list at > http://das.sanger.ac.uk/registry/listServices.jsp > and although I know this list pretty well, it seems to me a lot of > text/descriptions, etc. >> There is only one reference server for an annotation server. > > I think it should be one reference server per coordinate system. >> But if there are two COORDINATES elements, and you say that >> each has its own reference server, then aren't you saying that >> a single annotation server may have multiple reference servers? > > yes. i believe that this should be possible. >> What's the concern about having >> no more than one coordinate per data source? > > Just last friday somebody asked me how to add a DAS server that has > two coordinate systems to different Ensembl views ( ContigView and > GeneView) > Her initial solution was to provide multiple DAS sources > http://das.sanger.ac.uk/registry/showdetails.jsp?auto_id=DS_211 > and > http://das.sanger.ac.uk/registry/showdetails.jsp?auto_id=DS_219 > > but I think this could be joint into a single server. In any case, I think the proposal I outlined in the previous email makes things cleaner even without support for multiple coordinate systems on the same server. Andrew dalke at dalkescientific.com From Steve_Chervitz at affymetrix.com Tue Mar 14 04:22:36 2006 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Mon, 13 Mar 2006 20:22:36 -0800 Subject: [DAS2] Notes from DAS/2 code sprint #2, day one, 13 Mar 2006 Message-ID: Notes from DAS/2 code sprint #2, day one, 13 Mar 2006 $Id: das2-teleconf-2006-03-13.txt,v 1.1 2006/03/14 04:31:36 sac Exp $ Note taker: Steve Chervitz Attendees: Affy: Steve Chervitz, Ed E., Gregg Helt Sanger: Andreas Prlic Dalke Scientific: Andrew Dalke (at Affy) UC Berkeley: Nomi Harris (at Affy) UCLA: Allen Day, Brian O'Connor (at Affy) Action items are flagged with '[A]'. These notes are checked into the biodas.org CVS repository at das/das2/notes/2006. Instructions on how to access this repository are at http://biodas.org DISCLAIMER: The note taker aims for completeness and accuracy, but these goals are not always achievable, given the desire to get the notes out with a rapid turnaround. So don't consider these notes as complete minutes from the meeting, but rather abbreviated, summarized versions of what was discussed. There may be errors of commission and omission. Participants are welcome to post comments and/or corrections to these as they see fit. General note: Passcode is now required to enter teleconf. This is a change in their system. Issue: Continuation Grant ------------------------- gh: no word yet. Issue: Coordinate System ------------------------ ad: question of what happens when there are multiple coordinate systems for an assembly. auth and source, source: contig space, scaffold space auth: organization (e.g. ncbi, ucsc) gh: not enough to get uniqueness. ncbi, genome, human is not enough, need version to uniquely id the coord system ad: auth, source, species, version identification string gh: use case: need to know whether uris for two versioned source refer to the same genome. gh: ncbi version numbers are separate from organism info, eg. v35. ad: we could have a service for mapping strings gh: idea - every server can say this assembly name is same as that. Clients could chain together statements from multiple servers. For the affy das server used by igb, we now have a synonyms file on our server which igb reads. It's a pain to maintain. ad: type of alignment server? gh: a synonym server. Here's a uri, give me a list of synonyms that refer to the same thing. This is something tho talk more about when Andreas is on line. [Andreas joins in.] GH: How would a das server verify the version info in a sources document point to same genome assembly? AP: You would check auth=ncbi, vers=35, taxid=human AP: In protein structure space, you check verison on every object you work with. Protein seq. gh: so we have to map version info on sequences as well as genome assemblies. gh: use case: two segment responses from diff servers, diff uris for the diff sequences, how you know they are refering to the same seq. name=chromosome21 vs name=chr21? ad: we require the same name for the same segments. gh: going to fall apart fast. no way to enforce it. People use 1, I, chr1, chromI. ee: can put this in the validation suite. aday: yes. gh: but what do you use for name: accession # for entry, string chr1, etc. gh: important since this is the name that goes to user. ad: could have one slot for computer to use, one for human consumption. ad: for segments there seem to be two diff ids: url, ad: the point of having special ids for segments is segment equivalence from different servers. Separate coordinates element that says how to merge things together. Identifiers in here that are just coordinate space ids, not necessarily for human use. Only for identifying coords. gh: but how do we get people to use it? sc: what about the idea of using checksums as identifiers for a seq? ad: problem of duplicate seqs in an assembly. eg., same seq from chr1 and chr9. gh: if they are the same seq they should get the same id. ad: don't you want to know if there is a region on chr1 that is an exact duplicate of a region on chr9? sc: we could create the checksum on source:sequence gh: useful to have a central place to ask for diff names for the same coord system. ad: uniqueness idea: coords element, has: auth, source, version, species (optional) uniqueness says these are the names you use. gh: this can fail. What do we say happens when it fails? Should there be a way of resolving it. ad: this is where your synonym table comes it. Publish it? gh: maybe as part of the registry, knows ap: there isn't a big variety in naming because there aren't many people providing assemblies. gh: we already have 10 different synonyms for an assembly ee: this has some performance impact on igb. should have to do it. ap: we should say this is how naming works. gh: will fail. ad: is this required for this version of the spec? gh: need something that can be used now. aday: without hardwiring gh: if we don't agree during the code sprint, then it won't happen for everyone else. aday: using roman numerals for yeast since sgd uses it. ee: trouble with chrX ad: andreas: is there a place for naming of segments to use ap: no, something for the reference server, not coords ad: given these coords, here are the names that are used. ap: same as reference server. gh: maybe registry should provide: here's a coord system and here are the names you can use for ap: you would get a long list for proteins aday: a user who wants to gh: question for brian g: LSID, when you come across this for LSIDs, ncbi is auth for human genome assembly yet they have no lsid for their assembly, how do people refer to their lsid when there's no authority to say what it is? bg: you can't, no one is the authority. but you can write a resolver that queries ncbi under the cover, in your resolver you make ncbi the authority of the lsid, add namespace, object id. Then everyone has to know that your resolver is hosted at some site somewhere. So there is no satisfactory answer. It's a problem if the authority does not host the resolver. bg: I'm at the w3c meeting at mit, providing a webified resolver, they would host a resolver, everyone would know to go to a well-known web address. bg: you start a convention, enforce it, give error if people don't use it. gh: thinking we need it associated with registry. ap: ref server + coord system, provides ids that can be used, gh: so other ids can be used, but registry server wouldn't support it. ad: site has ftp site for downloading chromosomes, contains names for different segments in the file. How do I go from the ids in ths file to the ids that Andreas describes. To make my annotations in the same space. Mapping from file from ncbi. bg: what are your use cases? write back to server? ad: user publishing locally, bg: you make a ref server. gh: experience from das1 is that everyone makes their own reference server and refers to it from their annotation server, using different names. ad: new tag 'coordinates' gh: like enforcing common names at registry server. Can use their own names, they just won't be allowed to post on the registry. ad: need documentation ap: could point to docn on reference server bg: workflow1: fish researcher looking for abberant regions in chr7, 11 and 3, singled out the abctransporter gene. How does that work in das/2? type 'abc' in web page for reference server? This is a gene name. ad: your client browser can go to to registry to find servers that host the assemblies for your fish. Go to those reference servers, do searches there. Will go to coord system, get a segments document, get display chromosome by title. gh: get a das features xml document saying the sequence and coordinates. gh: our discussion here is on getting the diff. ad: we don't have anything on coordinates saying which is the latest version. bg: latest build may have changed their gene coordinate. gh: mapping servers is part of our continuation grant. Can push an annotation on one assembly to another assembly. bg: a hard thing. gh: that's why where enlisting UCSC to do it! ad: Topic: id, url, uri, iri (see email) gh: likes uri, not url. Some things aren't really urls (resolvable). Iri might work. ad: multiple coord elements for same ref server. ap: originally there was one, but some use two, zebrafish guy chrom and scaffold coordinates. or chromosomes vs. gene ids. same types, different accession codes and features. ad: if you have graphical browser, do you get scaffolds or chromosomes. ap: depends on your view. gh: if you do a segments query, do you get segments and contigs? ap: depending on the coordinate system of the requrest. ad: one capabilities for scaffolds and one for chromosomes? gh: maybe Deliverables: [A] gregg: by end of week, load stuff from multiple servers, compare in the same view. [A] steve will work on getting gregg's das/2 server up and running. gh: trouble with biopackages.net server aday: possible power outage interference. gh: target filters have been dropped. aday: yay! From dalke at dalkescientific.com Tue Mar 14 15:14:44 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Tue, 14 Mar 2006 07:14:44 -0800 Subject: [DAS2] use cases Message-ID: <8bc46502eb164882394a3f4acbe08987@dalkescientific.com> I think these cover the basic use cases. Let me know if there are other reasonable ones I should add. Use Case #1 Biologist viewing genomic region wants to add information from server www.biodas.org/das2/ . Example of use: - Go to "open DAS server" option. Type/paste URL for DAS server. + DAS viewer connects to server, verifies that it annotates the same sequence source and has under (say) 10 types so it makes a new track for each type and does a request for all the features in the current display. Use Case #2 Biologist wants all lac repressors on build 12 of mouse. Example of use: - Start DAS viewer. Go to "find server" option. Select "mouse" from the list of "model organisms". Select "build 12" from a pull-down menu of build descriptions. Select all the listed servers. - Go to "find annotations" option Now what? Is "lac repressor" a name? Is it a combination of a name and ontology term? Is it a pure ontology term? Use Case #3 Biologist wants to find all the annotation servers for the most recent build of H. sapiens. Example of use: - Start DAS viewer. Go to "find server" option. Type "human" (or "H. sapiens" or "Homo sapiens"). Search. + DAS viewer consults internal NCBI taxonomy table to get taxid. DAS viewer displays all matches. - Sort by build date, select all matching servers by hand Problem: DAS has no field to search by build date Use Case #4 Bioinformaticist wants to make annotations available for build v32 of human. Example of use: - Go to registry server to get a human-readable description of the COORDINATES fields for build v32. - decide to point people to a reference server instead of providing local sequence data - create the sources, types and features document - put them on a web server - go to registry and submit site for future inclusion Use Case #5 IT wants people to use local mirrors of reference server when possible. Example of use: - set up a local registry server + server connects to Andreas' registry server and downloads all the data + server rewrites "segments" sections to use local server - configure all DAS viewers to consult local registry server Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Tue Mar 14 15:13:44 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Tue, 14 Mar 2006 07:13:44 -0800 Subject: [DAS2] using 'uri' instead of 'id' Message-ID: <9779f55861a4e800d0d21ec8d96deb8c@dalkescientific.com> Okay, I'm convinced. Where things in the spec use 'id' they will now use 'uri'. There are going to be a few wide-spread but shallow changes because of this. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Tue Mar 14 16:09:12 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Tue, 14 Mar 2006 08:09:12 -0800 Subject: [DAS2] segments and coordinates Message-ID: <24b8f786997fdabd72d3cc9c2a370352@dalkescientific.com> Summary: I want to - move the COORDINATE element inside of the CAPABILITY[type="segments"] element - add a 'created' timestamp to the COORDINATE (for sorting by time) - add a unique 'uri' identifier attribute to the COORDINATE (two coordinates are equal if and only if they have the same id) - have that identifier be resolvable, to get information about the coordinate system (but perhaps leave the contents for a future spec) In writing the documentation I've been struggling with COORDINATES. No surprise there. The current spec has COORDINATES and the "segments" capability as different elements, like (Note the 'created' timestamp to sort a list of coordinates by the time it was established.) With the current discussion on multiple coordinates, it looks like there is a 1-to-1 relationship between a COORDIANTES record and a CAPABILITY record. As that's the case I want to merge them together, as in (note change from "_id" to "_uri") In talking with Andreas I think he agrees that this makes sense. Second, there's a question of identity. When are two coordinates the same? Is it when they have the same (authority, source, version) the same (authority, source, version, taxid) Since taxid is optional, what if one server leaves it out; are the two still the same? I decided to solve it with a unique identifier. Two COORDINATES are the same if and only if they have the same identifier. That identifier just happens to be a URI. It does not need to be resolvable (but should be, with the results viewable at least for humans). Let's say that http://das.sanger.ac.uk/registry/coordinates/ABC123 is the identifier for: authority=NCBI version=v22 taxid=9606 source=Chromosome created=2006-03-14T07:27:49 Then the following are equivalent. The only difference is the number of properties defined in the COORDINATES tag. In theory these extra values don't need to be in the COORDINATES tag. They are knowable given the uri. But that requires a discovery mechanism for the properties (eg, the COORDINATES identifier might need to be retrievable, with some format or other). There is the possibility of value mismatch, but as Andreas pointed out the registry server can do that validation pretty easily. I mentioned property discovery earlier. Given a coordinates URI there are three things you might want to know: - what is the full list of coordinate system properties? - what is the authoritative reference server for the coordinates? - are there alternate reference servers? What if that was resolvable (doesn't need to be defined for DAS, so this is hypothetical) into something like (Hmmm, those are some ugly names. I usually shy away from '-'s in element and attribute names.) OR, what if the authoritative URL also implemented the segments interface, and we added a COORDINATES element to it? Errr, I don't like that. We will be in charge of the coordinate system URIs but we won't be in charge of the primary reference server. Use Case #6. NCBI releases a new human build. Ensembl releases annotations for it and wants to put the information with Andreas' registry. Example of use: - Set up an Ensembl reference server and annotation server for the new build; test it out - Create a new coordinate system record on the registry - fill in the species, source, doc_href, etc. fields - when finished the result is a URL, tied to coordinate info - Stick the COORDINATES information in the versioned source record - Tell the registry server to register the given versioned source URL Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Tue Mar 14 16:21:54 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Tue, 14 Mar 2006 08:21:54 -0800 Subject: [DAS2] today's sprint meeting Message-ID: Gregg can't make it this morning and asked that I let today's meeting. Here are the things I would like to talk about: == segment identifier. Quoting from my email yesterday - do not use segment "name" as an identifier - rename it "title" (human readable only) - allow a new optional "alias-of" attribute which is the link to the primary identifier for this segment - change the feature location to use the segment uri - change the feature filter range searches so there is a new "segment" keyword and so the "includes", "overlaps", etc. only work on the given segment, as segment= inside=$start:$stop overlaps=$start:$stop contains=$start:$stop identical=$start:$stop http://biodas.org/feature.cgi?segment=http://whatever.com/ChromosomeD; inside=5000:6000 (with URL escaping rules for the query string that's ...feature.cgi? segment=http%3A%2F%2Fwhatever.com%2FChromosomeD&inside=5000%3A6000 - If 'includes', 'overlaps', etc. are given then the 'segment' must be given (do we need this restriction? It doesn't make sense to me to ask for "annotations on 1000 to 2000 of anything" - only allow at most one each of includes, overlaps, contains, or identical (do we need this restriction? Then again, Gregg only needs a single includes and a single overlaps; perhaps make this even more restrictive?) - multiple segments may be given, but then range searches are not supported (do we need this restriction?) Consensus on this side seems to be fine. The biggest worry is the increasing use of URIs in URL query strings. == coordinate systems Quoting from an email I wrote recently - move the COORDINATE element inside of the CAPABILITY[type="segments"] element - add a 'created' timestamp to the COORDINATE (for sorting by time) - add a unique 'uri' identifier attribute to the COORDINATE (two coordinates are equal if and only if they have the same id) Result looks like - have that identifier be resolvable, to get information about the coordinate system (but perhaps leave the contents for a future spec) == use 'uri' instead of 'id' in the spec I've decided to go with 'uri' instead of 'id' (or 'url' or 'iri') in its various uses in the spec. == churn My feeling is this is the last major churn. I'm not able to keep up with the documentation writing, which makes it hard for people to get things done. Should I work with people today on getting data sources working and developing example data files for people to review? That is, examples which show and explain the various element in the spec? I figure more people work from example than from spec description. Andrew dalke at dalkescientific.com From ap3 at sanger.ac.uk Tue Mar 14 16:35:07 2006 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Tue, 14 Mar 2006 16:35:07 +0000 Subject: [DAS2] URIs for sequence identifiers In-Reply-To: <441606C8.3070902@affymetrix.com> References: <32b20a44f60c916d9b3649fbcdacd31f@dalkescientific.com> <441606C8.3070902@affymetrix.com> Message-ID: <0cd005042c73d6080c568576a08bb987@sanger.ac.uk> > > A different solution: > > Scaffold and Chromosome coordinate systems are served by separate DAS/2 > servers. Each server returns data from one and only one namespace. > > Those separate servers can, behind-the-scenes, use the same database. > > DAS/2 clients, like IGB, would choose to connect to either the > Scaffold-based server or the Chromosome-based server, but not usually > to > both at once. > > Does this handle all the issues? Hm I see this as a possibility but what about the following: ? ? ? ? This would be how to write one server which has two coordinate systems. according to the "one coord sys/server" rule. I think it would be shorter to provide two coordinates sections for that and only one source description... --- fyi, a yeast by Gene_ID server is e.g. http://das.sanger.ac.uk/registry/showdetails.jsp?auto_id=DS_169 Cheers, Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 From ap3 at sanger.ac.uk Tue Mar 14 16:48:09 2006 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Tue, 14 Mar 2006 16:48:09 +0000 Subject: [DAS2] segments and coordinates In-Reply-To: <24b8f786997fdabd72d3cc9c2a370352@dalkescientific.com> References: <24b8f786997fdabd72d3cc9c2a370352@dalkescientific.com> Message-ID: On 14 Mar 2006, at 16:09, Andrew Dalke wrote: > Summary: I want to > - move the COORDINATE element inside of the > CAPABILITY[type="segments"] element Is this really needed? > The current spec has COORDINATES and the "segments" capability > as different elements, like > > taxid="9606" created="2006-03-14T07:27:49" /> > query_id="http://localhost/das2/h.sapiens/v22/segments" /> > With the current discussion on multiple coordinates, it > looks like there is a 1-to-1 relationship between a COORDIANTES > record and a CAPABILITY record. As that's the case I want > to merge them together, as in (note change from "_id" to "_uri") I think hat this is a many to many relationship. Do you still want to provide the link to the reference server from an annotation server? This is not needed because the coordinates describe the reference server sufficiently. Annotation servers do not need the segments capability - only the features capability. > query_uri="http://localhost/das2/h.sapiens/v22/segments"> > taxid="9606" created="2006-03-14T07:27:49" /> > > > In talking with Andreas I think he agrees that this makes sense. If you really *want* to have the link back from the annotation server to the reference then I would propose to put capability under coordinates - i.e. the other way round. > econd, there's a question of identity. When are two coordinates > the same? Is it when they have the same > (authority, source, version) > the same > (authority, source, version, taxid) yes > > Since taxid is optional, what if one server leaves it out; > are the two still the same? no - because if a taxid is specified that is a restriction for one organism. no taxid means that this refers to multiple organisms. > I decided to solve it with a unique identifier. that might be good. this identifier could also be used to restrict searches on servers with many coordinate systems. > > Let's say that > http://das.sanger.ac.uk/registry/coordinates/ABC123 > is the identifier for: > authority=NCBI > version=v22 > taxid=9606 > source=Chromosome > created=2006-03-14T07:27:49 fine > Then the following are equivalent. The only difference is the > number of properties defined in the COORDINATES tag. > > query_uri="http://localhost/das2/h.sapiens/v22/segments"> > uri="http://das.sanger.ac.uk/registry/coordinates/ABC123" /> > > > > query_uri="http://localhost/das2/h.sapiens/v22/segments"> > uri="http://das.sanger.ac.uk/registry/coordinates/ABC123" > source="Chromosome"/> > > > > query_uri="http://localhost/das2/h.sapiens/v22/segments"> > uri="http://das.sanger.ac.uk/registry/coordinates/ABC123" > source="Chromosome" authority="NCBI" version="v22" taxid="9606" > created="2006-03-14T07:27:49" /> > o.k. This is a lot of change to the spec for us being already on the second code sprint, but I think it makes things clearer Cheers, Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 From dalke at dalkescientific.com Tue Mar 14 20:46:27 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Tue, 14 Mar 2006 12:46:27 -0800 Subject: [DAS2] description and title Message-ID: <84c508c1625b5507dd511c8d1ef0f682@dalkescientific.com> Andreas' DAS registry has a description for each versioned source. See http://das.sanger.ac.uk/registry/listServices.jsp . Here's an example of what's in it Machine learning approach used SWISSPROT variants annotated as disease/neutral as training dataset. Predictions made on all ENSEMBL nscSNPs as to their disease status I've added an optional 'description' field to the versioned source record for servers that wish to provide that information. Allen's types response had 'name' and 'description' attributes. These were not in the types record. I've added 'description' and added 'title'. I've been using 'title' for short descriptions; a few words long. I've been using 'description' for plain text up to a paragraph. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Wed Mar 15 00:34:55 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Tue, 14 Mar 2006 16:34:55 -0800 Subject: [DAS2] updated examples Message-ID: Checked into das CVS. das/das2/draft3/ The current (incomplete) spec is 'spec.txt'. It is already out of date. The .rnc files are up-to-date. The subdirectory "ucla" contains data from Allen's server, with the format hand-updated. A couple of things to note. I used three different ways of specifying the same namespace: This is to check that you all are doing correct namespace processing. :) Also, I've gone ahead and added the 'SUPPORTS' element, like this This says that the server only supports 'basic' searches, which means you can only ask it for all the feature. There is no feature query language. There is also 'das2queries' which says that the server supports the das2 query language. The following says that you can ask for everything or you can ask for things in the DAS2 query language. If not given the client should assume it supports 'das2queries'. Note that 'basic' is a subset of 'das2queries'. Andrew dalke at dalkescientific.com From lstein at cshl.edu Wed Mar 15 10:46:41 2006 From: lstein at cshl.edu (Lincoln Stein) Date: Wed, 15 Mar 2006 10:46:41 +0000 Subject: [DAS2] biopackages.net out of synch with spec? In-Reply-To: References: Message-ID: <200603151046.43196.lstein@cshl.edu> Hi Folks, I just ran through the source request on biopackages.net and it is returning something that is very different from the current spec (CVS updated as of this morning UK time). I understand why there is a discrepancy, but for the purposes of the code sprint, should I code to what the spec says or to what biopackages.net returns? It is much more fun for me to code to a working server because I have the opportunity to watch my code run. Best, Lincoln -- Lincoln D. Stein Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724 FOR URGENT MESSAGES & SCHEDULING, PLEASE CONTACT MY ASSISTANT, SANDRA MICHELSEN, AT michelse at cshl.edu (516 367-5008) From lstein at cshl.edu Wed Mar 15 10:39:35 2006 From: lstein at cshl.edu (Lincoln Stein) Date: Wed, 15 Mar 2006 10:39:35 +0000 Subject: [DAS2] Shouldn't prefix be /das2? In-Reply-To: References: Message-ID: <200603151039.36405.lstein@cshl.edu> Hi Folks, Shouldn't the prefix to das2 requests be http://server/blahblah/das2 ? It would make it easier for clients to load the correct parsing code and would avoid the client having to make a round-trip to the server just to determine whether it is dealing with a das/1 or das/2 server. My apologies if this has already been discussed. Lincoln -- Lincoln D. Stein Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724 FOR URGENT MESSAGES & SCHEDULING, PLEASE CONTACT MY ASSISTANT, SANDRA MICHELSEN, AT michelse at cshl.edu (516 367-5008) From dalke at dalkescientific.com Wed Mar 15 14:32:26 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 15 Mar 2006 06:32:26 -0800 Subject: [DAS2] biopackages.net out of synch with spec? In-Reply-To: <200603151046.43196.lstein@cshl.edu> References: <200603151046.43196.lstein@cshl.edu> Message-ID: <4d86b8f899632c8cd506297938fffd8a@dalkescientific.com> Lincoln: > I just ran through the source request on biopackages.net and it is > returning > something that is very different from the current spec (CVS updated as > of > this morning UK time). The server isn't synched with any specific version of the spec. For example, if I make a features request from http://das.biopackages.net/das/genome/yeast/S228C/feature?inside=chr1/ 0:1000") I get As from the discussion a few weeks ago we shouldn't be using the standalone="no" since that says the document cannot be understood without consulting the DTD, which doesn't exist. And I don't want a DTD. Also, the namespace needs to be "http://www.biodas.org/ns/das/genome/2.00" (It's missing the 'genome') and the 'FEATURELIST' was replaced with 'FEATURES' a year ago. In the types request the commented out namespace declaration needs to there, and the type id 'SO:ARS' needs to be escaped as it's treated as an identifier resolved with the "SO" protocol. Plus, until yesterday I didn't know about the 'name' or 'definition' attributes. These are now in the schema as 'title' and 'description'. There are a few other differences, like problems in the taxid and empty strings for timestamps. I hand-updated examples from Allen's server yesterday, in cvs under das/das2/draft3/ucla . I found some of these during the update, though others I pointed out about a year ago. Allen doesn't want to update the server until the spec is stable, for two reasons. First, he doesn't like the churn of doing work only to have to make more changes. Second, you're not the only one who says > It is much more fun for me to code to a working > server because I have the opportunity to watch my code run. and Allen's setup doesn't have the ability to implement two versions at the same time. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Wed Mar 15 14:46:39 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 15 Mar 2006 06:46:39 -0800 Subject: [DAS2] Shouldn't prefix be /das2? In-Reply-To: <200603151039.36405.lstein@cshl.edu> References: <200603151039.36405.lstein@cshl.edu> Message-ID: > Shouldn't the prefix to das2 requests be http://server/blahblah/das2 > ? > > It would make it easier for clients to load the correct parsing code > and would > avoid the client having to make a round-trip to the server just to > determine > whether it is dealing with a das/1 or das/2 server. It doesn't need the round-trip. It can look at the Content-Type to figure that out. Plus, few of the DAS1 servers follow the DAS1 naming scheme. Here's a list from Andreas' registry server. genome.cbs.dtu.dk:9000/das/tmhmm/ genome.cbs.dtu.dk:9000/das/netoglyc/ das.ensembl.org/das/ens_sc1_ygpm/ atgc.lirmm.fr/cgi-bin/das/MethDB/ smart.embl.de/smart/das/smart/ supfam.org/SUPERFAMILY/cgi-bin/das/up/ mips.gsf.de/cgi-bin/proj/biosapiens/das/saccharomyces_cerevisiae/ All of them do have the substring '/das/' somewhere, but not at the start/end of the string. Now, the content-type might be "application/xml" and not sufficient to disambiguate between the two documents, but in that case you can dispatch based on the root element type. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Wed Mar 15 15:05:52 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 15 Mar 2006 07:05:52 -0800 Subject: [DAS2] XML namespaces Message-ID: <9fc7158a198c7d5d62c5c9be2624f5f9@dalkescientific.com> I mentioned this yesterday but am doing it again as its own email. This is a quick tutorial on XML namespaces. The DAS spec uses XML namespaces. XML didn't start with namespaces. They were added later. Older parsers, like SAX 1.0, did not understand namespaces. Newer ones, like SAX 2.0, do. By default a document does not have a namespace. For example, has no namespace. To declare a default namespace use the 'xmlns' attribute. All attributes which start 'xml' or are in the 'xml:' namespace are reserved. This is the name 'person' in the namespace 'http://www.biodas.org/'. The namespace is an opaque identifer. It leverages URIs in part because it's much easier to guarantee uniqueness. The combination of (namespace, tag name) is unique. The tag name is also called the "local name". That's to distinguish it from a "qualified name", also called a "qname". These look like This element has identical meaning to the previous element using the default namespace. It's qname is 'abc:person' but the full name is the tuple of ("http://www.biodas.org/", "person") For notational convenience this is sometimes written in Clark notation, as {http://www.biodas.org}person Element Clark notation person {}person ("empty namespace" is different than "no namespace") {http://biodas.org/}person {http://biodas.org/}person {http://biodas.org/}person The prefix used doesn't matter. Only the combination of (namespace, local name) is important. The Clark notation string captures that as a single string, which is much easier when doing comparisons. For example, if you try the dasypus verifier at http://cgi.biodas.org:8080/verify?url=http://das.biopackages.net/das/ genome/yeast/S228C/feature?inside=chr1/0:1000&doctype=features one of the output messages is Expected element '{http://www.biodas.org/ns/das/genome/2.00}FEATURES' but got '{http://www.biodas.org/ns/das/2.00}FEATURELIST' at byte 113, line 3, column 2 This shows the Clark name for the elements, indicating that the root element has a different namespace and local name from what Dasypus expects. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Wed Mar 15 15:15:40 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 15 Mar 2006 07:15:40 -0800 Subject: [DAS2] xml namespaces Message-ID: related to the previous email. The spec uses the namespace http://www.biodas.org/ns/das/genome/2.00 I propose using a smaller and simpler URL. The content does not matter to XML processors. The practice though is to use a URI which is resolvable for more information about the element. For example, xmlns:xlink="http://www.w3.org/1999/xlink" Go to that and the response is > This is an XML namespace defined in the XML Linking Language (XLink) > specification. > > For more information about XML, please refer to The Extensible Markup > Language (XML) 1.0 specification. For more information about XML > namespaces, please refer to the Namespaces in XML specification. Similarly the XML namespace URI is http://www.w3.org/1999/xhtml XSLT is http://www.w3.org/1999/XSL/Transform FOAF is http://xmlns.com/foaf/0.1/ which points to the actual documentation I like the last approach and propose that DAS2 use the namespace http://biodas.org/documents/das2/ Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Wed Mar 15 15:22:14 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 15 Mar 2006 07:22:14 -0800 Subject: [DAS2] xml namespaces In-Reply-To: References: Message-ID: Me: > I propose using a smaller and simpler URL. ... > I like the last approach and propose that DAS2 use the namespace > > http://biodas.org/documents/das2/ But it's such a minor point that not changing it is fine with me. On the other hand, Allen's server doesn't given the right namespace and Gregg's client currently ignores the namespace, so there isn't any extra work. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Wed Mar 15 15:29:56 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 15 Mar 2006 07:29:56 -0800 Subject: [DAS2] search by segment id Message-ID: <712b5b29c53161455f3d9d1b34768937@dalkescientific.com> One thing I came up with yesterday when moving from local identifiers to URIs for the segment names. There are two possible identifiers for a given segment The local name is "http://localhost/das2/segment/chr1" while the well-known global name (of which the local name is an alias) is "http://dalkescientific.com/human35v1/chr1" The global name can be anything. It can be "urn:lsid:chr1" or anything else. It only needs to be unique across all identifiers. Now, are range queries done with the local name or the global one? That is, features?segment=http://localhost/das2/segment/chr1&range=100:200 or features?segment=http://dalkescientific.com/human35v1/chr1&range=100: 200 ( or features?segment=urn:lsid:chr1&range=100:200 if that was the uri) If it's the local name then the client must first query all servers to get the mapping from global name to local name, and perform the translation itself. I propose that the client can query using the global name, and not need to do the mapping to the local name. In addition, a server may support both names in the query, since by using URIs we guarantee there are no accidental id collisions. Andrew dalke at dalkescientific.com From ap3 at sanger.ac.uk Wed Mar 15 15:34:06 2006 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Wed, 15 Mar 2006 15:34:06 +0000 Subject: [DAS2] Shouldn't prefix be /das2? In-Reply-To: References: <200603151039.36405.lstein@cshl.edu> Message-ID: <9370c22dda73ba356c665eca3838e6e6@sanger.ac.uk> > > genome.cbs.dtu.dk:9000/das/tmhmm/ > genome.cbs.dtu.dk:9000/das/netoglyc/ > das.ensembl.org/das/ens_sc1_ygpm/ > atgc.lirmm.fr/cgi-bin/das/MethDB/ > smart.embl.de/smart/das/smart/ > supfam.org/SUPERFAMILY/cgi-bin/das/up/ > mips.gsf.de/cgi-bin/proj/biosapiens/das/saccharomyces_cerevisiae/ all these servers match to the DAS 1 spec which says that the second to last bit is "das" and the last bit is the "data source name". The registry contains a check for that. Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 From td2 at sanger.ac.uk Wed Mar 15 15:16:25 2006 From: td2 at sanger.ac.uk (Thomas Down) Date: Wed, 15 Mar 2006 15:16:25 +0000 Subject: [DAS2] Shouldn't prefix be /das2? In-Reply-To: References: <200603151039.36405.lstein@cshl.edu> Message-ID: <58C7DFD3-9B5A-4BC5-B863-49B2366D06A3@sanger.ac.uk> On 15 Mar 2006, at 14:46, Andrew Dalke wrote: > Plus, few of the DAS1 servers follow the DAS1 naming scheme. Here's > a list from Andreas' registry server. > > genome.cbs.dtu.dk:9000/das/tmhmm/ > genome.cbs.dtu.dk:9000/das/netoglyc/ > das.ensembl.org/das/ens_sc1_ygpm/ > atgc.lirmm.fr/cgi-bin/das/MethDB/ > smart.embl.de/smart/das/smart/ > supfam.org/SUPERFAMILY/cgi-bin/das/up/ > mips.gsf.de/cgi-bin/proj/biosapiens/das/saccharomyces_cerevisiae/ These all look fine to me -- but they're URLs for individual data sources, rather than complete server installations. Remove the last element and you'll get a server URL (e.g. genome.cbs.dtu.dk:9000/ das/) which ends /das/ in all cases. The registry records datasources, not server installations. In general, I'm not sure a server installation is a terribly "interesting" object, since it's quite possible that one server installation will host many datasources with little or no semantic connection between them -- the only thing they have in common is that they're hosted at the same site. Thomas. From lstein at cshl.edu Wed Mar 15 15:41:46 2006 From: lstein at cshl.edu (Lincoln Stein) Date: Wed, 15 Mar 2006 15:41:46 +0000 Subject: [DAS2] biopackages.net out of synch with spec? In-Reply-To: <4d86b8f899632c8cd506297938fffd8a@dalkescientific.com> References: <200603151046.43196.lstein@cshl.edu> <4d86b8f899632c8cd506297938fffd8a@dalkescientific.com> Message-ID: <200603151541.47538.lstein@cshl.edu> I'll use your hand-edited examples for testing. Lincoln On Wednesday 15 March 2006 14:32, Andrew Dalke wrote: > Lincoln: > > I just ran through the source request on biopackages.net and it is > > returning > > something that is very different from the current spec (CVS updated as > > of > > this morning UK time). > > The server isn't synched with any specific version of the spec. For > example, if I make a features request from > > http://das.biopackages.net/das/genome/yeast/S228C/feature?inside=chr1/ > 0:1000") > > I get > > > "http://www.biodas.org/dtd/das2feature.dtd"> > xmlns="http://www.biodas.org/ns/das/2.00" > xmlns:xlink="http://www.w3.org/1999/xlink" > xml:base="http://das.biopackages.net/das/genome/yeast/S228C/feature"> > > > As from the discussion a few weeks ago we shouldn't be using the > standalone="no" > since that says the document cannot be understood without consulting > the DTD, which doesn't exist. And I don't want a DTD. > > Also, the namespace needs to be > "http://www.biodas.org/ns/das/genome/2.00" > (It's missing the 'genome') and the 'FEATURELIST' was replaced with > 'FEATURES' a year ago. > > In the types request > > > > > > xmlns:xlink="http://www.w3.org/1999/xlink" > xml:base="http://das.biopackages.net/das/genome/yeast/S228C/type/"> > name="ARS" definition="A sequence that can autonomously replicate, as a > plasmid, when transformed into a bacterial host."> > > > the commented out namespace declaration needs to there, and the type > id 'SO:ARS' needs to be escaped as it's treated as an identifier > resolved > with the "SO" protocol. Plus, until yesterday I didn't know about the > 'name' or 'definition' attributes. These are now in the schema as > 'title' and 'description'. > > There are a few other differences, like problems in the taxid and > empty strings for timestamps. I hand-updated examples from Allen's > server yesterday, in cvs under das/das2/draft3/ucla . I found some > of these during the update, though others I pointed out about a > year ago. > > Allen doesn't want to update the server until the spec is stable, > for two reasons. First, he doesn't like the churn of doing work only > to have to make more changes. Second, you're not the only one who says > > > It is much more fun for me to code to a working > > server because I have the opportunity to watch my code run. > > and Allen's setup doesn't have the ability to implement two versions > at the same time. > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/das2 -- Lincoln D. Stein Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724 FOR URGENT MESSAGES & SCHEDULING, PLEASE CONTACT MY ASSISTANT, SANDRA MICHELSEN, AT michelse at cshl.edu (516 367-5008) From lstein at cshl.edu Wed Mar 15 15:49:40 2006 From: lstein at cshl.edu (Lincoln Stein) Date: Wed, 15 Mar 2006 15:49:40 +0000 Subject: [DAS2] XML namespaces In-Reply-To: <9fc7158a198c7d5d62c5c9be2624f5f9@dalkescientific.com> References: <9fc7158a198c7d5d62c5c9be2624f5f9@dalkescientific.com> Message-ID: <200603151549.41773.lstein@cshl.edu> I have just finished adding XML namespace support to the early-version Perl DAS2 client. BTW, if a namespace tag is reused in an inner scope with a different Andrew K. Dalke I put middle into namespace http://addresses.com/address/2.0 and put first and last into namespace http://foo.bar.das. This is the correct scoping behavior, right? Lincoln On Wednesday 15 March 2006 15:05, Andrew Dalke wrote: > I mentioned this yesterday but am doing it again as its own email. > This is a quick tutorial on XML namespaces. > > The DAS spec uses XML namespaces. XML didn't start with namespaces. > They were added later. Older parsers, like SAX 1.0, did not understand > namespaces. Newer ones, like SAX 2.0, do. > > By default a document does not have a namespace. For example, > > > > has no namespace. > > To declare a default namespace use the 'xmlns' attribute. All > attributes which start 'xml' or are in the 'xml:' namespace are > reserved. > > > > This is the name 'person' in the namespace 'http://www.biodas.org/'. > The namespace is an opaque identifer. It leverages URIs in part > because it's much easier to guarantee uniqueness. > > The combination of (namespace, tag name) is unique. The tag > name is also called the "local name". > > That's to distinguish it from a "qualified name", also called > a "qname". These look like > > > > This element has identical meaning to the previous element > using the default namespace. It's qname is 'abc:person' but > the full name is the tuple of > > ("http://www.biodas.org/", "person") > > For notational convenience this is sometimes written in Clark > notation, as > {http://www.biodas.org}person > > Element Clark notation > person > {}person > ("empty namespace" is different than "no > namespace") > > > {http://biodas.org/}person > > {http://biodas.org/}person > > {http://biodas.org/}person > > The prefix used doesn't matter. Only the combination of > (namespace, local name) > is important. The Clark notation string captures that as a single > string, > which is much easier when doing comparisons. > > For example, if you try the dasypus verifier at > > http://cgi.biodas.org:8080/verify?url=http://das.biopackages.net/das/ > genome/yeast/S228C/feature?inside=chr1/0:1000&doctype=features > > one of the output messages is > > Expected element '{http://www.biodas.org/ns/das/genome/2.00}FEATURES' > but > got '{http://www.biodas.org/ns/das/2.00}FEATURELIST' at byte 113, line > 3, column 2 > > This shows the Clark name for the elements, indicating that the root > element has a different namespace and local name from what Dasypus > expects. > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/das2 -- Lincoln D. Stein Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724 FOR URGENT MESSAGES & SCHEDULING, PLEASE CONTACT MY ASSISTANT, SANDRA MICHELSEN, AT michelse at cshl.edu (516 367-5008) From dalke at dalkescientific.com Wed Mar 15 15:53:11 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 15 Mar 2006 07:53:11 -0800 Subject: [DAS2] Shouldn't prefix be /das2? In-Reply-To: <9370c22dda73ba356c665eca3838e6e6@sanger.ac.uk> References: <200603151039.36405.lstein@cshl.edu> <9370c22dda73ba356c665eca3838e6e6@sanger.ac.uk> Message-ID: <0e5d03e0bc2f9ab791a891f058ca664b@dalkescientific.com> Andreas (and Thomas) >> genome.cbs.dtu.dk:9000/das/tmhmm/ >> genome.cbs.dtu.dk:9000/das/netoglyc/ > all these servers match to the DAS 1 spec which says that the second > to last bit > is "das" and the last bit is the "data source name". > The registry contains a check for that. Ahh, right. I misremembered and thought that "/das" had to be immediately after the hostname. Looking now there can be an arbitrary prefix. What I remembered was the servers at http://das.bcgsc.ca:8080/das which don't have regular names. Then again, they have nearly bit-rotted away. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Wed Mar 15 16:04:38 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 15 Mar 2006 08:04:38 -0800 Subject: [DAS2] XML namespaces In-Reply-To: <200603151549.41773.lstein@cshl.edu> References: <9fc7158a198c7d5d62c5c9be2624f5f9@dalkescientific.com> <200603151549.41773.lstein@cshl.edu> Message-ID: <2de39a4a831f6a06c408bdf31ef2a41f@dalkescientific.com> Linconl: > BTW, if a namespace tag is reused in an inner scope with a > different > > > Andrew > xmlns:das="http://addresses.com/address/2.0">K. > Dalke > > > I put middle into namespace http://addresses.com/address/2.0 and put > first and > last into namespace http://foo.bar.das. > > This is the correct scoping behavior, right? Yes. I tested it with an XML process and it says the following is equivalent (after fixing a typo). Andrew K. Dalke BTW, it should be "P." :) Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Wed Mar 15 15:58:15 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 15 Mar 2006 07:58:15 -0800 Subject: [DAS2] Shouldn't prefix be /das2? In-Reply-To: <58C7DFD3-9B5A-4BC5-B863-49B2366D06A3@sanger.ac.uk> References: <200603151039.36405.lstein@cshl.edu> <58C7DFD3-9B5A-4BC5-B863-49B2366D06A3@sanger.ac.uk> Message-ID: Thomas: > The registry records datasources, not server installations. In > general, I'm not sure a server installation is a terribly > "interesting" object, since it's quite possible that one server > installation will host many datasources with little or no semantic > connection between them -- the only thing they have in common is that > they're hosted at the same site. I agree. The only thing that's interesting about the server installation is knowing who is in charge when it goes down. :) That's found from the MAINTAINER element at the level of the sources document. Andrew dalke at dalkescientific.com From Steve_Chervitz at affymetrix.com Wed Mar 15 16:37:51 2006 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Wed, 15 Mar 2006 08:37:51 -0800 Subject: [DAS2] Notes from DAS/2 code sprint #2, day two, 14 Mar 2006 Message-ID: Notes from DAS/2 code sprint #2, day two, 14 Mar 2006 $Id: das2-teleconf-2006-03-14.txt,v 1.1 2006/03/15 16:47:50 sac Exp $ Note taker: Steve Chervitz Attendees: Affy: Steve Chervitz, Ed E. Sanger: Andreas Prlic, Thomas Down Dalke Scientific: Andrew Dalke (at Affy) UC Berkeley: Nomi Harris (at Affy) UCLA: Allen Day (at Affy) Action items are flagged with '[A]'. These notes are checked into the biodas.org CVS repository at das/das2/notes/2006. Instructions on how to access this repository are at http://biodas.org DISCLAIMER: The note taker aims for completeness and accuracy, but these goals are not always achievable, given the desire to get the notes out with a rapid turnaround. So don't consider these notes as complete minutes from the meeting, but rather abbreviated, summarized versions of what was discussed. There may be errors of commission and omission. Participants are welcome to post comments and/or corrections to these as they see fit. Agenda: ---------- See Andrew's email. Here's a summary. * segment ids * coord systems and how to handle [Gregg is out, Andrew is leading the teleconf.] ap: ad proposed changes re: coords and capabilities i think is not really needed. the question is do annotation servers need to provide to link to reference servers back. If the link is apparent from, c ad: summary: moving coord element inside capabilities element (one part of 4 things mentioned). the reason: coords and capabilities are tied together. They refer to the same thing. E.g., you need know which of the segments are tied to which coords. ap: annotation server does need to, it can find the reference server by the coordinates. ad: if you have local coords, and you want to point to a local server, how do you specify that this segment corresponds to these coords. ap: you should have a reference server that speaks the coords you want to annotate. td: if you have your own assembly you have your own coord system, ad: yes, and i set up my own ref server for it. ad: if I have mult coords, won't I have multiple segments? isn't there a 1:1 relationship between coords and segments? ap: I think many:many.... wait td: each segment is a member of one coord system, a coord system contains many segments. ad: andreas has features, some annotated on scaffold, some annotated on chromosome. So, you need the ability to have two segments provided by server. ap: coords should contain segment capabilities, i.e., the other way around. ad: proposing to have a uri to id the coords, capapbility should have a field to say the coord uri is 'this' mailed out the idea to have a unique identifier for coords. keep them separate now, have the ability sc: optional? ad: yes only needed if you have mult coord systems. ad: like features and feature type. segment is saying it's of that type ad: will add optional id to the capability, so that you can figure out what the segments are. in proposal this am, 1) timestamp to coord info (optional) -- use case: sort by most recent coord system for a given build. 2) unique id for the coord ( ap: this will be useful for searches as well. can request only results from a particular coord system. (see email discussion this am) td: server alignment btwn human and mouse, you can say whether you are referencing human or mouse just by specifying coord system. ad: also two different human assemblies. ap: I have to leave now. Topic: Segment identifiers email td: segment had a name and url form id so that feature server doesn't have to give a concrete url for the seq of chrm22, nice for lightweight server sans sequence. getting rid of ability to reference sequence by name instead of url breaks this. You need a concrete url if you just want to serve features on a sequence. You end up having to rewrite urls rather than saying this feature is attached to chr22 in xxx coord system. ad: one thing gregg and I discussed, the fact that url is by itself an opaque id, you have to resolve it someway, http, or something else too. You can use any mechanism you want to turn the name you want. ad: in segments list, if you have your own local copy. Your segments section says my local copy is td: you need a segments capability. I can't have a server that uses only features capabilities. ad: if you have your own segments. if all your features are described using standard names/ids, no you don't need a segments capability. td: ok, my assembly is human build 35, and feature lives on chr22. ad: yes. every place you see optional alias attribute link back to primary id of segment, that id can be anything. td: arbitrary string scoped by the coord system, which now has a uri id string. ad: yes. and it's also globally unique, not scoped just by coord system . td: I don't see what's wrong with .... ad: we were discussing yesterday having diff names for the same chromosome. chrI vs chr1. td: that can be addressed using aliases ad: alias of field provides a synonym table for what you map locally to a global id. td: you're saying the global ids have to be universally unique even when taken out of the coord system ad: yes. feat server providing feats from two diff coord systems, you need a way to distinguish one segment from another segment, in a global sense. td: I don't totally understand cases involving mult coord systems. How do I find out which of three possible coord systems a given segment came from? ad: td: all clones in embl system. could be a lot. ad: your client will have to know how to look up the right one. if you have one coord system that has all your clones, you have to do the look up anyway to know where to display the features from the various clones. td: suppose looking for gene names: you get back a feature on clone AL19823. I want to start from that feature and build a meaningful display. So I need to work out what coord system this feature lives on. If my server speaks multiple coord systems, one for all embl accessions and gi ids, I have to test for membership in the set. My server could put the coord system id on each feature. Would be optional for servers only attached to one coord system. ad: right. Andreas also wants coord uri part of feature filter. Could add it to the feature filter. td: yes. give me all genes called xyz. Do you always want to limit to one coord system? ad: I see your point. Having to search ad: New thing called title for humans to read. Also proposed inside, overlaps, contains so they don't td: to avoid a nastiness in query lang, I like that. Removes an issue that scares me about having urls in the query. pathological case: client has a good reason to retrieve features on part of a two sequences that have lots of features on. e.g., all cutting sites for all restriction enzymes. Very high density. If the genome is made of 10kb clones, the user may want to get features that span clone boundaries. server may do lots of extra fetching that's not really necessary. ad: it's the number of requests that's the issue, same amout of info. so it's an issue of network overhead. advantage: makes servers easier to implement since it eliminates searching partial regions. Some use cases exists, but can be done on the client side. td: seems a shame to lose the capability, but not a huge loss. the alternative would be to say that you parse the query string left to right. overlaps=5000-10000; ... puts limits on how server parses. ad: or we propose a new query interface ad: this sounds like I should go ahead with segment ids. ad: using uri vs id (internal link id vs link to something else) td: seems to be enough impl-breaking changes, not a big argument either way. ad: enough changes going on now, but probably won't change much more. td: if you want to make a small change that's quick to implement, no objections. Also fine with using id, since all dom stuff about id refers to things marked id in the scheme, not attrib names. Changing to uri, won't cause much effect. nh: like a gobal replace. ad: in general there's been lots of changes, want people to get clients/servers going. ad: spec writing is going slow, would like to show examples that people can use. nh: feature parsing can use canned examples. aday: would prefer to have spec written, trouble with ambiguity ad: you need to impl before you can figure out how to write it. nh: server people need full spec, client can use examples ad: previous slow going since lincoln had little time to work on it. aday: would like a snapshot, version number. impl after last code sprint. nh: don't have time to work on das after this. will just break when/if allen's server changes. This just happens when working on developing spec. ad: the idea is to get code and examples up today. td: waiting for spec to stabilize a bit. ad: changes made this week won't have major impact on people's work in UK? td: no. nh: can you provide a changes document? ad: those would be my emails. a pain. nh: registry, I was suprised to find a versioned sources in it. won't there be an explosion of org x versions x server. It provides convenience td: as long as it's not thousands and thousands of data sources, it won't be a problem. ad: 2k per server x 1000 servers, = 2M td: if it gets to point where retrieving whole registry is a problem, we could add capability to restrict what you get. nh: need human-friendly title for each data source. would be nice if that explained more to the person who was choosing that data source (e.g., date). ad: Andreas' system (web-based) has a description. Status reports -------------- sc: adding more data to affy das server, working on building das2_server code recently checked into genoviz code base by gregg. Then will work on setting it up on a publically accessible server at affy. ee: will be working on style sheets in igb. aday: spent time on setting up dev environment since laptop died yesterday. bo: got food poisoning -- bad pizza?, was up till 4am. td: not much das-related stuff yet. From Steve_Chervitz at affymetrix.com Wed Mar 15 21:24:59 2006 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Wed, 15 Mar 2006 13:24:59 -0800 Subject: [DAS2] New affymetrix das/2 development server Message-ID: Gregg's latest spec-compliant, but still development-grade, das/2 server is now publically available via http://205.217.46.81:9091 It's currently serving annotations from the following assemblies: - human hg16 - human hg17 - drosophila dm2 Send me requests for any other data sources that would help your development efforts. Example query to get back a das-source xml document: http://205.217.46.81:9091/das2/genome/sequence It's compliance with the spec is steadily improving, on a daily if not hourly basis during the code sprint. Within IGB you can access this server from the DAS/2 servers tab under 'Affy-temp'. You'll need the latest version of IGB from the CVS repository at http://sf.net/projects/genoviz Steve From dalke at dalkescientific.com Wed Mar 15 21:25:53 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 15 Mar 2006 13:25:53 -0800 Subject: [DAS2] on local and global ids Message-ID: <4319dfa55d78ca849bcc8a231daf679c@dalkescientific.com> The discussion today was on local segment identifiers vs. global segment identifiers. I'm going to characterize them as "abstract" vs. "concrete" identifiers. An abstract id has no default resolution to a resource. A concrete one does. The identifier "http://www.biodas.org/" is concrete identifier because it has a default resolver. "lsid:ncbi:human:35" is an abstract identifier because it has no default resolver (though there are resolvers for lsid they are not default resolvers.) The global segment identifier may be a concrete identifier. It may implement the segments interface. But who is in charge of that? Who defines and maintains the service? If it goes down, (power outage, network cable cut) then what does the rest of the world do? For the purposes of DAS it is better (IMO) that the global identifiers be abstract, though they should be http URLs which are resolvable to something human readable. (This is what the XML namespace elements do.) Reference servers are concrete identifiers. They exist. They can change (eg, change technologies and change the URLs, say from cgi-bin/*.pl to an in-process servlet.) Now, they should be long-lived, but that's not how life works. Suppose someone wants to set up an annotation server, without setting up a reference server. One solution is to point to an existing reference server. In this case all the features are returned with segments labeled as in the reference server. There's no problem. Second, Andreas wants an abstract "COORDINATE" space id This requires a more complicated client because it must have other information to figure out how to convert from the coordinate identifier into the corresponding types. The answer that Andreas and others give is "consult the registry". That is, look for other other segments CAPABILITY elements with the same coordinates id. For that to happen there needs to be a way to associate a segments doc with a coordinate system. For example, this is what the current spec allows (almost - there's no example of it and I'm still trying to get the schema working for it) This makes a resolution scheme from an abstract coordinate identifier into a concrete segments document identifier. Why are there so many fields on the coordinates? It could be normalized, so you fetch the coordinate id to get the information. It's there to support searches. A goal has been that the top-level sources document gives you everything you need to know about the system. (Doesn't mean it's elegant. I won't talk about alternatives. It's not important. There's at most an extra 150 or so bytes per versioned source.) The problem comes when a site wants a local reference server. These segments have concrete local names. DAS1 experience suggests that people almost always set up local servers. They do not refer to an well-known server. There are good reasons for doing this. If the local annotation server works then the local reference server is almost certain to work. The well-known server might not work. Also, the configuration data is in the sources document. There's no need to set up a registry server to resolve coordinates. There's no configuration needed in the client to point to the appropriate concrete identifier given an abstract URL. My own experience has been that people do not read specifications. I am an odd-ball. According to http://diveintomark.org/archives/2004/08/16/specs I am an asshole. That's okay -- most people are morons. > Morons, on the other hand, don?t read specs until someone yells at > them. Instead, they take a few examples that they find ?in the wild? > and write code that seems to work based on their limited sample. Soon > after they ship, they inevitably get yelled at because their product > is nowhere near conforming to the part of the spec that someone else > happens to be using. Someone points them to the sentence in the spec > that clearly spells out how horribly broken their software is, and > they fix it. Someone who wants to implement a DAS reference server will take the data from somewhere and make up a local naming scheme. That's what happened with DAS1. That's why Gregg was saying he maintains a synonym table saying human 1 = chr1 = Chromo1 = ChrI 2 = chr2 = Chromo2 = ChrII This will not change. People will write a server for local data and point a DAS client at it. The client had better just work for the simple case of viewing the data even through there is no coordinate system -- it needs to, because people will work on systems with no coordinate system. Sites will even write multiple in-house DAS servers providing data, which work because everything refers to the same in-house reference server. It's only the first time that someone wants to merge in-house data with external data that there's a problem. This might be several months after setting up the server. At that point they do NOT want to rewrite all the in-house servers to switch to a new naming scheme. That's why the primary key for a paired annotation server and feature must be a local name. That's what morons will use. Few will consult some global registry to make things interoperable at the start. > For example, some people posit the existence of what I will call the > ?angel? developer. ?Angels? read specs closely, write code, and then > thoroughly test it against the accompanying test suite before shipping > their product. Angels do not actually exist, but they are a useful > fiction to make spec writers to feel better about themselves. Lincoln could come up with universal names for every coordinate system that ever existed or will exist. But people will not consult it. However, they will when there is a need to do that. The need comes in when they want to import external data. At that point they need a way to join between two different data sources. They consult the spec and see that there's a "synonym" (or "reference", or "global", or "master" or *whatever* name -- I went with synonym because it doesn't imply that it's the better name.) The local name + "segment/ChrI" is also known as http://dalkescientific.com/yeast1/ChrI . Simple, and requires very little change in the server code. The only other change is to support the synonym name when doing segment requests, as segment=http://dalkescientific.com/yeast1/ChrI This is important because then clients can make range requests from servers without having to download the segment document first. It's also easy to implement, because it's a lookup table in the web server interface, and not something which needs to be in the database proper. Most people are morons. The spec as-is is written for that. It's not written for angels. It allows post-facto patch-ups once people realize they need a globally recognized name. It does require smarter clients. They need to map from local name to global name, through a translation table provided by the server. This is fast and easy to implement. It's easier to implement than consulting multiple registry servers and trying to figure out which is appropriate. And the XML returned will be smaller. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Wed Mar 15 22:39:36 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 15 Mar 2006 14:39:36 -0800 Subject: [DAS2] xml namespace uri Message-ID: Please use "http://biodas.org/documents/das2" for the XML element namespace. The two current servers (Allen's and Steve's) use "http://www.biodas.org/ns/das/2.00" which is wrong according to the spec, for the last 2 years it's been "http://www.biodas.org/ns/das/genome/2.00" Since the servers need to change anyway, might as well make it something a bit more readable, and shorter. :) I've checked all the current dasypus (validator) software into CVS, btw, and updated all of the example xml (draft3/ucla/) to use the new namespace. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Thu Mar 16 05:17:24 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 15 Mar 2006 21:17:24 -0800 Subject: [DAS2] query language description Message-ID: <7ed68f2baa961f932e369cb449371439@dalkescientific.com> The query fields are name | takes | matches features ... ========================== xid | URI | which have the given xid type | URI | with the given type or subtype (XX keep this one???) exacttype | URI | with exactly the given type segment | URI | on the given segment overlaps | region | which overlap the given region inside | region | which are contained inside the given region (XX needed??) contains | region | which contain the given region (XX needed?? ) name | string | with a name or alias which matches the given string prop-* | string | with the property "*" matching the given string Queries are form-urlencoded requests. For example, if the features query URL is 'http://biodas.org/features' and there is a segment named 'http://ncbi.org/human/Chr1' then the following is a request for all the features on the first 10,000 bases of that segment The query is for segment = 'http://ncbi.org/human/Chr1' overlaps = 0:10000 which is form-urlencoded as http://biodas.org/features? segment=http%3A%2F%2Fncbi.org%2Fhuman%2FChr1;overlaps=0%3A1000 Multiple search terms with the same key are OR'ed together. The following searches for features containing the name or alias of either BC048328 or BC015400 http://biodas.org/features?name=BC048328;name=BC015400 Multiple search terms with different keys are AND'ed together, but only after doing the OR search for each set of search terms with identical keys. The following searches for features which have a name or alias of BC048328 or BC015400 and which are on the segment http://ncbi.org/human/Chr1 http://biodas.org/features?name=BC048328; segment=http%3A%2F%2Fncbi.org%2Fhuman%2FChr1;name=BC015400 The order of the search terms in the query string does not affect the results. If any part of a complex feature (that is, one with parents or parts) matches a search term then all of the parents and parts are returned. (XXX Gregg -- is this correct? XXX) The fields which take URLs require exact matches. I think we decided that there is no type inferencing done in the server; it's a client side thing. In that case the 'type' field goes away. We can still keep 'exacttype'. The URI used for the matching is the type uri, and NOT the ontology URI. (We don't have an ontology URI yet, and when we do we can add an 'ontology' query.) The segment URI must accept the local identifier. For interoperability with other servers they must also accept the equivalent global identifier, if there is one. If range searches are given then one and only one segment is allowed. Multiple segments may be given, but then ranges are not allowed. The string searches support a simple search language. ABC -- contains a word which exactly matches "ABC" (identity, not substring) *ABC -- words ending in "ABC" ABC* -- words starting with "ABC" *ABC* -- words containing the substring "ABC" If you want a field which exactly contains a '*' you're kinda out of luck. The interpretation of whitespace in the query or in the search string is implementation dependent. For that matter, the meaning of "word" is implementation dependent. (Is *O'Malley* one word? *Lethbridge-Stewart*?) When we looked into this last month at Sanger we verified that all the databases could handle %substring% searches, which was all that people there wanted. The Affy people want searches for exact word, prefix and suffix matches, as supported by the the back-end databases. XXX CORRECT ME XXX The 'name' search searches.... It used to search the 'name' attribute and the 'alias' fields. There is no 'name' now. I moved it to 'title'. I think I did the wrong thing; it should be 'name', but it's a name meant for people, not computers. Some features (sub-parts) don't have human-readable names so this field must be optional. The "prop-*" is a search of the elements. Features may have properties, like To do a string search for all 'membrane' cellular components, construct the query key by taking the string "prop-" and appending the property key text ("cellular_component"). The query value is the text to search for. prop-cellular_component=membrane To search for any cellular_component containing the substring "mem" prop-cellular_component=*membrane* The rules for multiple searches with the same key also apply to the prop-* searches. To search for all 'membrane' or 'nuclear' cellular components, use two 'prop-cellular_component' terms, as http://biodas.org/features?prop-cellular_component=membrane;prop- cellular_component=membrane The range searches are defined with explicit start and end coordinates. The range syntax is in the form "start:end", for example, "1:9". Let 'min' be the smallest coordinate for a feature on a given segment and 'max' be one larger than the largest coordinate. These are the lower and upper founds for the feature. An 'overlaps' search matches if and only if min < end AND max > start XXX For GREG XXX What do 'inside' and 'contains' do? Can't we just get away with 'excludes', which has complement of 'overlaps'? Searches are done as: Step 0) specify the segment Step 1) do all the includes (if none, match all features on segment) Step 2) do all the excludes, inverted (like an includes search) Step 3) only return features which are in Step 1 but not in Step 2) Step 4) ... Step 5) Profit! I think this will support your smart code, and it's easy enough to implement. Every one but you was planning to use 'overlaps'. Only you wanted to use 'inside'. Anyone want to use 'contains'? Andrew dalke at dalkescientific.com From td2 at sanger.ac.uk Thu Mar 16 09:24:03 2006 From: td2 at sanger.ac.uk (Thomas Down) Date: Thu, 16 Mar 2006 09:24:03 +0000 Subject: [DAS2] on local and global ids In-Reply-To: <4319dfa55d78ca849bcc8a231daf679c@dalkescientific.com> References: <4319dfa55d78ca849bcc8a231daf679c@dalkescientific.com> Message-ID: <9F3DD0BD-25B5-49BC-839A-B6134B17E9B0@sanger.ac.uk> On 15 Mar 2006, at 21:25, Andrew Dalke wrote: > > The problem comes when a site wants a local reference server. > These segments have concrete local names. > > DAS1 experience suggests that people almost always set up local > servers. They do not refer to an well-known server. I'm not sure that DAS1 experience is a good model for this. It's true that people didn't always point to well-known reference servers, but I think this has more to do with the fact that people didn't know which server to point to. Some people did set up their own reference servers. Many didn't, and many of those didn't give a valid MAPMASTER URL at all. This situation didn't actually cause too much trouble since a lot of these users just wanted to add a track to Ensembl -- which doesn't care about MAPMASTER URLs and just trusts the user to add tracks that live in an appropriate coordinate system. I'd still argue that the majority -- probably the vast majority -- of people setting up DAS servers really just want to make an assertion like "I'm annotating build NCBI35 of the human genome" and be done with it. That's what the coordinate system stuff in DAS/2 is for. If this is documented properly I don't think we'll see many "end- user" sites setting up their own reference servers unless a) they want an internal mirror of a well-known server purely for performance/ bandwidth reasons or b) they want to annotate an unpublished/new/ whatever genome assembly. (Actually, some of the "annotation providers set up their own reference servers" stuff might be my fault -- early versions of Dazzle were pretty strict about requiring a valid [and functional!] MAPMASTER for every datasource, so this pushed people towards setting up reference servers.) Thomas. From lstein at cshl.edu Thu Mar 16 11:03:49 2006 From: lstein at cshl.edu (Lincoln Stein) Date: Thu, 16 Mar 2006 11:03:49 +0000 Subject: [DAS2] on local and global ids In-Reply-To: <9F3DD0BD-25B5-49BC-839A-B6134B17E9B0@sanger.ac.uk> References: <4319dfa55d78ca849bcc8a231daf679c@dalkescientific.com> <9F3DD0BD-25B5-49BC-839A-B6134B17E9B0@sanger.ac.uk> Message-ID: <200603161103.50323.lstein@cshl.edu> I think it will help considerably to have a document that lists the valid sequence IDs for popular annotation targets. I've spoken with Ewan on this, and Ensembl will generate a list of IDs for all vertebrate builds. I'll take responsibility for creating IDs for budding yeast, two nematodes and 12 flies. Lincoln On Thursday 16 March 2006 09:24, Thomas Down wrote: > On 15 Mar 2006, at 21:25, Andrew Dalke wrote: > > The problem comes when a site wants a local reference server. > > These segments have concrete local names. > > > > DAS1 experience suggests that people almost always set up local > > servers. They do not refer to an well-known server. > > I'm not sure that DAS1 experience is a good model for this. It's > true that people didn't always point to well-known reference servers, > but I think this has more to do with the fact that people didn't know > which server to point to. Some people did set up their own reference > servers. Many didn't, and many of those didn't give a valid > MAPMASTER URL at all. This situation didn't actually cause too much > trouble since a lot of these users just wanted to add a track to > Ensembl -- which doesn't care about MAPMASTER URLs and just trusts > the user to add tracks that live in an appropriate coordinate system. > > I'd still argue that the majority -- probably the vast majority -- of > people setting up DAS servers really just want to make an assertion > like "I'm annotating build NCBI35 of the human genome" and be done > with it. That's what the coordinate system stuff in DAS/2 is for. > If this is documented properly I don't think we'll see many "end- > user" sites setting up their own reference servers unless a) they > want an internal mirror of a well-known server purely for performance/ > bandwidth reasons or b) they want to annotate an unpublished/new/ > whatever genome assembly. > > (Actually, some of the "annotation providers set up their own > reference servers" stuff might be my fault -- early versions of > Dazzle were pretty strict about requiring a valid [and functional!] > MAPMASTER for every datasource, so this pushed people towards setting > up reference servers.) > > Thomas. > > _______________________________________________ > DAS2 mailing list > DAS2 at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/das2 -- Lincoln D. Stein Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724 FOR URGENT MESSAGES & SCHEDULING, PLEASE CONTACT MY ASSISTANT, SANDRA MICHELSEN, AT michelse at cshl.edu (516 367-5008) From lstein at cshl.edu Thu Mar 16 11:06:38 2006 From: lstein at cshl.edu (Lincoln Stein) Date: Thu, 16 Mar 2006 11:06:38 +0000 Subject: [DAS2] Spec freeze In-Reply-To: <9F3DD0BD-25B5-49BC-839A-B6134B17E9B0@sanger.ac.uk> References: <4319dfa55d78ca849bcc8a231daf679c@dalkescientific.com> <9F3DD0BD-25B5-49BC-839A-B6134B17E9B0@sanger.ac.uk> Message-ID: <200603161106.39074.lstein@cshl.edu> Hi, I just spoke with Thomas and Andreas on this, and all three of us are experiencing difficulty coding to a changing spec. In my opinion the spec is really good right now and issues such as whether to use "uri" or "id" as attribute names are not germaine. Can I propose that we declare a three month spec freeze starting at midnight tonight (GMT)? Lincoln -- Lincoln D. Stein Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724 FOR URGENT MESSAGES & SCHEDULING, PLEASE CONTACT MY ASSISTANT, SANDRA MICHELSEN, AT michelse at cshl.edu (516 367-5008) From dalke at dalkescientific.com Thu Mar 16 15:38:00 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 16 Mar 2006 07:38:00 -0800 Subject: [DAS2] on local and global ids In-Reply-To: <9F3DD0BD-25B5-49BC-839A-B6134B17E9B0@sanger.ac.uk> References: <4319dfa55d78ca849bcc8a231daf679c@dalkescientific.com> <9F3DD0BD-25B5-49BC-839A-B6134B17E9B0@sanger.ac.uk> Message-ID: <41d2d7197710e14d4ba898ae758bf280@dalkescientific.com> Thomas: > I'm not sure that DAS1 experience is a good model for this. It's true > that people didn't always point to well-known reference servers, but I > think this has more to do with the fact that people didn't know which > server to point to. I think I said there are two cases; there's actually several 1. the sources document states a well-known COORDINATES and makes no links to segments 2. the sources document refers to a well-known segments server ("the" reference server) and no COORDINATES 3. the source document has a segments document, and each segment listed uses URIs from "the" reference server 4. the server implements its own coordinates server, with new segment ids 5. When uploading a track to Ensembl there's no need to have either COORDINATE or segments -- the upload server can verify for itself that the upload uses the right ids. The *only* concern is with #4. Everything else uses the well-known global identifier for segments. > I'd still argue that the majority -- probably the vast majority -- of > people setting up DAS servers really just want to make an assertion > like "I'm annotating build NCBI35 of the human genome" and be done > with it. I'm fine with that. There are two ways to do it. #1 and #2 above. In theory only one of those is needed. The document can point to "the" reference server for NCBI 35. In practice that's not sufficient because there is no authoritative NCBI 35 server. Hence COORDINATES provides an abstract global identifier describing the reference server. > That's what the coordinate system stuff in DAS/2 is for. If this is > documented properly I don't think we'll see many "end-user" sites > setting up their own reference servers unless a) they want an internal > mirror of a well-known server purely for performance/bandwidth reasons > or b) they want to annotate an unpublished/new/whatever genome > assembly. A philosophical comment. I'm a distributed, self-organizing kinda guy. I don't think single root centralized systems work well when there are many different groups involved. I think many people will use the registry server, but not all. I think there will be public DAS servers which aren't in the registry. I know there will be in-house DAS servers which aren't. I'm just about certain that some sites will have local copies of the primary data. They do for GenBank, for PDB, for SWISS-PROT, for EnsEMBL. Why not for DAS? That said, here's a couple of questions for you to answer: a) When connecting to a new versioned source containing only COORDINATES data, what should the client do to get the list of segments, sizes, and primary sequence? I can think of several answers. My answer is that the versioned source should state the preferred reference server and unless otherwise configured a client should use that reference server and only that reference server. Yes, all the reference servers for that coordinate system are supposed to return the same results. But that's only if they are available. There are performance issues too, like low bandwidth or hosting the server on a slow machine. The DAS client shouldn't round-robin through the list until it finds one which works because that could take several minutes to timeout on a single server, with another 10 to try. Yes, a client can be configured and told "for coordinate system A use reference server Z". But that's a user configuration. b) If there is a local mirror of some reference server, how should the local DAS clients be made aware of it? (And should this be a supportable configuration? I think so.) I'm pretty sure that most DAS clients won't be configurable to look for local servers instead of global ones. Even if they are, I'm pretty sure each will have a different way to do so. Apollo and Bioperl will use different mechanisms. I have no good answer for this. It sounds like your answer is "people won't have local copies." I think they will. Ideas: - have a rewriting registry server which does a rewrite of the information from the other servers. But this doesn't work because the feature result from the remote server (in my scheme) is given using its local segment names. There's no way to go from that local name to the appropriate mirror reference server. This suggests that the results really do need to be given through global ids, with no support for local ones. The segments result optionally provides a way to resolve a global name through a local resource. - set up an HTTP proxy service for DAS requests which transparently detects, translates and redirects to the appropriate local resource. Cute, but not likely to be done in real life. c) A group has been working on a new genome/assembly. The data is annotated on local machines using DAS and DAS writeback Finally it's published. Do they need to rewrite all their segment identifiers to use the newly defined global ones? As there are only a few places where the segment identifier is used, and it's an interface layer, I think the conversion is easy. But it is a flag day event which means people don't want to do it. Instead, it's more likely that local people will set up a synonym table to help with the conversion. There are perhaps a dozen groups which might do this and they all have competent people. This should not be a problem. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Thu Mar 16 16:06:26 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 16 Mar 2006 08:06:26 -0800 Subject: [DAS2] on local and global ids In-Reply-To: <200603161103.50323.lstein@cshl.edu> References: <4319dfa55d78ca849bcc8a231daf679c@dalkescientific.com> <9F3DD0BD-25B5-49BC-839A-B6134B17E9B0@sanger.ac.uk> <200603161103.50323.lstein@cshl.edu> Message-ID: Lincoln: > I think it will help considerably to have a document that lists the > valid > sequence IDs for popular annotation targets. I've spoken with Ewan on > this, > and Ensembl will generate a list of IDs for all vertebrate builds. > I'll take > responsibility for creating IDs for budding yeast, two nematodes and 12 > flies. What should people use if there aren't defined? Like now? If everyone must use the same well-defined global id for the features response then doesn't that mean we can't have any DAS servers until this document is made? Is the general requirement that the first person to make a server for a given build/genome/etc. is the one who gets to define the global ids? Or is it Andreas at Sanger who defines the names? Suppose one group in California starts defining names for, say, the barley genome. Another group in say, Germany, is also working on the barley genome. They hate each others guts and don't work together, so they make their own names. The names refer to the same thing because it was a group in Japan which produced the genome. Do we wait for an alignment service? An identity service? before people can merge data from these two groups? Maybe we can solve all this by having an identity mapper format. And defer defining that format until there is a problem. There is no perfect solution. This is a sociological problem. Gregg's current client, I think, used hard-coded knowledge about the mapping between the two current servers. Then again, his code already supports a synonym table. Andrew dalke at dalkescientific.com From gilmanb at pantherinformatics.com Thu Mar 16 15:52:51 2006 From: gilmanb at pantherinformatics.com (Brian Gilman) Date: Thu, 16 Mar 2006 10:52:51 -0500 Subject: [DAS2] on local and global ids In-Reply-To: <41d2d7197710e14d4ba898ae758bf280@dalkescientific.com> References: <4319dfa55d78ca849bcc8a231daf679c@dalkescientific.com> <9F3DD0BD-25B5-49BC-839A-B6134B17E9B0@sanger.ac.uk> <41d2d7197710e14d4ba898ae758bf280@dalkescientific.com> Message-ID: <441989D3.90202@pantherinformatics.com> Hey Guys, Where's the latest spec and use case document? Sorry if this is a super dumb question. I couldn't find it on the website. Best, -B Andrew Dalke wrote: >Thomas: > > >>I'm not sure that DAS1 experience is a good model for this. It's true >>that people didn't always point to well-known reference servers, but I >>think this has more to do with the fact that people didn't know which >>server to point to. >> >> > >I think I said there are two cases; there's actually several > > 1. the sources document states a well-known COORDINATES > and makes no links to segments > 2. the sources document refers to a well-known segments server > ("the" reference server) and no COORDINATES > 3. the source document has a segments document, and each segment > listed uses URIs from "the" reference server > 4. the server implements its own coordinates server, with > new segment ids > 5. When uploading a track to Ensembl there's no need to have > either COORDINATE or segments -- the upload server can > verify for itself that the upload uses the right ids. > > >The *only* concern is with #4. Everything else uses the well-known >global identifier for segments. > > > >>I'd still argue that the majority -- probably the vast majority -- of >>people setting up DAS servers really just want to make an assertion >>like "I'm annotating build NCBI35 of the human genome" and be done >>with it. >> >> > >I'm fine with that. There are two ways to do it. #1 and #2 above. >In theory only one of those is needed. The document can point to >"the" reference server for NCBI 35. > >In practice that's not sufficient because there is no authoritative >NCBI 35 server. > >Hence COORDINATES provides an abstract global identifier describing >the reference server. > > > >> That's what the coordinate system stuff in DAS/2 is for. If this is >>documented properly I don't think we'll see many "end-user" sites >>setting up their own reference servers unless a) they want an internal >>mirror of a well-known server purely for performance/bandwidth reasons >>or b) they want to annotate an unpublished/new/whatever genome >>assembly. >> >> > >A philosophical comment. I'm a distributed, self-organizing kinda >guy. I don't think single root centralized systems work well when >there are many different groups involved. > >I think many people will use the registry server, but not all. >I think there will be public DAS servers which aren't in the registry. >I know there will be in-house DAS servers which aren't. > >I'm just about certain that some sites will have local copies of >the primary data. They do for GenBank, for PDB, for SWISS-PROT, >for EnsEMBL. Why not for DAS? > >That said, here's a couple of questions for you to answer: > > a) When connecting to a new versioned source containing only >COORDINATES data, what should the client do to get the list >of segments, sizes, and primary sequence? > >I can think of several answers. My answer is that the versioned >source should state the preferred reference server and unless >otherwise configured a client should use that reference server >and only that reference server. > >Yes, all the reference servers for that coordinate system >are supposed to return the same results. But that's only if >they are available. There are performance issues too, like >low bandwidth or hosting the server on a slow machine. The >DAS client shouldn't round-robin through the list until it >finds one which works because that could take several minutes >to timeout on a single server, with another 10 to try. > >Yes, a client can be configured and told "for coordinate >system A use reference server Z". But that's a user >configuration. > > b) If there is a local mirror of some reference server, how >should the local DAS clients be made aware of it? (And >should this be a supportable configuration? I think so.) > >I'm pretty sure that most DAS clients won't be configurable >to look for local servers instead of global ones. Even if >they are, I'm pretty sure each will have a different way >to do so. Apollo and Bioperl will use different mechanisms. > >I have no good answer for this. It sounds like your answer >is "people won't have local copies." I think they will. > >Ideas: > - have a rewriting registry server which does a rewrite of >the information from the other servers. But this doesn't >work because the feature result from the remote server (in >my scheme) is given using its local segment names. There's >no way to go from that local name to the appropriate mirror >reference server. This suggests that the results really do >need to be given through global ids, with no support for >local ones. The segments result optionally provides a way >to resolve a global name through a local resource. > > - set up an HTTP proxy service for DAS requests which >transparently detects, translates and redirects to the >appropriate local resource. Cute, but not likely to be >done in real life. > > c) A group has been working on a new genome/assembly. The >data is annotated on local machines using DAS and DAS writeback >Finally it's published. Do they need to rewrite all their >segment identifiers to use the newly defined global ones? > >As there are only a few places where the segment identifier is >used, and it's an interface layer, I think the conversion is >easy. But it is a flag day event which means people don't >want to do it. Instead, it's more likely that local people >will set up a synonym table to help with the conversion. > >There are perhaps a dozen groups which might do this and they >all have competent people. This should not be a problem. > > Andrew > dalke at dalkescientific.com > >_______________________________________________ >DAS2 mailing list >DAS2 at lists.open-bio.org >http://lists.open-bio.org/mailman/listinfo/das2 > > > > From dalke at dalkescientific.com Thu Mar 16 16:33:58 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 16 Mar 2006 08:33:58 -0800 Subject: [DAS2] on local and global ids In-Reply-To: <441989D3.90202@pantherinformatics.com> References: <4319dfa55d78ca849bcc8a231daf679c@dalkescientific.com> <9F3DD0BD-25B5-49BC-839A-B6134B17E9B0@sanger.ac.uk> <41d2d7197710e14d4ba898ae758bf280@dalkescientific.com> <441989D3.90202@pantherinformatics.com> Message-ID: <24b985c0229970562a9e2612f00f2da5@dalkescientific.com> Brian: > Where's the latest spec and use case document? Sorry if this is a > super dumb question. I couldn't find it on the website. CVS for the spec. The history is: draft 1 - written by Lincoln, freeze for summer last year. This is the one with HTML, etc. and is on the web site. draft 2 - written by me in January. In CVS under das/das2/new_spec.txt with examples under das/das2/scratch . This was the version for the spring last month draft 3 - under development I rewrote beginning of it because no one liked the pedantic pedagogical style it used. This draft starts with examples. The incomplete version, as of Monday morning, is das/das2/draft3/spec.txt However, I am slow at writing spec text, especially new text. Instead of working on it more I put example output files in das/das2/draft3/ucla/ starting with 'sources.xml' in that directory. As for use cases, the email you saw from me a couple of days ago is the only thing even close to formal. Andrew dalke at dalkescientific.com From ap3 at sanger.ac.uk Thu Mar 16 17:05:10 2006 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Thu, 16 Mar 2006 17:05:10 +0000 Subject: [DAS2] sources responses Message-ID: <355af8b441fefe8690a9e78de55fc2f9@sanger.ac.uk> Hi! the (toy) sources responses at http://www.spice-3d.org/dasregistry/das1/sources/ http://www.spice-3d.org/dasregistry/das2/sources/ now are updated to the latest spec and validate with Andrew's validator at http://cgi.biodas.org:8080/ Cheers, Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 From Steve_Chervitz at affymetrix.com Thu Mar 16 20:37:16 2006 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Thu, 16 Mar 2006 12:37:16 -0800 Subject: [DAS2] Notes from DAS/2 code sprint #2, day three, 15 Mar 2006 Message-ID: Notes from DAS/2 code sprint #2, day three, 15 Mar 2006 $Id: das2-teleconf-2006-03-15.txt,v 1.1 2006/03/16 20:45:35 sac Exp $ Note taker: Steve Chervitz Attendees: Affy: Steve Chervitz, Ed E., Gregg Helt Sanger: Thomas Down, Andreas Prlic CSHL: Lincoln Stein Dalke Scientific: Andrew Dalke (at Affy) UCLA: Allen Day, Brian O'Connor (at Affy) Action items are flagged with '[A]'. These notes are checked into the biodas.org CVS repository at das/das2/notes/2006. Instructions on how to access this repository are at http://biodas.org DISCLAIMER: The note taker aims for completeness and accuracy, but these goals are not always achievable, given the desire to get the notes out with a rapid turnaround. So don't consider these notes as complete minutes from the meeting, but rather abbreviated, summarized versions of what was discussed. There may be errors of commission and omission. Participants are welcome to post comments and/or corrections to these as they see fit. [Notetaker: joining 10 min into the discussion] ls: how does synonym business work? ad: if server has access to data... ls: we ask server for the global id, uses same global id for segments, and uses same global id for the sequence. gh: to do this in the capabilities for annot server, the global id for segments query points to reference server. ls: if the local machine current server, has sequence capabilities, then it passes global id for segments to current server and it gets the sequence. if it doesn't have that capability, then we need to figure out a way for it to get the sequence. the easiest way to do that would be to resolve that url and fetch it. I'm open to any suggestion. I don't see how this uri/synonym is getting us any closer to being able to find the server where sequence can be fetched. The synonym isn't always a fetchable thing. ad: syn is a global id ad: look at the uri for the segment and fetch it from there ls: could be a remote url. gh: segments query is only thing that gives segment url segments capabilities for the annot server should point ls: break apart segments into: id=a string, then have an attribute seq_url, when fetched returns the seq. returns the bases. ad: is that's what's there already? ls: no, uri is an id ad: every url is an id, but it's up to whim of the server ls: i don't want people to think its for an id. want an agreed upon uri identifier, then optionally have a url. turn synonym into uri, turn uri into resolver make uri required, bases not required. ad: additional constraint is 'agreed upon'. what about a group starts a new sequencing project. There is no globally known uri for it yet. ls: they just create their own ids td: the natural authority is the creator of the assembly. gh: ncbi won't do it. they don't have a das server, unlikely to. ls: can point to genome assembly. can create a url that will return bases from ncbi in a supported format. this approach will disentangle issue of resolvable vs non-resolvable, local vs non-local segment ids and how to get segment dna. gh: I think this will work. ad: 'this' changing key names? ls: key semantics uri is required, global identifier sequence is an optional pointer gh: you say that for feat xml, the id for seq will be the globally agreed on id. ls: yes ad: if you don't have a local copy, if you have ability to map global identifiers, then you know what it is from the coordinates. there are two ways to specificy coordinates: coordinates and segments ad: if you just need the segments and some identifier. only when you need to do an overlay with someone else that you need the coords. gh: no, coords don't say anything about ids of coord (?) gh: if we do it the way lincoln proposed, then the logical way to relate those is that the segments capapbilities points to ref server. ad: when feat returns a location is it in global or local space? gh: lincoln - global space ls: every annot server will know length of its landmarks (chrms). some people will not want to be served dna, they will point somewhere else where to get the dna. There will be many places to get dna for a given global id, they chose one they like. ls: feature locations are given in global id ad: this changes the way it's been working. xml:base issues ls: I know. gh: if base of sequence and base of features are different, the xml will get bigger. ls: so an argument for having local ids is so you can make location string shorter. gh: yes. ls: probably not worth it ad: also makes it easier to set up a basic server. if you want to overlay them, yes you do. ls: you can always set up a local server if you gh: segments response local and global id as we talked about yesterday (which one feature locatn is relative to) gh: if the only way to overlay for a client to know things are in the same coord system is segid=xxxx and globalid=yyyy, how much harder is it for server to use global ids. ls: server can have configuration file to know where its global ids are coming from aday: would need to think about it more. ad: who will set up these identifiers (yeast, human) ls: I'll do it for model org databases, I will specify segments, and their dna fetchers and will look up their lengths. gh: versions? ls: most recent. community can then keep it up to date. I bet ensembl will be happy to generate this file automatically with every build (for vertebrates) ad: local id uri, and a bunch of synonyms. People will set up own server not referencing a global system. ls: then client would do a closure over all systems. imagine three servers: server-a says here is my segment server-b says it can be b or c server-c says it can be c or a so you have to do a join over all servers gh: not encourage people to do that with local seq ids, encourage people to use. need a global referencing system to say this uri is same as that uri. ad: bad logic for the web. If one is wrong, could be a problem td: (proposal - based on genomic coord alignments) ad: that says only alignable things are the same. ad: don't think it will work, they will already have local servers gh: what about 'the stick': people who want to register their server with central registry can only do so if they use global ids for their segments. ls, td: fine ad: if they've been working for a while in house, they would have a big effort to retrofit their system to comply. just won't do. ls: in draft 3, where's assembly info? ad: same as before. ask segments for agp format. draft not complete. gh: the thing that ids which assembly you're on is the coordinates element (authority, taxonomy, ...) ls: authority is a recognized, globally unique organization. Should it be a uri? ad: authority and version is human visible so people can search by it. ls: fine. gh: can invoke the 'stick' idea here: if you 're trying to register something on same genomome assembly, then registry can check your segments to verify they are agreed up. ls: taxon, source, authority, version all must match ad: also an id ap: we discussed in email ad: the only stuff that is complete is in the ucla subdir. ls: the examples are definitive ad: yes, unless we change things today. ls: what if taxon, source, version match but uri doesn't? registry gets submission. makes a segments request on submitter, if it gets a list of same segment identifiers, it accepts it. what if it gets a subset? gh: ok ls: superset is not ok. aday: why? gh: if you allow subset and superset, you can have everything. aday: use case: bacteria with extra plasmid identifier. nh: signing off. will be at affy tomorrow. ls: you would have to create your own coord system. gh: could argue with maintainer to added it. ls: can you have multiple coordinates in a given assembly? aday: proposal: make coords an attribute of the segment. could keep your segment references local. ls: we shouldn't give people ways to create new names. human chr1 ncbi build 35 should be something that everybody can agree on. gh: then we wouldn't allow allen's use case where someone wants a superset of what's in reference? ls: add new coord tag to source version entry, says I'm creating a superset consisting of coords from ref 1, 2, 3, any of these can be a new namespace that I set up. gh: how do you know which ones come from where? right now there's now way to get coord for a segment. ad: can as of yesterday afternoon. ls: to indicate which segments come from which auth. put coord id into segments tag. aday: thank you! ad: alternative proposal - multiple segments use case: when you have scaffolds or chromosomes, or mouse and yeast ls: say you want human mouse scaffolds + chrms, and human chrms three diff coords tags in the sources document each one gives auth, taxon, etc. when client goes to get segments, it will get human chromosomes, mouse chrms, and mouse scaffolds, in one big list, each will point back to coord it got in features requets. gh: knowing what coordinates doesn't tell you global id for segment aday: ok. gh: multiple segments elements vs mult coords in a segment work for me. ad: what does a client do gh: ... ls: three types of entry points, hu chrms, mo chrms, mo scaffolds, now tell me what you want to start browsing. human readable. scaffold on mouse with name xxx from two ad: displaying all together vs one or the other or the other. ee: affymetrix use case in igb. [probe gh: doesn't seem to matter aday: the tag values are easier to implement td: not a big difference to me gh: drawing on whiteboard... ls: let's rename das to distributed annotation research network. then we can say "darn1, darn2"! ad: gregg's request for search to find everything identical (start and end are same) td: if you have contained and inside, you can do identical with an and operation. ls: doesn't make server any more complicated, for completeness you may want to do that. ad: how about includes 1-5000 and excludes ... some of this is asethetic. ls: overlaps, contains, contained-in have good use cases for. exact match - maybe searching for curated exons that exactly match predicted. [Lincoln has to leave.] gh: drawing options for segments and coordinate systems. [whether you put a coords tag per segment, or ome capabilities one for each coord system] allen's approach - one query with filter or multiple fetches aday: uniprot example gh: separate segments query. ap: can we leave it out and add later if necessary? ad: these are things that haven't been discussed in last two years aday: uri ad: xml namespace issue - what do we call it (see email) gh: you pick it ad: required syntax for entry points /das/source gh: recommended, but not required ad: lincoln was only one who felt strongly about it being required, and he's not here. gh: feature xml, every feature can have multiple locations feaures can represent alignments (collapsed alignment tag into feature tag) td: like it gh: naive user- given a feat with multip location on genome, represent as multip locations, or parent child relations td: don't see as a problem. using parent-child you have things to say about child features specific to them gh: genscan prediction, a problem: one server can serve them up as parent child or as multiple locations on parent four child exons in one case four diff locations in other case problem is with feat filters. if yo do an overlaps query and any children meet the condition, you have to return the parent as well and it's parent on up. agreed? ad: yes gh: works fine for parent child, but for multip location situation, if inside query fully contains only two eons, do you return parent? td: I'd assume inside query would return both. as long as one exon is inside the region, the parent is return. define inside as applying to any level. gh: so even though the transcript is not inside, you still return it? td: using the get parent-if-get-children rule gh: rule must apply to all of them, so you don't get transcript since it doesn't meet the inside condition. aday: multiple locations makes sense - just aligned mult times. human alu feature 100,000s, do you want to create a single feature, or just a single identifier and put it in many different locations. ee: that is for alignments not parent-child relationship aday: you consider location as a attribute of the object.. ee: I agree. alu is only one object, but the exon-transcript are different ad: would someone want to annotate the separate exons differently? aday: you would split it off ad: eg blast alignment, hsp is part of the conceptual alignment. gh: in bioperl, some people will go one path, some go the other path, so we need to figure out how to deal with it. feat filters is clear for parent child relationship. aday: inside and overlaps gh: if your overlap query only grazes one child, you return the parent. this is the only one I'm certain about. gh: we haven't specified that the child is within bounds of parent. with insides, we have a difference of opinion. one exon is within, do you return it? ad: most clients will be doing overlaps, you are the only one doing insides what do you want? gh: the multiple locations muddies the issue. if parent child rule is you only return it if parent is inside (and recursive parent), I've already optimized for that. For multiple locations, I can catch that and handle it. the way I want, the behaviour of mult location will be diff than parent child. td: for me, the overlaps is the most important thing. Andreas just get everything. ad: can we delegate to gregg here for what to do in case of inside. [A] gregg will write up description for inside query and multiple locations Status reports ----------------- gh: updating server. overlaps, insides, types, and each good news: latest genome assembly on human on affy server overlayed with allen's server. using hardcoded knowledge in igb for assembly id, not coordinates yet. with andrew: making sure clients can understand any variants of namespace usage in the xml. get client to use more capabilities like links ad: example data set together, updated schema to latest spec, but forgot cigar thing. update validator to use most recent version or rnc schemas. gh: even if your server isn't public you can cut and paste into you validator at http://cgi.biodas.org:8080 aday: biopackages up to date with version 200 of spec file. issues for nomi, and gregg. off by one error. bo: small code refactor in the das server. testing that today. ee: nothing das related yet, but will. implementing style sheets to get colors for features. ap: registry ui for upload of a das/2 source. coding for that gh: what about registry rejecting segment ids if they don't match standard ids for that coord system. sound good to you? ap: basically yes. td: not done a great deal gh: Nomi has been here working on apollo client. we'll hear from her tomorrow. ----------------------- post teleconf discussion re: using global identifiers for uri [Notetaker: just a few morsels were captured here.] ad: most folks i work with get something going locally, then after it's going, hook it up with the rest of the world, integrate with other people. they don't want to revamp their work in order to do that. gh: slightly in favor with andrew ad: get what we have now. they are still uri's so it's just an interpretation. will change attributes to be 'uri and 'reference_uri' gh: how does it get length of segments? ad: good idea to have coordinates and segments in the document. add your own track to ensembl, you don't need to give it a segments, just specify coordinates. gh: seems like it will encourage servers that can only work with particular clients. ad: what about getting rid of coordinates, just needed by Andreas for registry. From Steve_Chervitz at affymetrix.com Thu Mar 16 20:38:13 2006 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Thu, 16 Mar 2006 12:38:13 -0800 Subject: [DAS2] Notes from DAS/2 code sprint #2, day four, 16 Mar 2006 Message-ID: Notes from DAS/2 code sprint #2, day four, 16 Mar 2006 $Id: das2-teleconf-2006-03-16.txt,v 1.1 2006/03/16 20:45:48 sac Exp $ Note taker: Steve Chervitz Attendees: Affy: Steve Chervitz, Gregg Helt CSHL: Lincoln Stein Dalke Scientific: Andrew Dalke (at Affy) Sanger: Andreas Prlic UC Berkeley: Nomi Harris (at Affy) UCLA: Allen Day, Brian O'Connor (at Affy) Action items are flagged with '[A]'. These notes are checked into the biodas.org CVS repository at das/das2/notes/2006. Instructions on how to access this repository are at http://biodas.org DISCLAIMER: The note taker aims for completeness and accuracy, but these goals are not always achievable, given the desire to get the notes out with a rapid turnaround. So don't consider these notes as complete minutes from the meeting, but rather abbreviated, summarized versions of what was discussed. There may be errors of commission and omission. Participants are welcome to post comments and/or corrections to these as they see fit. Status reports --------------- nh: apollo work, reading the registry, saving capabilties. modifications to code that was based on prototype das adaptor. Generally lots of under the hood work to bring it up to spec. bo: diff functionality between allen's server biopackages.net server and andrew's samepl xml. Updated templates in allen's das server to match andrew's sample xml. ad: worked on validation server, all stuff is in cvs. the http://cgi.openbio.org:8080 server is built off cvs, just check out and rebuild. gh: worked on affy das2 server and client up to current spec based on whatever the rnc documents say (schema doc) as for xml. no chance to read andrew's email on query syntax, will incorporate that today. sc: got latest version of gregg's das/2 server up at affy. serving hg17, hg16, dm2. Updated code that the das1 server is using based on latest genoviz jars. Getting some errors when loading data for new affy arrays. Investigating. aday: minor bug fixes for spec v200. exporting assay data as different views. ucsc browser can viz expression data out of das server in bed format. das viewer can view as egr format. working on single chip at a time. ls: here's a great use case for you: there's a cshl fellow creating dna spectrographs of oligo frequencies presented as audiographs. can really tell diffs from coding vs non-coding, CpG triplets, microsatellites harmonics, big matrices of floating point data tied to genome. consider this a challenge to das to serve this up. my postdoc sheldon mckay is serving this up give you heatmap back given a genomic region. new glyph for spectrographic data aday: format netCDF is good for this, but clients out there don't vizualize it. gh: would like to support netCDF in igb. not sure if this is default way to represent qualtitative data for das. [A] allen will send lincoln pointer to netCDF. aday: netCDF is great for cross-lang, cross platform support. gh: people are pushing wiggle format to ucsc, so we don't want to restrict to just netCDF. aday: my refactor yesterday allows treatment of these as templates. gh: how do this via region query in das? ls: feature query, tag says here comes binary data, each column corresponds to a base (or maybe a scaling factor to indicate # of bp per column). tag says here comes binary qualtitatilve data, scale is 1:1. gh: better way is to use alternative content format stuff (already in spec for types) ls: if you do feat request and don't filter by type, you'll get a mix of binary and non binary. aday: not in genome domain, genome/sequence the fetch to assay service to get quant data. then do intersection to find overlap. performance goes out window if you make the query too complex. fine to do just two fetches. ls: how indicate scale for numerical scale? aday: good question. units are not encoded now. ls: spectogrphic data one value per window where window is 100 bp aday: so two diff units window size, amplitude value and frequency, and that's in four channels for the bases. we're representing as 4 matrices. aday: one matrix per channel.many formats don't support n-dimensional data. only 2d at most. ls: in das1 did base64 encoded string in the notes. It worked. gh: we can't require all clients to know how to interpret it. This is why we have the alt content functionality... [A] das should support dense numeric data across regions, format specified by the existing alternative format mechanism Topic: Spec Freeze ------------------- ls: can we talk about feezing spec? ad: what good will it do? ls: allow us to code to a fixed spec. you freeze spec, people write code for a defined period of time, during that time we compare notes, then make changes, freeze, and repeat. ad: concerned there hasn't been enough work since the changes in jan/feb. ls: now that i'm 'on the other side of the fence' of spec writing, i'd like to see it not change, and have time to make an informed view of what it's strengths and weaknesses are. ad: haven't gotten feedback about my questions, until the codesprints. two months ago, only now being addressed. ls: these issues don't become pressing until we start implementing. this is why we do code sprints. ad: worry because there's been no extensive data modeling for features. ls: can do a 1 month freeze gh: comfortable with 1 mon freeze of schemas as they are in the rnc's now. issues will come up. ls: announce on biodas.org - march 18th das/2 is frozen for 1 month. gh: we'll have to live to ambiguity with how server does certain things. ls: hence the time limited 'trial' freeze. ad: would have like people to write code from last feb so I could get feedback. ls: you very much improved the spec. grateful for what you've done. I wasn't getting feedback when I was writing either. gh: validation website is great for implementers, rather than having to read a spec document everyday. ad: schemas aren't going to change after today (pm). would like to clear some things up about filter language, today? ls: most urgent freeze [A] spec will freeze as of end of today (3/16/06, PST) for one month. Topic: Feature filters ---------------------- ad: feature filters is most important, and how do we define global names? schema is a simple change - which is req'd and which is optional but for impls makes a big diff. ls: global is req'd and local is optional. ad: who comes up with global names ls: first person to do it has naming rights. people have been able to do it for the ensembl service. ad: I need documented names gh: it means you don't know whether two names are the same thing until this document comes out. ls: filter language? ad: gregg needs inside and contains, - type and exact type: das type or ontology type? ls: das type gh: uri attribute of the type ad: that type or it's subtype makes no sense for das types ls: it's just an exact match. client can use ontology to get a series of types ls: should be an exact match, does not traverse ontology. client should ask user: do you want all exons or a specific type of exon? ls: client goes through ontology as necesary [A] drop exacttype, type now has exacttype semantics Topic: XID, feature ids ------------------------ ad: xid in features. no one used yet. gives a ref to some other db. all it is is a url/uri. feels like there should be more info (type?) ad: primary name field for feature, feels like should be name ls: name is human readable. title would be ok ad: but feature filter is called name searches name and id fields ls: this is correct behavior, you can do a fetch on the url/uri this is ok. ad: the name feature searches title and alias. gh: if feature id is resolvable and you resolve it, there's no guarantee it gives back a das2xml document. if the feature uri is resolvable, and you fetch it, you will get back a das2xml document right? can you put uri in the feature query? aday: feels that having auto-generated names ad: do all features have a human readable name? gh/ls: optional ad: why would you want to put a url in a name field? gh: rdf ad: should be a resolvable resource, das2xml for that feature. ad: features with aliases, do aliases need type pk or accession? prosite has false match to ... ls: this is a property or xid, not alias ad: suggests that xid needs extra stuff to it. gh: file with an optional type attribute on xid ad: let's wait to someone has a need. Topic: Feature filters (continued) ---------------------------------- gh: feature filters, inside, contains, identical. Which do we need, which can we drop? [A] overlaps - keep (all agree) inside - gregg needs contains - dropping, maybe identical - dropping ad: what about excludes - the complement of overlap? gh: haven't had time to investigate whether I can use excludes rather than the inside + overlaps (contains?) combination I need now. ls: use case: pointing to children and they haven't arrived yet. gh: my client keeps stuff around, when you get parent/child if you have parent + all children you can construct feature. ls: the spec requires single parent, right? gh: no you can have multiple. ls: gff3 spec also allows mult parent and children [A] Lincoln will provide use cases/examples of these features scenarios: - three or greater hierarchy features - multiple parents - alignments Topic: Registry ---------------- ap: still here. gh: looking at registry, having trouble retrieving in a normal browser. when looking at it in client, I only see biopackages server registered as a server. Lincoln said there was more? ap: this is related to mime types, changed from text plain to x-das-sources gh: I get an error: source file could not be red. lincoln said you added other test das2 servers to it. ap: working on interface so users can upload servers. half way through it now. upload a link to sources. will send email once it's there. [A] Steve will add gregg's new affy das/2 server to registry when Andreas' web interface is ready gh: same time tomorrow. From cjm at fruitfly.org Thu Mar 16 20:50:37 2006 From: cjm at fruitfly.org (chris mungall) Date: Thu, 16 Mar 2006 12:50:37 -0800 Subject: [DAS2] query language description In-Reply-To: <7ed68f2baa961f932e369cb449371439@dalkescientific.com> References: <7ed68f2baa961f932e369cb449371439@dalkescientific.com> Message-ID: Hi Andrew I presume one constraint is that you want to preserve standard CGI URL syntax? I think this is the best that can be done using that constraint, which is to say, fairly limited. This lacks one of the most important features of a real query language, composability. These ad-hoc constraint syntaxes have their uses but you'll eventually want to go beyond the limits and end up adding awkward extensions. Why not just forego the URL constraint and go with a composable extendable query language in the first place and save a lot of bother downstream? On Mar 15, 2006, at 9:17 PM, Andrew Dalke wrote: > The query fields are > > name | takes | matches features ... > ========================== > xid | URI | which have the given xid > type | URI | with the given type or subtype (XX keep this > one???) > exacttype | URI | with exactly the given type > segment | URI | on the given segment > overlaps | region | which overlap the given region > inside | region | which are contained inside the given region (XX > needed??) > contains | region | which contain the given region (XX needed?? ) > name | string | with a name or alias which matches the given > string > prop-* | string | with the property "*" matching the given string > > Queries are form-urlencoded requests. For example, if the features > query URL is 'http://biodas.org/features' and there is a segment named > 'http://ncbi.org/human/Chr1' then the following is a request for all > the > features on the first 10,000 bases of that segment > > The query is for > segment = 'http://ncbi.org/human/Chr1' > overlaps = 0:10000 > > which is form-urlencoded as > > > http://biodas.org/features? > segment=http%3A%2F%2Fncbi.org%2Fhuman%2FChr1;overlaps=0%3A1000 > > Multiple search terms with the same key are OR'ed together. The > following > searches for features containing the name or alias of either > BC048328 or BC015400 > > http://biodas.org/features?name=BC048328;name=BC015400 > > Multiple search terms with different keys are AND'ed together, > but only after doing the OR search for each set of search terms with > identical keys. The following searches for features which have > a name or alias of BC048328 or BC015400 and which are on the segment > http://ncbi.org/human/Chr1 > > > http://biodas.org/features?name=BC048328; > segment=http%3A%2F%2Fncbi.org%2Fhuman%2FChr1;name=BC015400 > > The order of the search terms in the query string does not affect > the results. > > If any part of a complex feature (that is, one with parents > or parts) matches a search term then all of the parents and > parts are returned. (XXX Gregg -- is this correct? XXX) > > > The fields which take URLs require exact matches. > > I think we decided that there is no type inferencing done in > the server; it's a client side thing. In that case the 'type' > field goes away. We can still keep 'exacttype'. The URI > used for the matching is the type uri, and NOT the ontology URI. > > (We don't have an ontology URI yet, and when we do we can add > an 'ontology' query.) > > The segment URI must accept the local identifier. For > interoperability with other servers they must also accept the > equivalent global identifier, if there is one. > > If range searches are given then one and only one segment is > allowed. Multiple segments may be given, but then ranges are not > allowed. > > The string searches support a simple search language. > ABC -- contains a word which exactly matches "ABC" (identity, not > substring) > *ABC -- words ending in "ABC" > ABC* -- words starting with "ABC" > *ABC* -- words containing the substring "ABC" > > If you want a field which exactly contains a '*' you're kinda > out of luck. The interpretation of whitespace in the query or > in the search string is implementation dependent. For that > matter, the meaning of "word" is implementation dependent. (Is > *O'Malley* one word? *Lethbridge-Stewart*?) > > When we looked into this last month at Sanger we verified that > all the databases could handle %substring% searches, which was > all that people there wanted. The Affy people want searches for > exact word, prefix and suffix matches, as supported by the the > back-end databases. > > > XXX CORRECT ME XXX > > The 'name' search searches.... It used to search the 'name' > attribute and the 'alias' fields. There is no 'name' now. I > moved it to 'title'. I think I did the wrong thing; it should > be 'name', but it's a name meant for people, not computers. > > Some features (sub-parts) don't have human-readable names so > this field must be optional. > > > The "prop-*" is a search of the elements. Features may > have properties, like > > > > To do a string search for all 'membrane' cellular components, > construct the query key by taking the string "prop-" and > appending the property key text ("cellular_component"). The > query value is the text to search for. > > prop-cellular_component=membrane > > To search for any cellular_component containing the substring "mem" > > prop-cellular_component=*membrane* > > The rules for multiple searches with the same key also apply to the > prop-* searches. To search for all 'membrane' or 'nuclear' > cellular components, use two 'prop-cellular_component' terms, as > > > http://biodas.org/features?prop-cellular_component=membrane;prop- > cellular_component=membrane > > > The range searches are defined with explicit start and end > coordinates. The range syntax is in the form "start:end", for > example, "1:9". > > Let 'min' be the smallest coordinate for a feature on a given > segment and 'max' be one larger than the largest coordinate. > These are the lower and upper founds for the feature. > > An 'overlaps' search matches if and only if > min < end AND max > start > > XXX For GREG XXX > > What do 'inside' and 'contains' do? Can't we just get > away with 'excludes', which has complement of 'overlaps'? > Searches are done as: > Step 0) specify the segment > Step 1) do all the includes (if none, match all features on > segment) > Step 2) do all the excludes, inverted (like an includes search) > Step 3) only return features which are in Step 1 but not > in Step 2) > Step 4) ... > Step 5) Profit! > > I think this will support your smart code, and it's easy > enough to implement. > > Every one but you was planning to use 'overlaps'. Only you > wanted to use 'inside'. Anyone want to use 'contains'? > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/das2 From dalke at dalkescientific.com Thu Mar 16 23:24:25 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 16 Mar 2006 15:24:25 -0800 Subject: [DAS2] 'source' attribute in the types document Message-ID: Types have a 'source' field. The first draft shows examples like source='curated' source='genescan' source='tRNAscan-SE-1.11' My interpretation is that this is a human readable field, with no machine interpretation other than as a string. It does not come from a controlled vocabulary. It may contain spaces. This field is not currently searchable because we expect the number of types to be small enough a client will download everything and do the search locally. Let me know if I'm wrong. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Thu Mar 16 22:46:14 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 16 Mar 2006 14:46:14 -0800 Subject: [DAS2] query language description In-Reply-To: References: <7ed68f2baa961f932e369cb449371439@dalkescientific.com> Message-ID: Hi Chris, > I presume one constraint is that you want to preserve standard CGI URL > syntax? Yes. > I think this is the best that can be done using that constraint, > which is to say, fairly limited. Then again, the functionality we need is also fairly limited. > This lacks one of the most important features of a real query > language, composability. These ad-hoc constraint syntaxes have their > uses but you'll eventually want to go beyond the limits and end up > adding awkward extensions. Why not just forego the URL constraint and > go with a composable extendable query language in the first place and > save a lot of bother downstream? Because no one can decide on a generic language which is more powerful than this. Anything more powerful would need to support .. boolean algebra? numeric searches? regexps? What about quoting rules for "multiple word phrases"? Is it SQL-like? XPath/XQuery-like? Is it a context-free grammar? How easy is it to implement and work cross-platform? For what people need now, this search solution seems good. For the future we can have and clients which understand that interface will know that it's there. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Thu Mar 16 23:38:07 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 16 Mar 2006 15:38:07 -0800 Subject: [DAS2] new search terms Message-ID: <5a29cf88a8fc1e8e8448c6e1dd248dbb@dalkescientific.com> "note=" is a string search of the note fields Example: note=And* finds all features where which have a note containing a word starting with 'And' "coordinates=" filters for features on that coordinate system. (We talked about this one yesterday.) I'm republish the search terms before the end of the day. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Thu Mar 16 23:54:12 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 16 Mar 2006 15:54:12 -0800 Subject: [DAS2] comments in schema Message-ID: I've updated the schema docs (das/das2/draft3/*.rnc ) to include more detailed comments. Also, updated the ucla examples to change 'synonym' to 'reference'. Everything should be up to date. Andrew dalke at dalkescientific.com From cjm at fruitfly.org Fri Mar 17 00:04:03 2006 From: cjm at fruitfly.org (chris mungall) Date: Thu, 16 Mar 2006 16:04:03 -0800 Subject: [DAS2] query language description In-Reply-To: References: <7ed68f2baa961f932e369cb449371439@dalkescientific.com> Message-ID: <8b7582943da22dfed23ba7b5386402fb@fruitfly.org> On Mar 16, 2006, at 2:46 PM, Andrew Dalke wrote: > Hi Chris, > >> I presume one constraint is that you want to preserve standard CGI URL >> syntax? > > Yes. I'm guessing you've been through this debate before, so no comment.. > >> I think this is the best that can be done using that constraint, >> which is to say, fairly limited. > > Then again, the functionality we need is also fairly limited. ignorant question.. (I have only been tangentially aware of the outer edges of the whole das2 process).. how are you determining the functionality required? surely someone somewhere will want to write a das2 client that implements boolean queries I speak from experience - I designed the GO Database API to have a very similar constraint language (it's expressed using perl hash keys rather than CGI parameters but the same basic idea). For years people have been clamouring for the ability to do more complex queries - right now they are forced bypass the constraint language and go direct to SQL. > >> This lacks one of the most important features of a real query >> language, composability. These ad-hoc constraint syntaxes have their >> uses but you'll eventually want to go beyond the limits and end up >> adding awkward extensions. Why not just forego the URL constraint and >> go with a composable extendable query language in the first place and >> save a lot of bother downstream? > > Because no one can decide on a generic language which is more > powerful than this. > > Anything more powerful would need to support .. boolean algebra? > numeric searches? regexps? What about quoting rules for "multiple > word phrases"? > > Is it SQL-like? XPath/XQuery-like? Is it a context-free grammar? > How easy is it to implement and work cross-platform? None of these really lit into the DAS paradigm. I'm guessing you want something simple that can be used as easily as an API with get-by-X methods but will seamlessly blend into something more powerful. I think what you have is on the right lines. I'm just arguing to make this language composable from the outset, so that it can be extended to whatever expressivity is required in the future, without bolting on a new query system that's incompatible with the existing one. The generic language could just be some kind of simple extensible function syntax for search terms, boolean operators, and some kind of (optional) nesting syntax. If you have boolean operators and it's composable, then yep it does have to be as expressive as boolean algebra. I'd argue that implementing a composable query language is easier than an ad-hoc one > For what people need now, this search solution seems good. > > For the future we can have > > > > and clients which understand that interface will know that it's > there. hmm, not sure how useful this would be - surely you'd want something more dasmodel-aware? if you're going to just pass-through to xpath or sql then why have a das protocol at all? > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/das2 From Gregg_Helt at affymetrix.com Fri Mar 17 00:22:54 2006 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Thu, 16 Mar 2006 16:22:54 -0800 Subject: [DAS2] query language description Message-ID: For the type query filter, I'd suggest keeping the exacttype semantics you discuss below, but using "type" for the field name rather than "exacttype". If we're getting rid of one of them, and a non-exact type is a meaningless concept, it seems like keeping that "exact" part is unnecessary and potentially confusing. gregg > > I think we decided that there is no type inferencing done in > the server; it's a client side thing. In that case the 'type' > field goes away. We can still keep 'exacttype'. The URI > used for the matching is the type uri, and NOT the ontology URI. > > (We don't have an ontology URI yet, and when we do we can add > an 'ontology' query.) > > The segment URI must accept the local identifier. For > interoperability with other servers they must also accept the > equivalent global identifier, if there is one. > > If range searches are given then one and only one segment is > allowed. Multiple segments may be given, but then ranges are not > allowed. > > The string searches support a simple search language. > ABC -- contains a word which exactly matches "ABC" (identity, not > substring) > *ABC -- words ending in "ABC" > ABC* -- words starting with "ABC" > *ABC* -- words containing the substring "ABC" > > If you want a field which exactly contains a '*' you're kinda > out of luck. The interpretation of whitespace in the query or > in the search string is implementation dependent. For that > matter, the meaning of "word" is implementation dependent. (Is > *O'Malley* one word? *Lethbridge-Stewart*?) > > When we looked into this last month at Sanger we verified that > all the databases could handle %substring% searches, which was > all that people there wanted. The Affy people want searches for > exact word, prefix and suffix matches, as supported by the the > back-end databases. > > > XXX CORRECT ME XXX > > The 'name' search searches.... It used to search the 'name' > attribute and the 'alias' fields. There is no 'name' now. I > moved it to 'title'. I think I did the wrong thing; it should > be 'name', but it's a name meant for people, not computers. > > Some features (sub-parts) don't have human-readable names so > this field must be optional. > > > The "prop-*" is a search of the elements. Features may > have properties, like > > > > To do a string search for all 'membrane' cellular components, > construct the query key by taking the string "prop-" and > appending the property key text ("cellular_component"). The > query value is the text to search for. > > prop-cellular_component=membrane > > To search for any cellular_component containing the substring "mem" > > prop-cellular_component=*membrane* > > The rules for multiple searches with the same key also apply to the > prop-* searches. To search for all 'membrane' or 'nuclear' > cellular components, use two 'prop-cellular_component' terms, as > > > http://biodas.org/features?prop-cellular_component=membrane;prop- > cellular_component=membrane > > > The range searches are defined with explicit start and end > coordinates. The range syntax is in the form "start:end", for > example, "1:9". > > Let 'min' be the smallest coordinate for a feature on a given > segment and 'max' be one larger than the largest coordinate. > These are the lower and upper founds for the feature. > > An 'overlaps' search matches if and only if > min < end AND max > start > > XXX For GREG XXX > > What do 'inside' and 'contains' do? Can't we just get > away with 'excludes', which has complement of 'overlaps'? > Searches are done as: > Step 0) specify the segment > Step 1) do all the includes (if none, match all features on segment) > Step 2) do all the excludes, inverted (like an includes search) > Step 3) only return features which are in Step 1 but not > in Step 2) > Step 4) ... > Step 5) Profit! > > I think this will support your smart code, and it's easy > enough to implement. > > Every one but you was planning to use 'overlaps'. Only you > wanted to use 'inside'. Anyone want to use 'contains'? > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/das2 From dalke at dalkescientific.com Fri Mar 17 02:05:06 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 16 Mar 2006 18:05:06 -0800 Subject: [DAS2] query language description In-Reply-To: <8b7582943da22dfed23ba7b5386402fb@fruitfly.org> References: <7ed68f2baa961f932e369cb449371439@dalkescientific.com> <8b7582943da22dfed23ba7b5386402fb@fruitfly.org> Message-ID: Chris: > ignorant question.. (I have only been tangentially aware of the outer > edges of the whole das2 process).. > > how are you determining the functionality required? surely someone > somewhere will want to write a das2 client that implements boolean > queries It was informal, based on feedback from client developers and maintainers. Lincoln, Thomas, Andreas, Gregg and others provided that feedback. It was not by talking with users. I know there's a wide range of users and use cases. The point of this query language is to have basic functionality that all servers can implement. > right now they are forced bypass the constraint language and go direct > to SQL. In addition, we provide defined ways for a server to indicate that there are additional ways to query the server. > None of these really lit into the DAS paradigm. I'm guessing you want > something simple that can be used as easily as an API with get-by-X > methods but will seamlessly blend into something more powerful. I > think what you have is on the right lines. I'm just arguing to make > this language composable from the outset, so that it can be extended > to whatever expressivity is required in the future, without bolting on > a new query system that's incompatible with the existing one. We have two ways to compose the system. If the simple query language is extended, for example, to support word searches of the text field instead of substring searches, then a server can say This is backwards compatible, so the normal DAS queries work. But a client can recognize the new feature and support whatever new filters that 'word-search' indicates, eg http://somewhere.over.rainbox/server.cgi?note-wordsearch=Andre* (finds features with notes containing words starting with 'Andre' ) These are composable. For example, suppose Sanger allows modification date searches of curation events. Then it might say and I can search for notes containing words starting with "Andre" which were modified by "dalke" between 2002 and 2005 by doing http://somewhere.over.rainbox/server.cgi?note-wordsearch=Andre*& modified-by=dalke&modified-before=2005&modified-after=2002 An advantage to the simple boolean logic of the current system is that the GUI interface is easy, and in line with existing simple search systems. If someone wants to implement a new search system which is not backwards compatible then the server can indicate that alternative with a new CAPABILITY. Suppose Thomas at Sanger comes up with a new search mechanism based on an object query language he invented, The Sanger and EBI clients might understand that and support a more complex GUI, eg, with a text box interface. Everyone else must ignore unknown capability types. Then that would be POSTED (or whatever the protocol defines) to the given URL, which returns back whatever results are desired. Or the server can point to a public MySQL port, like That's what you are doing to bypass the syntax, except that here it isn't a bypass; you can define the new interface in the DAS sources document. > The generic language could just be some kind of simple > extensible function syntax for search terms, boolean operators, > and some kind of (optional) nesting syntax. Which syntax? Is it supposed to be easy for people to write? Text oriented? Or tree structured, like XML, or SQL-like? And which clients and servers will implement that search language? If there was a generic language it would allow OR("on segment Chr1 between 1000 and 2000", "on segment ChrX between 99 and 777") which is something we are expressly not allowing in DAS2 queries. It doesn't make sense for the target applications and by excluding it it simplifies the server development, which means less chance for bugs. Also, I personally haven't figured out a decent way to do a GUI composition of a complex boolean query which is as easy as learning the query language in the first place. A more generic language implementation is a lot of overhead if most (80%? 90%) need basic searches, and many of the rest can fake it by breaking a request into parts and doing the boolean logic on the client side. Feedback I've heard so far is that DAS1 queries were acceptable, with only a few new search fields needed. > hmm, not sure how useful this would be - surely you'd want something > more dasmodel-aware? The example I gave was a bad one. What I meant was to show how there's an extension point so someone can develop a new search interface and clients can know that the new functionality exists, without having to change the DAS spec. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Fri Mar 17 04:47:58 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 16 Mar 2006 20:47:58 -0800 Subject: [DAS2] query language description In-Reply-To: <7ed68f2baa961f932e369cb449371439@dalkescientific.com> References: <7ed68f2baa961f932e369cb449371439@dalkescientific.com> Message-ID: Updated: - added 'note' as a query field - changed string searches to substring (not word) searches and made them be case insensitive "AB" matches only the strings "AB", "Ab", "aB" and "ab" "*AB" matches only fields which exactly end with "AB", "ab", "aB", and "Ab" "AB*" matches only fields which exactly match, up to case "*AB*" matches only fields which contain the substring, up to case - added 'coordinates' search - removed 'type' and renamed 'exacttype' to 'type' - removed 'contains' search, which no one said they wanted. Instead, supporting (EXPERIMENTAL) an 'excludes' search. ================================== The query fields are name | takes | matches features ... ========================== xid | URI | which have the given xid type | URI | with exactly the given type segment | URI | on the given segment coordinates | URI | which are part of the given coordinate system overlaps | region | which overlap the given region excludes | region | which have no overlap to the given region inside | region | which are contained inside the given region name | string | with a title or alias which matches the given string note | string | with a note which matches the given string prop-* | string | with the property "*" matching the given string Queries are form-urlencoded requests. For example, if the features query URL is 'http://biodas.org/features' and there is a segment named 'http://ncbi.org/human/Chr1' then the following is a request for all the features on the first 10,000 bases of that segment The query is for segment = 'http://ncbi.org/human/Chr1' overlaps = 0:10000 which is form-urlencoded as http://biodas.org/features? segment=http%3A%2F%2Fncbi.org%2Fhuman%2FChr1;overlaps=0%3A1000 Multiple search terms with the same key are OR'ed together. The following searches for features containing the name or alias of either BC048328 or BC015400 http://biodas.org/features?name=BC048328;name=BC015400 The 'excludes' search is an exception. See below. Multiple search terms with different keys are AND'ed together, but only after doing the OR search for each set of search terms with identical keys. The following searches for features which have a name or alias of BC048328 or BC015400 and which are on the segment http://ncbi.org/human/Chr1 http://biodas.org/features?name=BC048328; segment=http%3A%2F%2Fncbi.org%2Fhuman%2FChr1;name=BC015400 The order of the search terms in the query string does not affect the results. If any part of a complex feature (that is, one with parents or parts) matches a search term then all of the parents and parts are returned. (XXX Gregg -- is this correct? XXX) The fields which take URLs require exact matches, that is, a character by character match. (For details on the nuances of comparing URIs see http://www.textuality.com/tag/uri-comp-3.html ) (We don't have an ontology URI yet, and when we do we can add an 'ontology' query.) The segment query filter takes a URI. This must accept the segment URI and, if known to the server, the equivalent reference identifier for the segment. If range searches are given then one and only one segment must be given. If there are multiple segment queries then ranges are not allowed. The string searches may be exact matches, substring, prefix or suffix searches. The query type depends on if the search value starts and/or ends with a '*'. ABC -- field exactly matches "ABC" *ABC -- field ends with "ABC" ABC* -- field starts with "ABC" *ABC* -- field contains the substring "ABC" The "*" has no special meaning except at the start or end of the query value. The search term "***" will match a field which contains the character "*" anywhere. There is no way to match fields which exactly match '*' or which only start or end with that character. Text searches are case-insensitive. The string "ABC" matches "abc", "aBc", "ABC", etc. A server may choose to collapse multiple whitespace characters into a single space character for search purposes. For example, the query "*a newline*" should match "This is a line of text which contains a newline" The 'name' search does a text search of the 'title' and 'alias' fields. The "prop-*" is shorthand for a class of text searches of elements. Features may have properties, like To do a string search for all 'membrane' cellular components, construct the query key by taking the string "prop-" and appending the property key text ("cellular_component"). The query value is the text to search for, in this case: prop-cellular_component=membrane To search for any cellular_component containing the substring "membrane" prop-cellular_component=*membrane* The rules for multiple searches with the same key also apply to the prop-* searches. To search for all 'membrane' or 'nuclear' cellular components, use two 'prop-cellular_component' terms, as http://biodas.org/features?prop-cellular_component=membrane;prop- cellular_component=membrane The range searches are defined with explicit start and end coordinates. The range syntax is in the form "start:end", for example, "1:9". There is no way to restrict the search to a specific strand. A feature may have several locations. An annotation may have several features in a parent/part relationship. The relationship may have several levels. If a range search matches any feature in the annotation then the search returns all of the features in the annotation. An 'overlaps' search matches if and only if any feature location of any of the parent or part overlaps the query range and segment. An 'inside' search matches if and only if at least one feature in the annotation has a location on the query segment and all features which have a location on the query segment have at least one location which starts and ends in the query range. EXPERIMENTAL: An 'excludes' matches if and only if at least one feature of the annotation is on the query segment and no features are in the query range. This is the complement of the 'overlaps' search, for annotations on the same query segment. Unlike the other search keys, if there multiple 'excludes' searches then the results are AND'ed together. That is, if the query is has two excludes ranges segment=ChrX excludes=RANGE1 excludes=RANGE2 then the result are those features which on ChrX which are not in RANGE1 and are not in RANGE2. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Fri Mar 17 07:05:54 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 16 Mar 2006 23:05:54 -0800 Subject: [DAS2] alternate formats Message-ID: <3f895441c38b74460da9f8e4582b7a74@dalkescientific.com> If you've read the updated schema definitions you saw I've added the following comment in the CAPABILITY # Format names which can be passed to the query_uri. # The names are type dependent. At present the # only reserved names are for the 'features' capability. # These are: das2xml, count, uris format*, We talked about this in the UK I think, and I mentioned it to people here. The 'count' format returns the count of features which would be returned for a given query. This is a single line containing the integer followed by a newline. The content-type of the document is text/plain . For example, to get the number of all the features on the server Request: http://www.example.com/das2/mus/v22/features?format=count Response: Content-Type: text/plain 129254 I will add this format description to the spec. When does the server need to declare that it implements a given document type? My thought is that if the format list is not specified then the server must implement 'das2xml' and 'count' formats. If it doesn't implement the 'count' format then it needs to declare the complete list of what it does support. In addition I'll describe here the 'uris' format. It is a document of content-type text/plain containing the matching feature URIs, one per line. For example, file://Users/dalke/ucla/feature/Affymetrix_U133_X3P: Hs.21346.0.A1_3p_a_at file://Users/dalke/ucla/feature/Affymetrix_U133_X3P: Hs.21346.0.A1_3p_x_at file://Users/dalke/ucla/feature/Affymetrix_U133_X3P: Hs.21346.1.S1_3p_x_at file://Users/dalke/ucla/feature/Affymetrix_U133_X3P: Hs.21346.2.S1_3p_x_at file://Users/dalke/ucla/feature/Affymetrix_U133_X3P: Hs.21346.3.S1_3p_x_at file://Users/dalke/ucla/feature/Affymetrix_U133_X3P:Hs.271468.0.S1_3p_at (I feel like it should implement an xml:base scheme to reduce the amount of traffic.) The idea is that a client can request the URIs only, eg, to do more complex boolean-esque searches by doing simpler ones on the server and combining the results in client space. For another example, if the client already knows the feature data for a URI then it doesn't need to download the data again. So it gets a list of URIs then only fetches the ones it does not know about. This requires HTTP/1.1 pipelining for good performance. Because there are no clients which want it, because I'm not certain on the format, and because of the lack of pipelining in the existing servers, I will not document this format. I'll just leave it as a reserved format name. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Fri Mar 17 07:33:44 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 16 Mar 2006 23:33:44 -0800 Subject: [DAS2] debugging validation proxy Message-ID: After a conversation with Gregg this afternoon I this evening implemented a debugging validation proxy for DAS. The code is about 100 lines long and combines Python's "twisted" network library and the dasypus validator. To make it work, configure your DAS client to use a proxy, which is this validation proxy. Then do things like normal. The request go through the proxy. It dumps the request info to stdout and forwards the request to the real server. It requires the response headers and body. When finished it passed the data to dasypus. I stuck some DAS-ish XML on my company web server and did the connection like this % curl -x localhost:8080 http://www.dalkescientific.com/sources.xml The output from the debug window is Making request for 'http://www.dalkescientific.com/sources.xml' Warning: Unknown Content-Type 'application/xml'. Info: Assuming doctype of 'sources' based on root element at byte 40, line 2, column 2 Finished processing Andrew dalke at dalkescientific.com From allenday at ucla.edu Thu Mar 16 18:27:56 2006 From: allenday at ucla.edu (Allen Day) Date: Thu, 16 Mar 2006 10:27:56 -0800 (PST) Subject: [DAS2] biopackages.net out of synch with spec? In-Reply-To: <200603151046.43196.lstein@cshl.edu> References: <200603151046.43196.lstein@cshl.edu> Message-ID: Hi Lincoln, Please just code to what is there, and expect your code to break when I update the biopackages server to v300 (probably next week). -Allen On Wed, 15 Mar 2006, Lincoln Stein wrote: > Hi Folks, > > I just ran through the source request on biopackages.net and it is returning > something that is very different from the current spec (CVS updated as of > this morning UK time). I understand why there is a discrepancy, but for the > purposes of the code sprint, should I code to what the spec says or to what > biopackages.net returns? It is much more fun for me to code to a working > server because I have the opportunity to watch my code run. > > Best, > > Lincoln > > From Gregg_Helt at affymetrix.com Fri Mar 17 08:22:12 2006 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Fri, 17 Mar 2006 00:22:12 -0800 Subject: [DAS2] New affymetrix das/2 development server Message-ID: I checked in a new version of the Affymetrix DAS/2 server this evening that supports XML responses based on the latest DAS/2 spec, version 300. For sample sources, segments, types, and features responses it passes the Dasypus validator tests. The validator was _very_ useful for bringing the server up to the current spec! Steve rolled the new version out on our public test server, the root sources query URL is http://205.217.46.81:9091/das2/genome/sequence. In the latest version of IGB checked into CVS, this server can be accessed as "Affy-temp" in the list of DAS/2 servers. Although the server's XML responses conform to spec v.300, the query strings it recognizes still only conform to a subset of spec v.200. I expect to have the queries upgraded to v.300 tonight. But it will probably still only support a subset of the query filters: one type (required), one overlaps (required), one inside (optional). This server also supports bed, psl, and some binary formats as alternative content formats, depending on the type of the annotations. gregg > -----Original Message----- > From: das2-bounces at lists.open-bio.org [mailto:das2-bounces at lists.open- > bio.org] On Behalf Of Steve Chervitz > Sent: Wednesday, March 15, 2006 1:25 PM > To: DAS/2 > Subject: [DAS2] New affymetrix das/2 development server > > > Gregg's latest spec-compliant, but still development-grade, das/2 server > is > now publically available via http://205.217.46.81:9091 > > It's currently serving annotations from the following assemblies: > - human hg16 > - human hg17 > - drosophila dm2 > > Send me requests for any other data sources that would help your > development > efforts. > > Example query to get back a das-source xml document: > http://205.217.46.81:9091/das2/genome/sequence > > It's compliance with the spec is steadily improving, on a daily if not > hourly basis during the code sprint. > > Within IGB you can access this server from the DAS/2 servers tab > under 'Affy-temp'. > > You'll need the latest version of IGB from the CVS repository at > http://sf.net/projects/genoviz > > Steve > > _______________________________________________ > DAS2 mailing list > DAS2 at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/das2 From dalke at dalkescientific.com Fri Mar 17 16:09:44 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Fri, 17 Mar 2006 08:09:44 -0800 Subject: [DAS2] biopackages.net out of synch with spec? In-Reply-To: References: <200603151046.43196.lstein@cshl.edu> Message-ID: Allen: > Please just code to what is there, and expect your code to break when I > update the biopackages server to v300 (probably next week). So you all know, "300" is what we've been calling the current version of the spec, based on the code freeze that started 8 hours ago. It's the one currently only described in the schema definitions and in the example files under das/das2/draft3. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Fri Mar 17 16:40:20 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Fri, 17 Mar 2006 08:40:20 -0800 Subject: [DAS2] proxies, caching and network configuration Message-ID: <58f16cd7fac095a708fd81a5cc5e40df@dalkescientific.com> I'm writing to encourage DAS client authors to include support for proxies when fetching DAS URLs. Nomi pointed out that Apollo supports proxies, because users asked for it. I think it's because some sites don't have direct access to the internet. I know a few of my clients have internal networks set up that way. Yesterday we talked a bit about how to point to local mirrors. It would be hard to have a standard configuration so that all DAS client code can know about local mirrors. I mentioned setting up proxies, but dismissed the idea. Now I'm thinking that that might be the solution. If there are local ways to get, say, sequence data then that could be done at the proxy level. Someone can easily (with less than 100 lines of code) write a new proxy server which points to a local resource if it knows that a URI is resolvable that way. Having proxy support also helps with debugging, like in the debugging proxy server I wrote yesterday. A nice thing is that some people want proxy support anyway, so if client code supports proxies then these other things (redirection to local mirrors, debugging) can be set up later, and with no extra work in the client. Andrew dalke at dalkescientific.com From Steve_Chervitz at affymetrix.com Fri Mar 17 18:47:51 2006 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Fri, 17 Mar 2006 10:47:51 -0800 Subject: [DAS2] New affymetrix das/2 development server In-Reply-To: Message-ID: The affy das/2 development server at http://205.217.46.81:9091 has been updated to better support DAS/2 spec version 300. Gregg says: > Changed genometry DAS/2 server so that it responds to feature queries that use > DAS/2 v.300 feature filters. Currently implements a subset of > the v.300 feature query spec: > requires one and only one segment filter > requires one and only one type filter > accepts zero or one inside filter > Also attempts to support DAS/2 v.200 feature filters, but success is not > guaranteed. Steve > From: Steve Chervitz > Date: Wed, 15 Mar 2006 13:24:59 -0800 > To: DAS/2 > Conversation: New affymetrix das/2 development server > Subject: New affymetrix das/2 development server > > > Gregg's latest spec-compliant, but still development-grade, das/2 server is > now publically available via http://205.217.46.81:9091 > > It's currently serving annotations from the following assemblies: > - human hg16 > - human hg17 > - drosophila dm2 > > Send me requests for any other data sources that would help your development > efforts. > > Example query to get back a das-source xml document: > http://205.217.46.81:9091/das2/genome/sequence > > It's compliance with the spec is steadily improving, on a daily if not hourly > basis during the code sprint. > > Within IGB you can access this server from the DAS/2 servers tab > under 'Affy-temp'. > > You'll need the latest version of IGB from the CVS repository at > http://sf.net/projects/genoviz > > Steve From dalke at dalkescientific.com Fri Mar 17 20:09:42 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Fri, 17 Mar 2006 12:09:42 -0800 Subject: [DAS2] defined minimum limits Message-ID: We should define minimum sizes for fields in the server database. For example, "the server must support feature titles of at least 40 characters", "must handle at least two 'excludes' feature filters". And define what do to when the server decides that writeback of a 30MB feature is just a bit too large. Andrew dalke at dalkescientific.com From boconnor at ucla.edu Fri Mar 17 23:23:09 2006 From: boconnor at ucla.edu (Brian O'Connor) Date: Fri, 17 Mar 2006 15:23:09 -0800 Subject: [DAS2] das.biopackages.net Updated to Spec 300 Message-ID: <441B44DD.5010505@ucla.edu> Hi, So I checked in my changes to the DAS/2 server which should bring it up to the 300 spec. Allen updated the das.biopackages.net server and I tested the following URLs in Andrew's validation app. They all appear to be OK: * http://das.biopackages.net/das/genome * http://das.biopackages.net/das/genome/yeast * http://das.biopackages.net/das/genome/human * http://das.biopackages.net/das/genome/yeast/S228C * http://das.biopackages.net/das/genome/human/17 * http://das.biopackages.net/das/genome/yeast/S228C/segment * http://das.biopackages.net/das/genome/human/17/segment * http://das.biopackages.net/das/genome/yeast/S228C/type * http://das.biopackages.net/das/genome/human/17/type * http://das.biopackages.net/das/genome/yeast/S228C/feature?overlaps=chrI/1:1000 * http://das.biopackages.net/das/genome/human/17/feature?overlaps=chr1/1000:2000 Let Allen or I know if you run into problems. --Brian From cjm at fruitfly.org Sat Mar 18 00:20:14 2006 From: cjm at fruitfly.org (chris mungall) Date: Fri, 17 Mar 2006 16:20:14 -0800 Subject: [DAS2] query language description In-Reply-To: References: <7ed68f2baa961f932e369cb449371439@dalkescientific.com> <8b7582943da22dfed23ba7b5386402fb@fruitfly.org> Message-ID: On Mar 16, 2006, at 6:05 PM, Andrew Dalke wrote: >> right now they are forced bypass the constraint language and go direct >> to SQL. > > In addition, we provide defined ways for a server to indicate > that there are additional ways to query the server. I was positing this as a bad feature, not a good one. or at least a symptom of an incorrectly designed system (at least in the case of the GO DB API - it may not carry forward to DAS - though if you're going to allow querying by terms...) > >> None of these really lit into the DAS paradigm. I'm guessing you want >> something simple that can be used as easily as an API with get-by-X >> methods but will seamlessly blend into something more powerful. I >> think what you have is on the right lines. I'm just arguing to make >> this language composable from the outset, so that it can be extended >> to whatever expressivity is required in the future, without bolting on >> a new query system that's incompatible with the existing one. > > We have two ways to compose the system. If the simple query language > is extended, for example, to support word searches of the text field > instead of substring searches, then a server can say > > query_uri="http://somewhere.over.rainbow/server.cgi"> > > > > This is backwards compatible, so the normal DAS queries work. But > a client can recognize the new feature and support whatever new filters > that 'word-search' indicates, eg > > http://somewhere.over.rainbox/server.cgi?note-wordsearch=Andre* > > (finds features with notes containing words starting with 'Andre' ) > > These are composable. For example, suppose Sanger allows modification > date searches of curation events. Then it might say > > query_uri="http://somewhere.over.rainbow/server.cgi"> > > > so this is limited to single-argument search functions? > > and I can search for notes containing words starting with "Andre" > which were modified by "dalke" between 2002 and 2005 by doing > > http://somewhere.over.rainbox/server.cgi?note-wordsearch=Andre*& > modified-by=dalke&modified-before=2005&modified-after=2002 but the compositionality is always associative since the CGI parameter constraint forbids nesting > An advantage to the simple boolean logic of the current system > is that the GUI interface is easy, and in line with existing > simple search systems. there's nothing preventing you from implementing a simple GUI on top of an expressive system - there is nothing forcing you to use the expressivity > If someone wants to implement a new search system which is > not backwards compatible then the server can indicate that > alternative with a new CAPABILITY. Suppose Thomas at Sanger > comes up with a new search mechanism based on an object query > language he invented, > > query_uri="http://sanger.ac.uk/oql-search" /> > > The Sanger and EBI clients might understand that and support > a more complex GUI, eg, with a text box interface. Everyone > else must ignore unknown capability types. but this doesn't integrate with the existing query system > > Then that would be POSTED (or whatever the protocol defines) > to the given URL, which returns back whatever results are > desired. > > Or the server can point to a public MySQL port, like > > query_uri="mysql://username:password at hostname:port/databasename" > /> > > That's what you are doing to bypass the syntax, except that > here it isn't a bypass; you can define the new interface in > the DAS sources document. > >> The generic language could just be some kind of simple >> extensible function syntax for search terms, boolean operators, >> and some kind of (optional) nesting syntax. > > Which syntax? Is it supposed to be easy for people to write? > Text oriented? Or tree structured, like XML, or SQL-like? I'd favour some concrete asbtract syntax that looks much like the existing DAS QL > And which clients and servers will implement that search > language? all servers. clients optional > > If there was a generic language it would allow > OR("on segment Chr1 between 1000 and 2000", > "on segment ChrX between 99 and 777") > which is something we are expressly not allowing in DAS2 > queries. It doesn't make sense for the target applications > and by excluding it it simplifies the server development, > which means less chance for bugs. this example is pointless but it's easy to imagine plenty of ontology term queries or other queries in which this would be useful I guess I depart from the normal DAS philosophy - I don't see this being a high barrier for entry for servers, if they're not up to this it'll probably be a buggy hacky server anyway > Also, I personally haven't figured out a decent way to > do a GUI composition of a complex boolean query which is > as easy as learning the query language in the first place. doesn't mean it doesn't exist. i'm not sure what's hard about having say, a clipboard of favourite queries, then allowing some kind of drag-and-drop composition > A more generic language implementation is a lot of overhead > if most (80%? 90%) need basic searches, and many of the > rest can fake it by breaking a request into parts and > doing the boolean logic on the client side. this is always an option - if the user doesn't mind the additional possibly very high overhead. it's just a little bit of a depressing approach, as if Codd's seminal paper from 1970 or whenever it was never happened. > Feedback I've heard so far is that DAS1 queries were > acceptable, with only a few new search fields needed. > >> hmm, not sure how useful this would be - surely you'd want something >> more dasmodel-aware? > > The example I gave was a bad one. What I meant was to show > how there's an extension point so someone can develop a new > search interface and clients can know that the new functionality > exists, without having to change the DAS spec. ok that's probably all I've got to say on the matter, sorry for being irksome. I guess I'm fundamentally missing something, that is, why wrap simple and expressive declarative query languages with limited ad-hoc constraint systems with consciously limited expressivity and limited means of extensibility.. cheers chris > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/das2 From Steve_Chervitz at affymetrix.com Mon Mar 20 04:54:36 2006 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Sun, 19 Mar 2006 20:54:36 -0800 Subject: [DAS2] Notes from DAS/2 code sprint #2, day five, 17 Mar 2006 Message-ID: Notes from DAS/2 code sprint #2, day five, 17 Mar 2006 $Id: das2-teleconf-2006-03-17.txt,v 1.2 2006/03/20 05:05:22 sac Exp $ Note taker: Steve Chervitz Attendees: Affy: Steve Chervitz, Ed E., Gregg Helt Dalke Scientific: Andrew Dalke (at Affy) UCLA: Allen Day, Brian O'Connor (at Affy) Action items are flagged with '[A]'. These notes are checked into the biodas.org CVS repository at das/das2/notes/2006. Instructions on how to access this repository are at http://biodas.org DISCLAIMER: The note taker aims for completeness and accuracy, but these goals are not always achievable, given the desire to get the notes out with a rapid turnaround. So don't consider these notes as complete minutes from the meeting, but rather abbreviated, summarized versions of what was discussed. There may be errors of commission and omission. Participants are welcome to post comments and/or corrections to these as they see fit. Agenda: * Status reports * Writeback progress Status reports: --------------- gh: This is the last mtg of code sprint. For the status reports, focus on where you are at and what you are hoping to accomplish post-sprint. gh: working on version of affy server that impls das/2 v300 spec for all xml responses. sample responses passed andrew's validation. steve rolled it out to public server. updated igb client to handle v300 xml. worked more on server to impl v300 query syntax using full uri for type segment, segment separate from overlaps and inside. only impls a subset of the feature query. requires one and only one segment, type, insides. hoping todo for rest of sprint and after: 1. supporting name feat filters in igb client 2. remove restrictions from the server 3. making sure new version of server gets rolled out, 4. roll out jar for this version of igb. maybe put on genoviz sf site for testing purposes. bo: looked at xml docs that andrew checked in, updating ucla templates on server, not rolled out to biopackages.net, waiting to make rpm, hoping to do code cleanup in igb. getting andrew's help running validator on local copy of server. gh: igb would like to support v300, but one server is v200+ (ucla), one at v300 (affy) complicates things. so getting your server good to go would be my priority. bo: code clean up involves assay and ontology interface. gh: we're planning an igb release at end of march. as long as the code is clean by then it's ok. aday: code cleanup, things removed from protocol. exporting data matrices from assay part of server. validate sources document w/r/t v300 validator. work with brian to make sure everything is update to v300. probably working on fiter query, since we now treat things as names not full uri's. ad: what extra config info do you need in server for that? can you get it from the http headers? gh: mine is being promiscuous, just name of type will work. might give the wrong thing back, but for data we're serving back now, it can't be wrong. ad: how much trouble does the uri handling cause for you? gh: has to be full uri of the type, doing otherwise is not an option (in the spec). ad: you could just use name internally, then put together full uri when you go to the outside world. ad: I updated comments in schema definitions, updated query lang spec. string searches are substring searches not word-substring searches. abc = whole field must be equal *abc = suffix match abc* = prefix match previously said it was word match, but that's too complicated on server. worked with gregg to pin down what inside search means. I'm thinking about the possibility of a validating proxy server, configure das client to go through proxy before outside world, the server would sniff everything going by. Support for proxys can enable lots of sorts of things w/o needing additional config for each client. gh: how do you do proxy in java? i.e., redirect all network calls to a proxy. bo: there's a way to set proxy options via the system object in the java vm. can show you some examples of this. aday: performance. gh: current webstart based ibg works with the existing public das/2 server, [comment pertaining to: the new version of igb and a new version of the affy das/2 server]. ad: when will we get reference names from lincoln? gh: should happen yesterday. poke him about this. would be really nice to be able to overlay anotations! The current version of igb can turn off v300 options, and then ti can load stuff from the ucla server. The version of igb in cvs now can hit both biopackages.net and affy server in the dmz. and there's hardwiring to get things to overlay. temporary patch. ee: two things: 1. style sheets. info from andrew yesterday. looking over that. will discuss questions w/ andrew. 2. making sure that when we do a new release of igb in a couple of weeks (when I'm not here) that it will go smoothly . go over w/ gregg, steve. lots of testing. made changes in parser code, should still work. sc: I updated jars for das/1 not das/2 on netaffxdas.affymetrix.com. ee: it's the das/1 I'm most concerned about. sc: installed and updated gregg's new das/2 server on a publically accessible machine (separate box from the production das/1 and das/2 servers on netaffxdas.affymetrix.com). Also spent a time loading data for new affy arrays (mouse rat exons). this required lots of memory, had to disable support for some other arrays. [gregg's das servers load all annotations into memory at start up, hance the big memory requirements for arrays with lots of probe sets.] [A] gregg optimize affy das server memory reqts for exon arrays. gh: we' gotten a lot done this week. I think we have a stable spec. gh: serving alignments, no cigars, but blat alignment to genome as coords on mrna and coords on the genome. igb doesn't use it yet, but it's there. ad: xid in region elements. gh: we haven't exercised the xids. so 'link' in das/1 is equivalent to xid in das/2? ad: yes. i believe gh: if you have links in das/1. without links it can build links from feature id using a template. This is used for building links from within IGB back to netaffx, for example. Topic: Writebacks ----------------- gh: writebacks haven't been mentioned at all this week. ad: we need people committed to writing a server to implement it. gh: we decided that since ed griffith would be working on it at Sanger, we wouldn't worry about it for ucla server. bo: we started prototyping. locking mechanism. persisting part of a mage document. the spec changed after that. andrew's delta model. a little different from what we were prototyping. actual persistence will be done in the assay portion of our server. gh: grant focuses on write back for genome portion, and this was a big chunk of the grant. ends in end of may or june. ad: delta model was: here's a list of add, delete, modify in one document. An issue was if you change an existing record, do you give it a new identifier? gh: you never modify something with an existing id, just make a new one, new id, with a pointer back to old one. Ed Griffith said this a month ago. I like this idea. but told we cannot make this requirement on the database. but very few dbs will be writeback, so it's not affecting all servers ad: making new uris, client has to know the new uri for the old one. needs to return a mapping document. if network crashes partway through, client won't know mapping is and will be lost. gh: server doesn't know if client got it. it could act(?) back. gh: if a response from http server dies, server has no way to know. ad: There could be a proxy in the middle, or isp's proxy server. The server sent it successfully to the proxy, but never made it to the client. gh: how is this dealt with for commits into relational dbs? same thing applies ad: don't know ee: could ask for everything in this region. ad: have a new element that says 'i used to be this'. bo: you do an insert in a db, to get last pk that was issued. client talks back to server, give me last feature uri that was provisioned on my connection. so the client is in control. sc: it's up to client to get confirmation from server. If it failed to get the response after sending in the modification request, it could request that the server send it again. ad: (drawing on whiteboard) two stage strategy, get a transaction state. post "get transaction url" <--------------- post (put?) to transaction URL -------------> can do multiple (if identical) ----------> ----------> Get was successful and here's transformation info <--------------- ad: server can hold transformation info for some timespan in case client needs to re-fetch. gh: I'm more insterested in getting a server up than a client regarding writeback. complex parts of the client are already implemented (apollo). gh: locks are region based not feature based. ad: can't lock... gh: we can talk about how to trigger ucla locking mechanism. bo: did flock transactional locking the suggested in perl cookbook. mage document has content. server locks an id using flock, (for assay das). gh: to lock a region on the genome, lock on all ids for features in this region. bo: make a file containing all the ids that are locked. flock this file. ad: file locking is frought with problems. why not keep it in the database and let the db lock it for you. don't let perl + file system do it for you. there could be fs problems. nfs isn't good at that. a database is much more reliable. bo: I went with perl flock mechanism since you could have other non-database sources (though so far it's all db). [A] steve, allen send brian code tips regarding locking. gh: putting aside pushing large data chunks into the server, for curation it's ok if protocol is a little error prone, since the curator-caused errors will be much more likely/common. ad: UK folks haven't done any writeback work as far as I know. gh: they haven't billed us in 2 years. Tony cox is contact, ed griffith is main developer. ad: andreas and thomas are not funded by this grant or the next one. gh: they are already funded by other means. ad: if someone want's to change an annotation should they need to get a lock first or can it work like cvs? do it if it can, get lock, release lock in one transaction. ee: that's my preference. ad: if every feature has it's own id, you know if it's... ee: some servers might not have any writeback facility at all. conflicts will be rare. [A] ask ed/tony on whether they plan to have any writeback facility gh: ed g wanted to work on client to do writeback, don't know who would work on a server there. ad: someone else, can't remember - roy? gh: unless we hear back from sanger, the highest priority for ucla folks after updating server for v300, is working server-side writeback. gh: spec freeze is for the read portion. the writeback portion will have to change as needed. ad: and arithmetic? ;-) From lstein at cshl.edu Mon Mar 20 17:27:59 2006 From: lstein at cshl.edu (Lincoln Stein) Date: Mon, 20 Mar 2006 12:27:59 -0500 Subject: [DAS2] Notes from DAS/2 code sprint #2, day five, 17 Mar 2006 In-Reply-To: References: Message-ID: <200603201227.59816.lstein@cshl.edu> Hi Folks, I will join the DAS2 call a little late today (no more than 10 min). I'm assuming that we're on? Lincoln On Sunday 19 March 2006 23:54, Steve Chervitz wrote: > Notes from DAS/2 code sprint #2, day five, 17 Mar 2006 > > $Id: das2-teleconf-2006-03-17.txt,v 1.2 2006/03/20 05:05:22 sac Exp $ > > Note taker: Steve Chervitz > > Attendees: > Affy: Steve Chervitz, Ed E., Gregg Helt > Dalke Scientific: Andrew Dalke (at Affy) > UCLA: Allen Day, Brian O'Connor (at Affy) > > Action items are flagged with '[A]'. > > These notes are checked into the biodas.org CVS repository at > das/das2/notes/2006. Instructions on how to access this > repository are at http://biodas.org > > DISCLAIMER: > The note taker aims for completeness and accuracy, but these goals are > not always achievable, given the desire to get the notes out with a > rapid turnaround. So don't consider these notes as complete minutes > from the meeting, but rather abbreviated, summarized versions of what > was discussed. There may be errors of commission and omission. > Participants are welcome to post comments and/or corrections to these > as they see fit. > > Agenda: > > * Status reports > * Writeback progress > > > Status reports: > --------------- > > gh: This is the last mtg of code sprint. For the status reports, focus > on where you are at and what you are hoping to accomplish post-sprint. > > gh: working on version of affy server that impls das/2 v300 spec for > all xml responses. sample responses passed andrew's validation. > steve rolled it out to public server. > > updated igb client to handle v300 xml. > worked more on server to impl v300 query syntax using full uri for > type segment, segment separate from overlaps and inside. > only impls a subset of the feature query. requires one and only one > segment, type, insides. > > hoping todo for rest of sprint and after: > 1. supporting name feat filters in igb client > 2. remove restrictions from the server > 3. making sure new version of server gets rolled out, > 4. roll out jar for this version of igb. maybe put on genoviz sf site for > testing purposes. > > bo: looked at xml docs that andrew checked in, updating ucla templates > on server, not rolled out to biopackages.net, waiting to make rpm, > hoping to do code cleanup in igb. > getting andrew's help running validator on local copy of server. > > gh: igb would like to support v300, but one server is v200+ (ucla), > one at v300 (affy) complicates things. so getting your server good to > go would be my priority. > > bo: code clean up involves assay and ontology interface. > > gh: we're planning an igb release at end of march. as long as the code > is clean by then it's ok. > > aday: code cleanup, things removed from protocol. exporting data > matrices from assay part of server. > validate sources document w/r/t v300 validator. work with brian to > make sure everything is update to v300. probably working on fiter > query, since we now treat things as names not full uri's. > > ad: what extra config info do you need in server for that? can you get > it from the http headers? > gh: mine is being promiscuous, just name of type will work. might give > the wrong thing back, but for data we're serving back now, it can't be > wrong. > > ad: how much trouble does the uri handling cause for you? > > gh: has to be full uri of the type, doing otherwise is not an option > (in the spec). > ad: you could just use name internally, then put together full uri > when you go to the outside world. > > ad: I updated comments in schema definitions, updated query lang > spec. string searches are substring searches not word-substring > searches. > abc = whole field must be equal > *abc = suffix match > abc* = prefix match > > previously said it was word match, but that's too complicated on > server. > worked with gregg to pin down what inside search means. > > I'm thinking about the possibility of a validating proxy server, > configure das client to go through proxy before outside world, the > server would sniff everything going by. > Support for proxys can enable lots of sorts of things w/o needing > additional config for each client. > > gh: how do you do proxy in java? i.e., redirect all network calls to a > proxy. > bo: there's a way to set proxy options via the system object in the > java vm. can show you some examples of this. > > aday: performance. > gh: current webstart based ibg works with the existing public das/2 > server, [comment pertaining to: the new version of igb and a new > version of the affy das/2 server]. > > ad: when will we get reference names from lincoln? > gh: should happen yesterday. poke him about this. > would be really nice to be able to overlay anotations! > > The current version of igb can turn off v300 options, and then ti can > load stuff from the ucla server. The version of igb in cvs now can hit > both biopackages.net and affy server in the dmz. and there's > hardwiring to get things to overlay. temporary patch. > > ee: two things: > 1. style sheets. info from andrew yesterday. looking over that. will > discuss questions w/ andrew. > 2. making sure that when we do a new release of igb in a couple of > weeks (when I'm not here) that it will go smoothly . go over w/ > gregg, steve. lots of testing. > made changes in parser code, should still work. > > sc: I updated jars for das/1 not das/2 on netaffxdas.affymetrix.com. > ee: it's the das/1 I'm most concerned about. > > sc: installed and updated gregg's new das/2 server on a publically > accessible machine (separate box from the production das/1 and das/2 > servers on netaffxdas.affymetrix.com). > Also spent a time loading data for new affy arrays (mouse rat > exons). this required lots of memory, had to disable support for some > other arrays. [gregg's das servers load all annotations into memory at > start up, hance the big memory requirements for arrays with lots of > probe sets.] > > [A] gregg optimize affy das server memory reqts for exon arrays. > > gh: we' gotten a lot done this week. I think we have a stable spec. > > gh: serving alignments, no cigars, but blat alignment to genome as > coords on mrna and coords on the genome. igb doesn't use it yet, but > it's there. > ad: xid in region elements. > gh: we haven't exercised the xids. so 'link' in das/1 is equivalent to > xid in das/2? > ad: yes. i believe > gh: if you have links in das/1. without links it can build links from > feature id using a template. This is used for building links from > within IGB back to netaffx, for example. > > Topic: Writebacks > ----------------- > > gh: writebacks haven't been mentioned at all this week. > ad: we need people committed to writing a server to implement it. > gh: we decided that since ed griffith would be working on it at > Sanger, we wouldn't worry about it for ucla server. > bo: we started prototyping. locking mechanism. persisting part of a > mage document. the spec changed after that. andrew's delta model. a > little different from what we were prototyping. > actual persistence will be done in the assay portion of our server. > gh: grant focuses on write back for genome portion, and this was a big > chunk of the grant. ends in end of may or june. > > ad: delta model was: here's a list of add, delete, modify in one > document. An issue was if you change an existing record, do you give > it a new identifier? > gh: you never modify something with an existing id, just make a new > one, new id, with a pointer back to old one. Ed Griffith said this a > month ago. I like this idea. but told we cannot make this requirement > on the database. but very few dbs will be writeback, so it's not > affecting all servers > > ad: making new uris, client has to know the new uri for the old > one. needs to return a mapping document. > if network crashes partway through, client won't know mapping is and > will be lost. > gh: server doesn't know if client got it. it could act(?) back. > > gh: if a response from http server dies, server has no way to know. > ad: There could be a proxy in the middle, or isp's proxy server. The > server sent it successfully to the proxy, but never made it to the > client. > > gh: how is this dealt with for commits into relational dbs? same thing > applies > ad: don't know > ee: could ask for everything in this region. > ad: have a new element that says 'i used to be this'. > bo: you do an insert in a db, to get last pk that was issued. client > talks back to server, give me last feature uri that was provisioned on > my connection. so the client is in control. > > sc: it's up to client to get confirmation from server. If it failed to > get the response after sending in the modification request, it could > request that the server send it again. > > ad: (drawing on whiteboard) two stage strategy, get a transaction state. > > post "get transaction url" > <--------------- > post (put?) to transaction URL > -------------> > can do multiple (if identical) > ----------> > ----------> > Get was successful and here's transformation info > <--------------- > > ad: server can hold transformation info for some timespan in case > client needs to re-fetch. > > gh: I'm more insterested in getting a server up than a client > regarding writeback. complex parts of the client are already > implemented (apollo). > > gh: locks are region based not feature based. > ad: can't lock... > > gh: we can talk about how to trigger ucla locking mechanism. > bo: did flock transactional locking the suggested in perl > cookbook. mage document has content. server locks an id using flock, > (for assay das). > gh: to lock a region on the genome, lock on all ids for features in > this region. > bo: make a file containing all the ids that are locked. flock this > file. > > ad: file locking is frought with problems. why not keep it in the > database and let the db lock it for you. don't let perl + file system > do it for you. there could be fs problems. nfs isn't good at that. a > database is much more reliable. > > bo: I went with perl flock mechanism since you could have other > non-database sources (though so far it's all db). > > [A] steve, allen send brian code tips regarding locking. > > gh: putting aside pushing large data chunks into the server, for > curation it's ok if protocol is a little error prone, since the > curator-caused errors will be much more likely/common. > > ad: UK folks haven't done any writeback work as far as I know. > gh: they haven't billed us in 2 years. Tony cox is contact, ed > griffith is main developer. > ad: andreas and thomas are not funded by this grant or the next one. > gh: they are already funded by other means. > > ad: if someone want's to change an annotation should they need to get > a lock first or can it work like cvs? do it if it can, get lock, > release lock in one transaction. > ee: that's my preference. > > ad: if every feature has it's own id, you know if it's... > > ee: some servers might not have any writeback facility at > all. conflicts will be rare. > > [A] ask ed/tony on whether they plan to have any writeback facility > > gh: ed g wanted to work on client to do writeback, don't know who > would work on a server there. > ad: someone else, can't remember - roy? > gh: unless we hear back from sanger, the highest priority for ucla > folks after updating server for v300, is working server-side > writeback. > > gh: spec freeze is for the read portion. the writeback portion will > have to change as needed. > ad: and arithmetic? ;-) > > > _______________________________________________ > DAS2 mailing list > DAS2 at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/das2 -- Lincoln D. Stein Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724 FOR URGENT MESSAGES & SCHEDULING, PLEASE CONTACT MY ASSISTANT, SANDRA MICHELSEN, AT michelse at cshl.edu (516 367-5008) From lstein at cshl.edu Mon Mar 20 17:32:40 2006 From: lstein at cshl.edu (Lincoln Stein) Date: Mon, 20 Mar 2006 12:32:40 -0500 Subject: [DAS2] query language description In-Reply-To: References: <7ed68f2baa961f932e369cb449371439@dalkescientific.com> Message-ID: <200603201232.41522.lstein@cshl.edu> The current filter query language, which provides one level of ANDs and a nested level of ORs, satisfies our use cases. It is not clear to me what additional benefit we'll get from a composable query language. Note that none of the popular and functional genome information sources -- NCBI, UCSC, Ensembl or BioMart -- offer a composable query language, and there does not seem to be rioting on the streets! Lincoln On Friday 17 March 2006 19:20, chris mungall wrote: > On Mar 16, 2006, at 6:05 PM, Andrew Dalke wrote: > >> right now they are forced bypass the constraint language and go direct > >> to SQL. > > > > In addition, we provide defined ways for a server to indicate > > that there are additional ways to query the server. > > I was positing this as a bad feature, not a good one. or at least a > symptom of an incorrectly designed system (at least in the case of the > GO DB API - it may not carry forward to DAS - though if you're going to > allow querying by terms...) > > >> None of these really lit into the DAS paradigm. I'm guessing you want > >> something simple that can be used as easily as an API with get-by-X > >> methods but will seamlessly blend into something more powerful. I > >> think what you have is on the right lines. I'm just arguing to make > >> this language composable from the outset, so that it can be extended > >> to whatever expressivity is required in the future, without bolting on > >> a new query system that's incompatible with the existing one. > > > > We have two ways to compose the system. If the simple query language > > is extended, for example, to support word searches of the text field > > instead of substring searches, then a server can say > > > > > query_uri="http://somewhere.over.rainbow/server.cgi"> > > > > > > > > This is backwards compatible, so the normal DAS queries work. But > > a client can recognize the new feature and support whatever new filters > > that 'word-search' indicates, eg > > > > http://somewhere.over.rainbox/server.cgi?note-wordsearch=Andre* > > > > (finds features with notes containing words starting with 'Andre' ) > > > > These are composable. For example, suppose Sanger allows modification > > date searches of curation events. Then it might say > > > > > query_uri="http://somewhere.over.rainbow/server.cgi"> > > > > > > > > so this is limited to single-argument search functions? > > > and I can search for notes containing words starting with "Andre" > > which were modified by "dalke" between 2002 and 2005 by doing > > > > http://somewhere.over.rainbox/server.cgi?note-wordsearch=Andre*& > > modified-by=dalke&modified-before=2005&modified-after=2002 > > but the compositionality is always associative since the CGI parameter > constraint forbids nesting > > > An advantage to the simple boolean logic of the current system > > is that the GUI interface is easy, and in line with existing > > simple search systems. > > there's nothing preventing you from implementing a simple GUI on top of > an expressive system - there is nothing forcing you to use the > expressivity > > > If someone wants to implement a new search system which is > > not backwards compatible then the server can indicate that > > alternative with a new CAPABILITY. Suppose Thomas at Sanger > > comes up with a new search mechanism based on an object query > > language he invented, > > > > > query_uri="http://sanger.ac.uk/oql-search" /> > > > > The Sanger and EBI clients might understand that and support > > a more complex GUI, eg, with a text box interface. Everyone > > else must ignore unknown capability types. > > but this doesn't integrate with the existing query system > > > Then that would be POSTED (or whatever the protocol defines) > > to the given URL, which returns back whatever results are > > desired. > > > > Or the server can point to a public MySQL port, like > > > > > query_uri="mysql://username:password at hostname:port/databasename" > > /> > > > > That's what you are doing to bypass the syntax, except that > > here it isn't a bypass; you can define the new interface in > > the DAS sources document. > > > >> The generic language could just be some kind of simple > >> extensible function syntax for search terms, boolean operators, > >> and some kind of (optional) nesting syntax. > > > > Which syntax? Is it supposed to be easy for people to write? > > Text oriented? Or tree structured, like XML, or SQL-like? > > I'd favour some concrete asbtract syntax that looks much like the > existing DAS QL > > > And which clients and servers will implement that search > > language? > > all servers. clients optional > > > If there was a generic language it would allow > > OR("on segment Chr1 between 1000 and 2000", > > "on segment ChrX between 99 and 777") > > which is something we are expressly not allowing in DAS2 > > queries. It doesn't make sense for the target applications > > and by excluding it it simplifies the server development, > > which means less chance for bugs. > > this example is pointless but it's easy to imagine plenty of ontology > term queries or other queries in which this would be useful > > I guess I depart from the normal DAS philosophy - I don't see this > being a high barrier for entry for servers, if they're not up to this > it'll probably be a buggy hacky server anyway > > > Also, I personally haven't figured out a decent way to > > do a GUI composition of a complex boolean query which is > > as easy as learning the query language in the first place. > > doesn't mean it doesn't exist. > > i'm not sure what's hard about having say, a clipboard of favourite > queries, then allowing some kind of drag-and-drop composition > > > A more generic language implementation is a lot of overhead > > if most (80%? 90%) need basic searches, and many of the > > rest can fake it by breaking a request into parts and > > doing the boolean logic on the client side. > > this is always an option - if the user doesn't mind the additional > possibly very high overhead. it's just a little bit of a depressing > approach, as if Codd's seminal paper from 1970 or whenever it was never > happened. > > > Feedback I've heard so far is that DAS1 queries were > > acceptable, with only a few new search fields needed. > > > >> hmm, not sure how useful this would be - surely you'd want something > >> more dasmodel-aware? > > > > The example I gave was a bad one. What I meant was to show > > how there's an extension point so someone can develop a new > > search interface and clients can know that the new functionality > > exists, without having to change the DAS spec. > > ok > > that's probably all I've got to say on the matter, sorry for being > irksome. I guess I'm fundamentally missing something, that is, why wrap > simple and expressive declarative query languages with limited ad-hoc > constraint systems with consciously limited expressivity and limited > means of extensibility.. > > cheers > chris > > > Andrew > > dalke at dalkescientific.com > > > > _______________________________________________ > > DAS2 mailing list > > DAS2 at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/das2 > > _______________________________________________ > DAS2 mailing list > DAS2 at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/das2 -- Lincoln D. Stein Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724 FOR URGENT MESSAGES & SCHEDULING, PLEASE CONTACT MY ASSISTANT, SANDRA MICHELSEN, AT michelse at cshl.edu (516 367-5008) From Gregg_Helt at affymetrix.com Mon Mar 20 17:40:19 2006 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Mon, 20 Mar 2006 09:40:19 -0800 Subject: [DAS2] call today? Message-ID: Apologies, I forgot to post that today's DAS/2 teleconference was cancelled. The feeling on Friday was that after the code sprint last week we needed a break. The teleconference will resume next week on the regular schedule (Mondays at 9:30 AM Pacific time). Thanks, Gregg > -----Original Message----- > From: Andreas Prlic [mailto:ap3 at sanger.ac.uk] > Sent: Monday, March 20, 2006 9:02 AM > To: Andrew Dalke; Helt,Gregg > Cc: Thomas Down > Subject: call today? > > Hi Dasians, > > do we have a conference call today? > > Cheers, > Andreas > > ----------------------------------------------------------------------- > > Andreas Prlic Wellcome Trust Sanger Institute > Hinxton, Cambridge CB10 1SA, UK > +44 (0) 1223 49 6891 From cjm at fruitfly.org Mon Mar 20 23:45:46 2006 From: cjm at fruitfly.org (chris mungall) Date: Mon, 20 Mar 2006 15:45:46 -0800 Subject: [DAS2] query language description In-Reply-To: <200603201232.41522.lstein@cshl.edu> References: <7ed68f2baa961f932e369cb449371439@dalkescientific.com> <200603201232.41522.lstein@cshl.edu> Message-ID: <7900d1398d5045a268a5f6fe51af529d@fruitfly.org> I guess things need to be left open for a DAS/3... On Mar 20, 2006, at 9:32 AM, Lincoln Stein wrote: > The current filter query language, which provides one level of ANDs > and a > nested level of ORs, satisfies our use cases. It is not clear to me > what > additional benefit we'll get from a composable query language. Note > that none > of the popular and functional genome information sources -- NCBI, UCSC, > Ensembl or BioMart -- offer a composable query language, and there > does not > seem to be rioting on the streets! > > Lincoln > > > On Friday 17 March 2006 19:20, chris mungall wrote: >> On Mar 16, 2006, at 6:05 PM, Andrew Dalke wrote: >>>> right now they are forced bypass the constraint language and go >>>> direct >>>> to SQL. >>> >>> In addition, we provide defined ways for a server to indicate >>> that there are additional ways to query the server. >> >> I was positing this as a bad feature, not a good one. or at least a >> symptom of an incorrectly designed system (at least in the case of the >> GO DB API - it may not carry forward to DAS - though if you're going >> to >> allow querying by terms...) >> >>>> None of these really lit into the DAS paradigm. I'm guessing you >>>> want >>>> something simple that can be used as easily as an API with get-by-X >>>> methods but will seamlessly blend into something more powerful. I >>>> think what you have is on the right lines. I'm just arguing to make >>>> this language composable from the outset, so that it can be extended >>>> to whatever expressivity is required in the future, without bolting >>>> on >>>> a new query system that's incompatible with the existing one. >>> >>> We have two ways to compose the system. If the simple query language >>> is extended, for example, to support word searches of the text field >>> instead of substring searches, then a server can say >>> >>> >> query_uri="http://somewhere.over.rainbow/server.cgi"> >>> >>> >>> >>> This is backwards compatible, so the normal DAS queries work. But >>> a client can recognize the new feature and support whatever new >>> filters >>> that 'word-search' indicates, eg >>> >>> http://somewhere.over.rainbox/server.cgi?note-wordsearch=Andre* >>> >>> (finds features with notes containing words starting with 'Andre' ) >>> >>> These are composable. For example, suppose Sanger allows >>> modification >>> date searches of curation events. Then it might say >>> >>> >> query_uri="http://somewhere.over.rainbow/server.cgi"> >>> >>> >>> >> >> so this is limited to single-argument search functions? >> >>> and I can search for notes containing words starting with "Andre" >>> which were modified by "dalke" between 2002 and 2005 by doing >>> >>> http://somewhere.over.rainbox/server.cgi?note-wordsearch=Andre*& >>> modified-by=dalke&modified-before=2005&modified-after=2002 >> >> but the compositionality is always associative since the CGI parameter >> constraint forbids nesting >> >>> An advantage to the simple boolean logic of the current system >>> is that the GUI interface is easy, and in line with existing >>> simple search systems. >> >> there's nothing preventing you from implementing a simple GUI on top >> of >> an expressive system - there is nothing forcing you to use the >> expressivity >> >>> If someone wants to implement a new search system which is >>> not backwards compatible then the server can indicate that >>> alternative with a new CAPABILITY. Suppose Thomas at Sanger >>> comes up with a new search mechanism based on an object query >>> language he invented, >>> >>> >> query_uri="http://sanger.ac.uk/oql-search" /> >>> >>> The Sanger and EBI clients might understand that and support >>> a more complex GUI, eg, with a text box interface. Everyone >>> else must ignore unknown capability types. >> >> but this doesn't integrate with the existing query system >> >>> Then that would be POSTED (or whatever the protocol defines) >>> to the given URL, which returns back whatever results are >>> desired. >>> >>> Or the server can point to a public MySQL port, like >>> >>> >> query_uri="mysql://username:password at hostname:port/databasename" >>> /> >>> >>> That's what you are doing to bypass the syntax, except that >>> here it isn't a bypass; you can define the new interface in >>> the DAS sources document. >>> >>>> The generic language could just be some kind of simple >>>> extensible function syntax for search terms, boolean operators, >>>> and some kind of (optional) nesting syntax. >>> >>> Which syntax? Is it supposed to be easy for people to write? >>> Text oriented? Or tree structured, like XML, or SQL-like? >> >> I'd favour some concrete asbtract syntax that looks much like the >> existing DAS QL >> >>> And which clients and servers will implement that search >>> language? >> >> all servers. clients optional >> >>> If there was a generic language it would allow >>> OR("on segment Chr1 between 1000 and 2000", >>> "on segment ChrX between 99 and 777") >>> which is something we are expressly not allowing in DAS2 >>> queries. It doesn't make sense for the target applications >>> and by excluding it it simplifies the server development, >>> which means less chance for bugs. >> >> this example is pointless but it's easy to imagine plenty of ontology >> term queries or other queries in which this would be useful >> >> I guess I depart from the normal DAS philosophy - I don't see this >> being a high barrier for entry for servers, if they're not up to this >> it'll probably be a buggy hacky server anyway >> >>> Also, I personally haven't figured out a decent way to >>> do a GUI composition of a complex boolean query which is >>> as easy as learning the query language in the first place. >> >> doesn't mean it doesn't exist. >> >> i'm not sure what's hard about having say, a clipboard of favourite >> queries, then allowing some kind of drag-and-drop composition >> >>> A more generic language implementation is a lot of overhead >>> if most (80%? 90%) need basic searches, and many of the >>> rest can fake it by breaking a request into parts and >>> doing the boolean logic on the client side. >> >> this is always an option - if the user doesn't mind the additional >> possibly very high overhead. it's just a little bit of a depressing >> approach, as if Codd's seminal paper from 1970 or whenever it was >> never >> happened. >> >>> Feedback I've heard so far is that DAS1 queries were >>> acceptable, with only a few new search fields needed. >>> >>>> hmm, not sure how useful this would be - surely you'd want something >>>> more dasmodel-aware? >>> >>> The example I gave was a bad one. What I meant was to show >>> how there's an extension point so someone can develop a new >>> search interface and clients can know that the new functionality >>> exists, without having to change the DAS spec. >> >> ok >> >> that's probably all I've got to say on the matter, sorry for being >> irksome. I guess I'm fundamentally missing something, that is, why >> wrap >> simple and expressive declarative query languages with limited ad-hoc >> constraint systems with consciously limited expressivity and limited >> means of extensibility.. >> >> cheers >> chris >> >>> Andrew >>> dalke at dalkescientific.com >>> >>> _______________________________________________ >>> DAS2 mailing list >>> DAS2 at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/das2 >> >> _______________________________________________ >> DAS2 mailing list >> DAS2 at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/das2 > > -- > Lincoln D. Stein > Cold Spring Harbor Laboratory > 1 Bungtown Road > Cold Spring Harbor, NY 11724 > FOR URGENT MESSAGES & SCHEDULING, > PLEASE CONTACT MY ASSISTANT, > SANDRA MICHELSEN, AT michelse at cshl.edu (516 367-5008) From dalke at dalkescientific.com Tue Mar 21 23:21:11 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Tue, 21 Mar 2006 15:21:11 -0800 Subject: [DAS2] complex features Message-ID: I've been working on the data model some, trying to get a feel for complex features. I've also been evaluating how GFF3 handles them. Both use a parent/child link, though GFF3 only has the reference to the parent while DAS has both. That means DAS clients can determine when all of the complex feature have been downloaded. GFF3 potentially requires waiting until the end of the library, though there is a way to hint that all the results have been returned. Both allow complex graphs. That is, both allow cycles. I assume we are restricting complex features to DAGs, but even then the following is possible [root1] [root2] [root3] | \ | / | \ | / | ------------------ | | node 4 | | ------------------ | / | / |/ [node 5] Node 4 has three parents (root1, root2 and root3) and node 5 has two parents (root1 and node4) This may or may not make biological sense. I don't know. I only point out that it's there. I feel that complex annotations must only have a single root element, even if it's a synthetic one with no location. Next, consider writeback, with the following two complex features [root1] [root2] | \ | | \ | | \ | [node1.1] [node1.2] [node2.1] Suppose someone adds a new "connector" node >-->---. | V [root1] | [root2] | \ | | | \ | | | \ ^ | [node1.1] [node1.2] | [node2.1] | | V | [connector]-->--->--^ Should that sort of thing be allowed? What's the model for the behavior? It seems to me there's a missing concept in DAS relating to complex features. My model is that the "complex feature" is its own concept, which I've been calling an "annotation". All simple features are annotations. The connected nodes of a complex feature are also annotations. As such, two annotations cannot be combined like this. Writeback only occurs at the annotation level, in that new feature elements cannot be used to connect two existing annotations. We might also consider having a new interface for annotations (complex features), so they can be referred to by URI. I don't think that's needed right now. Andrew dalke at dalkescientific.com From cjm at fruitfly.org Wed Mar 22 00:43:49 2006 From: cjm at fruitfly.org (chris mungall) Date: Tue, 21 Mar 2006 16:43:49 -0800 Subject: [DAS2] complex features In-Reply-To: References: Message-ID: <3879834dc8786f628c68e47a076c1e90@fruitfly.org> The GFF3 spec says that Parent can only be used to indicate part_of relations. If we go by the definition of part_of in the OBO relations ontology, or any other definition of part_of (there are many), then cycles are explicitly verboten, although the GFF3 docs do not state this. There's no reason in general why part_of graphs should have a single root, although it's certainly desirable from a software perspective. Dicistronic genes thow a bit of a spanner in the works. There's nothing to stop you adding a fake root, or refering to the maximally connected graph as an entity in its own right however. I don't know enough about DAS/2 to be helpful with the writeback example. It looks like your example below is a gene merge. On Mar 21, 2006, at 3:21 PM, Andrew Dalke wrote: > I've been working on the data model some, trying to get a feel > for complex features. I've also been evaluating how GFF3 handles > them. > > Both use a parent/child link, though GFF3 only has the reference > to the parent while DAS has both. That means DAS clients can > determine when all of the complex feature have been downloaded. > GFF3 potentially requires waiting until the end of the library, > though there is a way to hint that all the results have been > returned. > > Both allow complex graphs. That is, both allow cycles. I > assume we are restricting complex features to DAGs, but even > then the following is possible > > [root1] [root2] [root3] > | \ | / > | \ | / > | ------------------ > | | node 4 | > | ------------------ > | / > | / > |/ > [node 5] > > Node 4 has three parents (root1, root2 and root3) and > node 5 has two parents (root1 and node4) > > This may or may not make biological sense. I don't know. I > only point out that it's there. > > I feel that complex annotations must only have a single root > element, even if it's a synthetic one with no location. > > Next, consider writeback, with the following two complex features > > [root1] [root2] > | \ | > | \ | > | \ | > [node1.1] [node1.2] [node2.1] > > > Suppose someone adds a new "connector" node > >> -->---. > | V > [root1] | [root2] > | \ | | > | \ | | > | \ ^ | > [node1.1] [node1.2] | [node2.1] > | | > V | > [connector]-->--->--^ > > Should that sort of thing be allowed? What's the model > for the behavior? > > It seems to me there's a missing concept in DAS relating to > complex features. My model is that the "complex feature" is > its own concept, which I've been calling an "annotation". > All simple features are annotations. The connected nodes of > a complex feature are also annotations. > > As such, two annotations cannot be combined like this. > Writeback only occurs at the annotation level, in that > new feature elements cannot be used to connect two existing > annotations. > > We might also consider having a new interface for annotations > (complex features), so they can be referred to by URI. I > don't think that's needed right now. > > > Andrew > dalke at dalkescientific.com > > _______________________________________________ > DAS2 mailing list > DAS2 at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/das2 From boconnor at ucla.edu Wed Mar 22 00:47:51 2006 From: boconnor at ucla.edu (Brian O'Connor) Date: Tue, 21 Mar 2006 16:47:51 -0800 Subject: [DAS2] das.biopackages.net Message-ID: <44209EB7.9070008@ucla.edu> The DAS/2 server located at das.biopackages.net may be unavailable on and off for the next hour or so. Just wanted to let everyone know in case someone is using it. --Brian From dalke at dalkescientific.com Thu Mar 23 21:44:00 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 23 Mar 2006 13:44:00 -0800 Subject: [DAS2] complex features In-Reply-To: <3879834dc8786f628c68e47a076c1e90@fruitfly.org> References: <3879834dc8786f628c68e47a076c1e90@fruitfly.org> Message-ID: <53840452abca7236130efd4e57f42aef@dalkescientific.com> chris: > The GFF3 spec says that Parent can only be used to indicate part_of > relations. If we go by the definition of part_of in the OBO relations > ontology, or any other definition of part_of (there are many), then > cycles are explicitly verboten, although the GFF3 docs do not state > this. It looks like the most recent spec at http://song.sourceforge.net/gff3.shtml does state this, although the earlier one did not: "A Parent relationship between two features that is not one of the Part-Of relationships listed in SO should trigger a parse exception Similarly, a set of Parent relationships that would cause a cycle should also trigger an exception." > There's no reason in general why part_of graphs should have a single > root, although it's certainly desirable from a software perspective. > Dicistronic genes thow a bit of a spanner in the works. There's nothing > to stop you adding a fake root, or refering to the maximally connected > graph as an entity in its own right however. I've been working with GFF3 data for a few days now, trying to catch the different cases. It isn't hard, but it had been a long time since I worried about cycle detection. The biggest problem has been keeping all the "could be a parent" elements around until the entire data set is finished. Except for features with no ID and no Parents, parsers need to go to the end of the file (or no-forward-references line) before being able to do anything with the data. In DAS it's easier because each feature lists all parents and children, so it's possible to detect when a complex feature is ready. Even then it requires a bit of thinking to handle cases with multiple roots. It would be much easier if either all complex features were in an element or if there was a unique name to tie them together Another solution is to make the problem simpler. I see, for example, that the biopython doesn't have any gff code and the biojava one only works at the single feature level. Only bioperl implements a gff3 parser with support for complex features, but it assumes all complex features are single rooted and that the features are topologically sorted, so that parents come before children. It also looks like a diamond structure (single root, two children, both with the same child) is supported on input but the output assumes features are trees. For example, I tried it just now on dmel-4-r4.3.gff from wormbase, which I'm finding to be a bad example of what a GFF file should look like. It contains one duplicate ID, which bioperl catches and dies on. I fixed it. It then complains with a lot of MSG: Bio::SeqFeature::Annotated=HASH(0xba4a93c) is not contained within parent feature, and expansion is not valid, ignoring. because the features are not topologically sorted, as in this (trimmed) example. The order is the same as in the file. 4 sim4:na_dbEST.same.dmel match_part 5175 5627 ... Parent=88682278868229;Name=GH01459.5prime 4 sim4:na_dbEST.same.dmel match 5175 5627 ... ID=88682278868229;Name=GH The simpler the data model we use (eg, single rooted, output must be topologically sorted with parents first) then the more likely it is for client and server code to be correct and the more likely there will be more DAS code. Andrew dalke at dalkescientific.com From ap3 at sanger.ac.uk Fri Mar 24 18:19:41 2006 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Fri, 24 Mar 2006 18:19:41 +0000 Subject: [DAS2] 100th das1 source in registry Message-ID: <23fe2aa8d3c4a9afc28782b3d3e58032@sanger.ac.uk> Hi! Today the 100th DAS1 source was registered in the DAS registration server at http://das.sanger.ac.uk/registry/ It currently counts 101 DAS sources from 23 institutions in 9 countries. The purpose of the DAS registration service is to keep track which DAS services are available and to help with automated discovery of new DAS servers on the client side. Regards, Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 From Gregg_Helt at affymetrix.com Fri Mar 24 18:37:21 2006 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Fri, 24 Mar 2006 10:37:21 -0800 Subject: [DAS2] 100th das1 source in registry Message-ID: Congratulations! On a related note, is there a way to automatically register DAS/2 servers yet? If not, can I send you info to add the Affymetrix test DAS/2 server to the registry? Thanks, Gregg > -----Original Message----- > From: das2-bounces at lists.open-bio.org [mailto:das2-bounces at lists.open- > bio.org] On Behalf Of Andreas Prlic > Sent: Friday, March 24, 2006 10:20 AM > To: DAS/2 > Subject: [DAS2] 100th das1 source in registry > > Hi! > > Today the 100th DAS1 source was registered in the DAS registration > server at > > http://das.sanger.ac.uk/registry/ > > It currently counts 101 DAS sources from 23 institutions in 9 countries. > > The purpose of the DAS registration service is to keep track which DAS > services are available > and to help with automated discovery of new DAS servers on the client > side. > > Regards, > Andreas > > > ----------------------------------------------------------------------- > > Andreas Prlic Wellcome Trust Sanger Institute > Hinxton, Cambridge CB10 1SA, UK > +44 (0) 1223 49 6891 > > _______________________________________________ > DAS2 mailing list > DAS2 at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/das2 From ap3 at sanger.ac.uk Sat Mar 25 11:13:06 2006 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Sat, 25 Mar 2006 11:13:06 +0000 Subject: [DAS2] 100th das1 source in registry In-Reply-To: References: Message-ID: > On a related note, is there a way to automatically register DAS/2 > servers yet? A beta - version can be tried at the toy-registry at http://www.spice-3d.org/dasregistry/registerDas2Source.jsp and the results will be visible at http://www.spice-3d.org/dasregistry/das2/sources - so far this provides a simple upload mechanism that is based on the sources decription. what is still missing is a validation of the user provided data ("does this request give really a features response?") plus other things like a html representation of the das2 servers. I think it would be great if Andrew's Dasypus server could provide an interface to the validation mechanism that could be used by programs. If validation fails the response could contain a link, to point the user to the nice error report web page. will be abroad next week so can't join for the call... Cheers, Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 From Gregg_Helt at affymetrix.com Mon Mar 27 16:24:53 2006 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Mon, 27 Mar 2006 08:24:53 -0800 Subject: [DAS2] Agenda for today's teleconference Message-ID: We're back on the standard DAS/2 teleconference schedule, every Monday at 9:30 AM Pacific time. Suggestions for today's agenda: Code sprint summary DAS/2 grant status Writeback spec & implementation ??? Teleconference # US: 800-531-3250 International: 303-928-2693 Conference ID: 2879055 Passcode: 1365 From Steve_Chervitz at affymetrix.com Mon Mar 27 19:05:28 2006 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Mon, 27 Mar 2006 11:05:28 -0800 Subject: [DAS2] Notes from the weekly DAS/2 teleconference, 27 Mar 2006 Message-ID: Notes from the weekly DAS/2 teleconference, 27 Mar 2006 $Id: das2-teleconf-2006-03-27.txt,v 1.1 2006/03/27 19:03:30 sac Exp $ Note taker: Steve Chervitz Attendees: Affy: Steve Chervitz, Gregg Helt CSHL: Lincoln Stein Dalke Scientific: Andrew Dalke UC Berkeley: Nomi Harris UCLA: Allen Day Action items are flagged with '[A]'. These notes are checked into the biodas.org CVS repository at das/das2/notes/2006. Instructions on how to access this repository are at http://biodas.org DISCLAIMER: The note taker aims for completeness and accuracy, but these goals are not always achievable, given the desire to get the notes out with a rapid turnaround. So don't consider these notes as complete minutes from the meeting, but rather abbreviated, summarized versions of what was discussed. There may be errors of commission and omission. Participants are welcome to post comments and/or corrections to these as they see fit. Proposed agenda: * Code sprint summary * DAS/2 grant status * Writeback spec & implementation [Notetaker: missed the first 40min - apologies] Topic: Code sprint summary -------------------------- gh: pleased with our progress during the last code sprint (13-17 Mar) [Notetaker: detailed summaries of what folks did during this code sprint are described here: http://lists.open-bio.org/pipermail/das2/2006-March/000668.html ] Topic: Writeback ---------------- [Discussion in progress] ls: in my model, every feature has a unique id, when you update it, it's going to make the change to the object and not create a new one. the object is associated with url in some way, when you update the position of this exon, it's going to change some attributes of it. gh: thomas proposed the alternative: every time you change a feature you create a new one with a pointer back to the old one. ad: can't speak for what db implementers will do for versioning of features. only taking about merging from different complex features. So only when you merge from complex ones. ls: this is the history tracking business. writeback will explicitly support merges and splits. ad: how detailed does the spec need to be? ls: driven by requirements. ad: what are the reqts? I can't go further without more details. roy said eevery modification gets new version, so you could do time travel, if your db supported that. ls: does igb or apollo explicitly support merges and splits among transcripts? gh: yes. curation in igb is experimental (now turned off). but it does support these. as does apollo. so these are essential. ls: writeback should have instructions for how feature will adopt children of a subfeature. one feature adopts children of the other and previous feature is now deprecated. there's a specific set of operations for creating new features, renaming, spliting, and merging. perhaps Nomi should write down what operations that apollo supports. nh: yes, all those are supported as well as things like adjusting endpoints of start of translation. apollo can merge transcripts within a gene and between genes (which offers to merge the associated genes). curators can do 'splurge' - a split, merge combo. ls: that sounds like suzi's nomenclature. gh: the db that apollo writes back to, do changes create new versions of feature or change the feature itself? nh: not sure. mark did the work with chado. I know they were doing something to rewrite the entire feature if anything changed. [A] nomi will ask Mark to join in discussion next week (3 April). aday: what fraction of the operations are doing simple vs complex things? eg., revising the gene model. nh: revision happens a lot. mostly adjusting endpoints. splits and merges are infrequent. adding annotation. But it doesn't matter how infrequent the operations are, we either support them or we don't. ad: when there are changes in the model, how does the client get notified that the change occurred? nh: that's tricky. gh: this is outside the scope of the das/2 spec itself. as long as we have locks to prevent simultaneous modification, that is sufficient. ad: there's no mechanism for polling server. gh: yes, just requery server. gh: but your client doesn't do it. gh: I'm thinking of adding polling to get the last modified stuff. For now, one can simply re-start your session to see what has changed. aday: is the portion of writeback spec for modifying endpoints, simple add/delete of annotations stable? ad: the general idea is unchanged. gh: priority here is before next meeting: brian and allen read over writeback spec and identify any issues as implementers. aday: looking for an 80% solution. not dealing with heritance wihich is difficult. nh: splits and merges can be done with combos of simpler ops. aday: performace operations will be affected. graph flattening and partial indexes. splits and merges will affect this table, so will have to trigger update of that table any time there's a split/merge. this will have big impact on query performance: could be 1-2 sec for yeast, 30-60 min for human. gh: what about if you do that update 1x/day? Then users would be working off a snapshot that was current as of the end of previous day. aday: caching on server responses will also be affected, unless we turn caching off. maybe I can tell apache to remove a subset of cached pages and leave others intact. aday: for tiling requests - server could find affected blocks and purge those, instead of purging the entire cache. gh: you can't rely on any client to use your tiling strategy. but could be helpful for those clients that use it. aday: basically we'll have to turn caching off when we start doing writeback. gh: is there a way for server to detect what has changed? gh: if database detects change it can flush cache for that sequence. aday: maybe. possibly the easiest way to do this is via tiling. gh: say you have two servers: 1) everthing that can be edited 2) everything that has been edited (slower) aday: main server has all features and second server handles writeback, just writes to gff file, then cron runs once a night to merge the gff into the db. gh: separate dbs: 1) curation 2) everything that has been edited. aday: yes. persistent flat file adapter can be used for one of them. gh: this is the sort of detail I'm looking for w/r/t development of the writeback spec. [A] allen and brian look over writeback spec to discuss on 3 April. From nomi at fruitfly.org Mon Mar 27 19:42:59 2006 From: nomi at fruitfly.org (Nomi Harris) Date: Mon, 27 Mar 2006 11:42:59 -0800 Subject: [DAS2] Mark Gibson on Apollo writeback to Chado Message-ID: mark gibson said that he plans to attend next monday's DAS/2 teleconference. he also gave me permission to forward this message that he wrote recently in response to a group that is adapting apollo and wondered what he thought about direct-to-chado writeback vs. the use of chadoxml as an intermediate storage format. FlyBase Harvard prefers to use the latter approach because (we gather) they worry about possibly corrupting the database by having clients write directly to it. if anyone from harvard is reading this and feels that mark has misrepresented their approach, please set us straight! Nomi On 10 March 2006, Mark Gibson wrote: > Im rather biased as a I wrote the chado jdbc adapter [for Apollo], but let me put forth my > view of chado jdbc vs chado xml. > > The chado Jdbc adapter is transactional, the chado xml adapter is not. What this > means is jdbc only makes changes in the database that reflect what has actually > been changed in the apollo session, like updating a row in a table; with chado > xml you just get the whole dump. So if a synonym has been added jdbc will add a > row to the synonym table. For xml you will get the whole dump of the region you > were editing (probably a gene) no matter how small the edit. > > What I believe Harvard/Flybase then does (with chado xml) is wipe out the gene > from the database and reinsert the gene from the chado xml. The problem with > this approach is if you have data in the db thats not associated with apollo > (for flybase this would be phenotype data) then that will get wiped out as well, > and there has to be some way of reinstating non-apollo data. If you dont have > non-apollo data and dont intend on having it in the future this isnt a huge > issue I suppose. I think Harvard is integrating non-apollo data into their chado > database. > > I think what they are going to do is actually figure out all of the transactions > by comparing the chado xml with the chado database, which is what apollo already > does, but I'm not sure as Im not so in touch with them these days (as Im not > working with apollo these days - waiting for new grant to kick in). > > Since the paradigm with chado xml is wipe out & reload, then apollo has to make > sure it preserves every bit of the chado xml that came in. Theres a bunch of > stuff thats in chado/chado xml that the apollo datamodel is unconcerned with, > and has no need to be concerned with as its stuff that it doesnt visualize. In > other words apollos data model is solely for apollos task of visualizing data, > not for roundtripping what we call non-apollo data. In writing the chado xml > adapter for FlyBase, Nomi Harris had a heck of a time with these issues, and she > can elaborate on this I suppose. > > I'm personally not fond of chado xml because its basically a relational database > dump, so its extremely verbose. It redundantly has information for lots of joins > to data in other tables - like a cvterm entry can take 10 or 20 lines of chado > xml, and a given cvterm may be used a zillion times in a given chado xml file > (as every feature has a cvterm). So these files can get rather large. > > The solution for this verbose output is to use what I call macros in chado xml. > Macros are supported by xort. They take the 15 line cvterm entry and reduce it > to a line or 2 making the file size much more reasonable. The apollo chado xml > adapter does not support macros, so you have to use unmacro'd chado xml for > apollo purposes. Nomi Harris had a hard enough time getting the chado xml > adapter working for flybase(and did a great job with a harrowing task), that she > did not have time to take on the macro issue. If you wanted macros (and smaller > file sizes) you would have to add this functionality to the chado xml adapter > (are there java programmers in your group?). > > One of the arguments against the jdbc adapter is that its dangerous because it > goes straight into the database so if there are any bugs in the data adapter > then the database could get corrupted - some groups find this a bit precarious. > This is a valid argument. I think theres 2 solutions here. One is to thoroughly > test the adapter out against a test database until you are confident that bugs > are hammered out. > > Another solution is to not go straight from apollo to the database. You can use > an interim format and actually use apollo to get that interim format into the > database. Of course one choice for interim format is chado xml and then you are > at the the chado xml solution. The other choice for file format is GAME xml. You > can then use apollo to load game into the chado database, and this can be done > at the command line (with batching) so you dont have to bring up the gui to do > it. Also chado xml can be loaded into chado via apollo as well (of course xort > does this as well but not with transactions) > > So then the question is if Im not going to go straight into the database, why > would I choose game over chado xml? Or if Im using chado xml should I use > apollo or xort to load into chado. I think if you are using chado xml it makes > sense to use xort as it is the tried & true technology for chado xml. The > advantage of going through apollo is that it also uses the transactions from > apollo (theres a transaction xml file) and thus writes back the edits in a > transactional way as mentioned above rather than in a wipe out & reload fashion. > > Also Game is a tried & true technology that has been used with apollo in > production at flybase (before chado came along) for many years now. One > criticism of it has been that DTD/XSD/schema has been a moving target, nor has > it been described. That is not as true anymore. Nomi Harris has made a xsd for > it as well as a rng. But I must confess that I have recently added the ability > to have one level annotations in game (previously 1 levels had to be hacked as 3 > levels). Also game is a lot less verbose than un-macro'd chado xml, as it more > or less fits with the apollo datamodel. One advantage of chado xml over game xml > is that it is more flexible in terms of taking on features of arbitrary depth. > > The chado xml adapter was developed for FlyBase and as far as I know has not > been taken on by any other groups yet. Nomi can elaborate on this, but I think > what this might mean is that there are places where things are FlyBase specific. > If you went with chado xml the adapter would have to be generalized. Its a good > exercise for the adapter to go through, but it will take a bit of work. Nomi can > probably comment on how hard generalizing might be. I could be wrong about this > but I think the current status with the chado xml adapter is that Harvard has > done a bunch of testing on it but they havent put it into production yet. > > The jdbc adapter is being used by several groups so has been forced to be > generalized. One thing I have found is that chado databases vary all too much > from mod to mod (ontologies change). There is a configuration file for the jdbc > adapter that has settings for the differences that I encountered. I initially > wrote it for cold spring harbors rice database that will be used in classrooms. > Its working for rice in theory, but they havent actually used it much in the > classroom yet. For rice the model is to save to game and use apollo command line > to save game & transactions back to chado. > > Cyril Pommier, at the INRA - URGI - Bioinformatique, has taken on the jdbc > adapter for his group. I have cc'd him on this email as I think he will have a > lot to say about the jdbc adapter. Cyril has uncovered many bugs and has fixed a > lot of them (thank you cyril) as hes a very savvy java programmer. And he has > also forced the adapter to generalize and brought about the evolution of the > config file to adapt to chado differences. But as Cyril can attest (Cyril feel > free to elaborate) it has been a lot of work to get jdbc working for him. There > were a lot of bugs to fix that we both went after. Hopefully now its a bit more > stable and the next db/mod wont have as many problems. I think Cyril is still at > the test phase and hasn't gone into production (Cyril?) > > Berkeley is using the jdbc adapter for an in house project. They are using the > jdbc reader to load up game files (as the straight jdbc reader is slow as the > chado db is rather slow) which are then loaded by a curator. They are saving > game, and then I think chris mungall is xslting game to chado xml which is then > saved with xort - or he is somehow writing game in another way - not actually > sure. The Berkeley group drove the need for 1 level annotations(in jdbc,game,& > apollo datmodel) > > Jonathan Crabtree at TIGR wrote the jdbc read adapter, and they use it there. I > believe they are intending to use the write adapter but dont yet do so (Jonathan?). > > I should mention that reading jdbc straight from chado tends to be slow, as I > find that chado is a slow database, at least for Berkeley. It really depends on > the db vendor and the amount of data. TIGRs reading is actually really zippy. > The workaround for slow chados is to dump game files that read in pretty fast. > > In all fairness, you should probably email with FlyBase (& Chris Mungall) and > get the pros of using chado xml & xort, which they can give a far better answer > on than I. > > Hope this helps, > Mark From dalke at dalkescientific.com Mon Mar 27 20:59:28 2006 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 27 Mar 2006 13:59:28 -0700 Subject: [DAS2] cell phone battery dead Message-ID: <3d9298aced5c4efb7d9c34574fcf7618@dalkescientific.com> Sorry about the drop out towards the end of today's conversation. The battery on my phone died. Andrew dalke at dalkescientific.com