From edgrif at sanger.ac.uk Thu Dec 1 04:21:40 2005 From: edgrif at sanger.ac.uk (Ed Griffiths) Date: Thu, 1 Dec 2005 09:21:40 +0000 (GMT) Subject: [DAS2] DAS intro In-Reply-To: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> References: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> Message-ID: Andrew, > The front of the DAS doc starts > > DAS 2.0 is designed to address the shortcomings of DAS 1.0, including: > > That kinda assumes people know what DAS 1.0 is to understand DAS 2.0. Good to make this change but I also think that there should be a short section which compares/contrasts DAS 1.0 and DAS 2.0. It should be written to show that DAS 2.0 addresses the shortcomings of DAS 1.0 (e.g. updating protocol). Otherwise there is nothing major I would change about the intro., a good change to make. Ed -- ** PLEASE NOTE NEW ADDRESS/PHONE NUMBER ** ------------------------------------------------------------------------ | Ed Griffiths, Acedb development, Informatics Group, | | The Morgan Building, Sanger Institute, Wellcome Trust Genome Campus | | Hinxton, Cambridge CB10 1HH | | | | email: edgrif at sanger.ac.uk Tel: +44-1223-496844 Fax: +44-1223-494919 | ------------------------------------------------------------------------ From dalke at dalkescientific.com Sun Dec 4 18:35:49 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 5 Dec 2005 00:35:49 +0100 Subject: [DAS2] the /region subtree Message-ID: Looks like no one knows why it's there? Regions are important. But regions can (as far as I can tell) be described in a feature by pointing directly to the /sequence subtree and not through an intermediate /region object. Identifiable regions (contigs, ESTs) are important, but they can be stored as a feature, and take advantage of the other capabilities of features, like searching and returning alternative formats. I sent mail to Lincoln asking to talk about this but haven't heard back from him. Or anyone else want to explain it to me? Andrew dalke at dalkescientific.com From Steve_Chervitz at affymetrix.com Mon Dec 5 11:50:56 2005 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Mon, 05 Dec 2005 08:50:56 -0800 Subject: [DAS2] DAS/2 teleconference today Message-ID: Today's agenda: implementation status reports. Dialin (US): 800-531-3250 Dialin (Intl): 303-928-2693 Conference ID: 2879055 Steve From Steve_Chervitz at affymetrix.com Mon Dec 5 11:52:49 2005 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Mon, 05 Dec 2005 08:52:49 -0800 Subject: [DAS2] Re: DAS/2 teleconference today In-Reply-To: Message-ID: Forgot to note the time: 9:30am PDT, 12:30pm EDT, 5:30pm GMT Steve > From: Steve Chervitz > Date: Mon, 05 Dec 2005 08:50:56 -0800 > To: DAS/2 > Conversation: DAS/2 teleconference today > Subject: DAS/2 teleconference today > > Today's agenda: implementation status reports. > > Dialin (US): 800-531-3250 > Dialin (Intl): 303-928-2693 > Conference ID: 2879055 > > Steve From Gregg_Helt at affymetrix.com Mon Dec 5 12:13:16 2005 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Mon, 5 Dec 2005 09:13:16 -0800 Subject: What are regions for? (was Re: [DAS2] DAS intro) Message-ID: > -----Original Message----- > From: das2-bounces at portal.open-bio.org [mailto:das2-bounces at portal.open- > bio.org] On Behalf Of Andrew Dalke > Sent: Tuesday, November 29, 2005 4:02 PM > To: DAS/2 > Subject: What are regions for? (was Re: [DAS2] DAS intro) > > Ed: > > I understand this as talking about coordinates in general, not the > > elements or "pos" attributes in the spec. Suzi specifically > > mentions chromosomes and contigs; one can definitely be backwards with > > respect to the other. But top-level regions in an assembly would > > probably all be chromosomes or all be contigs, rather than a mixture. > > I'm trying to figure out when people use the /region. Okay, for now ignore the whole issue of assembly. The need for something like /region doesn't depend on different levels of assembly. I do think handling assembly information is necessary, but that's for a different post. In the current spec the ".../region" query is the only way to _efficiently_ discover the set of sequences that can be used for region/sequence-based filters in feature queries. Pretty much any client that wants to restrict feature queries by sequence needs to use it. Now you _can_ determine this same info via an unqualified ".../sequence" query but then you're retrieving all the residues for each sequence -- this is about as inefficient as you can get. Another alternative to the current approach would be to combine /region and /sequence into one type of query, but to add modifiers (format param?) that specify what to return: .../sequence?format=x-das-regions (or something similar) .../sequence?format=fasta We would need to specify at least these two different formats to allow for both efficient retrieval of minimal information about the set of seqs and retrieval of sequence residues. ... > My questions, to summarize, are: > - why do we need a /region space when we can > 1. point directly to a sequence (for chromosome regions) and/or > 2. point to a "contig" or "assembly" or "region" feature type > (for other regions) > > - When would someone have regions which have more than one of > contigs, ESTs and chromosomes? Especially given that this > is the genome spec, so chromosome-level info is known, at > least enough for a rough assembly. > > In other words, what are regions for? > I'm really only addressing question 1.1, as I said before I think assembly is a separate issue. gregg From suzi at fruitfly.org Mon Dec 5 11:53:02 2005 From: suzi at fruitfly.org (Suzanna Lewis) Date: Mon, 5 Dec 2005 08:53:02 -0800 Subject: [DAS2] DAS/2 teleconference today In-Reply-To: References: Message-ID: in just a few minutes? or is it on the half hour? On Dec 5, 2005, at 8:50 AM, Steve Chervitz wrote: > Today's agenda: implementation status reports. > > Dialin (US): 800-531-3250 > Dialin (Intl): 303-928-2693 > Conference ID: 2879055 > > Steve > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 From dalke at dalkescientific.com Mon Dec 5 12:43:04 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 5 Dec 2005 18:43:04 +0100 Subject: What are regions for? (was Re: [DAS2] DAS intro) In-Reply-To: References: Message-ID: <76364b7909b0d4e7df3fc4bc649d10de@dalkescientific.com> Gregg: > Okay, for now ignore the whole issue of assembly. The need for > something like /region doesn't depend on different levels of assembly. > I do think handling assembly information is necessary, but that's for a > different post. Okay. > In the current spec the ".../region" query is the only way to > _efficiently_ discover the set of sequences that can be used for > region/sequence-based filters in feature queries. What's wrong with $VERSIONED_SOURCE/feature?type=region That is, regions are as specific type of feature. Andrew dalke at dalkescientific.com From Steve_Chervitz at affymetrix.com Mon Dec 5 15:42:44 2005 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Mon, 05 Dec 2005 12:42:44 -0800 Subject: [DAS2] DAS/2 weekly meeting notes from 5 Dec 2005 Message-ID: Notes from the weekly DAS/2 teleconference, 5 Dec 2005. $Id: das2-teleconf-2005-12-05.txt,v 1.2 2005/12/05 19:55:54 sac Exp $ Note taker: Steve Chervitz Attendees: Affy: Steve Chervitz, Ed E., Gregg Helt CSHL: Lincoln Stein UC Berkeley: Suzi Lewis Sanger: Andreas Prlic U Alabama: Ann Loraine Action items are flagged with '[A]'. These notes are checked into the biodas.org CVS repository at das/das2/notes/2005. Instructions on how to access this repository are at http://biodas.org DISCLAIMER: The note taker aims for completeness and accuracy, but these goals are not always achievable, given the desire to get the notes out with a rapid turnaround. So don't consider these notes as complete minutes from the meeting, but rather abbreviated, summarized versions of what was discussed. There may be errors of commission and omission. Participants are welcome to post comments and/or corrections to these as they see fit. Today's topic: Implementation status reports --------------------------------------------- Gregg ------ * Planning new IGB release in 2-3 weeks * Revised IGB docs * Updating the IGB DAS client - Want to make the DAS/1 iface more like DAS/2. Currently, they are isolated in the code and the interface. Maybe done w/in next 2 weeks. Question: Has anyone use the agile 2D java stuff? ( http://www.cs.umd.edu/hcil/agile2d/ ) AL: Hasn't used it but used jazz (piccolo). Couldn't figure how to get it to zoom in only 1D. they do 2D. GH: Looking to speed up rendering in IGB. The Affy transcriptarium loses hardware acceleration when using normal java 2d/swing stuff (one cpu controling 8 monitors). agile 2d background: Billed as a drop in replacement for java 2d that runs on top of open GL. Significant speed ups on solaris, other platforms. But Gregg finds it is slower on windows. Requies open GL for java that does lots of native stuff underneath. May work on mac OS X. The focus on windows b/c in order to use spotfire and other windows-only apps. AP: Does it use double buffering? GH: Yes. Swing does this. Builds volatile image. Hooks into hardware acceleration graphics card. If you can construct a volatile img, rendering goes through graphics hardware, which is a big speed up. But swing can't deal w/ multiple graphics cards (Direct X limitation?). Open gl may deal with things better in this regard. AL: Maybe get a consultant. This is very specialized stuff. GH: Getting some time from sun java graphics engineers could help. Ed E ---- * Debugging for DAS/1 in IGB client - Fixed bug in the Affy DAS/1 server. - Also fixed a few bugs in parser (concerned note, length tags -- not heavily used). Steve ------ * Working on hooking up tomcat to apache to enable apache to handle requests that are fed to an underlying servlet engine. - Have made progress but need to do more testing before installing on netaffxdas server machine. GH: Ultimate goal is to release the Affy DAS/2 server as a servlet using standard configuration. Isn't this sevlet-under-apache configuration fairly standard? AP: We provide a war file. users can plug this into their app server. SC: The apache-tomcat config is quite common for ISPs or other situations where you need to redirect apache to different servlets depending on different companies, products, etc. SC: Assuming the apache configuration is ready, is the DAS/2 server code ready to go? GH: depends on how much of the spec we want to support. Estimates 6mos to 1yr before releasing our server. Also probably just release DAS/2. DAS/1 is also just partial and does some custom things. Better to focus on fleshing out DAS/2 server. SC: Possible complexities for apache/tomcat due to our port forwarding schemes (DAS/1 and DAS/2 servers running on different ports). GH: There's no reason we can't run these servers on different machines. GH: Steve's time for DAS work? SC: Have to transition back to NetAffx now, but can still proceed with DAS work, just less cpu cycles. Can still take and post notes. Ann ---- * Working on new arabidopsis 2010 proposal from last year. - Main aim of grant: setting up data archive for arab, and for people to add to it. Currently scattered. people put data on their website and move on. No easy way to collect it in one place. - Setting up a web service at UAB, give it an ID, info about id (e.g., whose affy, tair, agi). The server will return synonyms for that ID. Goal is to be a backend for IGB to do searches for an ID. Will talk to the UAB server via some data format. - Grant will include funding for uab and Affy (usability enhancements for IGB as front end for data repository, figuring out protocol for talking to id server) GH: Will you run DAS/2 servers at UAB? AL: Yes, we'll set up the servers, but not do server development itself. Can test and offer feedback from installation issues, etc. GH: Will data repository be in Alabama? AL: Yes. We have good sysadmin support. Originally TAIR would do it, but we've built things up since then. Grant will be more smaller scale than last year. No comparative genomics viewers. Sticking to core expertise. Focus on getting one genome right. Also funding for genoviz sdk. library upkeep. tutorials. GH: Reimplement IGB on top of picolo! AL: Looking to Affy on fancy graphics stuff since they know the libraries better than anyone else. GH: Regarding your synonym resolver - have you talked to suzi about what they're doing? AL: No. Looked at the flynome link steve mentioned (http://flynome.com). If there's a standard out there we'd love to use it. LS: There is Gene Seer developed here - a synonlym database for gene IDs. Supports many to may associations between synonly and the canonical forms. No interface besides a web-form. May make sense to provide access to it via DAS/2. GH: layering a DAS/2 interface on top of the synomym service might be nice. [A] Ann will help develop gene synonym server on top of DAS/2 EE: In IGB we have a need for synomym service for genome builds (e.g. hg16 = build34). Currently a hack in IGB, would be nice to do via DAS. AL: Plan to start with arab first. GH: Using Gene Seer scheme for your storage you will get a lot of other genomes for free. AL: Is it open source? LS: Yes, but it's not my code (ravi at cshl.edu) heavily used in cshl. Loaded with all model orgs. This is now public: http://geneseer.cshl.edu - published in bioinformatics. - no arabidopsis (rat, mouse, human, celegans, yeast). [A] Ann will contact Ravi re: Gene Seer & get letter to submit w/ her proposal. SC: Does Gene Seer contain LSIDs? LS: No, but could be added as a synonym. Every name has a species, and unique type (embl, ncbi). E.g., you can search for rad5 - in all species, or in yeast, or as a locus name in yeast. Used for the RNAi libraries here. AL: Is anyone else building such a resource? LS: Google maybe.... Andreas ------- * Not much coding for DAS/2. Waiting for issues with spec to be discussed. (next week). DAS/2 interface should be easy to add. GH: Spec is changing, so impl is hard to do. AP: Database and java server-side code has been running a while. A new interface should be easy to do. Andrew (via Gregg) ------------------ * Planning to get the registry-related stuff formalized in the spec soon. * Working on setting up web service for validation on open-bio server with Chris D. * Updating spec but is slow going. Will join the call next week. LS: Whose in charge of updating the spec now -- me, Steve, Andrew? [A] Andrew is now responsible for updating the spec. Lincoln -------- * Update on the NCI DAS/2 project: - Awarded grant for (funded by NCI) after 1 year negotiating. - caBIG grid depends on caBIO java lib/API from NCI. They had dropped all DAS support when going from caCORE v2 -> v3. - Brian Gilman proposes to add DAS/2 support to caCORE v3. o First, create plugin API for caCORE. Will allow you to add in class libs without recompiling java code. Permits caCORE to speak arbitrary protocols (~3 mos work). o Then implement DAS/2 plugin (~3 mos work). o Starting on 12/17. o Will use Alan Day's biopackages DAS/2 server (gmod), serving hap map through that. Client will be caBIO. ---- [A] Discuss spec issues for next weeks teleconf: regions, seq, registry, etc. From dalke at dalkescientific.com Mon Dec 5 17:22:14 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 5 Dec 2005 23:22:14 +0100 Subject: [DAS2] transitioning from DAS/1 to DAS/2 Message-ID: <913823a0cc11e248822242b73f9ecd13@dalkescientific.com> What should we do to make DAS/1 -> DAS/2 transitions easier? For at least a year I expect there will be both DAS/1 and DAS/2 servers. We have or will have clients which can handle both interfaces. I expect the metadata server should support both server types. This means it must describe that data source "X" uses interface "A". I also expect that "A" can be: - DAS/1 genomic annotations - DAS/1.5 structure annotations - DAS/2 genomic annotations - DAS/2 protein annotations - DAS/2 structure We've already talked about the need for identifying the different DAS/2 data sources, based on Andreas Prlic's comments. But we didn't talk about how to handle DAS/1 data sets in the metadata server. At present I don't have even a sketch of an answer. In theory this registry could be extended to a registery of many different data sources and API (eg, 'go to this Oracle database, which uses the biosql schema'). That's well outside the scope of this question. Andrew dalke at dalkescientific.com From ed_erwin at affymetrix.com Mon Dec 5 17:33:08 2005 From: ed_erwin at affymetrix.com (Ed Erwin) Date: Mon, 05 Dec 2005 14:33:08 -0800 Subject: [DAS2] transitioning from DAS/1 to DAS/2 In-Reply-To: <913823a0cc11e248822242b73f9ecd13@dalkescientific.com> References: <913823a0cc11e248822242b73f9ecd13@dalkescientific.com> Message-ID: <4394C024.6010103@affymetrix.com> Andrew Dalke wrote: > What should we do to make DAS/1 -> DAS/2 transitions easier? > > For at least a year I expect there will be both DAS/1 and DAS/2 > servers. We have or will have clients which can handle both > interfaces. You are talking about the registry and discovery service, right? How about only registering DAS/2 servers. That should help speed users to transition to DAS/2. From ap3 at sanger.ac.uk Tue Dec 6 05:14:11 2005 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Tue, 6 Dec 2005 10:14:11 +0000 Subject: [DAS2] transitioning from DAS/1 to DAS/2 In-Reply-To: <913823a0cc11e248822242b73f9ecd13@dalkescientific.com> References: <913823a0cc11e248822242b73f9ecd13@dalkescientific.com> Message-ID: <04f347a5b6873fbda4978d866204048d@sanger.ac.uk> Hi Andrew! > For at least a year I expect there will be both DAS/1 and DAS/2 > servers. We have or will have clients which can handle both > interfaces. > I expect the metadata server should support both server types I think there should be two different points where to get a list of DAS1 and DAS2 servers. after all it is different protocols. e.g. something like das.sanger.ac.uk/registry/ ... the existing das1 registry das.sanger.ac.uk/registry/das2/ ... the upcoming das2 registry. > This means it must describe that data source "X" uses interface "A". > > I also expect that "A" can be: > - DAS/1 genomic annotations > - DAS/1.5 structure annotations In terms of meta description, There is not much difference between genome, sequence and structure capabilities with DAS1. "structure" and "alignment" are just 2 additional commands ("capabilities") that a das server speaks (like "sequence" or "feature). Still it is important to distinguish between the types of data. Therefore the das1- registry style coordinate systems contain the information if the type is chromosomal, protein sequence, or protein structure. > - DAS/2 genomic annotations > - DAS/2 protein annotations > - DAS/2 structure this are the DAS2 - "domains" This leads to a discussion I would like to have next monday: DAS1 is rather powerful, because it is possible to use the sequence and features commands in a way that works for both genomic and protein sequences. There is no need to distinguish between these in the DAS1 - world. I understand that one of the reason for creating the DAS2-"domains" is to have several "modules" which can be extended/developed independently. Still the genome -domain is very similar to what is needed for any kind of sequence. I would therefore like to discuss to rename the "genome" domain to "sequence". The information of which type of sequence, "genomic" or "protein sequence" should be provided via the source description. > But we didn't talk about how to handle DAS/1 data sets in > the metadata server. if the source description is done well, it can also be used for das1 servers. ( see my recent mail about the "meta" description, which could be used like that for das1) Something else I also would like to try is to provide a DAS2- proxy for DAS1 sources via the registry... I.e. you can make a DAS2 request to a URL at the registry, which is translated to a DAS1 request at the real server and then translated back again... Cheers, Andreas > ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 From Steve_Chervitz at affymetrix.com Tue Dec 6 14:37:20 2005 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Tue, 06 Dec 2005 11:37:20 -0800 Subject: [DAS2] transitioning from DAS/1 to DAS/2 In-Reply-To: <04f347a5b6873fbda4978d866204048d@sanger.ac.uk> Message-ID: Andreas Prlic wrote: > Hi Andrew! > ... >> - DAS/2 genomic annotations >> - DAS/2 protein annotations >> - DAS/2 structure > > this are the DAS2 - "domains" This leads to a discussion I would like > to have next monday: > > DAS1 is rather powerful, because it is possible to use the sequence and > features commands > in a way that works for both genomic and protein sequences. There is > no need to distinguish between these > in the DAS1 - world. > > I understand that one of the reason for creating the DAS2-"domains" is > to have several "modules" which can be > extended/developed independently. Still the genome -domain is very > similar to what is needed for any kind of > sequence. > > I would therefore like to discuss to rename the "genome" domain to > "sequence". > > The information of which type of sequence, "genomic" or "protein > sequence" should be provided via the > source description. I like this idea. The only nucleotide specific stuff in the DAS/2 retrieval spec is the region request. Strand designation in a location specifier is already optional. We'd may then want to change the 'sequence' request to something else, perhaps 'residues'? Steve From dalke at dalkescientific.com Tue Dec 6 18:43:10 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 7 Dec 2005 00:43:10 +0100 Subject: [DAS2] transitioning from DAS/1 to DAS/2 In-Reply-To: References: Message-ID: <6314df07bc52c9828a4165e7b1060aee@dalkescientific.com> > Andreas Prlic wrote: >> I would therefore like to discuss to rename the "genome" domain to >> "sequence". >> >> The information of which type of sequence, "genomic" or "protein >> sequence" should be provided via the >> source description. Steve: > I like this idea. The only nucleotide specific stuff in the DAS/2 > retrieval > spec is the region request. Strand designation in a location specifier > is > already optional. > > We'd may then want to change the 'sequence' request to something else, > perhaps 'residues'? I've been thinking the same thing. Going one step further - what about dropping the name entirely? Consider this, with some of the xml: attributes removed for clarity. I've added a 'source_type' field in the element. This is what you get from a SOURCES request HTTP GET http://www.example.com/das2/ (note lack of 'genome' in that URL) That is, the SOURCES request returns information about genomic, protein sequence and structure databases. If this occurs then there will need to be a few changes to the spec. For example, 'taxon' is probably only properly part of the genomic sources and not in the others, so perhaps move the taxon information into a subelement of those SOURCE elements with 'source_type' == 'genome'. http://www.ncbi.nlm.nih.gov/taxon-browser?id=29118 ... Andrew dalke at dalkescientific.com From Steve_Chervitz at affymetrix.com Wed Dec 7 13:56:02 2005 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Wed, 07 Dec 2005 10:56:02 -0800 Subject: [DAS2] Sequence retrieval proposal In-Reply-To: Message-ID: Here's a proposal regarding sequence retrievals that apparently never made it to the list. (I was compiling a list of agenda items for next week's spec discussion when I noticed I sent this message only to myself...) Steve ------ Forwarded Message From: Steve Chervitz Date: Mon, 14 Nov 2005 13:38:44 -0800 To: Steve Chervitz Subject: Re: [DAS2] DAS/2 weekly meeting notes for 14 Nov 05 >From the notes of today's meeting (14 Nov 05): > LS: When you request versioned source from a server, it should say what > assembly coords it's working on and give a uri for that. In this case > there's no guarantee you can do a 'get' on that URI. > We want to say: > 1- what is unique uri for assembly (everyone agrees to share this) > 2- das URL for how to fetch it (some server's region url - trusted, > faithful copy with what is at ncbi). Diff servers could assert that > you can fetch it from various places. This raises another issue we didn't discuss: How about allowing some way to verify that the sequence data received from a given reference server are in fact faithful copies? Use case 1: Validate a given reference server as providing correct sequence data for a specific assembly (either the entire assembly or a specific chromosome). Use case 2: Verify that the sequence or subsequence I received from a specific sequence request is correct and complete. Case #1 requires that the official source of the assembly (or some other trusted reference server) publish checksums on each complete sequence it provides (e.g., each full-length chromosome of each assembly). Case #2 requires the ability to encode a checksum in a sequence response. But there are two issues here: validating the data transfer for the request and validating the correctness of the sequence or subsequence itself with respect to the original assembled sequence. The first issue of case #2 is already supported in the current spec, if the request specifies a format that incorporates a checksum (e.g., sequence/chr21?format=GCG). However, not all servers may support that format, yet they could support checksums. The second issue of case #2 is covered only for responses from trusted reference servers. To consider: 1. What do folks think about adding to the DAS/2 retrieval spec facilities supporting sequence data validation? (i.e., Add an optional checksum attribute in the REGION response.) 2. What do folks think about specifying a DAS2XML format for sequence requests (text/x-das-sequence+xml)? In addition to permitting an optional checksum attribute to address the above use case, it would add some consistency and flexibility to the spec, since at present, the default sequence response format is the only one that is not under our control (currently it's text/x-fasta). Steve ------ End of Forwarded Message From dalke at dalkescientific.com Wed Dec 7 18:22:56 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 8 Dec 2005 00:22:56 +0100 Subject: [DAS2] Sequence retrieval proposal In-Reply-To: References: Message-ID: <5946bc5a646ab2e55da6455aac9edcc5@dalkescientific.com> Steve: > This raises another issue we didn't discuss: How about allowing some > way to verify that the sequence data received from a given reference > server are in fact faithful copies? > Use case 1: Validate a given reference server as providing correct > 1. What do folks think about adding to the DAS/2 retrieval spec > facilities supporting sequence data validation? (i.e., Add an > optional checksum attribute in the REGION response.) How many people actually write client code which verifies the checksum of those formats which have a checksum? I know I never have. Bioperl's genbank.pm doesn't check the atcg counts, nor does swiss.pm check the crc. (Both generate the checks; they just don't verify them.) For those who have implemented checksum verification, how many times has that checksum detected an error in the data transmission? There are already several layers of checksums in the network connection. One in ethernet, another in IP, a third in TCP. Is another one useful? As an example, HTTP and (I think) ftp don't use checksums. I've transfered many very large files via both and not had a problem. Rather, the only check I needed was to verify that I got all of the data, and HTTP provides that information in the header. Now, I know that there are problems when you scale to large data transfers. I even remember talking with Gregg and Lincoln about this years ago. A friend of mine went to a presentation at Stanford that Bram Cohen gave about bittorrent and he was commenting that the four byte check summing in TCP/IP isn't enough for his needs as when you're trying to transfer a 4 gig file to 10,000 users the check summing in TCP/IP isn't enough. We aren't in the terabyte data transfer range. ... doing research ... But if it does become a concern, one solution is RFC 1864 http://www.faqs.org/rfcs/rfc1864.html which adds a "Content-MD5" header to the HTTP response, and describes how to use it. Another is RFC 3230 http://www.scit.wlv.ac.uk/rfc/rfc32xx/RFC3230.html As far as I can tell, very few people, if any, actually use those fields for anything. That serves as a sort of confirmation that data rarely gets corrupted at the TCP/IP level. > 2. What do folks think about specifying a DAS2XML format for sequence > requests (text/x-das-sequence+xml)? In addition to permitting an > optional checksum attribute to address the above use case, it would > add some consistency and flexibility to the spec, since at present, > the default sequence response format is the only one that is not > under > our control (currently it's text/x-fasta). As a consumer of this sort of data, I don't want to write another parser. It isn't just the parsing part - it's the effort of mapping to my program's data model. There's already a huge number of existing sequence file formats. What would another provide? Are some of them already extensible? Several of those formats are designed and developed by people involved with DAS. If it's important, extend GAME or GFF. As a spec writer, I don't really want to write that part of the spec. Andrew dalke at dalkescientific.com From ap3 at sanger.ac.uk Thu Dec 8 04:48:58 2005 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Thu, 8 Dec 2005 09:48:58 +0000 Subject: [DAS2] DAS2 source description In-Reply-To: <6314df07bc52c9828a4165e7b1060aee@dalkescientific.com> References: <6314df07bc52c9828a4165e7b1060aee@dalkescientific.com> Message-ID: The way Andrew suggests the source description looks already quite good to me. Could we add a couple things? * we have some people doing annotations on clones and scaffolds, which -regarding DAS- is essentially the same as annotating in chromosomal coordinates, but for the description a few other types of coordinate systems are needed. * there are a couple of sources that can speak multiple "coordinate systems", so the description should be able to deal with that. * It would be good to have something like an "authority" field in the coordinate systems. i.e. the institution who defines a set of reference objects. with this in mind one could do something like: taxon="http://www.ncbi.nlm.nih.gov/taxon-browser?id=9606" source_type="chromosome" authority_name="NCBI" > This would be the part that is needed for describing the actual data and then it would be good to have some other meta info for the sources as well: * which DAS commands does a source understand * a testcode (per namespace) that can be used to validate responses * some historical data like "has been available since" "was successfully validated the last time at" * a link back to the homepage of the group that provides the source for more detailed docu about the data * an email address to contact if there is a problem/question with the source * a "nickname" for a source that should be used in a DAS client to label tracks coming from that source. * some optional properties that can be added like "funded by ..." "GO evidence code: " > That is, the SOURCES request returns information about genomic, > protein sequence and structure databases. good. - plus a couple of others. this should be a restricted list. > If this occurs then there will need to be a few changes to the spec. > For example, 'taxon' is probably only properly part of the genomic > sources some people annotate protein sequences from a particular organism. e.g there is a DAS1 source that only annotates Fugu protein sequences Cheers, Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 From td2 at sanger.ac.uk Thu Dec 8 05:33:16 2005 From: td2 at sanger.ac.uk (Thomas Down) Date: Thu, 8 Dec 2005 10:33:16 +0000 Subject: [DAS2] Sequence retrieval proposal In-Reply-To: <5946bc5a646ab2e55da6455aac9edcc5@dalkescientific.com> References: <5946bc5a646ab2e55da6455aac9edcc5@dalkescientific.com> Message-ID: On 7 Dec 2005, at 23:22, Andrew Dalke wrote: > >> 2. What do folks think about specifying a DAS2XML format for sequence >> requests (text/x-das-sequence+xml)? In addition to permitting an >> optional checksum attribute to address the above use case, it >> would >> add some consistency and flexibility to the spec, since at >> present, >> the default sequence response format is the only one that is >> not under >> our control (currently it's text/x-fasta). > > As a consumer of this sort of data, I don't want to write another > parser. It isn't just the parsing part - it's the effort of mapping > to my program's data model. > > There's already a huge number of existing sequence file formats. > What would another provide? Are some of them already extensible? > > Several of those formats are designed and developed by people involved > with DAS. If it's important, extend GAME or GFF. Do GAME or GFF have a sequence representation? I thought they were both primarily feature-table formats (right now I'm having trouble finding the GAME documentation though...). The problem I have with Fasta format (other than the tendency of many data-providers to over-load the header line) is that there's no explicit marker for the alphabet and encoding of sequence data. This is pretty nasty for codebases like BioJava which want to present a richer view of sequence data than just a String. I'd certainly be in favour of a nice XML format that made alphabet information explicit. The DAS 1.5 DASSEQUENCE document has a moltype attribute which supports this (at least the three most important cases, DNA/RNA/ Protein -- there's not a standards-compliant way to add other alphabets though). I guess an alternative, more classically RESTful, way of doing things might be with MIME types: Content-Type: application/fasta; sequence-alphabet=DNA; sequence-encoding=IUPAC I admit I'd prefer the XML though... Thomas. From nomi at fruitfly.org Thu Dec 8 14:13:05 2005 From: nomi at fruitfly.org (Nomi Harris) Date: Thu, 8 Dec 2005 11:13:05 -0800 Subject: [DAS2] Sequence retrieval proposal In-Reply-To: References: <5946bc5a646ab2e55da6455aac9edcc5@dalkescientific.com> Message-ID: <17304.34241.521199.4698@spongecake.lbl.gov> On 8 December 2005, Thomas Down wrote: > > There's already a huge number of existing sequence file formats. > > What would another provide? Are some of them already extensible? > > > > Several of those formats are designed and developed by people involved > > with DAS. If it's important, extend GAME or GFF. > > Do GAME or GFF have a sequence representation? I thought they were > both primarily feature-table formats GAME certainly has a sequence representation, and i think GFF3 must, though old GFF doesn't. 3R:1178000-1230000 Drosophila melanogaster AAGCCCACTATATTGCATTAAATTATGCGATAATTGATCAATTTTAAAGG ... > (right now I'm having trouble > finding the GAME documentation though...). http://www.fruitfly.org/annot/apollo/game.rng.txt Nomi From Steve_Chervitz at affymetrix.com Thu Dec 8 16:04:56 2005 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Thu, 08 Dec 2005 13:04:56 -0800 Subject: [DAS2] Sequence retrieval proposal In-Reply-To: Message-ID: On Thu, 8 Dec 2005, Thomas Down wrote: > > On 7 Dec 2005, at 23:22, Andrew Dalke wrote: >> >> Steve Chervitz wrote: >>> >>> 2. What do folks think about specifying a DAS2XML format for sequence >>> requests (text/x-das-sequence+xml)? In addition to permitting an optional >>> checksum attribute to address the above use case, it would add some >>> consistency and flexibility to the spec, since at present, the default >>> sequence response format is the only one that is not under our control >>> (currently it's text/x-fasta). >> >> As a consumer of this sort of data, I don't want to write another >> parser. It isn't just the parsing part - it's the effort of mapping >> to my program's data model. >> >> There's already a huge number of existing sequence file formats. >> What would another provide? Are some of them already extensible? I am also somewhat loath to add yet another sequence file format to the world. Seems reasonable to state that a DAS/2 server can supply sequence in an alternative format via requests such as: http://www.wormbase.org/das/genome/volvox/1/sequence?format=GAME There would have to be a way for a server to indicate what alternative formats is supports. We could use the same strategy as we do in the versioned source request, supplying a FORMAT element listing alternative formats. But where to put it? Perhaps in the regions request: For interoperability purposes, we'd should provide a controlled vocabulary of alternative formats and their types, at least for the commonly used ones. >> Several of those formats are designed and developed by people involved >> with DAS. If it's important, extend GAME or GFF. > > Do GAME or GFF have a sequence representation? I thought they were > both primarily feature-table formats (right now I'm having trouble > finding the GAME documentation though...). Here's a brief tour of some possibly extensible candidates: GFF - only represents features: http://song.sourceforge.net/gff3.shtml GAME - does encode sequence data as a simple string. Flybase/BDGP use GAME XML and appear to be the main users/maintainers. Suzi and Chris can elaborate more here, but I found link to an RNG schema in the Apollo FAQ: http://www.fruitfly.org/annot/apollo/game.rng.txt GAME notes: - The http://bioxml.org links are now obsolete. Here's an old description containing such links: http://xml.coverpages.org/game.html - GAME variants have arisen that have created incompatibilities in the bio* world: http://open-bio.org/pipermail/bioperl-l/2003-April/011988.html - When I checked a flybase data file, it didn't point to a DTD: ftp://flybase.net/genomes/Drosophila_melanogaster/current/xml-game/ Otter - a sort of simplified GAME that also represents sequence: http://www.sanger.ac.uk/Users/jgrg/otter_xml.html XFF - models sequences and has alphabet support (Thomas: is this in use?): http://www.biojava.org/thomasd/XFF/ INSDseq and EMBLxml - An XML format for Gebank/EMBL/DDBJ sequence data: http://www.ebi.ac.uk/xembl/ BSML - Somewhat antiquated but is supported by the XEMBL service http://www.bsml.org/ and in use by LabBook: http://www.labbook.com/default.aspx AGAVE - From DoubleTwist - now defunct, but also supported by XEMBL: http://www.agavexml.org/ BIOML - Details are sketchy, appears to be used internally by Genomic Solutions which acquired Proteometrics, the originators of BIOML. Here's the most recent references I could find: http://www.biomedcentral.com/1471-2105/5/25 http://www.genomicsolutions.com/showPage.php?title=Data%20Integration > The problem I have with Fasta format (other than the tendency of many > data-providers to over-load the header line) is that there's no > explicit marker for the alphabet and encoding of sequence data. This > is pretty nasty for codebases like BioJava which want to present a > richer view of sequence data than just a String. I'd certainly be in > favour of a nice XML format that made alphabet information explicit. > The DAS 1.5 DASSEQUENCE document has a moltype attribute which > supports this (at least the three most important cases, DNA/RNA/ > Protein -- there's not a standards-compliant way to add other > alphabets though). Various data providers take all sorts of liberties with fasta sequence, e.g., sequences with no IDs, whitespace-containing IDs, space between the '>' and the ID, etc. We might consider proscribing some conventions for what DAS considers proper fasta format. I put in a little bit of description of a DAS-acceptable fasta format here in the retrieval spec: http://biodas.org/documents/das2/das2_get.html#sequence Do we want to add more to this? Perhaps something about an optional description being separated from the ID by whitespace and consisting of any amount of free-form text. Steve > I guess an alternative, more classically RESTful, way of doing things > might be with MIME types: > > Content-Type: application/fasta; sequence-alphabet=DNA; > sequence-encoding=IUPAC > > I admit I'd prefer the XML though... > > > Thomas. > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 From dalke at dalkescientific.com Sun Dec 11 13:40:46 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Sun, 11 Dec 2005 19:40:46 +0100 Subject: [DAS2] Sequence retrieval proposal In-Reply-To: References: <5946bc5a646ab2e55da6455aac9edcc5@dalkescientific.com> Message-ID: <50dd4457c1fb4038dda3f4563c92947e@dalkescientific.com> Thomas: > Do GAME or GFF have a sequence representation? I thought they were > both primarily feature-table formats (right now I'm having trouble > finding the GAME documentation though...). Others followed up on this. For me, I was confused. Even though Steve said "sequence retrieval" -- in the subject even -- I was thinking of feature formats. I think that came to mind because I expect there to be more feature data transfered than sequence data, so if data corruption is a concern then the annotations are more likely to have problems. Or I may have been thinking about some of the formats (Genbank, swissprot) which combine the two, and have a checksum. I still don't think checksum-identifiable data corruption is something we need to worry about. > The problem I have with Fasta format (other than the tendency of many > data-providers to over-load the header line) is that there's no > explicit marker for the alphabet and encoding of sequence data. *sigh* It seems like this never goes away. Biopython also has a "rich" alphabet property, designed to handle alternate alphabets, like 3-letter codes and secondary structure alphabets. Bioperl's seems more appropriate in practice - dna, protein, rna, and perhaps 'unknown'. In the context of DAS, this is not a problem. DAS 2.0 uses only genomic data, so all FASTA records will be of type 'dna'. It might be different with structure data where a single record may have all three alphabet types. (Though I only know of structures with 2 of the 3.) > I guess an alternative, more classically RESTful, way of doing things > might be with MIME types: > > Content-Type: application/fasta; sequence-alphabet=DNA; > sequence-encoding=IUPAC > > I admit I'd prefer the XML though... As I mentioned, for purposes of DAS 2.0 this isn't needed so I don't think we need to solve this problem. If we do, I think it's a nearly intractable problem. How does one register all the different possible alphabets? IUPAC dna/rna/protein covers most of it. Getting the other few percent is hard. Then making all the software to preserve or interconvert the different formats adds another layer of hard. There's a lot of social issues as well. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Sun Dec 11 14:26:21 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Sun, 11 Dec 2005 20:26:21 +0100 Subject: [DAS2] Sequence retrieval proposal In-Reply-To: References: Message-ID: <7f717b3ebc3306bfca3f004dcade093b@dalkescientific.com> Steve: > I am also somewhat loath to add yet another sequence file format to the > world. Seems reasonable to state that a DAS/2 server can supply > sequence in > an alternative format via requests such as: > > http://www.wormbase.org/das/genome/volvox/1/sequence?format=GAME That makes good sense to me. > Here's a brief tour of some possibly extensible candidates: Do you want to say this as: "The server must implement these sequence formats" or "If the server implements one or more of these sequence formats then it must use the corresponding id and content-type." ? Or say nothing and wait until several different servers implement this then standardize on what they do? I don't think anyone here seriously wants the first. :) The last is my favorite, then the middle one. My stronger preference is to get a complete 2.0 spec out. Do you or other users need checksum validation of the sequence and/or alternate sequence formats in 2.0? What prevents you from extending existing HTTP headers or experimenting with extensions then submitting your experience for inclusion in future versions of the spec? My sense is that this can wait. > We might consider proscribing some conventions for what DAS considers > proper > fasta format. I put in a little bit of description of a DAS-acceptable > fasta > format here in the retrieval spec: > http://biodas.org/documents/das2/das2_get.html#sequence Do current DAS clients even use the header? Will future ones use it? If so, why? Shouldn't all the information in a header be available as an annotation? The wikiepedia entry for FASTA is pretty good. http://en.wikipedia.org/wiki/Fasta_format I had my students a few months ago find different FASTA definitions. Some disagreed with others. Wikiepedia was the most complete. Andrew dalke at dalkescientific.com From Gregg_Helt at affymetrix.com Mon Dec 12 12:28:46 2005 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Mon, 12 Dec 2005 09:28:46 -0800 Subject: [DAS2] DAS/2 meeting agenda, Dec 12 2005 Message-ID: Today we're focusing on spec issues. Here's a few topics raised in the last two weeks: The .../region subtree Retrieval of sequence residues Support for different versions of DAS in the registry Revising top level of DAS tree (putting more info in source response) From ed_erwin at affymetrix.com Mon Dec 12 15:03:33 2005 From: ed_erwin at affymetrix.com (Ed Erwin) Date: Mon, 12 Dec 2005 12:03:33 -0800 Subject: [DAS2] Sequence retrieval proposal In-Reply-To: <7f717b3ebc3306bfca3f004dcade093b@dalkescientific.com> References: <7f717b3ebc3306bfca3f004dcade093b@dalkescientific.com> Message-ID: <439DD795.1070406@affymetrix.com> Andrew Dalke wrote: > > The wikiepedia entry for FASTA is pretty good. > http://en.wikipedia.org/wiki/Fasta_format Aha! Now I know where those ^A characters in the NetAffx database came from! From ed_erwin at affymetrix.com Mon Dec 12 16:05:47 2005 From: ed_erwin at affymetrix.com (Ed Erwin) Date: Mon, 12 Dec 2005 13:05:47 -0800 Subject: [DAS2] transitioning from DAS/1 to DAS/2 In-Reply-To: <6314df07bc52c9828a4165e7b1060aee@dalkescientific.com> References: <6314df07bc52c9828a4165e7b1060aee@dalkescientific.com> Message-ID: <439DE62B.6010002@affymetrix.com> Andrew Dalke wrote: > > That is, the SOURCES request returns information about genomic, > protein sequence and structure databases. > > If this occurs then there will need to be a few changes to the spec. > For example, 'taxon' is probably only properly part of the genomic > sources and not in the others, so perhaps move the taxon information > into a subelement of those SOURCE elements with 'source_type' == 'genome'. > > > source_type="genome"> > http://www.ncbi.nlm.nih.gov/taxon-browser?id=29118 > > > > > This a bit off-topic, but I noticed that links to the taxonomy browser need to be formatted like this: http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=3066 And note this disclaimer on that site: "The NCBI taxonomy database is not an authoritative source for nomenclature or classification" From Steve_Chervitz at affymetrix.com Mon Dec 12 16:33:00 2005 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Mon, 12 Dec 2005 13:33:00 -0800 Subject: [DAS2] Sequence retrieval proposal In-Reply-To: <7f717b3ebc3306bfca3f004dcade093b@dalkescientific.com> Message-ID: On Sun, 11 Dec 2005 Andrew Dalke wrote: > Steve: >> I am also somewhat loath to add yet another sequence file format to the >> world. Seems reasonable to state that a DAS/2 server can supply >> sequence in >> an alternative format via requests such as: >> >> http://www.wormbase.org/das/genome/volvox/1/sequence?format=GAME > > That makes good sense to me. > >> Here's a brief tour of some possibly extensible candidates: > > Do you want to say this as: > "The server must implement these sequence formats" > or > "If the server implements one or more of these sequence formats then > it must use the corresponding id and content-type." > ? > > Or say nothing and wait until several different servers implement > this then standardize on what they do? > > I don't think anyone here seriously wants the first. :) > > The last is my favorite, then the middle one. The last is fine with me. This is the approach we use for type-specific alternative feature formats: http://biodas.org/documents/das2/das2_get.html#types > My stronger preference is to get a complete 2.0 spec out. Do > you or other users need checksum validation of the sequence and/or > alternate sequence formats in 2.0? What prevents you from extending > existing HTTP headers or experimenting with extensions then > submitting your experience for inclusion in future versions of > the spec? > > My sense is that this can wait. Yep. Especially in light of this morning's teleconf (notes for which are on the way). This seems like a good place to invoke YAGNI ( http://keithdevens.com/quotes/YAGNI ). >> We might consider proscribing some conventions for what DAS considers proper >> fasta format. I put in a little bit of description of a DAS-acceptable fasta >> format here in the retrieval spec: >> http://biodas.org/documents/das2/das2_get.html#sequence > > Do current DAS clients even use the header? > > Will future ones use it? If so, why? Shouldn't all the information > in a header be available as an annotation? Don't know. Seems like it should be left to the client implementation to decide what to do with the header. The aim of the sequence request (soon to be 'residues') is to get sequence data, not annotations. If we're not saying what DAS/2 clients are supposed to do with the header info, and there are so many variations out there, we might consider stating that clients are free to ignore the header. Then if we do this, why use fasta format instead of raw sequence? Btw, DAS/1 used an XML formatted response for sequence data. The DAS/1 sequence element has these attributes: id, start, stop, moltype, version. Does anyone know how DAS/1 clients make use of these from the seq response? > The wikiepedia entry for FASTA is pretty good. > http://en.wikipedia.org/wiki/Fasta_format Interesting. That more-than-one-header business seems evil. They give a good link for alternative sequence formats. Steve From Steve_Chervitz at affymetrix.com Mon Dec 12 20:38:14 2005 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Mon, 12 Dec 2005 17:38:14 -0800 Subject: [DAS2] DAS/2 weekly meeting notes from 12 Dec 2005 Message-ID: Notes from the weekly DAS/2 teleconference, 12 Dec 2005. $Id: das2-teleconf-2005-12-12.txt,v 1.1 2005/12/13 01:03:01 sac Exp $ Note taker: Steve Chervitz Attendees: Affy: Steve Chervitz, Ed E., Gregg Helt CSHL: Lincoln Stein Sanger: Andreas Prlic, Thomas Down Sweden: Andrew Dalke Action items are flagged with '[A]'. These notes are checked into the biodas.org CVS repository at das/das2/notes/2005. Instructions on how to access this repository are at http://biodas.org DISCLAIMER: The note taker aims for completeness and accuracy, but these goals are not always achievable, given the desire to get the notes out with a rapid turnaround. So don't consider these notes as complete minutes from the meeting, but rather abbreviated, summarized versions of what was discussed. There may be errors of commission and omission. Participants are welcome to post comments and/or corrections to these as they see fit. Today's topic: Spec Issues -------------------------- * Regions ----------- Discussion thread: http://portal.open-bio.org/pipermail/das2/2005-December/000388.html AD: Can the region request be removed, it's just a type of a feature. LS: There are situations where we need to say "something lives in this region and you can't get base pairs for it." Example, gaps or chromosomal sequence based on mapping data only. AP: How would start/stop be specified? LS: Endpoints of gap are specified in base pair coordinates. This is standard in AGP files. Can indicate approximate length. It's not a case of feature with an ambiguous location, just not precisely defined location. GH: Does the spec allow decimal places in location, e.g., for recombination frequency? LS: No. Still genome/base pair oriented. If we want to require the retrieval of bases, we could possibly have a convention where Ns could be returned. AD: Didn't know DAS needed to handle this info. For features type=region, names could be returned, not necessarily sequence data. LS: The operations we need to support: 1) Return entry points for an interactive search - e.g., chromosome length fragments 2) Assembly info (AGP) 3) Bases (residues) given location on the sequence TD: Why not return assembly as a set of features as DAS/1 does? Why do we need a special assembly communication format? GH: You can't get the whole picture all at once. You have to get the top-level contigs, each of which has it's own assembly, recursively. Lots of queries may be required. TD, LS: Hierarchical features are supported. You could have one feature per chromosome of type=assembly. You then do a non-recursive request to get the top-level features, then do a recursive request to get the feature with all children. GH: This is the chado approach where every sequence is a feature. I have trouble with this. TD: The feature indicates an alignment. The region for the feature alignes to a piece of chromosome. GH: How do you find out what the chromosomes are? LS: Assembly fragment type could be used for children. Currently in the DAS/2 spec: - a region request returns contigs a la the entry points list from DAS/1. volvox/Contig1 - or in a finished assembly, this would be chromosome length things that IGB would present to the user to select for browsing. These are not necessarily chromosomes, just recommended entry points for browsing. Feature-based approach: - do a feature query using filter type=assembly AD: Why do we need region request? GH: To get top-level entry points for browsing. LS: Sequence ontology has these types that could appear as entry points: - assembly - assembly component - contig - supercontig - chromosome - chromosome arm Problem: A naive browser comes into genome, doesn't know what the entry points are. AD: Saying type=top-level is wrong. It should be a property. LS: 'Entry point' or 'landmark' attribute. GH: How do you get the entry points? LS: Feature request with a filter for attribute='entry point', and type='assembly component' AD: Possible trouble with people defining features at different servers from the one providing regions: - server 1 provides regions - server 2 provides other feature types So you need to go to multiple servers. TD: This is not a big change from DAS/1 LS: gmod has chromosomes as features, this has never been a problem. Advantages to Andrew's suggestion (regions as features): - simplifies the protocol - can't return AGP format files, must parse DAS2XML (or can only get AGP for a subcomponent of assembly). GH: Can use the same alternative format approach we use for types request (optional FORMAT subelements). But then no server would be required to return it. TD: Not a big deal because every client will be required to parse feature XML. Also, the top-level assembly won't be very large. GH: AGP support is not that important. LS: Was a request from UCSC (Jim Kent). OK to get rid of region and use feature. GH: Still have trouble using feature to get region data because of the circular nature of refering to yourself as your coordinate system. AD: You can still point to a sequence as your coordinate system. GH: How do you know the size of the sequence without requesting the whole sequence? There's also the possibility of 0 vs 1-based coordinate confusion. Someone could provide an assembly top-level feature and declare it starts at 1, getting around our 0-based requirement for genomic features. LS: They could, but will suffer the consequences of pervasive off-by-one errors. Proposal: Abolish the region namespace (request/response): - Add special feature type 'assembly component' - assembly component has optional attribute 'entry point' - Response to this query must be fast - Servers must be able to handle attribute filters GH: Not comfortable with this, and how gmod treats chromosomes being the same as features. Why? Data modeling, e.g., the sequence symmetry concept of genometry used in IGB. An annotation/feature is always described as a relation between one or more sequences. The annotation only points to the sequence. LS: In Bioperl GFF database and chado schema, entry-level sequences are features that use their own coord system. Top-level sequence is a feature with type=chromosome or contig. Limitation is that you need to know what to use for type. Advantage is in relative addressing (e.g., get all genes within 1000 bases of other genes). Works when feature is in its own corrdinate system. AD: There's a danger of becoming too generic, example from WebDAV. When everything is a property, there's no structure. LS: There is the risk of having too many magic attributes. AD: We could keep the top-level or landmark request as a special alias that retrieves a subset of the data -- just top-level entry points, instead of having a special feature request. Would be the same as a region without the extra stuff. GH: Bad to have two ways to get same data. LS: Regions as features is OK, but no top-level attribute. Proposal: DAS-defined special feature type 'top level' or 'entry point' that maps to SO assembly component. Hard-coded, special type that returns entry type features. AD: Is there multiple inheritance support? Are there features that inherit from both SO and our special type? E.g., of type entry point and contig? LS: No. A data source must support type='das:entry point'. To get top-level features, you ask to get features of this type. They can have children to describe the assembly. Trouble with this: Duplication, you now have features that appear as type=entry point and as type=supercontig or chromosome. One is a physical object, one is a navigation object. This trades using a magic type instead of a magic attribute. AD: So we have a choice: - magic attribute - magic type - magic URL LS: Likes special attribute the best. Advantage is that you can tag what ever feature type you want to appear as an entry point. Disadvantage is potential abuse and implementation could be harder. Attribute filtering must be fast. Use case: At an intermediate stage of a big assembly you can choose what you want to be top-level, rather than creating a new database object, or figuring out another way to make it appear in response to a region request. Vote: - GH: special URL (region) - LS, AD, TD: special attribute AD: As benevolent dictator, decides that DAS/2 will employ a special attribute to handle regions as features. Question: What to do with the location attribute (now is a feature URL). Or do we get rid of the position attribute. LS: LOC points to feature that establishes coord system and has subranges of that feature. So the URL gets longer. Attributes specify position of the feature. LOC is for feature space. It specifies the reference system of the feature and where it starts relative to the feature. Position attribute points to the sequence. Clients know to parse the URL to get the start/end. TD: In XML, LOC with attributes start, end, strand, seq. GH: We now permit matching feature filters to allowing combining these. So we should keep the filter syntax. Feature loc syntax can be different. [A]: Andrew: provide details for retrieving regions via feature request - need to get the feature the coordinates are relative to (contig) - need to get the bases, which may not be on the same server SC: Has some philosophical issues with collapsing regions into features, but willing to explore doing so for simplicity. Trouble is putting objects with some physical correlate (sequence) at same level as objects lacking such solid substrate (features). GH: This discussion has created a lot of churn in the spec fairly late in the game. We should be more settled by now. ALL: General agreement. [A]: Everyone make a push to stabilize the retrieval spec. * New topic: Rename DAS 'genome' domain to 'sequence' ----------------------------------------------------- Discussion thread: http://portal.open-bio.org/pipermail/das2/2005-December/000394.html AD: Why not remove the top-level domain completely? (das/genome becomes just das). GH: Need to know what data a given server has. AP: As long as the source description provides info about what its about, should be sufficient. GH: This pushes the URL data into the source type tag (this is the same magic URL vs magic type vs magic attribute issue all over again...) AD: If we get rid of it, a given server can provide different data without special URLs. GH: What you put on the URL determines the return type. Why don't you like it? AD: 1) 'genome' in the URL is extra fluff. 2) saying you might need it in the future is a weak argument (ain't gonna need it). GH: Most servers will provide one type of data. AD: People who provide meta data might want to combine it into one document for all data. LS: Saying you're in 'genome space' is a contract for what coordinate system is (positive integers, start, end, strand) and what type of reqests/responses are expected. If we jumble things up, it makes it difficult for dealing with other systems (3D coords). The 'genome' space is intended to cover both protein and DNA. AD: The top-level DAS response would point to the versioned source, and indicate that it has a sequence, and a top-level URL. LS: Seems like an arbitrary decision. AP: What about the original proposal, to simply change 'genome' to 'sequence'? GH: OK with this. [A]: Andrew (spec czar) change 'das/genome' domain to 'das/sequence'. [A]: Andrew (spec czar) change 'sequence' request to 'residues'. Other Issues: ------------- LS: Concerned about big changes being made to spec at this date. ALL: Agreed. Should have happened earlier, but the discussion is important. [A]: All - focus on spec issues again next week. No meeting in two weeks. From nomi at fruitfly.org Tue Dec 13 13:27:01 2005 From: nomi at fruitfly.org (Nomi Harris) Date: Tue, 13 Dec 2005 10:27:01 -0800 Subject: [DAS2] Link to GAME XML schema Message-ID: http://biodas.org/documents/das2/das2_get.html has a link to http://flybase.net/annot/gamexml.dtd.txt for GAME XML documentation. This link should be changed to http://flybase.bio.indiana.edu/annot/game.rng.txt Thanks, Nomi From dalke at dalkescientific.com Fri Dec 16 07:28:29 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Fri, 16 Dec 2005 13:28:29 +0100 Subject: [DAS2] DAS code sprint proposal Message-ID: <76643bf6d537c2dade84350eb5d77b1d@dalkescientific.com> Hi all, I've been thinking more about the comments from Gregg and others during the previous phone conference. He was concerned that there are some major spec changes this late in the DAS/2 grant period. It does seem rather late, but from what I've seen in spec development, especially ones like this which combine people from, what, 6 or more sites and will be used by many more people, it's not uncommon. It's just not possible to emerge the spec full-blown like Athena from the head of Zeus, any more than it's possible to write 10,000 lines of code and have it compile the first time, much less run. Even with peer review. I think we have a solid idea of what the spec should look like - though I'll have a busy few weeks assembling them into words. I don't think the object model has had that much in the way of changes; mostly it's been a question of pinning down a few details. I think it's time to schedule a DAS code sprint, where the different server and client people get together to implement the spec and use that to provide feedback for the spec development. My feeling is that people are ready for this too. I know I would rather code specs then doc them! Christmas is coming up and there's usually the week or so at the start of the year where people are getting back up to speed. That sound like the end of January / early February is a good time. For me the week of 6 February is the best. That's probably far enough in the future too for people to clear out time. My thought is to have two groups of people, one in Cambridge and one in Emoryville, but I've not heard of a geographically split sprint like that. Andrew dalke at dalkescientific.com From ap3 at sanger.ac.uk Fri Dec 16 08:11:10 2005 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Fri, 16 Dec 2005 13:11:10 +0000 Subject: [DAS2] DAS code sprint proposal Message-ID: Hi! We (thomas and me) think that this is a very good idea and we are happy to organize our part here. The proposed week (february 6th) is also fine. Cheers, Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 From rds at sanger.ac.uk Fri Dec 16 09:33:38 2005 From: rds at sanger.ac.uk (Roy Storey) Date: Fri, 16 Dec 2005 14:33:38 +0000 (GMT) Subject: [DAS2] DAS code sprint proposal Message-ID: Andreas, You can count Ed and myself in as well. We've very much like to get zmap to be a DAS2 client and anything we can do to help hack.... Roy On Fri, 16 Dec 2005, Andreas Prlic wrote: > Hi! > > We (thomas and me) think that this is a very good idea and we are happy to > organize > our part here. The proposed week (february 6th) is also fine. > > Cheers, > Andreas > > > ----------------------------------------------------------------------- > > Andreas Prlic Wellcome Trust Sanger Institute > Hinxton, Cambridge CB10 1SA, UK > +44 (0) 1223 49 6891 > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 > > > From lstein at cshl.edu Mon Dec 19 12:10:30 2005 From: lstein at cshl.edu (Lincoln Stein) Date: Mon, 19 Dec 2005 12:10:30 -0500 Subject: [DAS2] Sequence retrieval proposal In-Reply-To: References: Message-ID: <200512191210.31108.lstein@cshl.edu> Hi, Just CVS updated the spec today in anticipation of the telecon, and I don't see any changes to the feature or region requests. Didn't we agree to drop region? Lincoln On Monday 12 December 2005 04:33 pm, Steve Chervitz wrote: > On Sun, 11 Dec 2005 Andrew Dalke wrote: > > Steve: > >> I am also somewhat loath to add yet another sequence file format to the > >> world. Seems reasonable to state that a DAS/2 server can supply > >> sequence in > >> an alternative format via requests such as: > >> > >> http://www.wormbase.org/das/genome/volvox/1/sequence?format=GAME > > > > That makes good sense to me. > > > >> Here's a brief tour of some possibly extensible candidates: > > > > Do you want to say this as: > > "The server must implement these sequence formats" > > or > > "If the server implements one or more of these sequence formats then > > it must use the corresponding id and content-type." > > ? > > > > Or say nothing and wait until several different servers implement > > this then standardize on what they do? > > > > I don't think anyone here seriously wants the first. :) > > > > The last is my favorite, then the middle one. > > The last is fine with me. This is the approach we use for type-specific > alternative feature formats: > http://biodas.org/documents/das2/das2_get.html#types > > > My stronger preference is to get a complete 2.0 spec out. Do > > you or other users need checksum validation of the sequence and/or > > alternate sequence formats in 2.0? What prevents you from extending > > existing HTTP headers or experimenting with extensions then > > submitting your experience for inclusion in future versions of > > the spec? > > > > My sense is that this can wait. > > Yep. Especially in light of this morning's teleconf (notes for which are on > the way). This seems like a good place to invoke YAGNI ( > http://keithdevens.com/quotes/YAGNI ). > > >> We might consider proscribing some conventions for what DAS considers > >> proper fasta format. I put in a little bit of description of a > >> DAS-acceptable fasta format here in the retrieval spec: > >> http://biodas.org/documents/das2/das2_get.html#sequence > > > > Do current DAS clients even use the header? > > > > Will future ones use it? If so, why? Shouldn't all the information > > in a header be available as an annotation? > > Don't know. Seems like it should be left to the client implementation to > decide what to do with the header. The aim of the sequence request (soon to > be 'residues') is to get sequence data, not annotations. > > If we're not saying what DAS/2 clients are supposed to do with the header > info, and there are so many variations out there, we might consider stating > that clients are free to ignore the header. Then if we do this, why use > fasta format instead of raw sequence? > > Btw, DAS/1 used an XML formatted response for sequence data. The DAS/1 > sequence element has these attributes: id, start, stop, moltype, version. > Does anyone know how DAS/1 clients make use of these from the seq response? > > > The wikiepedia entry for FASTA is pretty good. > > http://en.wikipedia.org/wiki/Fasta_format > > Interesting. That more-than-one-header business seems evil. They give a > good link for alternative sequence formats. > > Steve > > > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 -- Lincoln D. Stein Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724 FOR URGENT MESSAGES & SCHEDULING, PLEASE CONTACT MY ASSISTANT, SANDRA MICHELSEN, AT michelse at cshl.edu From suzi at fruitfly.org Mon Dec 19 13:18:13 2005 From: suzi at fruitfly.org (Suzanna Lewis) Date: Mon, 19 Dec 2005 10:18:13 -0800 Subject: [DAS2] Sequence retrieval proposal In-Reply-To: <200512191210.31108.lstein@cshl.edu> References: <200512191210.31108.lstein@cshl.edu> Message-ID: yikes, i lost track of the time. i had planned on being on the call. sorry about that. On Dec 19, 2005, at 9:10 AM, Lincoln Stein wrote: > Hi, > > Just CVS updated the spec today in anticipation of the telecon, and I > don't > see any changes to the feature or region requests. Didn't we agree to > drop > region? > > Lincoln > > On Monday 12 December 2005 04:33 pm, Steve Chervitz wrote: >> On Sun, 11 Dec 2005 Andrew Dalke wrote: >>> Steve: >>>> I am also somewhat loath to add yet another sequence file format to >>>> the >>>> world. Seems reasonable to state that a DAS/2 server can supply >>>> sequence in >>>> an alternative format via requests such as: >>>> >>>> http://www.wormbase.org/das/genome/volvox/1/sequence?format=GAME >>> >>> That makes good sense to me. >>> >>>> Here's a brief tour of some possibly extensible candidates: >>> >>> Do you want to say this as: >>> "The server must implement these sequence formats" >>> or >>> "If the server implements one or more of these sequence formats >>> then >>> it must use the corresponding id and content-type." >>> ? >>> >>> Or say nothing and wait until several different servers implement >>> this then standardize on what they do? >>> >>> I don't think anyone here seriously wants the first. :) >>> >>> The last is my favorite, then the middle one. >> >> The last is fine with me. This is the approach we use for >> type-specific >> alternative feature formats: >> http://biodas.org/documents/das2/das2_get.html#types >> >>> My stronger preference is to get a complete 2.0 spec out. Do >>> you or other users need checksum validation of the sequence and/or >>> alternate sequence formats in 2.0? What prevents you from extending >>> existing HTTP headers or experimenting with extensions then >>> submitting your experience for inclusion in future versions of >>> the spec? >>> >>> My sense is that this can wait. >> >> Yep. Especially in light of this morning's teleconf (notes for which >> are on >> the way). This seems like a good place to invoke YAGNI ( >> http://keithdevens.com/quotes/YAGNI ). >> >>>> We might consider proscribing some conventions for what DAS >>>> considers >>>> proper fasta format. I put in a little bit of description of a >>>> DAS-acceptable fasta format here in the retrieval spec: >>>> http://biodas.org/documents/das2/das2_get.html#sequence >>> >>> Do current DAS clients even use the header? >>> >>> Will future ones use it? If so, why? Shouldn't all the information >>> in a header be available as an annotation? >> >> Don't know. Seems like it should be left to the client implementation >> to >> decide what to do with the header. The aim of the sequence request >> (soon to >> be 'residues') is to get sequence data, not annotations. >> >> If we're not saying what DAS/2 clients are supposed to do with the >> header >> info, and there are so many variations out there, we might consider >> stating >> that clients are free to ignore the header. Then if we do this, why >> use >> fasta format instead of raw sequence? >> >> Btw, DAS/1 used an XML formatted response for sequence data. The DAS/1 >> sequence element has these attributes: id, start, stop, moltype, >> version. >> Does anyone know how DAS/1 clients make use of these from the seq >> response? >> >>> The wikiepedia entry for FASTA is pretty good. >>> http://en.wikipedia.org/wiki/Fasta_format >> >> Interesting. That more-than-one-header business seems evil. They give >> a >> good link for alternative sequence formats. >> >> Steve >> >> >> >> _______________________________________________ >> DAS2 mailing list >> DAS2 at portal.open-bio.org >> http://portal.open-bio.org/mailman/listinfo/das2 > > -- > Lincoln D. Stein > Cold Spring Harbor Laboratory > 1 Bungtown Road > Cold Spring Harbor, NY 11724 > FOR URGENT MESSAGES & SCHEDULING, > PLEASE CONTACT MY ASSISTANT, > SANDRA MICHELSEN, AT michelse at cshl.edu > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 From edgrif at sanger.ac.uk Thu Dec 1 09:21:40 2005 From: edgrif at sanger.ac.uk (Ed Griffiths) Date: Thu, 1 Dec 2005 09:21:40 +0000 (GMT) Subject: [DAS2] DAS intro In-Reply-To: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> References: <5b3c55a976a0effc3725923122c66d4f@dalkescientific.com> Message-ID: Andrew, > The front of the DAS doc starts > > DAS 2.0 is designed to address the shortcomings of DAS 1.0, including: > > That kinda assumes people know what DAS 1.0 is to understand DAS 2.0. Good to make this change but I also think that there should be a short section which compares/contrasts DAS 1.0 and DAS 2.0. It should be written to show that DAS 2.0 addresses the shortcomings of DAS 1.0 (e.g. updating protocol). Otherwise there is nothing major I would change about the intro., a good change to make. Ed -- ** PLEASE NOTE NEW ADDRESS/PHONE NUMBER ** ------------------------------------------------------------------------ | Ed Griffiths, Acedb development, Informatics Group, | | The Morgan Building, Sanger Institute, Wellcome Trust Genome Campus | | Hinxton, Cambridge CB10 1HH | | | | email: edgrif at sanger.ac.uk Tel: +44-1223-496844 Fax: +44-1223-494919 | ------------------------------------------------------------------------ From dalke at dalkescientific.com Sun Dec 4 23:35:49 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 5 Dec 2005 00:35:49 +0100 Subject: [DAS2] the /region subtree Message-ID: Looks like no one knows why it's there? Regions are important. But regions can (as far as I can tell) be described in a feature by pointing directly to the /sequence subtree and not through an intermediate /region object. Identifiable regions (contigs, ESTs) are important, but they can be stored as a feature, and take advantage of the other capabilities of features, like searching and returning alternative formats. I sent mail to Lincoln asking to talk about this but haven't heard back from him. Or anyone else want to explain it to me? Andrew dalke at dalkescientific.com From Steve_Chervitz at affymetrix.com Mon Dec 5 16:50:56 2005 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Mon, 05 Dec 2005 08:50:56 -0800 Subject: [DAS2] DAS/2 teleconference today Message-ID: Today's agenda: implementation status reports. Dialin (US): 800-531-3250 Dialin (Intl): 303-928-2693 Conference ID: 2879055 Steve From Steve_Chervitz at affymetrix.com Mon Dec 5 16:52:49 2005 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Mon, 05 Dec 2005 08:52:49 -0800 Subject: [DAS2] Re: DAS/2 teleconference today In-Reply-To: Message-ID: Forgot to note the time: 9:30am PDT, 12:30pm EDT, 5:30pm GMT Steve > From: Steve Chervitz > Date: Mon, 05 Dec 2005 08:50:56 -0800 > To: DAS/2 > Conversation: DAS/2 teleconference today > Subject: DAS/2 teleconference today > > Today's agenda: implementation status reports. > > Dialin (US): 800-531-3250 > Dialin (Intl): 303-928-2693 > Conference ID: 2879055 > > Steve From Gregg_Helt at affymetrix.com Mon Dec 5 17:13:16 2005 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Mon, 5 Dec 2005 09:13:16 -0800 Subject: What are regions for? (was Re: [DAS2] DAS intro) Message-ID: > -----Original Message----- > From: das2-bounces at portal.open-bio.org [mailto:das2-bounces at portal.open- > bio.org] On Behalf Of Andrew Dalke > Sent: Tuesday, November 29, 2005 4:02 PM > To: DAS/2 > Subject: What are regions for? (was Re: [DAS2] DAS intro) > > Ed: > > I understand this as talking about coordinates in general, not the > > elements or "pos" attributes in the spec. Suzi specifically > > mentions chromosomes and contigs; one can definitely be backwards with > > respect to the other. But top-level regions in an assembly would > > probably all be chromosomes or all be contigs, rather than a mixture. > > I'm trying to figure out when people use the /region. Okay, for now ignore the whole issue of assembly. The need for something like /region doesn't depend on different levels of assembly. I do think handling assembly information is necessary, but that's for a different post. In the current spec the ".../region" query is the only way to _efficiently_ discover the set of sequences that can be used for region/sequence-based filters in feature queries. Pretty much any client that wants to restrict feature queries by sequence needs to use it. Now you _can_ determine this same info via an unqualified ".../sequence" query but then you're retrieving all the residues for each sequence -- this is about as inefficient as you can get. Another alternative to the current approach would be to combine /region and /sequence into one type of query, but to add modifiers (format param?) that specify what to return: .../sequence?format=x-das-regions (or something similar) .../sequence?format=fasta We would need to specify at least these two different formats to allow for both efficient retrieval of minimal information about the set of seqs and retrieval of sequence residues. ... > My questions, to summarize, are: > - why do we need a /region space when we can > 1. point directly to a sequence (for chromosome regions) and/or > 2. point to a "contig" or "assembly" or "region" feature type > (for other regions) > > - When would someone have regions which have more than one of > contigs, ESTs and chromosomes? Especially given that this > is the genome spec, so chromosome-level info is known, at > least enough for a rough assembly. > > In other words, what are regions for? > I'm really only addressing question 1.1, as I said before I think assembly is a separate issue. gregg From suzi at fruitfly.org Mon Dec 5 16:53:02 2005 From: suzi at fruitfly.org (Suzanna Lewis) Date: Mon, 5 Dec 2005 08:53:02 -0800 Subject: [DAS2] DAS/2 teleconference today In-Reply-To: References: Message-ID: in just a few minutes? or is it on the half hour? On Dec 5, 2005, at 8:50 AM, Steve Chervitz wrote: > Today's agenda: implementation status reports. > > Dialin (US): 800-531-3250 > Dialin (Intl): 303-928-2693 > Conference ID: 2879055 > > Steve > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 From dalke at dalkescientific.com Mon Dec 5 17:43:04 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 5 Dec 2005 18:43:04 +0100 Subject: What are regions for? (was Re: [DAS2] DAS intro) In-Reply-To: References: Message-ID: <76364b7909b0d4e7df3fc4bc649d10de@dalkescientific.com> Gregg: > Okay, for now ignore the whole issue of assembly. The need for > something like /region doesn't depend on different levels of assembly. > I do think handling assembly information is necessary, but that's for a > different post. Okay. > In the current spec the ".../region" query is the only way to > _efficiently_ discover the set of sequences that can be used for > region/sequence-based filters in feature queries. What's wrong with $VERSIONED_SOURCE/feature?type=region That is, regions are as specific type of feature. Andrew dalke at dalkescientific.com From Steve_Chervitz at affymetrix.com Mon Dec 5 20:42:44 2005 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Mon, 05 Dec 2005 12:42:44 -0800 Subject: [DAS2] DAS/2 weekly meeting notes from 5 Dec 2005 Message-ID: Notes from the weekly DAS/2 teleconference, 5 Dec 2005. $Id: das2-teleconf-2005-12-05.txt,v 1.2 2005/12/05 19:55:54 sac Exp $ Note taker: Steve Chervitz Attendees: Affy: Steve Chervitz, Ed E., Gregg Helt CSHL: Lincoln Stein UC Berkeley: Suzi Lewis Sanger: Andreas Prlic U Alabama: Ann Loraine Action items are flagged with '[A]'. These notes are checked into the biodas.org CVS repository at das/das2/notes/2005. Instructions on how to access this repository are at http://biodas.org DISCLAIMER: The note taker aims for completeness and accuracy, but these goals are not always achievable, given the desire to get the notes out with a rapid turnaround. So don't consider these notes as complete minutes from the meeting, but rather abbreviated, summarized versions of what was discussed. There may be errors of commission and omission. Participants are welcome to post comments and/or corrections to these as they see fit. Today's topic: Implementation status reports --------------------------------------------- Gregg ------ * Planning new IGB release in 2-3 weeks * Revised IGB docs * Updating the IGB DAS client - Want to make the DAS/1 iface more like DAS/2. Currently, they are isolated in the code and the interface. Maybe done w/in next 2 weeks. Question: Has anyone use the agile 2D java stuff? ( http://www.cs.umd.edu/hcil/agile2d/ ) AL: Hasn't used it but used jazz (piccolo). Couldn't figure how to get it to zoom in only 1D. they do 2D. GH: Looking to speed up rendering in IGB. The Affy transcriptarium loses hardware acceleration when using normal java 2d/swing stuff (one cpu controling 8 monitors). agile 2d background: Billed as a drop in replacement for java 2d that runs on top of open GL. Significant speed ups on solaris, other platforms. But Gregg finds it is slower on windows. Requies open GL for java that does lots of native stuff underneath. May work on mac OS X. The focus on windows b/c in order to use spotfire and other windows-only apps. AP: Does it use double buffering? GH: Yes. Swing does this. Builds volatile image. Hooks into hardware acceleration graphics card. If you can construct a volatile img, rendering goes through graphics hardware, which is a big speed up. But swing can't deal w/ multiple graphics cards (Direct X limitation?). Open gl may deal with things better in this regard. AL: Maybe get a consultant. This is very specialized stuff. GH: Getting some time from sun java graphics engineers could help. Ed E ---- * Debugging for DAS/1 in IGB client - Fixed bug in the Affy DAS/1 server. - Also fixed a few bugs in parser (concerned note, length tags -- not heavily used). Steve ------ * Working on hooking up tomcat to apache to enable apache to handle requests that are fed to an underlying servlet engine. - Have made progress but need to do more testing before installing on netaffxdas server machine. GH: Ultimate goal is to release the Affy DAS/2 server as a servlet using standard configuration. Isn't this sevlet-under-apache configuration fairly standard? AP: We provide a war file. users can plug this into their app server. SC: The apache-tomcat config is quite common for ISPs or other situations where you need to redirect apache to different servlets depending on different companies, products, etc. SC: Assuming the apache configuration is ready, is the DAS/2 server code ready to go? GH: depends on how much of the spec we want to support. Estimates 6mos to 1yr before releasing our server. Also probably just release DAS/2. DAS/1 is also just partial and does some custom things. Better to focus on fleshing out DAS/2 server. SC: Possible complexities for apache/tomcat due to our port forwarding schemes (DAS/1 and DAS/2 servers running on different ports). GH: There's no reason we can't run these servers on different machines. GH: Steve's time for DAS work? SC: Have to transition back to NetAffx now, but can still proceed with DAS work, just less cpu cycles. Can still take and post notes. Ann ---- * Working on new arabidopsis 2010 proposal from last year. - Main aim of grant: setting up data archive for arab, and for people to add to it. Currently scattered. people put data on their website and move on. No easy way to collect it in one place. - Setting up a web service at UAB, give it an ID, info about id (e.g., whose affy, tair, agi). The server will return synonyms for that ID. Goal is to be a backend for IGB to do searches for an ID. Will talk to the UAB server via some data format. - Grant will include funding for uab and Affy (usability enhancements for IGB as front end for data repository, figuring out protocol for talking to id server) GH: Will you run DAS/2 servers at UAB? AL: Yes, we'll set up the servers, but not do server development itself. Can test and offer feedback from installation issues, etc. GH: Will data repository be in Alabama? AL: Yes. We have good sysadmin support. Originally TAIR would do it, but we've built things up since then. Grant will be more smaller scale than last year. No comparative genomics viewers. Sticking to core expertise. Focus on getting one genome right. Also funding for genoviz sdk. library upkeep. tutorials. GH: Reimplement IGB on top of picolo! AL: Looking to Affy on fancy graphics stuff since they know the libraries better than anyone else. GH: Regarding your synonym resolver - have you talked to suzi about what they're doing? AL: No. Looked at the flynome link steve mentioned (http://flynome.com). If there's a standard out there we'd love to use it. LS: There is Gene Seer developed here - a synonlym database for gene IDs. Supports many to may associations between synonly and the canonical forms. No interface besides a web-form. May make sense to provide access to it via DAS/2. GH: layering a DAS/2 interface on top of the synomym service might be nice. [A] Ann will help develop gene synonym server on top of DAS/2 EE: In IGB we have a need for synomym service for genome builds (e.g. hg16 = build34). Currently a hack in IGB, would be nice to do via DAS. AL: Plan to start with arab first. GH: Using Gene Seer scheme for your storage you will get a lot of other genomes for free. AL: Is it open source? LS: Yes, but it's not my code (ravi at cshl.edu) heavily used in cshl. Loaded with all model orgs. This is now public: http://geneseer.cshl.edu - published in bioinformatics. - no arabidopsis (rat, mouse, human, celegans, yeast). [A] Ann will contact Ravi re: Gene Seer & get letter to submit w/ her proposal. SC: Does Gene Seer contain LSIDs? LS: No, but could be added as a synonym. Every name has a species, and unique type (embl, ncbi). E.g., you can search for rad5 - in all species, or in yeast, or as a locus name in yeast. Used for the RNAi libraries here. AL: Is anyone else building such a resource? LS: Google maybe.... Andreas ------- * Not much coding for DAS/2. Waiting for issues with spec to be discussed. (next week). DAS/2 interface should be easy to add. GH: Spec is changing, so impl is hard to do. AP: Database and java server-side code has been running a while. A new interface should be easy to do. Andrew (via Gregg) ------------------ * Planning to get the registry-related stuff formalized in the spec soon. * Working on setting up web service for validation on open-bio server with Chris D. * Updating spec but is slow going. Will join the call next week. LS: Whose in charge of updating the spec now -- me, Steve, Andrew? [A] Andrew is now responsible for updating the spec. Lincoln -------- * Update on the NCI DAS/2 project: - Awarded grant for (funded by NCI) after 1 year negotiating. - caBIG grid depends on caBIO java lib/API from NCI. They had dropped all DAS support when going from caCORE v2 -> v3. - Brian Gilman proposes to add DAS/2 support to caCORE v3. o First, create plugin API for caCORE. Will allow you to add in class libs without recompiling java code. Permits caCORE to speak arbitrary protocols (~3 mos work). o Then implement DAS/2 plugin (~3 mos work). o Starting on 12/17. o Will use Alan Day's biopackages DAS/2 server (gmod), serving hap map through that. Client will be caBIO. ---- [A] Discuss spec issues for next weeks teleconf: regions, seq, registry, etc. From dalke at dalkescientific.com Mon Dec 5 22:22:14 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Mon, 5 Dec 2005 23:22:14 +0100 Subject: [DAS2] transitioning from DAS/1 to DAS/2 Message-ID: <913823a0cc11e248822242b73f9ecd13@dalkescientific.com> What should we do to make DAS/1 -> DAS/2 transitions easier? For at least a year I expect there will be both DAS/1 and DAS/2 servers. We have or will have clients which can handle both interfaces. I expect the metadata server should support both server types. This means it must describe that data source "X" uses interface "A". I also expect that "A" can be: - DAS/1 genomic annotations - DAS/1.5 structure annotations - DAS/2 genomic annotations - DAS/2 protein annotations - DAS/2 structure We've already talked about the need for identifying the different DAS/2 data sources, based on Andreas Prlic's comments. But we didn't talk about how to handle DAS/1 data sets in the metadata server. At present I don't have even a sketch of an answer. In theory this registry could be extended to a registery of many different data sources and API (eg, 'go to this Oracle database, which uses the biosql schema'). That's well outside the scope of this question. Andrew dalke at dalkescientific.com From ed_erwin at affymetrix.com Mon Dec 5 22:33:08 2005 From: ed_erwin at affymetrix.com (Ed Erwin) Date: Mon, 05 Dec 2005 14:33:08 -0800 Subject: [DAS2] transitioning from DAS/1 to DAS/2 In-Reply-To: <913823a0cc11e248822242b73f9ecd13@dalkescientific.com> References: <913823a0cc11e248822242b73f9ecd13@dalkescientific.com> Message-ID: <4394C024.6010103@affymetrix.com> Andrew Dalke wrote: > What should we do to make DAS/1 -> DAS/2 transitions easier? > > For at least a year I expect there will be both DAS/1 and DAS/2 > servers. We have or will have clients which can handle both > interfaces. You are talking about the registry and discovery service, right? How about only registering DAS/2 servers. That should help speed users to transition to DAS/2. From ap3 at sanger.ac.uk Tue Dec 6 10:14:11 2005 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Tue, 6 Dec 2005 10:14:11 +0000 Subject: [DAS2] transitioning from DAS/1 to DAS/2 In-Reply-To: <913823a0cc11e248822242b73f9ecd13@dalkescientific.com> References: <913823a0cc11e248822242b73f9ecd13@dalkescientific.com> Message-ID: <04f347a5b6873fbda4978d866204048d@sanger.ac.uk> Hi Andrew! > For at least a year I expect there will be both DAS/1 and DAS/2 > servers. We have or will have clients which can handle both > interfaces. > I expect the metadata server should support both server types I think there should be two different points where to get a list of DAS1 and DAS2 servers. after all it is different protocols. e.g. something like das.sanger.ac.uk/registry/ ... the existing das1 registry das.sanger.ac.uk/registry/das2/ ... the upcoming das2 registry. > This means it must describe that data source "X" uses interface "A". > > I also expect that "A" can be: > - DAS/1 genomic annotations > - DAS/1.5 structure annotations In terms of meta description, There is not much difference between genome, sequence and structure capabilities with DAS1. "structure" and "alignment" are just 2 additional commands ("capabilities") that a das server speaks (like "sequence" or "feature). Still it is important to distinguish between the types of data. Therefore the das1- registry style coordinate systems contain the information if the type is chromosomal, protein sequence, or protein structure. > - DAS/2 genomic annotations > - DAS/2 protein annotations > - DAS/2 structure this are the DAS2 - "domains" This leads to a discussion I would like to have next monday: DAS1 is rather powerful, because it is possible to use the sequence and features commands in a way that works for both genomic and protein sequences. There is no need to distinguish between these in the DAS1 - world. I understand that one of the reason for creating the DAS2-"domains" is to have several "modules" which can be extended/developed independently. Still the genome -domain is very similar to what is needed for any kind of sequence. I would therefore like to discuss to rename the "genome" domain to "sequence". The information of which type of sequence, "genomic" or "protein sequence" should be provided via the source description. > But we didn't talk about how to handle DAS/1 data sets in > the metadata server. if the source description is done well, it can also be used for das1 servers. ( see my recent mail about the "meta" description, which could be used like that for das1) Something else I also would like to try is to provide a DAS2- proxy for DAS1 sources via the registry... I.e. you can make a DAS2 request to a URL at the registry, which is translated to a DAS1 request at the real server and then translated back again... Cheers, Andreas > ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 From Steve_Chervitz at affymetrix.com Tue Dec 6 19:37:20 2005 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Tue, 06 Dec 2005 11:37:20 -0800 Subject: [DAS2] transitioning from DAS/1 to DAS/2 In-Reply-To: <04f347a5b6873fbda4978d866204048d@sanger.ac.uk> Message-ID: Andreas Prlic wrote: > Hi Andrew! > ... >> - DAS/2 genomic annotations >> - DAS/2 protein annotations >> - DAS/2 structure > > this are the DAS2 - "domains" This leads to a discussion I would like > to have next monday: > > DAS1 is rather powerful, because it is possible to use the sequence and > features commands > in a way that works for both genomic and protein sequences. There is > no need to distinguish between these > in the DAS1 - world. > > I understand that one of the reason for creating the DAS2-"domains" is > to have several "modules" which can be > extended/developed independently. Still the genome -domain is very > similar to what is needed for any kind of > sequence. > > I would therefore like to discuss to rename the "genome" domain to > "sequence". > > The information of which type of sequence, "genomic" or "protein > sequence" should be provided via the > source description. I like this idea. The only nucleotide specific stuff in the DAS/2 retrieval spec is the region request. Strand designation in a location specifier is already optional. We'd may then want to change the 'sequence' request to something else, perhaps 'residues'? Steve From dalke at dalkescientific.com Tue Dec 6 23:43:10 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Wed, 7 Dec 2005 00:43:10 +0100 Subject: [DAS2] transitioning from DAS/1 to DAS/2 In-Reply-To: References: Message-ID: <6314df07bc52c9828a4165e7b1060aee@dalkescientific.com> > Andreas Prlic wrote: >> I would therefore like to discuss to rename the "genome" domain to >> "sequence". >> >> The information of which type of sequence, "genomic" or "protein >> sequence" should be provided via the >> source description. Steve: > I like this idea. The only nucleotide specific stuff in the DAS/2 > retrieval > spec is the region request. Strand designation in a location specifier > is > already optional. > > We'd may then want to change the 'sequence' request to something else, > perhaps 'residues'? I've been thinking the same thing. Going one step further - what about dropping the name entirely? Consider this, with some of the xml: attributes removed for clarity. I've added a 'source_type' field in the element. This is what you get from a SOURCES request HTTP GET http://www.example.com/das2/ (note lack of 'genome' in that URL) That is, the SOURCES request returns information about genomic, protein sequence and structure databases. If this occurs then there will need to be a few changes to the spec. For example, 'taxon' is probably only properly part of the genomic sources and not in the others, so perhaps move the taxon information into a subelement of those SOURCE elements with 'source_type' == 'genome'. http://www.ncbi.nlm.nih.gov/taxon-browser?id=29118 ... Andrew dalke at dalkescientific.com From Steve_Chervitz at affymetrix.com Wed Dec 7 18:56:02 2005 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Wed, 07 Dec 2005 10:56:02 -0800 Subject: [DAS2] Sequence retrieval proposal In-Reply-To: Message-ID: Here's a proposal regarding sequence retrievals that apparently never made it to the list. (I was compiling a list of agenda items for next week's spec discussion when I noticed I sent this message only to myself...) Steve ------ Forwarded Message From: Steve Chervitz Date: Mon, 14 Nov 2005 13:38:44 -0800 To: Steve Chervitz Subject: Re: [DAS2] DAS/2 weekly meeting notes for 14 Nov 05 >From the notes of today's meeting (14 Nov 05): > LS: When you request versioned source from a server, it should say what > assembly coords it's working on and give a uri for that. In this case > there's no guarantee you can do a 'get' on that URI. > We want to say: > 1- what is unique uri for assembly (everyone agrees to share this) > 2- das URL for how to fetch it (some server's region url - trusted, > faithful copy with what is at ncbi). Diff servers could assert that > you can fetch it from various places. This raises another issue we didn't discuss: How about allowing some way to verify that the sequence data received from a given reference server are in fact faithful copies? Use case 1: Validate a given reference server as providing correct sequence data for a specific assembly (either the entire assembly or a specific chromosome). Use case 2: Verify that the sequence or subsequence I received from a specific sequence request is correct and complete. Case #1 requires that the official source of the assembly (or some other trusted reference server) publish checksums on each complete sequence it provides (e.g., each full-length chromosome of each assembly). Case #2 requires the ability to encode a checksum in a sequence response. But there are two issues here: validating the data transfer for the request and validating the correctness of the sequence or subsequence itself with respect to the original assembled sequence. The first issue of case #2 is already supported in the current spec, if the request specifies a format that incorporates a checksum (e.g., sequence/chr21?format=GCG). However, not all servers may support that format, yet they could support checksums. The second issue of case #2 is covered only for responses from trusted reference servers. To consider: 1. What do folks think about adding to the DAS/2 retrieval spec facilities supporting sequence data validation? (i.e., Add an optional checksum attribute in the REGION response.) 2. What do folks think about specifying a DAS2XML format for sequence requests (text/x-das-sequence+xml)? In addition to permitting an optional checksum attribute to address the above use case, it would add some consistency and flexibility to the spec, since at present, the default sequence response format is the only one that is not under our control (currently it's text/x-fasta). Steve ------ End of Forwarded Message From dalke at dalkescientific.com Wed Dec 7 23:22:56 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 8 Dec 2005 00:22:56 +0100 Subject: [DAS2] Sequence retrieval proposal In-Reply-To: References: Message-ID: <5946bc5a646ab2e55da6455aac9edcc5@dalkescientific.com> Steve: > This raises another issue we didn't discuss: How about allowing some > way to verify that the sequence data received from a given reference > server are in fact faithful copies? > Use case 1: Validate a given reference server as providing correct > 1. What do folks think about adding to the DAS/2 retrieval spec > facilities supporting sequence data validation? (i.e., Add an > optional checksum attribute in the REGION response.) How many people actually write client code which verifies the checksum of those formats which have a checksum? I know I never have. Bioperl's genbank.pm doesn't check the atcg counts, nor does swiss.pm check the crc. (Both generate the checks; they just don't verify them.) For those who have implemented checksum verification, how many times has that checksum detected an error in the data transmission? There are already several layers of checksums in the network connection. One in ethernet, another in IP, a third in TCP. Is another one useful? As an example, HTTP and (I think) ftp don't use checksums. I've transfered many very large files via both and not had a problem. Rather, the only check I needed was to verify that I got all of the data, and HTTP provides that information in the header. Now, I know that there are problems when you scale to large data transfers. I even remember talking with Gregg and Lincoln about this years ago. A friend of mine went to a presentation at Stanford that Bram Cohen gave about bittorrent and he was commenting that the four byte check summing in TCP/IP isn't enough for his needs as when you're trying to transfer a 4 gig file to 10,000 users the check summing in TCP/IP isn't enough. We aren't in the terabyte data transfer range. ... doing research ... But if it does become a concern, one solution is RFC 1864 http://www.faqs.org/rfcs/rfc1864.html which adds a "Content-MD5" header to the HTTP response, and describes how to use it. Another is RFC 3230 http://www.scit.wlv.ac.uk/rfc/rfc32xx/RFC3230.html As far as I can tell, very few people, if any, actually use those fields for anything. That serves as a sort of confirmation that data rarely gets corrupted at the TCP/IP level. > 2. What do folks think about specifying a DAS2XML format for sequence > requests (text/x-das-sequence+xml)? In addition to permitting an > optional checksum attribute to address the above use case, it would > add some consistency and flexibility to the spec, since at present, > the default sequence response format is the only one that is not > under > our control (currently it's text/x-fasta). As a consumer of this sort of data, I don't want to write another parser. It isn't just the parsing part - it's the effort of mapping to my program's data model. There's already a huge number of existing sequence file formats. What would another provide? Are some of them already extensible? Several of those formats are designed and developed by people involved with DAS. If it's important, extend GAME or GFF. As a spec writer, I don't really want to write that part of the spec. Andrew dalke at dalkescientific.com From ap3 at sanger.ac.uk Thu Dec 8 09:48:58 2005 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Thu, 8 Dec 2005 09:48:58 +0000 Subject: [DAS2] DAS2 source description In-Reply-To: <6314df07bc52c9828a4165e7b1060aee@dalkescientific.com> References: <6314df07bc52c9828a4165e7b1060aee@dalkescientific.com> Message-ID: The way Andrew suggests the source description looks already quite good to me. Could we add a couple things? * we have some people doing annotations on clones and scaffolds, which -regarding DAS- is essentially the same as annotating in chromosomal coordinates, but for the description a few other types of coordinate systems are needed. * there are a couple of sources that can speak multiple "coordinate systems", so the description should be able to deal with that. * It would be good to have something like an "authority" field in the coordinate systems. i.e. the institution who defines a set of reference objects. with this in mind one could do something like: taxon="http://www.ncbi.nlm.nih.gov/taxon-browser?id=9606" source_type="chromosome" authority_name="NCBI" > This would be the part that is needed for describing the actual data and then it would be good to have some other meta info for the sources as well: * which DAS commands does a source understand * a testcode (per namespace) that can be used to validate responses * some historical data like "has been available since" "was successfully validated the last time at" * a link back to the homepage of the group that provides the source for more detailed docu about the data * an email address to contact if there is a problem/question with the source * a "nickname" for a source that should be used in a DAS client to label tracks coming from that source. * some optional properties that can be added like "funded by ..." "GO evidence code: " > That is, the SOURCES request returns information about genomic, > protein sequence and structure databases. good. - plus a couple of others. this should be a restricted list. > If this occurs then there will need to be a few changes to the spec. > For example, 'taxon' is probably only properly part of the genomic > sources some people annotate protein sequences from a particular organism. e.g there is a DAS1 source that only annotates Fugu protein sequences Cheers, Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 From td2 at sanger.ac.uk Thu Dec 8 10:33:16 2005 From: td2 at sanger.ac.uk (Thomas Down) Date: Thu, 8 Dec 2005 10:33:16 +0000 Subject: [DAS2] Sequence retrieval proposal In-Reply-To: <5946bc5a646ab2e55da6455aac9edcc5@dalkescientific.com> References: <5946bc5a646ab2e55da6455aac9edcc5@dalkescientific.com> Message-ID: On 7 Dec 2005, at 23:22, Andrew Dalke wrote: > >> 2. What do folks think about specifying a DAS2XML format for sequence >> requests (text/x-das-sequence+xml)? In addition to permitting an >> optional checksum attribute to address the above use case, it >> would >> add some consistency and flexibility to the spec, since at >> present, >> the default sequence response format is the only one that is >> not under >> our control (currently it's text/x-fasta). > > As a consumer of this sort of data, I don't want to write another > parser. It isn't just the parsing part - it's the effort of mapping > to my program's data model. > > There's already a huge number of existing sequence file formats. > What would another provide? Are some of them already extensible? > > Several of those formats are designed and developed by people involved > with DAS. If it's important, extend GAME or GFF. Do GAME or GFF have a sequence representation? I thought they were both primarily feature-table formats (right now I'm having trouble finding the GAME documentation though...). The problem I have with Fasta format (other than the tendency of many data-providers to over-load the header line) is that there's no explicit marker for the alphabet and encoding of sequence data. This is pretty nasty for codebases like BioJava which want to present a richer view of sequence data than just a String. I'd certainly be in favour of a nice XML format that made alphabet information explicit. The DAS 1.5 DASSEQUENCE document has a moltype attribute which supports this (at least the three most important cases, DNA/RNA/ Protein -- there's not a standards-compliant way to add other alphabets though). I guess an alternative, more classically RESTful, way of doing things might be with MIME types: Content-Type: application/fasta; sequence-alphabet=DNA; sequence-encoding=IUPAC I admit I'd prefer the XML though... Thomas. From nomi at fruitfly.org Thu Dec 8 19:13:05 2005 From: nomi at fruitfly.org (Nomi Harris) Date: Thu, 8 Dec 2005 11:13:05 -0800 Subject: [DAS2] Sequence retrieval proposal In-Reply-To: References: <5946bc5a646ab2e55da6455aac9edcc5@dalkescientific.com> Message-ID: <17304.34241.521199.4698@spongecake.lbl.gov> On 8 December 2005, Thomas Down wrote: > > There's already a huge number of existing sequence file formats. > > What would another provide? Are some of them already extensible? > > > > Several of those formats are designed and developed by people involved > > with DAS. If it's important, extend GAME or GFF. > > Do GAME or GFF have a sequence representation? I thought they were > both primarily feature-table formats GAME certainly has a sequence representation, and i think GFF3 must, though old GFF doesn't. 3R:1178000-1230000 Drosophila melanogaster AAGCCCACTATATTGCATTAAATTATGCGATAATTGATCAATTTTAAAGG ... > (right now I'm having trouble > finding the GAME documentation though...). http://www.fruitfly.org/annot/apollo/game.rng.txt Nomi From Steve_Chervitz at affymetrix.com Thu Dec 8 21:04:56 2005 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Thu, 08 Dec 2005 13:04:56 -0800 Subject: [DAS2] Sequence retrieval proposal In-Reply-To: Message-ID: On Thu, 8 Dec 2005, Thomas Down wrote: > > On 7 Dec 2005, at 23:22, Andrew Dalke wrote: >> >> Steve Chervitz wrote: >>> >>> 2. What do folks think about specifying a DAS2XML format for sequence >>> requests (text/x-das-sequence+xml)? In addition to permitting an optional >>> checksum attribute to address the above use case, it would add some >>> consistency and flexibility to the spec, since at present, the default >>> sequence response format is the only one that is not under our control >>> (currently it's text/x-fasta). >> >> As a consumer of this sort of data, I don't want to write another >> parser. It isn't just the parsing part - it's the effort of mapping >> to my program's data model. >> >> There's already a huge number of existing sequence file formats. >> What would another provide? Are some of them already extensible? I am also somewhat loath to add yet another sequence file format to the world. Seems reasonable to state that a DAS/2 server can supply sequence in an alternative format via requests such as: http://www.wormbase.org/das/genome/volvox/1/sequence?format=GAME There would have to be a way for a server to indicate what alternative formats is supports. We could use the same strategy as we do in the versioned source request, supplying a FORMAT element listing alternative formats. But where to put it? Perhaps in the regions request: For interoperability purposes, we'd should provide a controlled vocabulary of alternative formats and their types, at least for the commonly used ones. >> Several of those formats are designed and developed by people involved >> with DAS. If it's important, extend GAME or GFF. > > Do GAME or GFF have a sequence representation? I thought they were > both primarily feature-table formats (right now I'm having trouble > finding the GAME documentation though...). Here's a brief tour of some possibly extensible candidates: GFF - only represents features: http://song.sourceforge.net/gff3.shtml GAME - does encode sequence data as a simple string. Flybase/BDGP use GAME XML and appear to be the main users/maintainers. Suzi and Chris can elaborate more here, but I found link to an RNG schema in the Apollo FAQ: http://www.fruitfly.org/annot/apollo/game.rng.txt GAME notes: - The http://bioxml.org links are now obsolete. Here's an old description containing such links: http://xml.coverpages.org/game.html - GAME variants have arisen that have created incompatibilities in the bio* world: http://open-bio.org/pipermail/bioperl-l/2003-April/011988.html - When I checked a flybase data file, it didn't point to a DTD: ftp://flybase.net/genomes/Drosophila_melanogaster/current/xml-game/ Otter - a sort of simplified GAME that also represents sequence: http://www.sanger.ac.uk/Users/jgrg/otter_xml.html XFF - models sequences and has alphabet support (Thomas: is this in use?): http://www.biojava.org/thomasd/XFF/ INSDseq and EMBLxml - An XML format for Gebank/EMBL/DDBJ sequence data: http://www.ebi.ac.uk/xembl/ BSML - Somewhat antiquated but is supported by the XEMBL service http://www.bsml.org/ and in use by LabBook: http://www.labbook.com/default.aspx AGAVE - From DoubleTwist - now defunct, but also supported by XEMBL: http://www.agavexml.org/ BIOML - Details are sketchy, appears to be used internally by Genomic Solutions which acquired Proteometrics, the originators of BIOML. Here's the most recent references I could find: http://www.biomedcentral.com/1471-2105/5/25 http://www.genomicsolutions.com/showPage.php?title=Data%20Integration > The problem I have with Fasta format (other than the tendency of many > data-providers to over-load the header line) is that there's no > explicit marker for the alphabet and encoding of sequence data. This > is pretty nasty for codebases like BioJava which want to present a > richer view of sequence data than just a String. I'd certainly be in > favour of a nice XML format that made alphabet information explicit. > The DAS 1.5 DASSEQUENCE document has a moltype attribute which > supports this (at least the three most important cases, DNA/RNA/ > Protein -- there's not a standards-compliant way to add other > alphabets though). Various data providers take all sorts of liberties with fasta sequence, e.g., sequences with no IDs, whitespace-containing IDs, space between the '>' and the ID, etc. We might consider proscribing some conventions for what DAS considers proper fasta format. I put in a little bit of description of a DAS-acceptable fasta format here in the retrieval spec: http://biodas.org/documents/das2/das2_get.html#sequence Do we want to add more to this? Perhaps something about an optional description being separated from the ID by whitespace and consisting of any amount of free-form text. Steve > I guess an alternative, more classically RESTful, way of doing things > might be with MIME types: > > Content-Type: application/fasta; sequence-alphabet=DNA; > sequence-encoding=IUPAC > > I admit I'd prefer the XML though... > > > Thomas. > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 From dalke at dalkescientific.com Sun Dec 11 18:40:46 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Sun, 11 Dec 2005 19:40:46 +0100 Subject: [DAS2] Sequence retrieval proposal In-Reply-To: References: <5946bc5a646ab2e55da6455aac9edcc5@dalkescientific.com> Message-ID: <50dd4457c1fb4038dda3f4563c92947e@dalkescientific.com> Thomas: > Do GAME or GFF have a sequence representation? I thought they were > both primarily feature-table formats (right now I'm having trouble > finding the GAME documentation though...). Others followed up on this. For me, I was confused. Even though Steve said "sequence retrieval" -- in the subject even -- I was thinking of feature formats. I think that came to mind because I expect there to be more feature data transfered than sequence data, so if data corruption is a concern then the annotations are more likely to have problems. Or I may have been thinking about some of the formats (Genbank, swissprot) which combine the two, and have a checksum. I still don't think checksum-identifiable data corruption is something we need to worry about. > The problem I have with Fasta format (other than the tendency of many > data-providers to over-load the header line) is that there's no > explicit marker for the alphabet and encoding of sequence data. *sigh* It seems like this never goes away. Biopython also has a "rich" alphabet property, designed to handle alternate alphabets, like 3-letter codes and secondary structure alphabets. Bioperl's seems more appropriate in practice - dna, protein, rna, and perhaps 'unknown'. In the context of DAS, this is not a problem. DAS 2.0 uses only genomic data, so all FASTA records will be of type 'dna'. It might be different with structure data where a single record may have all three alphabet types. (Though I only know of structures with 2 of the 3.) > I guess an alternative, more classically RESTful, way of doing things > might be with MIME types: > > Content-Type: application/fasta; sequence-alphabet=DNA; > sequence-encoding=IUPAC > > I admit I'd prefer the XML though... As I mentioned, for purposes of DAS 2.0 this isn't needed so I don't think we need to solve this problem. If we do, I think it's a nearly intractable problem. How does one register all the different possible alphabets? IUPAC dna/rna/protein covers most of it. Getting the other few percent is hard. Then making all the software to preserve or interconvert the different formats adds another layer of hard. There's a lot of social issues as well. Andrew dalke at dalkescientific.com From dalke at dalkescientific.com Sun Dec 11 19:26:21 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Sun, 11 Dec 2005 20:26:21 +0100 Subject: [DAS2] Sequence retrieval proposal In-Reply-To: References: Message-ID: <7f717b3ebc3306bfca3f004dcade093b@dalkescientific.com> Steve: > I am also somewhat loath to add yet another sequence file format to the > world. Seems reasonable to state that a DAS/2 server can supply > sequence in > an alternative format via requests such as: > > http://www.wormbase.org/das/genome/volvox/1/sequence?format=GAME That makes good sense to me. > Here's a brief tour of some possibly extensible candidates: Do you want to say this as: "The server must implement these sequence formats" or "If the server implements one or more of these sequence formats then it must use the corresponding id and content-type." ? Or say nothing and wait until several different servers implement this then standardize on what they do? I don't think anyone here seriously wants the first. :) The last is my favorite, then the middle one. My stronger preference is to get a complete 2.0 spec out. Do you or other users need checksum validation of the sequence and/or alternate sequence formats in 2.0? What prevents you from extending existing HTTP headers or experimenting with extensions then submitting your experience for inclusion in future versions of the spec? My sense is that this can wait. > We might consider proscribing some conventions for what DAS considers > proper > fasta format. I put in a little bit of description of a DAS-acceptable > fasta > format here in the retrieval spec: > http://biodas.org/documents/das2/das2_get.html#sequence Do current DAS clients even use the header? Will future ones use it? If so, why? Shouldn't all the information in a header be available as an annotation? The wikiepedia entry for FASTA is pretty good. http://en.wikipedia.org/wiki/Fasta_format I had my students a few months ago find different FASTA definitions. Some disagreed with others. Wikiepedia was the most complete. Andrew dalke at dalkescientific.com From Gregg_Helt at affymetrix.com Mon Dec 12 17:28:46 2005 From: Gregg_Helt at affymetrix.com (Helt,Gregg) Date: Mon, 12 Dec 2005 09:28:46 -0800 Subject: [DAS2] DAS/2 meeting agenda, Dec 12 2005 Message-ID: Today we're focusing on spec issues. Here's a few topics raised in the last two weeks: The .../region subtree Retrieval of sequence residues Support for different versions of DAS in the registry Revising top level of DAS tree (putting more info in source response) From ed_erwin at affymetrix.com Mon Dec 12 20:03:33 2005 From: ed_erwin at affymetrix.com (Ed Erwin) Date: Mon, 12 Dec 2005 12:03:33 -0800 Subject: [DAS2] Sequence retrieval proposal In-Reply-To: <7f717b3ebc3306bfca3f004dcade093b@dalkescientific.com> References: <7f717b3ebc3306bfca3f004dcade093b@dalkescientific.com> Message-ID: <439DD795.1070406@affymetrix.com> Andrew Dalke wrote: > > The wikiepedia entry for FASTA is pretty good. > http://en.wikipedia.org/wiki/Fasta_format Aha! Now I know where those ^A characters in the NetAffx database came from! From ed_erwin at affymetrix.com Mon Dec 12 21:05:47 2005 From: ed_erwin at affymetrix.com (Ed Erwin) Date: Mon, 12 Dec 2005 13:05:47 -0800 Subject: [DAS2] transitioning from DAS/1 to DAS/2 In-Reply-To: <6314df07bc52c9828a4165e7b1060aee@dalkescientific.com> References: <6314df07bc52c9828a4165e7b1060aee@dalkescientific.com> Message-ID: <439DE62B.6010002@affymetrix.com> Andrew Dalke wrote: > > That is, the SOURCES request returns information about genomic, > protein sequence and structure databases. > > If this occurs then there will need to be a few changes to the spec. > For example, 'taxon' is probably only properly part of the genomic > sources and not in the others, so perhaps move the taxon information > into a subelement of those SOURCE elements with 'source_type' == 'genome'. > > > source_type="genome"> > http://www.ncbi.nlm.nih.gov/taxon-browser?id=29118 > > > > > This a bit off-topic, but I noticed that links to the taxonomy browser need to be formatted like this: http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=3066 And note this disclaimer on that site: "The NCBI taxonomy database is not an authoritative source for nomenclature or classification" From Steve_Chervitz at affymetrix.com Mon Dec 12 21:33:00 2005 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Mon, 12 Dec 2005 13:33:00 -0800 Subject: [DAS2] Sequence retrieval proposal In-Reply-To: <7f717b3ebc3306bfca3f004dcade093b@dalkescientific.com> Message-ID: On Sun, 11 Dec 2005 Andrew Dalke wrote: > Steve: >> I am also somewhat loath to add yet another sequence file format to the >> world. Seems reasonable to state that a DAS/2 server can supply >> sequence in >> an alternative format via requests such as: >> >> http://www.wormbase.org/das/genome/volvox/1/sequence?format=GAME > > That makes good sense to me. > >> Here's a brief tour of some possibly extensible candidates: > > Do you want to say this as: > "The server must implement these sequence formats" > or > "If the server implements one or more of these sequence formats then > it must use the corresponding id and content-type." > ? > > Or say nothing and wait until several different servers implement > this then standardize on what they do? > > I don't think anyone here seriously wants the first. :) > > The last is my favorite, then the middle one. The last is fine with me. This is the approach we use for type-specific alternative feature formats: http://biodas.org/documents/das2/das2_get.html#types > My stronger preference is to get a complete 2.0 spec out. Do > you or other users need checksum validation of the sequence and/or > alternate sequence formats in 2.0? What prevents you from extending > existing HTTP headers or experimenting with extensions then > submitting your experience for inclusion in future versions of > the spec? > > My sense is that this can wait. Yep. Especially in light of this morning's teleconf (notes for which are on the way). This seems like a good place to invoke YAGNI ( http://keithdevens.com/quotes/YAGNI ). >> We might consider proscribing some conventions for what DAS considers proper >> fasta format. I put in a little bit of description of a DAS-acceptable fasta >> format here in the retrieval spec: >> http://biodas.org/documents/das2/das2_get.html#sequence > > Do current DAS clients even use the header? > > Will future ones use it? If so, why? Shouldn't all the information > in a header be available as an annotation? Don't know. Seems like it should be left to the client implementation to decide what to do with the header. The aim of the sequence request (soon to be 'residues') is to get sequence data, not annotations. If we're not saying what DAS/2 clients are supposed to do with the header info, and there are so many variations out there, we might consider stating that clients are free to ignore the header. Then if we do this, why use fasta format instead of raw sequence? Btw, DAS/1 used an XML formatted response for sequence data. The DAS/1 sequence element has these attributes: id, start, stop, moltype, version. Does anyone know how DAS/1 clients make use of these from the seq response? > The wikiepedia entry for FASTA is pretty good. > http://en.wikipedia.org/wiki/Fasta_format Interesting. That more-than-one-header business seems evil. They give a good link for alternative sequence formats. Steve From Steve_Chervitz at affymetrix.com Tue Dec 13 01:38:14 2005 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Mon, 12 Dec 2005 17:38:14 -0800 Subject: [DAS2] DAS/2 weekly meeting notes from 12 Dec 2005 Message-ID: Notes from the weekly DAS/2 teleconference, 12 Dec 2005. $Id: das2-teleconf-2005-12-12.txt,v 1.1 2005/12/13 01:03:01 sac Exp $ Note taker: Steve Chervitz Attendees: Affy: Steve Chervitz, Ed E., Gregg Helt CSHL: Lincoln Stein Sanger: Andreas Prlic, Thomas Down Sweden: Andrew Dalke Action items are flagged with '[A]'. These notes are checked into the biodas.org CVS repository at das/das2/notes/2005. Instructions on how to access this repository are at http://biodas.org DISCLAIMER: The note taker aims for completeness and accuracy, but these goals are not always achievable, given the desire to get the notes out with a rapid turnaround. So don't consider these notes as complete minutes from the meeting, but rather abbreviated, summarized versions of what was discussed. There may be errors of commission and omission. Participants are welcome to post comments and/or corrections to these as they see fit. Today's topic: Spec Issues -------------------------- * Regions ----------- Discussion thread: http://portal.open-bio.org/pipermail/das2/2005-December/000388.html AD: Can the region request be removed, it's just a type of a feature. LS: There are situations where we need to say "something lives in this region and you can't get base pairs for it." Example, gaps or chromosomal sequence based on mapping data only. AP: How would start/stop be specified? LS: Endpoints of gap are specified in base pair coordinates. This is standard in AGP files. Can indicate approximate length. It's not a case of feature with an ambiguous location, just not precisely defined location. GH: Does the spec allow decimal places in location, e.g., for recombination frequency? LS: No. Still genome/base pair oriented. If we want to require the retrieval of bases, we could possibly have a convention where Ns could be returned. AD: Didn't know DAS needed to handle this info. For features type=region, names could be returned, not necessarily sequence data. LS: The operations we need to support: 1) Return entry points for an interactive search - e.g., chromosome length fragments 2) Assembly info (AGP) 3) Bases (residues) given location on the sequence TD: Why not return assembly as a set of features as DAS/1 does? Why do we need a special assembly communication format? GH: You can't get the whole picture all at once. You have to get the top-level contigs, each of which has it's own assembly, recursively. Lots of queries may be required. TD, LS: Hierarchical features are supported. You could have one feature per chromosome of type=assembly. You then do a non-recursive request to get the top-level features, then do a recursive request to get the feature with all children. GH: This is the chado approach where every sequence is a feature. I have trouble with this. TD: The feature indicates an alignment. The region for the feature alignes to a piece of chromosome. GH: How do you find out what the chromosomes are? LS: Assembly fragment type could be used for children. Currently in the DAS/2 spec: - a region request returns contigs a la the entry points list from DAS/1. volvox/Contig1 - or in a finished assembly, this would be chromosome length things that IGB would present to the user to select for browsing. These are not necessarily chromosomes, just recommended entry points for browsing. Feature-based approach: - do a feature query using filter type=assembly AD: Why do we need region request? GH: To get top-level entry points for browsing. LS: Sequence ontology has these types that could appear as entry points: - assembly - assembly component - contig - supercontig - chromosome - chromosome arm Problem: A naive browser comes into genome, doesn't know what the entry points are. AD: Saying type=top-level is wrong. It should be a property. LS: 'Entry point' or 'landmark' attribute. GH: How do you get the entry points? LS: Feature request with a filter for attribute='entry point', and type='assembly component' AD: Possible trouble with people defining features at different servers from the one providing regions: - server 1 provides regions - server 2 provides other feature types So you need to go to multiple servers. TD: This is not a big change from DAS/1 LS: gmod has chromosomes as features, this has never been a problem. Advantages to Andrew's suggestion (regions as features): - simplifies the protocol - can't return AGP format files, must parse DAS2XML (or can only get AGP for a subcomponent of assembly). GH: Can use the same alternative format approach we use for types request (optional FORMAT subelements). But then no server would be required to return it. TD: Not a big deal because every client will be required to parse feature XML. Also, the top-level assembly won't be very large. GH: AGP support is not that important. LS: Was a request from UCSC (Jim Kent). OK to get rid of region and use feature. GH: Still have trouble using feature to get region data because of the circular nature of refering to yourself as your coordinate system. AD: You can still point to a sequence as your coordinate system. GH: How do you know the size of the sequence without requesting the whole sequence? There's also the possibility of 0 vs 1-based coordinate confusion. Someone could provide an assembly top-level feature and declare it starts at 1, getting around our 0-based requirement for genomic features. LS: They could, but will suffer the consequences of pervasive off-by-one errors. Proposal: Abolish the region namespace (request/response): - Add special feature type 'assembly component' - assembly component has optional attribute 'entry point' - Response to this query must be fast - Servers must be able to handle attribute filters GH: Not comfortable with this, and how gmod treats chromosomes being the same as features. Why? Data modeling, e.g., the sequence symmetry concept of genometry used in IGB. An annotation/feature is always described as a relation between one or more sequences. The annotation only points to the sequence. LS: In Bioperl GFF database and chado schema, entry-level sequences are features that use their own coord system. Top-level sequence is a feature with type=chromosome or contig. Limitation is that you need to know what to use for type. Advantage is in relative addressing (e.g., get all genes within 1000 bases of other genes). Works when feature is in its own corrdinate system. AD: There's a danger of becoming too generic, example from WebDAV. When everything is a property, there's no structure. LS: There is the risk of having too many magic attributes. AD: We could keep the top-level or landmark request as a special alias that retrieves a subset of the data -- just top-level entry points, instead of having a special feature request. Would be the same as a region without the extra stuff. GH: Bad to have two ways to get same data. LS: Regions as features is OK, but no top-level attribute. Proposal: DAS-defined special feature type 'top level' or 'entry point' that maps to SO assembly component. Hard-coded, special type that returns entry type features. AD: Is there multiple inheritance support? Are there features that inherit from both SO and our special type? E.g., of type entry point and contig? LS: No. A data source must support type='das:entry point'. To get top-level features, you ask to get features of this type. They can have children to describe the assembly. Trouble with this: Duplication, you now have features that appear as type=entry point and as type=supercontig or chromosome. One is a physical object, one is a navigation object. This trades using a magic type instead of a magic attribute. AD: So we have a choice: - magic attribute - magic type - magic URL LS: Likes special attribute the best. Advantage is that you can tag what ever feature type you want to appear as an entry point. Disadvantage is potential abuse and implementation could be harder. Attribute filtering must be fast. Use case: At an intermediate stage of a big assembly you can choose what you want to be top-level, rather than creating a new database object, or figuring out another way to make it appear in response to a region request. Vote: - GH: special URL (region) - LS, AD, TD: special attribute AD: As benevolent dictator, decides that DAS/2 will employ a special attribute to handle regions as features. Question: What to do with the location attribute (now is a feature URL). Or do we get rid of the position attribute. LS: LOC points to feature that establishes coord system and has subranges of that feature. So the URL gets longer. Attributes specify position of the feature. LOC is for feature space. It specifies the reference system of the feature and where it starts relative to the feature. Position attribute points to the sequence. Clients know to parse the URL to get the start/end. TD: In XML, LOC with attributes start, end, strand, seq. GH: We now permit matching feature filters to allowing combining these. So we should keep the filter syntax. Feature loc syntax can be different. [A]: Andrew: provide details for retrieving regions via feature request - need to get the feature the coordinates are relative to (contig) - need to get the bases, which may not be on the same server SC: Has some philosophical issues with collapsing regions into features, but willing to explore doing so for simplicity. Trouble is putting objects with some physical correlate (sequence) at same level as objects lacking such solid substrate (features). GH: This discussion has created a lot of churn in the spec fairly late in the game. We should be more settled by now. ALL: General agreement. [A]: Everyone make a push to stabilize the retrieval spec. * New topic: Rename DAS 'genome' domain to 'sequence' ----------------------------------------------------- Discussion thread: http://portal.open-bio.org/pipermail/das2/2005-December/000394.html AD: Why not remove the top-level domain completely? (das/genome becomes just das). GH: Need to know what data a given server has. AP: As long as the source description provides info about what its about, should be sufficient. GH: This pushes the URL data into the source type tag (this is the same magic URL vs magic type vs magic attribute issue all over again...) AD: If we get rid of it, a given server can provide different data without special URLs. GH: What you put on the URL determines the return type. Why don't you like it? AD: 1) 'genome' in the URL is extra fluff. 2) saying you might need it in the future is a weak argument (ain't gonna need it). GH: Most servers will provide one type of data. AD: People who provide meta data might want to combine it into one document for all data. LS: Saying you're in 'genome space' is a contract for what coordinate system is (positive integers, start, end, strand) and what type of reqests/responses are expected. If we jumble things up, it makes it difficult for dealing with other systems (3D coords). The 'genome' space is intended to cover both protein and DNA. AD: The top-level DAS response would point to the versioned source, and indicate that it has a sequence, and a top-level URL. LS: Seems like an arbitrary decision. AP: What about the original proposal, to simply change 'genome' to 'sequence'? GH: OK with this. [A]: Andrew (spec czar) change 'das/genome' domain to 'das/sequence'. [A]: Andrew (spec czar) change 'sequence' request to 'residues'. Other Issues: ------------- LS: Concerned about big changes being made to spec at this date. ALL: Agreed. Should have happened earlier, but the discussion is important. [A]: All - focus on spec issues again next week. No meeting in two weeks. From nomi at fruitfly.org Tue Dec 13 18:27:01 2005 From: nomi at fruitfly.org (Nomi Harris) Date: Tue, 13 Dec 2005 10:27:01 -0800 Subject: [DAS2] Link to GAME XML schema Message-ID: http://biodas.org/documents/das2/das2_get.html has a link to http://flybase.net/annot/gamexml.dtd.txt for GAME XML documentation. This link should be changed to http://flybase.bio.indiana.edu/annot/game.rng.txt Thanks, Nomi From dalke at dalkescientific.com Fri Dec 16 12:28:29 2005 From: dalke at dalkescientific.com (Andrew Dalke) Date: Fri, 16 Dec 2005 13:28:29 +0100 Subject: [DAS2] DAS code sprint proposal Message-ID: <76643bf6d537c2dade84350eb5d77b1d@dalkescientific.com> Hi all, I've been thinking more about the comments from Gregg and others during the previous phone conference. He was concerned that there are some major spec changes this late in the DAS/2 grant period. It does seem rather late, but from what I've seen in spec development, especially ones like this which combine people from, what, 6 or more sites and will be used by many more people, it's not uncommon. It's just not possible to emerge the spec full-blown like Athena from the head of Zeus, any more than it's possible to write 10,000 lines of code and have it compile the first time, much less run. Even with peer review. I think we have a solid idea of what the spec should look like - though I'll have a busy few weeks assembling them into words. I don't think the object model has had that much in the way of changes; mostly it's been a question of pinning down a few details. I think it's time to schedule a DAS code sprint, where the different server and client people get together to implement the spec and use that to provide feedback for the spec development. My feeling is that people are ready for this too. I know I would rather code specs then doc them! Christmas is coming up and there's usually the week or so at the start of the year where people are getting back up to speed. That sound like the end of January / early February is a good time. For me the week of 6 February is the best. That's probably far enough in the future too for people to clear out time. My thought is to have two groups of people, one in Cambridge and one in Emoryville, but I've not heard of a geographically split sprint like that. Andrew dalke at dalkescientific.com From ap3 at sanger.ac.uk Fri Dec 16 13:11:10 2005 From: ap3 at sanger.ac.uk (Andreas Prlic) Date: Fri, 16 Dec 2005 13:11:10 +0000 Subject: [DAS2] DAS code sprint proposal Message-ID: Hi! We (thomas and me) think that this is a very good idea and we are happy to organize our part here. The proposed week (february 6th) is also fine. Cheers, Andreas ----------------------------------------------------------------------- Andreas Prlic Wellcome Trust Sanger Institute Hinxton, Cambridge CB10 1SA, UK +44 (0) 1223 49 6891 From rds at sanger.ac.uk Fri Dec 16 14:33:38 2005 From: rds at sanger.ac.uk (Roy Storey) Date: Fri, 16 Dec 2005 14:33:38 +0000 (GMT) Subject: [DAS2] DAS code sprint proposal Message-ID: Andreas, You can count Ed and myself in as well. We've very much like to get zmap to be a DAS2 client and anything we can do to help hack.... Roy On Fri, 16 Dec 2005, Andreas Prlic wrote: > Hi! > > We (thomas and me) think that this is a very good idea and we are happy to > organize > our part here. The proposed week (february 6th) is also fine. > > Cheers, > Andreas > > > ----------------------------------------------------------------------- > > Andreas Prlic Wellcome Trust Sanger Institute > Hinxton, Cambridge CB10 1SA, UK > +44 (0) 1223 49 6891 > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 > > > From lstein at cshl.edu Mon Dec 19 17:10:30 2005 From: lstein at cshl.edu (Lincoln Stein) Date: Mon, 19 Dec 2005 12:10:30 -0500 Subject: [DAS2] Sequence retrieval proposal In-Reply-To: References: Message-ID: <200512191210.31108.lstein@cshl.edu> Hi, Just CVS updated the spec today in anticipation of the telecon, and I don't see any changes to the feature or region requests. Didn't we agree to drop region? Lincoln On Monday 12 December 2005 04:33 pm, Steve Chervitz wrote: > On Sun, 11 Dec 2005 Andrew Dalke wrote: > > Steve: > >> I am also somewhat loath to add yet another sequence file format to the > >> world. Seems reasonable to state that a DAS/2 server can supply > >> sequence in > >> an alternative format via requests such as: > >> > >> http://www.wormbase.org/das/genome/volvox/1/sequence?format=GAME > > > > That makes good sense to me. > > > >> Here's a brief tour of some possibly extensible candidates: > > > > Do you want to say this as: > > "The server must implement these sequence formats" > > or > > "If the server implements one or more of these sequence formats then > > it must use the corresponding id and content-type." > > ? > > > > Or say nothing and wait until several different servers implement > > this then standardize on what they do? > > > > I don't think anyone here seriously wants the first. :) > > > > The last is my favorite, then the middle one. > > The last is fine with me. This is the approach we use for type-specific > alternative feature formats: > http://biodas.org/documents/das2/das2_get.html#types > > > My stronger preference is to get a complete 2.0 spec out. Do > > you or other users need checksum validation of the sequence and/or > > alternate sequence formats in 2.0? What prevents you from extending > > existing HTTP headers or experimenting with extensions then > > submitting your experience for inclusion in future versions of > > the spec? > > > > My sense is that this can wait. > > Yep. Especially in light of this morning's teleconf (notes for which are on > the way). This seems like a good place to invoke YAGNI ( > http://keithdevens.com/quotes/YAGNI ). > > >> We might consider proscribing some conventions for what DAS considers > >> proper fasta format. I put in a little bit of description of a > >> DAS-acceptable fasta format here in the retrieval spec: > >> http://biodas.org/documents/das2/das2_get.html#sequence > > > > Do current DAS clients even use the header? > > > > Will future ones use it? If so, why? Shouldn't all the information > > in a header be available as an annotation? > > Don't know. Seems like it should be left to the client implementation to > decide what to do with the header. The aim of the sequence request (soon to > be 'residues') is to get sequence data, not annotations. > > If we're not saying what DAS/2 clients are supposed to do with the header > info, and there are so many variations out there, we might consider stating > that clients are free to ignore the header. Then if we do this, why use > fasta format instead of raw sequence? > > Btw, DAS/1 used an XML formatted response for sequence data. The DAS/1 > sequence element has these attributes: id, start, stop, moltype, version. > Does anyone know how DAS/1 clients make use of these from the seq response? > > > The wikiepedia entry for FASTA is pretty good. > > http://en.wikipedia.org/wiki/Fasta_format > > Interesting. That more-than-one-header business seems evil. They give a > good link for alternative sequence formats. > > Steve > > > > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2 -- Lincoln D. Stein Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724 FOR URGENT MESSAGES & SCHEDULING, PLEASE CONTACT MY ASSISTANT, SANDRA MICHELSEN, AT michelse at cshl.edu From suzi at fruitfly.org Mon Dec 19 18:18:13 2005 From: suzi at fruitfly.org (Suzanna Lewis) Date: Mon, 19 Dec 2005 10:18:13 -0800 Subject: [DAS2] Sequence retrieval proposal In-Reply-To: <200512191210.31108.lstein@cshl.edu> References: <200512191210.31108.lstein@cshl.edu> Message-ID: yikes, i lost track of the time. i had planned on being on the call. sorry about that. On Dec 19, 2005, at 9:10 AM, Lincoln Stein wrote: > Hi, > > Just CVS updated the spec today in anticipation of the telecon, and I > don't > see any changes to the feature or region requests. Didn't we agree to > drop > region? > > Lincoln > > On Monday 12 December 2005 04:33 pm, Steve Chervitz wrote: >> On Sun, 11 Dec 2005 Andrew Dalke wrote: >>> Steve: >>>> I am also somewhat loath to add yet another sequence file format to >>>> the >>>> world. Seems reasonable to state that a DAS/2 server can supply >>>> sequence in >>>> an alternative format via requests such as: >>>> >>>> http://www.wormbase.org/das/genome/volvox/1/sequence?format=GAME >>> >>> That makes good sense to me. >>> >>>> Here's a brief tour of some possibly extensible candidates: >>> >>> Do you want to say this as: >>> "The server must implement these sequence formats" >>> or >>> "If the server implements one or more of these sequence formats >>> then >>> it must use the corresponding id and content-type." >>> ? >>> >>> Or say nothing and wait until several different servers implement >>> this then standardize on what they do? >>> >>> I don't think anyone here seriously wants the first. :) >>> >>> The last is my favorite, then the middle one. >> >> The last is fine with me. This is the approach we use for >> type-specific >> alternative feature formats: >> http://biodas.org/documents/das2/das2_get.html#types >> >>> My stronger preference is to get a complete 2.0 spec out. Do >>> you or other users need checksum validation of the sequence and/or >>> alternate sequence formats in 2.0? What prevents you from extending >>> existing HTTP headers or experimenting with extensions then >>> submitting your experience for inclusion in future versions of >>> the spec? >>> >>> My sense is that this can wait. >> >> Yep. Especially in light of this morning's teleconf (notes for which >> are on >> the way). This seems like a good place to invoke YAGNI ( >> http://keithdevens.com/quotes/YAGNI ). >> >>>> We might consider proscribing some conventions for what DAS >>>> considers >>>> proper fasta format. I put in a little bit of description of a >>>> DAS-acceptable fasta format here in the retrieval spec: >>>> http://biodas.org/documents/das2/das2_get.html#sequence >>> >>> Do current DAS clients even use the header? >>> >>> Will future ones use it? If so, why? Shouldn't all the information >>> in a header be available as an annotation? >> >> Don't know. Seems like it should be left to the client implementation >> to >> decide what to do with the header. The aim of the sequence request >> (soon to >> be 'residues') is to get sequence data, not annotations. >> >> If we're not saying what DAS/2 clients are supposed to do with the >> header >> info, and there are so many variations out there, we might consider >> stating >> that clients are free to ignore the header. Then if we do this, why >> use >> fasta format instead of raw sequence? >> >> Btw, DAS/1 used an XML formatted response for sequence data. The DAS/1 >> sequence element has these attributes: id, start, stop, moltype, >> version. >> Does anyone know how DAS/1 clients make use of these from the seq >> response? >> >>> The wikiepedia entry for FASTA is pretty good. >>> http://en.wikipedia.org/wiki/Fasta_format >> >> Interesting. That more-than-one-header business seems evil. They give >> a >> good link for alternative sequence formats. >> >> Steve >> >> >> >> _______________________________________________ >> DAS2 mailing list >> DAS2 at portal.open-bio.org >> http://portal.open-bio.org/mailman/listinfo/das2 > > -- > Lincoln D. Stein > Cold Spring Harbor Laboratory > 1 Bungtown Road > Cold Spring Harbor, NY 11724 > FOR URGENT MESSAGES & SCHEDULING, > PLEASE CONTACT MY ASSISTANT, > SANDRA MICHELSEN, AT michelse at cshl.edu > _______________________________________________ > DAS2 mailing list > DAS2 at portal.open-bio.org > http://portal.open-bio.org/mailman/listinfo/das2