[DAS2] Sequence retrieval proposal
Steve_Chervitz at affymetrix.com
Thu Dec 8 21:04:56 UTC 2005
On Thu, 8 Dec 2005, Thomas Down wrote:
> On 7 Dec 2005, at 23:22, Andrew Dalke wrote:
>> Steve Chervitz wrote:
>>> 2. What do folks think about specifying a DAS2XML format for sequence
>>> requests (text/x-das-sequence+xml)? In addition to permitting an optional
>>> checksum attribute to address the above use case, it would add some
>>> consistency and flexibility to the spec, since at present, the default
>>> sequence response format is the only one that is not under our control
>>> (currently it's text/x-fasta).
>> As a consumer of this sort of data, I don't want to write another
>> parser. It isn't just the parsing part - it's the effort of mapping
>> to my program's data model.
>> There's already a huge number of existing sequence file formats.
>> What would another provide? Are some of them already extensible?
I am also somewhat loath to add yet another sequence file format to the
world. Seems reasonable to state that a DAS/2 server can supply sequence in
an alternative format via requests such as:
There would have to be a way for a server to indicate what alternative
formats is supports. We could use the same strategy as we do in the
versioned source request, supplying a FORMAT element listing alternative
formats. But where to put it? Perhaps in the regions request:
<REGION id="sequence/ctg1" ...>
<FORMAT id="game" type="application/x-game+xml" />
<FORMAT id="otter" type="application/x-otter+xml" />
For interoperability purposes, we'd should provide a controlled vocabulary
of alternative formats and their types, at least for the commonly used ones.
>> Several of those formats are designed and developed by people involved
>> with DAS. If it's important, extend GAME or GFF.
> Do GAME or GFF have a sequence representation? I thought they were
> both primarily feature-table formats (right now I'm having trouble
> finding the GAME documentation though...).
Here's a brief tour of some possibly extensible candidates:
GFF - only represents features: http://song.sourceforge.net/gff3.shtml
GAME - does encode sequence data as a simple string.
Flybase/BDGP use GAME XML and appear to be the main users/maintainers. Suzi
and Chris can elaborate more here, but I found link to an RNG schema in the
- The http://bioxml.org links are now obsolete. Here's an old description
containing such links: http://xml.coverpages.org/game.html
- GAME variants have arisen that have created incompatibilities in the bio*
- When I checked a flybase data file, it didn't point to a DTD:
Otter - a sort of simplified GAME that also represents sequence:
XFF - models sequences and has alphabet support (Thomas: is this in use?):
INSDseq and EMBLxml - An XML format for Gebank/EMBL/DDBJ sequence data:
BSML - Somewhat antiquated but is supported by the XEMBL service
http://www.bsml.org/ and in use by LabBook:
AGAVE - From DoubleTwist - now defunct, but also supported by XEMBL:
BIOML - Details are sketchy, appears to be used internally by Genomic
Solutions which acquired Proteometrics, the originators of BIOML. Here's the
most recent references I could find:
> The problem I have with Fasta format (other than the tendency of many
> data-providers to over-load the header line) is that there's no
> explicit marker for the alphabet and encoding of sequence data. This
> is pretty nasty for codebases like BioJava which want to present a
> richer view of sequence data than just a String. I'd certainly be in
> favour of a nice XML format that made alphabet information explicit.
> The DAS 1.5 DASSEQUENCE document has a moltype attribute which
> supports this (at least the three most important cases, DNA/RNA/
> Protein -- there's not a standards-compliant way to add other
> alphabets though).
Various data providers take all sorts of liberties with fasta sequence,
e.g., sequences with no IDs, whitespace-containing IDs, space between the
'>' and the ID, etc.
We might consider proscribing some conventions for what DAS considers proper
fasta format. I put in a little bit of description of a DAS-acceptable fasta
format here in the retrieval spec:
Do we want to add more to this? Perhaps something about an optional
description being separated from the ID by whitespace and consisting of any
amount of free-form text.
> I guess an alternative, more classically RESTful, way of doing things
> might be with MIME types:
> Content-Type: application/fasta; sequence-alphabet=DNA;
> I admit I'd prefer the XML though...
> DAS2 mailing list
> DAS2 at portal.open-bio.org
More information about the DAS2