[DAS2] Sequence retrieval proposal

Steve Chervitz Steve_Chervitz at affymetrix.com
Thu Dec 8 21:04:56 UTC 2005

On Thu, 8 Dec 2005, Thomas Down wrote:
> On 7 Dec 2005, at 23:22, Andrew Dalke wrote:
>> Steve Chervitz wrote:
>>> 2. What do folks think about specifying a DAS2XML format for sequence
>>> requests (text/x-das-sequence+xml)? In addition to permitting an optional
>>> checksum attribute to address the above use case, it  would add some
>>> consistency and flexibility to the spec, since at  present, the default
>>> sequence response format is the only one that is  not under our control
>>> (currently it's text/x-fasta).
>> As a consumer of this sort of data, I don't want to write another
>> parser.  It isn't just the parsing part - it's the effort of mapping
>> to my program's data model.
>> There's already a huge number of existing sequence file formats.
>> What would another provide?  Are some of them already extensible?

I am also somewhat loath to add yet another sequence file format to the
world. Seems reasonable to state that a DAS/2 server can supply sequence in
an alternative format via requests such as:


There would have to be a way for a server to indicate what alternative
formats is supports. We could use the same strategy as we do in the
versioned source request, supplying a FORMAT element listing alternative
formats. But where to put it? Perhaps in the regions request:

   <REGION id="sequence/ctg1" ...>
     <FORMAT id="game"    type="application/x-game+xml" />
     <FORMAT id="otter"    type="application/x-otter+xml" />

For interoperability purposes, we'd should provide a controlled vocabulary
of alternative formats and their types, at least for the commonly used ones.

>> Several of those formats are designed and developed by people involved
>> with DAS.  If it's important, extend GAME or GFF.
> Do GAME or GFF have a sequence representation?  I thought they were
> both primarily feature-table formats (right now I'm having trouble
> finding the GAME documentation though...).

Here's a brief tour of some possibly extensible candidates:

GFF - only represents features: http://song.sourceforge.net/gff3.shtml

GAME - does encode sequence data as a simple string.
Flybase/BDGP use GAME XML and appear to be the main users/maintainers. Suzi
and Chris can elaborate more here, but I found link to an RNG schema in the
Apollo FAQ: 

GAME notes:
- The http://bioxml.org links are now obsolete. Here's an old description
containing such links: http://xml.coverpages.org/game.html
- GAME variants have arisen that have created incompatibilities in the bio*
world: http://open-bio.org/pipermail/bioperl-l/2003-April/011988.html
- When I checked a flybase data file, it didn't point to a DTD:

Otter - a sort of simplified GAME that also represents sequence:

XFF - models sequences and has alphabet support (Thomas: is this in use?):

INSDseq and EMBLxml - An XML format for Gebank/EMBL/DDBJ sequence data:

BSML - Somewhat antiquated but is supported by the XEMBL service
http://www.bsml.org/ and in use by LabBook:

AGAVE - From DoubleTwist - now defunct, but also supported by XEMBL:

BIOML - Details are sketchy, appears to be used internally by Genomic
Solutions which acquired Proteometrics, the originators of BIOML. Here's the
most recent references I could find:

> The problem I have with Fasta format (other than the tendency of many
> data-providers to over-load the header line) is that there's no
> explicit marker for the alphabet and encoding of sequence data.  This
> is pretty nasty for codebases like BioJava which want to present a
> richer view of sequence data than just a String.  I'd certainly be in
> favour of a nice XML format that made alphabet information explicit.
> The DAS 1.5 DASSEQUENCE document has a moltype attribute which
> supports this (at least the three most important cases, DNA/RNA/
> Protein -- there's not a standards-compliant way to add other
> alphabets though).

Various data providers take all sorts of liberties with fasta sequence,
e.g., sequences with no IDs, whitespace-containing IDs, space between the
'>' and the ID, etc.

We might consider proscribing some conventions for what DAS considers proper
fasta format. I put in a little bit of description of a DAS-acceptable fasta
format here in the retrieval spec:

Do we want to add more to this? Perhaps something about an optional
description being separated from the ID by whitespace and consisting of any
amount of free-form text.


> I guess an alternative, more classically RESTful, way of doing things
> might be with MIME types:
>         Content-Type: application/fasta; sequence-alphabet=DNA;
> sequence-encoding=IUPAC
> I admit I'd prefer the XML though...
>              Thomas.
> _______________________________________________
> DAS2 mailing list
> DAS2 at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/das2

More information about the DAS2 mailing list