[DAS2] format information for the reference server

Andrew Dalke dalke at dalkescientific.com
Mon Mar 13 14:00:45 UTC 2006


(NOTE: the open-bio mailing lists were moved from portal.open-bio.org
to lists.open-bio.org.  My first email on this bounced because I
sent to the old email address.)

Summary of questions:
   - what does it mean for the annotation server to list the formats
       available from the reference server?
   - can the reference server format information be moved to the
       segments document?
   - are there formats which will only work at the segment level and
       not at the segments level (ie, formats which don't handle multiple
       records)?

Something's been bothering me about the segments request.

Currently the DAS sources request responds with something like

<SOURCES>
   <SOURCE>
    <VERSION>
      <CAPABILITY type="segments" query_url="http://blah/seq">
         <FORMAT name="fasta" />
         <FORMAT name="agp" />
      </CAPABILITY>
   ...
</SOURCES>

This says "go to 'blah' for information about the sequence".

But it says more than that.  It provides metadata about
the reference server.  It says that the reference server can
respond in 'fasta' and 'agp' formats.

Hence the following are allowed from this URL

   http://blah/seq?format=agp  -- return the assembly
   http://blah/seq?format=fasta -- return all sequences in FASTA format

Does this mean that all annotations servers using the given
reference server must list all of the available formats?

If a client sees multiple CAPABILITY elements for the same
query_url is it okay to merge the list of supported formats?
That is, if server X says that annotation server A supports
fasta and server Y says that A supports genbank then a client
may assume A supports both fasta and genbank formats?
(This makes sense to me.)

Second, does it make sense to require the annotation servers
to list the formats on the reference server?  What about
making that information available from the segments document,
like this.

query:

   http://www.biodas.org/das/h.sapiens/38/segments.cgi

response:

<SEGMENTS>
   <SEGMENT id="abc">
     <FORMAT name="fasta" />
     <FORMAT name="agp" />
   </SEGMENT>
   <SEGMENT id="def">
     <FORMAT name="fasta" />
     <FORMAT name="agp" />
   </SEGMENT>
</SEGMENT>

A problem with this the lack of data saying that the
segments query URL itself supports multiple formats.  For
example,

   http://www.biodas.org/das/h.sapiens/38/segments.cgi?format=fasta

might support returning all of the chromosomes in FASTA format.

Are there any formats which only work at the segment level
and not at the segments level?  That is, which only work with
single gene/chromosome/contig/etc. but don't support multiple
sequences?  The only one I could think of off-hand is "raw",
since there's no concept of a "record" given a bunch of letters,
unless the usual way is to separate them by an extra newline?

If all formats are supported for both single and all segments
then here is another possible response

[possibility #1]
<SEGMENTS>
   <FORMAT name="fasta" />
   <FORMAT name="agp" />
   <SEGMENT id="abc" />
   <SEGMENT id="def" />
</SEGMENT>

I think all formats which work on the "segments" level also
work on a single segment level, so another possibility is
the following, which lets a given segment say that it supports
more formats.

[possibility #2]
<SEGMENTS>
   <FORMAT name="fasta" />
   <FORMAT name="agp" />
   <SEGMENT id="abc">
     <FORMAT name="raw" />
   </SEGMENT>
   <SEGMENT id="def" />
     <FORMAT name="raw" />
   </SEGMENT>
</SEGMENT>


Here's another, using a flag to say if a format is for a
single segment, the segments URL, or both (feel free to
pick better names!). By default it applies to both.

[possibility #3]

<SEGMENTS>
   <!-- both support FASTA retrieval -->
   <FORMAT name="fasta" />

   <!-- both support GenBank retrieval -->
   <FORMAT name="genbank" applies-to="both" />

   <!-- can only get the assembly of everything -->
   <FORMAT name="agp" applies-to="segments" />

   <!-- can only get the raw sequence for a segment -->
   <FORMAT name="raw" applies-to="segment" />
</SEGMENT>

Yet another option is

[possibility #4]
<SEGMENTS>
   <FORMATS-FOR-SEGMENTS>
     <FORMAT name="fasta" />
     <FORMAT name="genbank" />
     <FORMAT name="agp" />
   </FORMATS-FOR-SEGMENTS/>
   <FORMATS-FOR-SINGLE-SEGMENT>
     <FORMAT name="fasta" />
     <FORMAT name="genbank" />
     <FORMAT name="raw" />
   </FORMATS-FOR-SEGMENTS/>
   ..

Of these I support [possibility #1], with the ability to go
to [possibility #3] if there's ever a case where a given format
cannot be applied to both levels.

					Andrew
					dalke at dalkescientific.com




More information about the DAS2 mailing list