[DAS2] segments and types

Mon Nov 6 16:53:22 UTC 2006

> -----Original Message-----
> From: das2-bounces at lists.open-bio.org [mailto:das2-bounces at lists.open-
> bio.org] On Behalf Of Andrew Dalke
> Sent: Friday, October 27, 2006 12:56 PM
> To: DAS/2
> Subject: [DAS2] segments and types
> 
> A couple of observations about what I've seen in existing
> DAS1 servers.  Nothing here concerns format changes.
> 
> There are four different ways to handle segments:
>    1) Don't provide segment information
>         "Our clients know the segment because of the id
>          so they don't need a segments document"
>    2) use "size" (pre-DAS 1.0 spec)
>    3) use "start"/"stop" (DAS 1.0 spec)
>        - with variations, like "0", "0" meaning the length is
undefined
>            (and even "1", "0", with a size="2", for one server!)
>    4) use a "version" field
> 
> The last is mostly used for protein sequences, that I've seen.
> Its an aspect of #1 ("9pti" means "bovine pancreatic trypsin
> inhibitor structure from PDB") as an abstract identifier, with
> the version used to make it concrete ("with the update because
> the first release had a typo")  I think it can be encapsulated
> in the uri scheme we now use because each version gets it own
> identifier, and since the client knows all versions there's no
> problem.
> 
> 
> The folks at EBI/Sanger (what's the correct collective term;
> Hinxton? Genome Campus?) know which servers provide which
> systems so many servers don't provide coordinates.
> 
> In some cases, like rabbit, the server will generate about
> 120,000 segments, one for each scaffold.  It takes quite some time
> (a minute or more) to generate the output.  In theory this is
> static and can be precomputed by the server.
> 
> For my own knowledge, when do people want the complete list
> of segments?  When do they want the length?  You, yes, you
> there, in front of the computer.  When do you you want to
> use it?

For (nearly) completely sequenced genomes, it is important to provide a
complete list of genome segment ids/names.  This allows a visualization
client to provide this list for a user to select from if they are
interested in particular genome locations or simply browsing, rather
than having the id/name of a particular feature in mind.  Now you could
just have the user type in the id of a segment, but unless they are
familiar with the vagaries of that particular server, do they request
"chr1", or "1", "I", "chrI", "chrom1", etc?  Length information for a
segment is needed to place an upper bound on range queries to the
server.  And in a GUI client it is often more convenient for the user to
indicate visually what range on the segment they want to retrieve data
from, but this doesn't make sense without the client app knowing the
length of the segment.  Furthermore, once the client is displaying
located annotations on a segment, it can be important to know where the
end of the segment is relative to the locations of annotations.

For less complete genomes (like rabbit), it's not so clear what
advantage there is to having the list of 120,000 scaffolds to choose
from.  Same applies to list of proteins or mRNAs.

> 
> Let me stress -- this is not a request to change anything.  I
> would like to know for my own sake, for writing the documentation,
> and for how much emphasis to put on this for the validation.
> 
> As another observation, the Sanger/EBI servers also don't
> do much with the types document. Some don't even handle the
> request.  Eugene said that no one had asked him to add it.
> It's there now (thanks Eugene).
> 
> I think this is because most of their servers only had a single
> type and the solution was "display everything."  They are
> running into difficulties with this for a few new servers and
> will be need type support, and type filter support soonish.
> 
> 					Andrew
> 					dalke at dalkescientific.com