[DAS2] tiled queries for performance

Mon Nov 28 09:44:18 UTC 2005

> -----Original Message-----
> From: das2-bounces at portal.open-bio.org
[mailto:das2-bounces at portal.open-
> bio.org] On Behalf Of Andrew Dalke
> Sent: Thursday, November 24, 2005 5:47 AM
> To: Brian Gilman
> Cc: DAS/2
> Subject: Re: [DAS2] tiled queries for performance
> 
> Hi Brian,
> 
> > 	We're looking into this kind of implementation issue ourselves
and
> > thought that a bitorrent like cache makes the most sense. ie. all
> > servers in the "fabric" are issued the query in a certain "hop
> > adjacency". These servers then send their data to the client who's
job
> > it is to assemble the data.
> 
> I go back and forth between the "large data set" model and the "large
> number
> of entities" model.
> 
> In the first:
>    - client requests a large data file
>    - server returns it
> 
> This can be sped up by distributing the file among many sites and
> using something like BitTorrent to put it together, or something like
> Coral ( http://www.coralcdn.org/ ) to redirect to nearby caches.
> 
> But making the code for this is complicated.  It's possible to build
> on BitTorrent and similar systems, but I have no feel for the actual
> implementation cost, which makes me wary.  I've looked into a couple
> of the P2P toolkits and not gotten the feel that it's any easier than
> writing HTTP requests directly.  Plus, who will set up the alternate
> servers?

My hope would be that any system like this could be hidden behind a
single HTTP GET request and hence require no changes to the DAS/2
protocol.  Standard web caches already work this way.  I'm less familiar
with the BitTorrent approach, but I'm guessing that the client-side code
that stitches together the pieces from multiple servers could be
encapsulated in a client-side daemon that responds to localhost HTTP
calls.

> In the second:
>    - make query to server
>    - server returns list of N identifiers
>    - make N-n requests (where 'n' is the number of identifiers already
> resolved)
> 
> The id resolution can be done in a distributed fashion and is easily
> supported via web caches, either with well-configured proxies or
(again)
> through Coral.
> 
> I like the latter model in part because it's more fine grained.  Eg,
> a progress bar can say "downloading feature 4 of 10000", and if a
given
> feature is already present there's no need to refetch it.
> 
> The downside of the 2nd is the need for HTTP 1.1 pipelining to make it
> be efficient.  I don't know if we want to have that requirement.  

I'm wary of this "large number of entities" approach, for several
reasons.  Due to the overhead for TCP/IP, HTTP headers, and extra XML
stuff like doctype and namespace declarations, making an HTTP GET
request per feature will increase the total number of bytes that need to
be transmitted.  It will also increase the parsing overhead on the
client side.  And if the features contain little information (for
example just type, parts/parents, and location) that overhead could
easily exceed the time taken to process the "useful" data.  As you
indicated, some performance problems could be alleviated by HTTP 1.1
pipelining, but that adds additional requirements to both client and
server.  Also, for persistent caching on the local machine when you
start splitting up the data into hundreds of thousands of files, I
suspect the additional disk seek time will far exceed disk read time and
become a serious performance impediment.

Having said that, in theory this approach is (almost) testable using the
current DAS/2 spec.  Create one DAS/2 server that in response to feature
queries returns only the minimum required information for "N" features:
id and type.  And have feature ids returned be URLs on another DAS/2
server that _does_ return full feature information (location, alignment,
etc.).  Then make "N-n" single-feature queries with those URLs to get
full information.  Due to the current DAS/2 requirement that any parts /
parents referenced also be included in the same XML doc, this would only
be a reasonable test for features with no hierarchical structure, such
as SNPs.

> Gregg
> came up with the range restrictions because most of the massive
results
> will be from range searches.  By being a bit more clever about
tracking
> what's known and not known, a client can get a much smaller results
> page.
>
>
> These are complementary.  Using Gregg's restricted range queries can
> reduce the number of identifiers returned in a search, making the
> network overhead even smaller.
> 
> 					Andrew
> 					dalke at dalkescientific.com
> 
> _______________________________________________
> DAS2 mailing list
> DAS2 at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/das2