[DAS2] tiled queries for performance

Thu Nov 24 13:47:26 UTC 2005

Hi Brian,

> 	We're looking into this kind of implementation issue ourselves and 
> thought that a bitorrent like cache makes the most sense. ie. all 
> servers in the "fabric" are issued the query in a certain "hop 
> adjacency". These servers then send their data to the client who's job 
> it is to assemble the data.

I go back and forth between the "large data set" model and the "large 
number
of entities" model.

In the first:
   - client requests a large data file
   - server returns it

This can be sped up by distributing the file among many sites and
using something like BitTorrent to put it together, or something like
Coral ( http://www.coralcdn.org/ ) to redirect to nearby caches.

But making the code for this is complicated.  It's possible to build
on BitTorrent and similar systems, but I have no feel for the actual
implementation cost, which makes me wary.  I've looked into a couple
of the P2P toolkits and not gotten the feel that it's any easier than
writing HTTP requests directly.  Plus, who will set up the alternate
servers?

In the second:
   - make query to server
   - server returns list of N identifiers
   - make N-n requests (where 'n' is the number of identifiers already 
resolved)

The id resolution can be done in a distributed fashion and is easily
supported via web caches, either with well-configured proxies or (again)
through Coral.

I like the latter model in part because it's more fine grained.  Eg,
a progress bar can say "downloading feature 4 of 10000", and if a given
feature is already present there's no need to refetch it.

The downside of the 2nd is the need for HTTP 1.1 pipelining to make it
be efficient.  I don't know if we want to have that requirement.  Gregg
came up with the range restrictions because most of the massive results
will be from range searches.  By being a bit more clever about tracking
what's known and not known, a client can get a much smaller results 
page.

These are complementary.  Using Gregg's restricted range queries can
reduce the number of identifiers returned in a search, making the
network overhead even smaller.

					Andrew
					dalke at dalkescientific.com