[DAS2] tiled queries for performance
Andrew Dalke
dalke at dalkescientific.com
Thu Nov 24 13:47:26 UTC 2005
Hi Brian,
> We're looking into this kind of implementation issue ourselves and
> thought that a bitorrent like cache makes the most sense. ie. all
> servers in the "fabric" are issued the query in a certain "hop
> adjacency". These servers then send their data to the client who's job
> it is to assemble the data.
I go back and forth between the "large data set" model and the "large
number
of entities" model.
In the first:
- client requests a large data file
- server returns it
This can be sped up by distributing the file among many sites and
using something like BitTorrent to put it together, or something like
Coral ( http://www.coralcdn.org/ ) to redirect to nearby caches.
But making the code for this is complicated. It's possible to build
on BitTorrent and similar systems, but I have no feel for the actual
implementation cost, which makes me wary. I've looked into a couple
of the P2P toolkits and not gotten the feel that it's any easier than
writing HTTP requests directly. Plus, who will set up the alternate
servers?
In the second:
- make query to server
- server returns list of N identifiers
- make N-n requests (where 'n' is the number of identifiers already
resolved)
The id resolution can be done in a distributed fashion and is easily
supported via web caches, either with well-configured proxies or (again)
through Coral.
I like the latter model in part because it's more fine grained. Eg,
a progress bar can say "downloading feature 4 of 10000", and if a given
feature is already present there's no need to refetch it.
The downside of the 2nd is the need for HTTP 1.1 pipelining to make it
be efficient. I don't know if we want to have that requirement. Gregg
came up with the range restrictions because most of the massive results
will be from range searches. By being a bit more clever about tracking
what's known and not known, a client can get a much smaller results
page.
These are complementary. Using Gregg's restricted range queries can
reduce the number of identifiers returned in a search, making the
network overhead even smaller.
Andrew
dalke at dalkescientific.com
More information about the DAS2
mailing list