[DAS2] tiled queries for performance

Thu Nov 24 13:28:00 UTC 2005

Allen:
>  I'd like to be able to consistently get network-bottlenecked response 
> from the server.  The largest (250 megabase) SQL range queries 
> typically take ~30 seconds to complete, returning ~500K features.  I'm 
> currently working on getting the templating system (Template Toolkit 
> aka TT2) we use to flush to the client periodically, rather than 
> building the entire response first.  This is the current bottleneck; 
> TT2 generation of a 500K record XML document takes many minutes.  
> Regardless of how much more optimization work we put into the server, 
> it's never going to be as fast as serving up pre-queried, pre-rendered 
> content.

Interesting.  So I was right, in that the range search is fast, but 
wrong
in not considering the template generation problem.

Could that cause a DoS attack by asking for several large ranges at 
once?
You're building up multi-megabyte strings in memory.  (If 1 feature is 
1K
then that's 500MB.)

Ideologically the clean solution might be to have the search return only
a list of identifiers and have the client fetch each feature one-by-one.
This is a tile size of 1.

Implementation-wise this will cause problems unless using HTTP 1.1
pipelining since the act of opening 500K connections takes non-trivial
time.  Adding a "return XML for these ids" service doesn't help either -
it brings us back to the same problem.

But another solution is to cache all the features as XML, leaving out
only the header and footer.  Skip the templating system (rather, it's
upstream of the caching).  Do the search, get the ids, and stream the
contents directly from the cache.

This would be used in feature lookup and for search results.

>  In the DAS protocol, the distribution of the application logic is 
> distributed between the client and server, sometimes to ill effect.  
> Requiring both (a) the server to respond to arbitrary range queries, 
> and (b) the client to display arbitrary ranges unnecessarily creates a 
> bifurcation of the View component of the application.  Brian was 
> hinting at this when he mentioned the idea of bittorrent blocks 
> earlier in the thread.

What application logic?  There should be many ways to build different
applications on top of DAS.

DAS is a data model.  The client provides the view (or many views).

There are two reasons for query support on the server.

  1. slow bandwidth and limited client resources - otherwise clients 
could
       download and search the data locally
  2. easier support for (certain classes of) application developers

To make the Google comparison, there's no reason Google searches 
couldn't
take place on your personal machine except that you can't download the
Internet and search it in usable time.  With Google providing the 
service
others can do things like provide domain-specific web searches via 
Google,
include Google links in a web browser, or make something like 
Googlefight.

> We also require code redundancy between client and server to be able 
> to fully use the type and exacttype filters.  In this case the Model 
> component has been bifurcated -- the client needs to build a model the 
> ontology (from who knows where... presumably processing OBO-Edit 
> files) so the user can issue queries, and the server needs to also 
> have some representation of the ontology to generate a response.
>
>  Hopefully the ontology DAS extension will help the latter situation 
> outlined above by getting both client and server to be synchronized on 
> the same data model.  As far as the tiling optimization goes, it's 
> likely that I'll implement a preprocessor for the HTTP query so I can 
> break it into tiles -- conceptually very similar to the log10 binning 
> that Lincoln does in the GFF database.

I didn't follow this.  Code redundancy means what?  There's an
exchange of data models - in this case the model for a query.  But any
client/server needs to do this.

Take Entrez, for example.  It supports many types of search fields,
including MeSH (which I think counts as an ontology).  A sophisticated
client may have a GUI to help people identify MeSH terms.  This 
obviously
does some duplicate work as with the server.

Is that what you mean?  If so, why does it matter?

Note also that while Google Maps serves static images only, there's
shared logic between the application (in the browser) and the tools
that generated those maps.  Eg, both have the same code for 
understanding
geography/latitude&longitude.

					Andrew
					dalke at dalkescientific.com