[BioPython] versioning, distribution and webdav
Andrew Dalke
dalke@acm.org
Tue, 15 May 2001 03:06:47 -0600
I had a worrisome conversation with Roger Sayle today. He's
been downloading the latest EMBL release, which is some 7GB if
I counted ftp://ftp.ebi.ac.uk/pub/databases/embl/release/
correctly. They have a T1 but it's downloading at about 1.5MBytes
per minute. T1 is supposed to be 1.5MBits per second, so this
is about 1/6 of the bandwidth.
That means a full download takes 78 hours, or about 3.25 days.
He also mentioned they have 80GB of raw (unprocessed)
bioinformatics data. Genbank is probably most of this and
doesn't take the hit of crossing the Atlantic, so about 5
days at full bandwidth.
That's 8 days to download the world's public bioinformatics
data over a T1. YMMV.
An 80G EIDE drive costs about $225 and you'll need at least
another to store all the indicies. A 50G SCSI drive is a bit
under $500. A 120GB RAID is about $3,000. All prices from
pricewatch.com.
So it takes about >$1,000 of disk space to store the data.
Bioinformatics data seems to be doubling every year, which
is roughly the same rate by which storage is increasing. This
means there's going to be a fixed cost of about $1,500 every
year for a group to be able to store the public data. This
isn't a serious cost.
But that's only if it can download the data. By next year
it will take 16 days to download, and the year after, a month.
There's no way a T1 can keep up with it. Yet at about $1000/month
in this area, a T1 is all a small company can afford. (I can't
find pricing for a T3 on the web, but it seems to be about 20x
more expensive than a T1, for about 30x the bandwidth.) These
prices aren't dropping anywhere near the rates of data expansion.
If nothing is done, this means more centralization of data and
hence services at the large facilities like NCBI, EBI, etc.
I don't like this. This is not that there is a problem with
those centers, rather, I have a belief about research that you need
to have things as local and manipulatible as possible. This means
getting direct access to data and algorithms and servers and ...
everything. Pushing (say) BLAST services onto NCBI's machine
means you can't experiment with modifications to BLAST, or add
your own FASTA search program or even your brilliant new SCREAM
algorithm. Or test other methods for whole-database comparisons,
or do data mining, or ...
The biggest problem seems to be pushing the data across the network
as storage costs should be constant over time. I've a couple ideas
on how to improve this.
The first is to switch to bzip instead of gzip compression. I
don't have numbers handy, but I recall that it's a decent
improvement. This lowers the bandwidth and storage requirements.
However, it's only linearly (maybe 10%?) better than present.
The problem with exponential growth is that that pushes things
back by ... a month.
It is possible to get even higher compression using methods which
model the data stream better. It should be possible to compress
these formats better. For example, use the Martel grammer to define
a DFA then run it against some data to tune the transition
parameters on the DFA, then use those numbers to produce the
compressor/decompressor. This might be (shot in the dark) 25%
better, or ... two months :(
I'm told that http is a better protocol than ftp because it has
less overhead. I don't know how much overhead there is.
A completely different idea, and which could be more practical,
is to improve how updates are pushed about. Presently when there
is an update everyone needs to download the complete data file.
Why isn't there a way to get deltas? If possible, that means
the transfer rate is the derivative of the growth rate, which
is still exponential but it saves almost a year.
One such protocol already exists which looks promising: WebDAV.
See http://www.webdav.org/ . If my limited understanding of it
is correct, it should be possible to treat each record as its
own document, and update only the records or parts of the record
that changed.
Looking at the webdav.org page, it mentions that DAV servers can
be mounted on MS Windows and Mac OS X. I haven't looked into
what that means, but it suggests a potential way for existing file
oriented software to work with DAV-based systems.
One thing that comes to mind is to operate a honking proxy/caching
server. This is interesting because it is scalable. If you are
only interested in a small part of the database, it stays on the
local network. If you configure it for updating, it would update
those files on demand. Yet if you want all the database it could
download all of the files for local access and only update the
deltas when things change. All data acts local.
Switching to a delta based approach does two things. It could
push data transfer problems away by about a year. What it also
may do is make it easier to switch code between local and remote
resources. Because of the scalability, I could develop new
algorithms locally using a subset of the database, then go to
some large centers and be able to use the code without having
to change the I/O. This gives people a spectrum of how to work
on a problem rather than being limited by specific contraints.
I searched to see if anyone had mentioned this specific technology
for bioinformatics. I could not. The closest I could find was an
Advogato entry by Thomas Down at http://www.advogato.org/person/thomasd/
who mentions WebDAV but not this application of it.
Of course, this is also related to the CORBA work, and I don't know
enough about it to judge the overlap. My ungrounded feeling is that
it may be too heavyweight for this specific task.
Anyway, those are thoughts this really late night/early morning.
(If this ends up sounding like the ravings of a crazed lunatic,
I'll just blame it on a bad sleep schedule :)
Andrew
dalke@acm.org