[Bioperl-l] Best Practices for Downloading/Mirroring Genbank

Karalius, Joseph Joseph.Karalius at seminis.com
Mon Jun 14 13:31:21 EDT 2004


I'm working on setting up a local mirror of Genbank here at work and am
unsure of what the best way to go about it is.

I started off real simple with a wget -m ftp://genbank.sdsc.edu/pub (Yes, I
wanted the BLAST formatted databses and executables as well) and the
transfer is going just fine, albeit excruciatingly slow at times.  

But what happens:
1) between now and the next build?;
2) if I coose to mirror from an alternate source?;
3) after the next build?

For the first part, I just planned on doing daily wgets for the updates, and
the possibility occurred to me that if I miss the last couple days worth of
updates before the new build,  those updates get shuffled into the main
build files and I have to download the whole thing again?

For the second, If I choose to mirror from Biomirror or NCBI instead of San
Diego, those timestamps seem to be different for what I am assuming to be
the same build.  For example,

gbest1.seq.gz	19,454,020 bytes	5/22/04	5:04am SDSC Mirror
		19,454,020 bytes	4/25/04	2:01am NCBI Mirror
		19,454,020 bytes	4/25/04	2:01am BioMirror

For the third part,  do the build files really change or are new entries and
revisions just added on as extra build files?  I read that the files are
non-cumulative, so that would seem to confirm it, but the timestamps are
updated in sync with the latest build date.

How do I keep an updated mirror without losing daily builds or having to
download the whole thing every couple of months.  How do I verify that I do
have the latest data, because checking timestamps does not seem like it will
work?  Should I even bother with creating a true mirror?

I ran across this recent thesis on some of the issues in maintaining these
types of databases accurately while minimizing file transfers
http://if.anu.edu.au/Students/DamonSearle-2003-thesis.pdf

I know that Biomirror has some scripts to facilitate efficient transfers but
do they handle updates.  I'm guessing this problem has already been
addressed, I just can't find the solution.

Thanks in advance for any input,

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Joseph Karalius
RA, Bioinformatics
Molecular Markers and Applied Genomics
Seminis Vegetable Seeds, Inc
37437 State Highway 16
Woodland, CA 95695-9353
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



More information about the Bioperl-l mailing list