[BioSQL-l] load_ncbi_taxonomy
Hilmar Lapp
hlapp at gnf.org
Fri May 30 13:28:54 EDT 2003
I'm adding a bit documentation, and among the more important things,
I'm adding support for Oracle and MySQL and some more verbosity about
what's going on. I also fixed the problem with the ncbi_taxon_id being
mysteriously missing in the uploaded data (I suspect the root cause is
a perl 5.6.0-related bug).
The snag I've hit is with Postgres. It's unbelievably slow on a fresh
database, when really all it should be doing in the first pass is
inserting nodes. It starts with ~370 rows/s (this is *after* dropping
the insert rule on taxon, otherwise it's very slow right away) and
drops down to ~40 rows/s and 20 rows/s after 60k and 130k rows,
respectively. Updating the nested set values takes place at 3-5 rows/s
(!). So, it wouldn't even finish over night.
This is data from my PowerBook G4, 800 Mhz, single disk, 4096 shared
buffers in Pg, Postgres 7.3.2. Certainly not the most suitable hardware
configuration, but the rest should be OK. Note that in comparison MySQL
does ~2000 rows/s during the insert phase and ~500 rows/s during the
update phase, on the same hardware.
Is anyone else running this script under Postgres? What's the
performance? If it does finish in a reasonable time, what is your Pg
configuration, and do you take special measures, like dropping an index
before or something?
There are still various options left to speed up the whole thing, but
that must be entirely driver specific then. In particular,
- vacuum analyze after the insert phase
- drop all indexes on taxon except the primary key during the insert
phase (many of them are needed though during the update phase)
I'm just wondering whether it's only me and I'm doing something wrong.
I doubt that I screwed up the code to this effect as I haven't really
touched the meat of it (SQL statements, code flow, algorithm).
Unless Aaron wants me to, I'm not particularly happy committing the
updated version with the fixes before I can't get this to the point of
making at least one full test run on Pg... (apart from MySQL and
Oracle).
-hilmar
--
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------
More information about the BioSQL-l
mailing list