[BioSQL-l] load_ncbi_taxonomy

Hilmar Lapp hlapp at gnf.org
Fri May 30 13:28:54 EDT 2003


I'm adding a bit documentation, and among the more important things, 
I'm adding support for Oracle and MySQL and some more verbosity about 
what's going on. I also fixed the problem with the ncbi_taxon_id being 
mysteriously missing in the uploaded data (I suspect the root cause is 
a perl 5.6.0-related bug).

The snag I've hit is with Postgres. It's unbelievably slow on a fresh 
database, when really all it should be doing in the first pass is 
inserting nodes. It starts with ~370 rows/s (this is *after* dropping 
the insert rule on taxon, otherwise it's very slow right away) and 
drops down to ~40 rows/s and 20 rows/s after 60k and 130k rows, 
respectively. Updating the nested set values takes place at 3-5 rows/s 
(!). So, it wouldn't even finish over night.

This is data from my PowerBook G4, 800 Mhz, single disk, 4096 shared 
buffers in Pg, Postgres 7.3.2. Certainly not the most suitable hardware 
configuration, but the rest should be OK. Note that in comparison MySQL 
does ~2000 rows/s during the insert phase and ~500 rows/s during the 
update phase, on the same hardware.

Is anyone else running this script under Postgres? What's the 
performance? If it does finish in a reasonable time, what is your Pg 
configuration, and do you take special measures, like dropping an index 
before or something?

There are still various options left to speed up the whole thing, but 
that must be entirely driver specific then. In particular,

	- vacuum analyze after the insert phase
	- drop all indexes on taxon except the primary key during the insert 
phase (many of them are needed though during the update phase)

I'm just wondering whether it's only me and I'm doing something wrong. 
I doubt that I screwed up the code to this effect as I haven't really 
touched the meat of it (SQL statements, code flow, algorithm).

Unless Aaron wants me to, I'm not particularly happy committing the 
updated version with the fixes before I can't get this to the point of 
making at least one full test run on Pg... (apart from MySQL and 
Oracle).

	-hilmar
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------



More information about the BioSQL-l mailing list