[BioSQL-l] Re: load_ncbi_taxonomy
Hilmar Lapp
hlapp at gnf.org
Sun Jun 1 04:21:08 EDT 2003
I've finally got this under control, with some optimizations in the
code, and most importantly some temporary surgical removal of
constraints.
It turned out there were two performance killer: 1) the insert rule on
the taxon table (which would need the ncbi_taxon_id index to be
analyzed regularly in order to perform properly), and 2) the
self-referential foreign key constraint (even though it was deferred!).
At first it also looked that nested set rebuild phase needs regular
VACUUM ANALYZE runs (which require a commit/begin cycle), but it seems
now that's not necessarily true. I wrote a few pgsql functions as an
API that the script uses to remove and restore constraints. You will
need to install those functions to take advantage of it.
Anyway, the whole NCBI taxon database loads now on my laptop in 30-40
mins, which is not too bad I think. It should be considerably faster on
a beefy server with multiple disks (i.e., with the transaction log on
another disk than the table files).
I also added a check that when rebuilding the nested set avoids updates
when unnecessary and obtains the children ordered so that they would
always come out the same. This should make updates much faster.
-hilmar
On Friday, May 30, 2003, at 12:33 PM, Aaron J Mackey wrote:
>
> I've had the same response from Postgres; dropping the constraint
> checking
> seemed to help, but it was still slow (on a beefy database server, it
> finishes in around 20 hours, with no other load). I suspect that
> maintaining all the indices is probably slow (but to get them back will
> still require the time, even if you drop them while loading the data).
>
> Postgres is not a speedy database. It has other nice features, though.
>
> -Aaron
>
> On Fri, 30 May 2003, Hilmar Lapp wrote:
>
>> I'm adding a bit documentation, and among the more important things,
>> I'm adding support for Oracle and MySQL and some more verbosity about
>> what's going on. I also fixed the problem with the ncbi_taxon_id being
>> mysteriously missing in the uploaded data (I suspect the root cause is
>> a perl 5.6.0-related bug).
>>
>> The snag I've hit is with Postgres. It's unbelievably slow on a fresh
>> database, when really all it should be doing in the first pass is
>> inserting nodes. It starts with ~370 rows/s (this is *after* dropping
>> the insert rule on taxon, otherwise it's very slow right away) and
>> drops down to ~40 rows/s and 20 rows/s after 60k and 130k rows,
>> respectively. Updating the nested set values takes place at 3-5 rows/s
>> (!). So, it wouldn't even finish over night.
>>
>> This is data from my PowerBook G4, 800 Mhz, single disk, 4096 shared
>> buffers in Pg, Postgres 7.3.2. Certainly not the most suitable
>> hardware
>> configuration, but the rest should be OK. Note that in comparison
>> MySQL
>> does ~2000 rows/s during the insert phase and ~500 rows/s during the
>> update phase, on the same hardware.
>>
>> Is anyone else running this script under Postgres? What's the
>> performance? If it does finish in a reasonable time, what is your Pg
>> configuration, and do you take special measures, like dropping an
>> index
>> before or something?
>>
>> There are still various options left to speed up the whole thing, but
>> that must be entirely driver specific then. In particular,
>>
>> - vacuum analyze after the insert phase
>> - drop all indexes on taxon except the primary key during the insert
>> phase (many of them are needed though during the update phase)
>>
>> I'm just wondering whether it's only me and I'm doing something wrong.
>> I doubt that I screwed up the code to this effect as I haven't really
>> touched the meat of it (SQL statements, code flow, algorithm).
>>
>> Unless Aaron wants me to, I'm not particularly happy committing the
>> updated version with the fixes before I can't get this to the point of
>> making at least one full test run on Pg... (apart from MySQL and
>> Oracle).
>>
>> -hilmar
>>
>
> --
> Aaron J Mackey
> Pearson Laboratory
> University of Virginia
> (434) 924-2821
> amackey at virginia.edu
>
>
>
--
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------
More information about the BioSQL-l
mailing list