[BioSQL-l] load_ncbi_taxonomy.pl

Peter biopython at maubp.freeserve.co.uk
Sat Aug 2 12:30:46 UTC 2008


On Sat, Aug 2, 2008 at 1:15 AM, Hilmar Lapp <hlapp at gmx.net> wrote:
> These sound like reasonable times, depending on your machine configuration.
> I suspect that PostgreSQL might even be a bit faster, as that's a similar
> time to what I'm observing on my laptop.
>
> BTW if you provide --verbose=2 on the command line you'll get rows/time
> statistics. The slowest steps (recomputing nested set values, and inserting
> taxon names) average between 900-1800 rows/s on my laptop, depending on what
> else is going on (I suspect the spotlight indexer to contend for the disk
> drive on occasion). The faster steps (e.g. inserting taxon nodes) I observe
> at up to 2500-4000 rows/s.

I'm seeing about 900 rows/s on the recomputing of the nested set
values, which means my 2 year old desktop is slower than your laptop.
This is an AMD Athlon 64 X2 4600+ Socket 939 dual core machine, with a
Seagate Barracuda hard drive (7200rpm, 200GB, 8MB Cache, IDE Ultra
ATA100), running Ubuntu Dapper Drake (due for an upgrade soon!).

$ time perl ./load_ncbi_taxonomy.pl --dbname bioseqdb --driver mysql
--dbuser root --verbose=2
Loading NCBI taxon database in taxdata:
        ... retrieving all taxon nodes in the database
        ... reading in taxon nodes from nodes.dmp
        ... insert / update / delete taxon nodes
                20000/448630 done (in 0 secs, 20000.0 rows/s)
                40000/448630 done (in 1 secs, 20000.0 rows/s)
                60000/448630 done (in 0 secs, 20000.0 rows/s)
                80000/448630 done (in 0 secs, 20000.0 rows/s)
                100000/448630 done (in 0 secs, 20000.0 rows/s)
                120000/448630 done (in 0 secs, 20000.0 rows/s)
                140000/448630 done (in 1 secs, 20000.0 rows/s)
                160000/448630 done (in 0 secs, 20000.0 rows/s)
                180000/448630 done (in 0 secs, 20000.0 rows/s)
                200000/448630 done (in 0 secs, 20000.0 rows/s)
                220000/448630 done (in 0 secs, 20000.0 rows/s)
                240000/448630 done (in 1 secs, 20000.0 rows/s)
                260000/448630 done (in 0 secs, 20000.0 rows/s)
                280000/448630 done (in 0 secs, 20000.0 rows/s)
                300000/448630 done (in 0 secs, 20000.0 rows/s)
                320000/448630 done (in 0 secs, 20000.0 rows/s)
                340000/448630 done (in 1 secs, 20000.0 rows/s)
                360000/448630 done (in 0 secs, 20000.0 rows/s)
                380000/448630 done (in 0 secs, 20000.0 rows/s)
                400000/448630 done (in 0 secs, 20000.0 rows/s)
                420000/448630 done (in 0 secs, 20000.0 rows/s)
                440000/448630 done (in 1 secs, 20000.0 rows/s)
        ... updating new parent IDs
        ... (committing nodes)
        ... rebuilding nested set left/right values
                20000 done (in 22 secs, 909.1 rows/s)
                40000 done (in 22 secs, 909.1 rows/s)
                60000 done (in 23 secs, 869.6 rows/s)
                80000 done (in 22 secs, 909.1 rows/s)
                100000 done (in 22 secs, 909.1 rows/s)
                120000 done (in 22 secs, 909.1 rows/s)
                140000 done (in 22 secs, 909.1 rows/s)
                160000 done (in 22 secs, 909.1 rows/s)
                180000 done (in 22 secs, 909.1 rows/s)
                200000 done (in 21 secs, 952.4 rows/s)
                220000 done (in 21 secs, 952.4 rows/s)
                240000 done (in 22 secs, 909.1 rows/s)
                260000 done (in 22 secs, 909.1 rows/s)
                280000 done (in 21 secs, 952.4 rows/s)
                300000 done (in 22 secs, 909.1 rows/s)
                320000 done (in 21 secs, 952.4 rows/s)
                340000 done (in 22 secs, 909.1 rows/s)
                360001 done (in 22 secs, 909.1 rows/s)
                380001 done (in 22 secs, 909.1 rows/s)
                400001 done (in 21 secs, 952.4 rows/s)
                420001 done (in 22 secs, 909.1 rows/s)
                440001 done (in 21 secs, 952.4 rows/s)
        ... reading in taxon names from names.dmp
        ... deleting old taxon names
        ... inserting new taxon names
                20000 done (in 3 secs, 6666.7 rows/s)
                40000 done (in 2 secs, 10000.0 rows/s)
                60000 done (in 4 secs, 5000.0 rows/s)
                80000 done (in 3 secs, 6666.7 rows/s)
                100000 done (in 5 secs, 4000.0 rows/s)
                120000 done (in 6 secs, 3333.3 rows/s)
                140000 done (in 7 secs, 2857.1 rows/s)
                160000 done (in 7 secs, 2857.1 rows/s)
                180000 done (in 8 secs, 2500.0 rows/s)
                200000 done (in 8 secs, 2500.0 rows/s)
                220000 done (in 8 secs, 2500.0 rows/s)
                240000 done (in 9 secs, 2222.2 rows/s)
                260000 done (in 9 secs, 2222.2 rows/s)
                280000 done (in 10 secs, 2000.0 rows/s)
                300000 done (in 10 secs, 2000.0 rows/s)
                320000 done (in 10 secs, 2000.0 rows/s)
                340000 done (in 10 secs, 2000.0 rows/s)
                360000 done (in 10 secs, 2000.0 rows/s)
                380000 done (in 10 secs, 2000.0 rows/s)
                400000 done (in 11 secs, 1818.2 rows/s)
                420000 done (in 11 secs, 1818.2 rows/s)
                440000 done (in 11 secs, 1818.2 rows/s)
                460000 done (in 10 secs, 2000.0 rows/s)
                480000 done (in 10 secs, 2000.0 rows/s)
                500000 done (in 11 secs, 1818.2 rows/s)
                520000 done (in 11 secs, 1818.2 rows/s)
                540000 done (in 12 secs, 1666.7 rows/s)
                560000 done (in 10 secs, 2000.0 rows/s)
                580000 done (in 12 secs, 1666.7 rows/s)
                600000 done (in 12 secs, 1666.7 rows/s)
                620000 done (in 11 secs, 1818.2 rows/s)
        ... cleaning up
Done.

real    13m13.805s
user    2m3.548s
sys     0m13.781s

>
> Thanks for all the testing, it's much appreciated!
>

This is only very cursory, confirming the script runs without showing
any error messages, but its better than no testing ;)

Peter



More information about the BioSQL-l mailing list