[BioSQL-l] Re: load_ncbi_taxonomy
Hilmar Lapp
hlapp at gnf.org
Sun Jun 1 13:45:13 EDT 2003
:-)
I don't even have a strong feeling in this case as with the new taxon
tables we've made a strong move towards deferring taxon data to an
authority (whose errors are beyond our control).
Obviously, if we remove the constraint completely, there is nothing
that prevents you from removing a connecting node (e.g., 'Primates' :)
so if someone wanted to mess up his taxon data he wouldn't notice until
it's too late. I'd be OK with this risk in this case ...
-hilmar
On Sunday, June 1, 2003, at 03:51 AM, Aaron J Mackey wrote:
>
> I know how much you love constraints, Hilmar, but I think that dropping
> the constraint may be the best way to go (in general, when slurping
> from
> outside sources, we always seem to run into things like this, various
> small violations of basic assumptions; constraints help you to
> recognize
> them, but also create obstacles for continuing work).
>
> -Aaron
>
> On Sun, 1 Jun 2003, Hilmar Lapp wrote:
>
>> I was testing with a download from NCBI that was a couple months old.
>> I
>> went 'real' now with a current download - and voila - it doesn't work
>> anymore!
>>
>> It turns out the current NCBI taxon database contains nodes that
>> reference inexistent parents (!). This is what you call a reference
>> database. Wonderful.
>>
>> biosql=> select * from taxon t where not exists (select 1 from taxon n
>> where n.taxon_id = t.parent_taxon_id);
>> taxon_id | ncbi_taxon_id | parent_taxon_id | node_rank |
>> genetic_code
>> | mito_genetic_code | left_value | right_value
>> ----------+---------------+-----------------+-----------
>> +--------------
>> +-------------------+------------+-------------
>> 120645 | 120645 | 232080 | no rank |
>> 1
>> | 4 | 205944 | 205945
>> 169388 | 169388 | 232080 | no rank |
>> 1
>> | 4 | 205937 | 205956
>> 13617 | 13617 | 232194 | genus |
>> 1
>> | 5 | 280206 | 280223
>> 114837 | 114837 | 232208 | species |
>> 1
>> | 4 | 210892 | 210893
>> 213248 | 213248 | 232115 | species |
>> 11
>> | 0 | 79778 | 79779
>> 213860 | 213860 | 232120 | genus |
>> 1
>> | 5 | 280412 | 280415
>> 227072 | 227072 | 232120 | genus |
>> 1
>> | 5 | |
>> 227073 | 227073 | 232149 | species |
>> 1
>> | 5 | |
>> 227074 | 227074 | 232152 | species |
>> 1
>> | 5 | |
>>
>>
>> I checked some of these in the nodes.dmp file, the parents are indeed
>> absent.
>>
>> So I guess we
>>
>> 1) forget about enforcing a foreign key constraint on parent_taxon_id
>> (as NCBI apparently happily doesn't do that either)
>>
>> or
>>
>> 2) audaciously delete the nodes with bogus parents at the end of
>> uploading the nodes, and leave it to NCBI to bring them back next time
>> with a better parent designation.
>>
>> Any votes?
>>
>> -hilmar
>>
>> On Sunday, June 1, 2003, at 03:21 AM, Hilmar Lapp wrote:
>>
>>> I've finally got this under control, with some optimizations in the
>>> code, and most importantly some temporary surgical removal of
>>> constraints.
>>>
>>> It turned out there were two performance killer: 1) the insert rule
>>> on
>>> the taxon table (which would need the ncbi_taxon_id index to be
>>> analyzed regularly in order to perform properly), and 2) the
>>> self-referential foreign key constraint (even though it was
>>> deferred!). At first it also looked that nested set rebuild phase
>>> needs regular VACUUM ANALYZE runs (which require a commit/begin
>>> cycle), but it seems now that's not necessarily true. I wrote a few
>>> pgsql functions as an API that the script uses to remove and restore
>>> constraints. You will need to install those functions to take
>>> advantage of it.
>>>
>>> Anyway, the whole NCBI taxon database loads now on my laptop in 30-40
>>> mins, which is not too bad I think. It should be considerably faster
>>> on a beefy server with multiple disks (i.e., with the transaction log
>>> on another disk than the table files).
>>>
>>> I also added a check that when rebuilding the nested set avoids
>>> updates when unnecessary and obtains the children ordered so that
>>> they
>>> would always come out the same. This should make updates much faster.
>>>
>>> -hilmar
>>>
>>> On Friday, May 30, 2003, at 12:33 PM, Aaron J Mackey wrote:
>>>
>>>>
>>>> I've had the same response from Postgres; dropping the constraint
>>>> checking
>>>> seemed to help, but it was still slow (on a beefy database server,
>>>> it
>>>> finishes in around 20 hours, with no other load). I suspect that
>>>> maintaining all the indices is probably slow (but to get them back
>>>> will
>>>> still require the time, even if you drop them while loading the
>>>> data).
>>>>
>>>> Postgres is not a speedy database. It has other nice features,
>>>> though.
>>>>
>>>> -Aaron
>>>>
>>>> On Fri, 30 May 2003, Hilmar Lapp wrote:
>>>>
>>>>> I'm adding a bit documentation, and among the more important
>>>>> things,
>>>>> I'm adding support for Oracle and MySQL and some more verbosity
>>>>> about
>>>>> what's going on. I also fixed the problem with the ncbi_taxon_id
>>>>> being
>>>>> mysteriously missing in the uploaded data (I suspect the root cause
>>>>> is
>>>>> a perl 5.6.0-related bug).
>>>>>
>>>>> The snag I've hit is with Postgres. It's unbelievably slow on a
>>>>> fresh
>>>>> database, when really all it should be doing in the first pass is
>>>>> inserting nodes. It starts with ~370 rows/s (this is *after*
>>>>> dropping
>>>>> the insert rule on taxon, otherwise it's very slow right away) and
>>>>> drops down to ~40 rows/s and 20 rows/s after 60k and 130k rows,
>>>>> respectively. Updating the nested set values takes place at 3-5
>>>>> rows/s
>>>>> (!). So, it wouldn't even finish over night.
>>>>>
>>>>> This is data from my PowerBook G4, 800 Mhz, single disk, 4096
>>>>> shared
>>>>> buffers in Pg, Postgres 7.3.2. Certainly not the most suitable
>>>>> hardware
>>>>> configuration, but the rest should be OK. Note that in comparison
>>>>> MySQL
>>>>> does ~2000 rows/s during the insert phase and ~500 rows/s during
>>>>> the
>>>>> update phase, on the same hardware.
>>>>>
>>>>> Is anyone else running this script under Postgres? What's the
>>>>> performance? If it does finish in a reasonable time, what is your
>>>>> Pg
>>>>> configuration, and do you take special measures, like dropping an
>>>>> index
>>>>> before or something?
>>>>>
>>>>> There are still various options left to speed up the whole thing,
>>>>> but
>>>>> that must be entirely driver specific then. In particular,
>>>>>
>>>>> - vacuum analyze after the insert phase
>>>>> - drop all indexes on taxon except the primary key during the
>>>>> insert
>>>>> phase (many of them are needed though during the update phase)
>>>>>
>>>>> I'm just wondering whether it's only me and I'm doing something
>>>>> wrong.
>>>>> I doubt that I screwed up the code to this effect as I haven't
>>>>> really
>>>>> touched the meat of it (SQL statements, code flow, algorithm).
>>>>>
>>>>> Unless Aaron wants me to, I'm not particularly happy committing the
>>>>> updated version with the fixes before I can't get this to the point
>>>>> of
>>>>> making at least one full test run on Pg... (apart from MySQL and
>>>>> Oracle).
>>>>>
>>>>> -hilmar
>>>>>
>>>>
>>>> --
>>>> Aaron J Mackey
>>>> Pearson Laboratory
>>>> University of Virginia
>>>> (434) 924-2821
>>>> amackey at virginia.edu
>>>>
>>>>
>>>>
>>> --
>>> -------------------------------------------------------------
>>> Hilmar Lapp email: lapp at gnf.org
>>> GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
>>> -------------------------------------------------------------
>>>
>>> _______________________________________________
>>> BioSQL-l mailing list
>>> BioSQL-l at open-bio.org
>>> http://open-bio.org/mailman/listinfo/biosql-l
>>>
>>
>
> --
> Aaron J Mackey
> Pearson Laboratory
> University of Virginia
> (434) 924-2821
> amackey at virginia.edu
>
>
>
--
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------
More information about the BioSQL-l
mailing list