[Bioperl-l] PostgreSQL & lookup & insert in bioperl-db

Charles Hauser chauser at duke.edu
Thu Feb 27 11:06:01 EST 2003


Hilmar,

A couple of suggestions:

- I would crtainly run this problem by one of the postgres user groups
and see what suggestions they have.  My experience has been they are
sharp and quite helpful.
http://www.ca.postgresql.org/users-lounge/

- I would also suggest contacting the GMOD/Chado developers as I feel
certain they would be dealing with the same issues.


Here's the possible solutions.

1) Turn off transactions for PostgreSQL, i.e., run in auto-commit mode. 
Given the fact that Pg is a natively transactional database I find this 
option possible but sad. In addition, there would have to be code that 
specifically for PostgreSQL deletes the sequence entry if there was a 
failure during a particular sequence entry. (Otherwise rolling back the 
transaction takes care of cleaning up possibly incomplete rows.) On the 
plus side, this is probably the simplest one to implement.

I would keep transactions ON - data integrity should be paramount.

2) Look up every record before insertion. On the plus side of things, 
this would maintain transactions. On the downside, entities with a 
finite number of records are already cached and looked up before insert 
(examples being dbxref, reference, ontology_term). Those that are not 
looked up are potentially infinite in number, meaning that 95% of the 
time the record to be inserted will not already exist. A prior lookup 
will be a waste of time in 95% of cases. I.e., this option will slow 
down performance.

I ran a version of this (I think) for an EST database, where I looked up
say an est_id prior to loading it and a contig_id into an est_contig
relational table. When loading ~10,000 records I crashed our server. 
Needless to say the sysadmin was not a happy camper.

3) Separate the look-up into PL/PgSQL code in the database. Instead of 
issuing a direct INSERT you call a stored procedure that looks up the 
entry and inserts it if not yet present. The advantage of this option 
is that it keeps the bioperl-db layout largely unaffected and hence 
treats all RDBMSs relatively equal (i.e., cares least about RDBMS 
idiosyncrasies). The downside is that this means that the Pg version of 
Biosql can't be fully auto-generated anymore from the MySQL version (I 
don't want to auto-generate the PL/PgSQL procedures from a MySQL schema 
definition). Also, people say PL/PgSQL is not the fastest thing out 
there.

I have no experience using SPs, but when I browse others schemas I see
it used often - perhaps the best soln?

Charles








More information about the Bioperl-l mailing list