[Bioperl-l] PostgreSQL & lookup & insert in bioperl-db

Wed Feb 26 18:25:21 EST 2003

This is a problem I stated a couple of months ago and that has 
basically remained unsolved. I try here to summarize it again, and I'll 
point out 3 alternatives I see for solving the problem. Each of those 
alternatives has different advantages and disadvantages. I'd like to 
get a community vote on what the best and/or most preferred solution 
would be. If you're using bioperl on biosql or plan doing so, please 
read and vote if you have an opinion (some of the alternatives would 
affect all RDBMSs, not just Pg). Otherwise ignore ...

Here's the background.

The current code in the bioperl-db adaptors to biosql run the 
connection with AutoCommit off, which means the client determines the 
transaction. Load_seqdatabase.pl (the main script for loading databases 
into biosql) treats one sequence entry as one transaction. If any sql 
statement required for loading the sequence fails unexpectedly, the 
whole transaction is rolled back, otherwise it is committed once the 
entry and all its annotation went in successfully. The emphasis rests 
on 'unexpectedly': INSERTs may fail due to unique key violations, which 
is caught and triggers a look-up of the affected entry by unique key. 
E.g. a dbxref may already exist; if so, it needs to be looked up in 
order to establish the association with the bioentry.

Here's the problem.

If within a transaction a particular statement fails, in Oracle and 
MySQL/InnoDB only that particular statement fails, not the previous 
ones in the same transaction, nor the subsequent ones. If you commit 
the transaction, all the succeeded statements' results get committed. 
In PostgreSQL, however, the entire transaction fails and is 
invalidated, making all previous and all subsequent statements fail 
until you call rollback (you cannot call commit on a failed 
transaction).

For bioperl-db and the granularity used by load_seqdatabase.pl, as a 
simple example this means on Postgres that if insertion of a dbxref 
fails, at that point the entire transaction is aborted, i.e., the 
previously inserted bioentry with sequence, features, comments, etc all 
go away. In practice this doesn't happen with bioperl-db because 
dbxrefs are cached and looked up first; however what does occur is the 
multiple associations of the same bioentry to the same dbxref, to give 
an example, which is prohibited by a unique key constraint. If this 
happens, the respective sequence entry is lost.

Here's the possible solutions.

1) Turn off transactions for PostgreSQL, i.e., run in auto-commit mode. 
Given the fact that Pg is a natively transactional database I find this 
option possible but sad. In addition, there would have to be code that 
specifically for PostgreSQL deletes the sequence entry if there was a 
failure during a particular sequence entry. (Otherwise rolling back the 
transaction takes care of cleaning up possibly incomplete rows.) On the 
plus side, this is probably the simplest one to implement.

2) Look up every record before insertion. On the plus side of things, 
this would maintain transactions. On the downside, entities with a 
finite number of records are already cached and looked up before insert 
(examples being dbxref, reference, ontology_term). Those that are not 
looked up are potentially infinite in number, meaning that 95% of the 
time the record to be inserted will not already exist. A prior lookup 
will be a waste of time in 95% of cases. I.e., this option will slow 
down performance.

3) Separate the look-up into PL/PgSQL code in the database. Instead of 
issuing a direct INSERT you call a stored procedure that looks up the 
entry and inserts it if not yet present. The advantage of this option 
is that it keeps the bioperl-db layout largely unaffected and hence 
treats all RDBMSs relatively equal (i.e., cares least about RDBMS 
idiosyncrasies). The downside is that this means that the Pg version of 
Biosql can't be fully auto-generated anymore from the MySQL version (I 
don't want to auto-generate the PL/PgSQL procedures from a MySQL schema 
definition). Also, people say PL/PgSQL is not the fastest thing out 
there.

Let me know your thoughts and votes. There may easily be more options, 
feel free to share.

	-hilmar
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------