[BioSQL-l] PostgreSQL & lookup & insert in bioperl-db
Hilmar Lapp
hlapp at gnf.org
Wed Feb 26 18:25:21 EST 2003
This is a problem I stated a couple of months ago and that has
basically remained unsolved. I try here to summarize it again, and I'll
point out 3 alternatives I see for solving the problem. Each of those
alternatives has different advantages and disadvantages. I'd like to
get a community vote on what the best and/or most preferred solution
would be. If you're using bioperl on biosql or plan doing so, please
read and vote if you have an opinion (some of the alternatives would
affect all RDBMSs, not just Pg). Otherwise ignore ...
Here's the background.
The current code in the bioperl-db adaptors to biosql run the
connection with AutoCommit off, which means the client determines the
transaction. Load_seqdatabase.pl (the main script for loading databases
into biosql) treats one sequence entry as one transaction. If any sql
statement required for loading the sequence fails unexpectedly, the
whole transaction is rolled back, otherwise it is committed once the
entry and all its annotation went in successfully. The emphasis rests
on 'unexpectedly': INSERTs may fail due to unique key violations, which
is caught and triggers a look-up of the affected entry by unique key.
E.g. a dbxref may already exist; if so, it needs to be looked up in
order to establish the association with the bioentry.
Here's the problem.
If within a transaction a particular statement fails, in Oracle and
MySQL/InnoDB only that particular statement fails, not the previous
ones in the same transaction, nor the subsequent ones. If you commit
the transaction, all the succeeded statements' results get committed.
In PostgreSQL, however, the entire transaction fails and is
invalidated, making all previous and all subsequent statements fail
until you call rollback (you cannot call commit on a failed
transaction).
For bioperl-db and the granularity used by load_seqdatabase.pl, as a
simple example this means on Postgres that if insertion of a dbxref
fails, at that point the entire transaction is aborted, i.e., the
previously inserted bioentry with sequence, features, comments, etc all
go away. In practice this doesn't happen with bioperl-db because
dbxrefs are cached and looked up first; however what does occur is the
multiple associations of the same bioentry to the same dbxref, to give
an example, which is prohibited by a unique key constraint. If this
happens, the respective sequence entry is lost.
Here's the possible solutions.
1) Turn off transactions for PostgreSQL, i.e., run in auto-commit mode.
Given the fact that Pg is a natively transactional database I find this
option possible but sad. In addition, there would have to be code that
specifically for PostgreSQL deletes the sequence entry if there was a
failure during a particular sequence entry. (Otherwise rolling back the
transaction takes care of cleaning up possibly incomplete rows.) On the
plus side, this is probably the simplest one to implement.
2) Look up every record before insertion. On the plus side of things,
this would maintain transactions. On the downside, entities with a
finite number of records are already cached and looked up before insert
(examples being dbxref, reference, ontology_term). Those that are not
looked up are potentially infinite in number, meaning that 95% of the
time the record to be inserted will not already exist. A prior lookup
will be a waste of time in 95% of cases. I.e., this option will slow
down performance.
3) Separate the look-up into PL/PgSQL code in the database. Instead of
issuing a direct INSERT you call a stored procedure that looks up the
entry and inserts it if not yet present. The advantage of this option
is that it keeps the bioperl-db layout largely unaffected and hence
treats all RDBMSs relatively equal (i.e., cares least about RDBMS
idiosyncrasies). The downside is that this means that the Pg version of
Biosql can't be fully auto-generated anymore from the MySQL version (I
don't want to auto-generate the PL/PgSQL procedures from a MySQL schema
definition). Also, people say PL/PgSQL is not the fastest thing out
there.
Let me know your thoughts and votes. There may easily be more options,
feel free to share.
-hilmar
--
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------
More information about the BioSQL-l
mailing list