[Bioperl-l] biosql primary key generation

Sat, 10 Aug 2002 20:23:43 -0700

I've talked about this to several people at ISMB and BOSC, including 
ThomasD and ChrisM, and at least no-one told me to stop. So, I'm 
repeating my proposal here to a wider audience, to give everyone a 
chance to shout.

First off, I'll give a backgrounder on the problem that bothers me. 
If you know what I'm talking about, skip the next paragraph.

The problem with the topic is that the MySQL idiosyncratic way of 
generating PKs behind the scenes via AUTO_INCREMENT fields is 
completely incompatible with any other reasonable RDBMS, including 
Postgres and Oracle. The problem arises if you want control over the 
new PK assigned in INSERTs in order to return them to an 
application. The 'normal' way to achieve this is that the RDBMS 
provides you with so-called sequences, from which you obtain a new, 
guaranteed unique PK and then supply this PK to the INSERT statement 
instead of leaving the field empty. Since MySQL does not have 
sequences, you need to resort to calling a certain function _after_ 
executing the INSERT statement, which is not only very different but 
also makes for a different calling flow. Hence, porting to Oracle, 
which is what I'm trying to do, is a major headache or a ticket to a 
great deal of mess and code duplication.

Now my proposal: assume for the calling flow generally that the 
actual RDBMS does have sequences, and hence before inserting a 
record with a PK obtain a new value and supply it to the insert 
statement. For RDBMSs like MySQL that do not have sequences, fake a 
sequence.

Technically, given record- or table-level locks, a sequence can be 
faked reasonably easy using a dedicated table. In fact, it turns out 
that there is already a package DBIx::MySQLSequence on CPAN, which 
does exactly that and wraps it in a nice API.

The only problem with faking sequences is if inserts are made not 
only through bioperl-db, but also e.g. biojava (or any other bio* 
package other than bioperl) into a MySQL-based database. I.e., other 
packages have to re-implement the base functionality of 
DBIx::MySQLSequence (which should be fairly easy though) to the 
extent of averting PK clashes

Other than inserts, other projects would be unaffected.

Comments appreciated.

For me this is the sanest approach to cross-RDBMS code at this time, 
and hence I've already gone ahead and started implementing it. I've 
done a driver-specific API interface, the MySQL implementation, and  
dynamic loading of the respective driver. Next thing is I'll rewrite 
all inserts to follow the scheme I'm proposing.

If people vote against this I'll branch off and commit the changes 
to the branch.

	-hilmar
--
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------