[Bioperl-l] tuning load_seqdatabase.db script in bioperl-db

Hilmar Lapp hlapp at gnf.org
Tue May 27 13:13:07 EDT 2003



> -----Original Message-----
> From: Hilmar Lapp 
> Sent: Tuesday, May 27, 2003 12:02 PM
> To: Nicolas Rueff; bioperl-l at bioperl.org
> Subject: RE: [Bioperl-l] tuning load_seqdatabase.db script in 
> bioperl-db
> 
> 
> It actually won't (shouldn't) lookup unless you provide the 
> --lookup switch on the command line. Was there a particular 
> observation other than the time spent that made you think it 
> looks up sequences (bioentries)?
> 
> I bet you're using PostgreSQL? If you are, that totally 
> explains the behaviour, and you won't be able to fix this in 
> code. Here's why, and what can be done about it.
> 
> 	- The Pg schema defines so-called rules on every table 
> that does look-up by unique key first before allowing an 
> insert, because otherwise upon a UK failure the entire 
> transaction is aborted. This was a very elegant solution to a 
> problem the work-flow of the code has on PostgreSQL. I can 
> dig up links into the mailing list archive to several threads 
> discussing this issue.

Actually thinking over my email, the preceding paragraph is most likely going to be irrelevant. The point was (and I probably failed to make that clear) that those rules would issue a lookup, the performance of which dramatically degrades upon changing (insert/update/delete) a large number of rows.

As a matter of fact I've seen people on the Pg performance list report this performance degradation on insert without any rules defined. The problem is that any foreign key constraint will trigger a lookup too, regardless of those rules being present or not.

So, the only way out really is several vacuums during runtime of the upload.

	-hilmar

> 
> 	- When you insert, update, or delete large amounts of 
> data in PostgreSQL, you need to vacuum your schema regularly, 
> otherwise performance will degrade dramatically. Performance 
> problems related to this are regularly reported on the Pg 
> performance mailing list, and the usual advice is, vacuum 
> often, up to every few minutes. Since 7.2.x you can vacuum a 
> Pg database without it being locked (unless you vacuum full).
> 
> So, we could actually add a command-line switch that would 
> automatically issue a vacuum analyze after every so many 
> records if the driver is Pg. What about that?
> 
> 	-hilmar
> 
> > -----Original Message-----
> > From: Nicolas Rueff [mailto:rueff at mediagen.fr]
> > Sent: Monday, May 26, 2003 3:20 AM
> > To: bioperl-l at bioperl.org
> > Subject: [Bioperl-l] tuning load_seqdatabase.db script in bioperl-db
> > 
> > 
> > I'm using bioperl-db/script/biosql/load_seqdatabase.pl to
> > fill the biosql schema. The big issue of this script is that 
> > the time is takes is exponential, since for every new 
> > sequence, it has to search in the database if the entry 
> > doesn't exists yet. Useful for updates, but not for first-time fill.
> > 
> > For exemple, I used it with the last full swiss-prot release
> > (sprot41.dat) to spawn a new fresh database, and if the
> > computer could handle 100 inserts / sec, it drops to 2/sec 
> > near the end of the file.
> > 
> > I think it could be a good idea to add an option like
> > "--forceinsert" to avoid this problem.
> > 
> > --
> > Nicolas Rueff <rueff at mediagen.fr>
> > 
> > Mediagen SAS
> > Institut Pasteur de Lille
> > 1, rue du Professeur Calmette
> > Bâtiment Guérin, 3eme étage, BP 245
> > 59019 LILLE Cedex
> > Tel +33 3 20 87 72 76
> > Fax +33 3 20 87 72 82
> > 
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at portal.open-bio.org
> > http://portal.open-> bio.org/mailman/listinfo/bioperl-l
> > 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
> 



More information about the Bioperl-l mailing list