[Bioperl-l] tuning load_seqdatabase.db script in bioperl-db
Hilmar Lapp
hlapp at gnf.org
Tue May 27 13:01:44 EDT 2003
It actually won't (shouldn't) lookup unless you provide the --lookup switch on the command line. Was there a particular observation other than the time spent that made you think it looks up sequences (bioentries)?
I bet you're using PostgreSQL? If you are, that totally explains the behaviour, and you won't be able to fix this in code. Here's why, and what can be done about it.
- The Pg schema defines so-called rules on every table that does look-up by unique key first before allowing an insert, because otherwise upon a UK failure the entire transaction is aborted. This was a very elegant solution to a problem the work-flow of the code has on PostgreSQL. I can dig up links into the mailing list archive to several threads discussing this issue.
- When you insert, update, or delete large amounts of data in PostgreSQL, you need to vacuum your schema regularly, otherwise performance will degrade dramatically. Performance problems related to this are regularly reported on the Pg performance mailing list, and the usual advice is, vacuum often, up to every few minutes. Since 7.2.x you can vacuum a Pg database without it being locked (unless you vacuum full).
So, we could actually add a command-line switch that would automatically issue a vacuum analyze after every so many records if the driver is Pg. What about that?
-hilmar
> -----Original Message-----
> From: Nicolas Rueff [mailto:rueff at mediagen.fr]
> Sent: Monday, May 26, 2003 3:20 AM
> To: bioperl-l at bioperl.org
> Subject: [Bioperl-l] tuning load_seqdatabase.db script in bioperl-db
>
>
> I'm using bioperl-db/script/biosql/load_seqdatabase.pl to
> fill the biosql schema. The big issue of this script is that
> the time is takes is exponential, since for every new
> sequence, it has to search in the database if the entry
> doesn't exists yet. Useful for updates, but not for first-time fill.
>
> For exemple, I used it with the last full swiss-prot release
> (sprot41.dat) to spawn a new fresh database, and if the
> computer could handle 100 inserts / sec, it drops to 2/sec
> near the end of the file.
>
> I think it could be a good idea to add an option like
> "--forceinsert" to avoid this problem.
>
> --
> Nicolas Rueff <rueff at mediagen.fr>
>
> Mediagen SAS
> Institut Pasteur de Lille
> 1, rue du Professeur Calmette
> Bâtiment Guérin, 3eme étage, BP 245
> 59019 LILLE Cedex
> Tel +33 3 20 87 72 76
> Fax +33 3 20 87 72 82
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-> bio.org/mailman/listinfo/bioperl-l
>
More information about the Bioperl-l
mailing list