[Bioperl-l] RE: load_seqdatabase.pl running SLOW!

Tue Jan 25 20:04:49 EST 2005

To be honest I've never loaded a large file into a Pg installation. The problem that I'd expect you to run into is that if you started with a fresh database the lookup queries will become slower and slower in the absence of the stats being recomputed on a frequent basis through vacuum (which the load script won't do).

I believe in more recent releases you can actually vacuum the database concurrent to write access; not sure whether 7.2.x will allow this already. You should strongly consider upgrading to at least 7.3 if not 7.4 or even 8.x. The Pg developers may not even answer questions to 7.2 anymore ...

Your obvservation that the slower machine with the later kernel would be faster leaves me puzzled. If blind-tested I would have suggested that the machine appearing faster has had the database vacuumed.

Not sure this is very helpful ...

 -hilmar

	-----Original Message----- 
	From: Barry Moore [mailto:barry.moore at genetics.utah.edu] 
	Sent: Tue 1/25/2005 3:15 PM 
	To: Bioperl list; Hilmar Lapp 
	Cc: 
	Subject: load_seqdatabase.pl running SLOW!

	Hilmar (or others)-

	I've set up a biosql based database using PostgreSQL 7.2 on a PC with an
	Intel Pentium 4 3.0 GHz processor, 800 MHz system Bus.  1 GB of RAM, and
	Linux (2.2 kernel - Debian woody distro).  Onto that I am loading
	~352,000 sequences from RefSeq complete rna collection using
	load_seqdatabase.pl.  It's running kind of slow - loding on average
	about 1 sequence every 2-5 seconds.  In the archives I've read your
	comments to a previous question like this suggesting two fast
	processors, a couple gigs of memory and 2-3 drives to really make things
	fly and while my system isn't that good, it seems like I should be doing
	better.  I got to experimenting on another (slower) system while waiting
	for things to load, and found that running the same script to load the
	same file goes about 3X faster on a 266MHz Intel processor with 192 Mb
	RAM.  Same installation of PostgreSQL (both installed from deb package
	with defaults), and same installation of Debian Linux (except that the
	kernel on the older slow machine has been updated to 2.4)  Another
	difference I noticed between the two is that the old 266 MHz machine is
	using about 75% CPU resources for perl and about 25% for postmaster
	whereas the faster 3 GHz machine (but slower running
	load_seqdatabase.pl) is using 95% of it's CPU resources for postmaster
	and about 3% for perl.  Both systems are using up most of their memory,
	but little to no swap.  Could the kernel upgrade really be making the
	difference?  Any thoughts?  As it's going now I can wait over a week for
	all these sequences to load, or build the database on our dinosaur
	server in a couple of days and dump it across to our sexy new 3 GHz
	server.  Talk about bass ackwards!

	Barry

	--
	Barry Moore
	Dept. of Human Genetics
	University of Utah
	Salt Lake City, UT