[Bioperl-l] bioperl-db performance: load_seqdatabase.pl
throughput speed
Hilmar Lapp
hlapp at gnf.org
Tue May 11 18:38:30 EDT 2004
With this little amount of memory you'll very quickly run into memory
contention issues. I'm not exactly sure about the memory caching model
of mysql, but AFAIR it will use its own memory pool (like Oracle does)
rather than deferring that heavily to the OS (like Pg). Biosql-db
caches a lot, specifically it caches species, ontology terms, and
dbxrefs. With a database like swissprot that is very diverse in terms
of species (you've got several thousand in there) and richly annotated
with dbxrefs, you'll end up with *a lot* of memory for the loading
script alone. Last time I uploaded swissprot (as Uniprot) I had about
700MB consumed by the loading script process. The db server was running
on another machine ...
I typically see 4-10 seqs/second on a 1.8GHz CPU (1GB memory) against
an Oracle database running on an even faster machine.
So, my conclusion as to what most likely you saw happening is
bioperl-db being slow at the start because nothing is cached yet
(species lookups are expensive if not cached), and not getting any
better because it got more and more compounded by memory as well as
disk I/O contention.
If you really want this to fly, either get a really fast CPU, or better
yet, two CPUs, and at least two disks. If you want the loader to run on
the same machine as the db process, then get 3 disks if you can
(sequence source file on one, db transaction log on the second, db data
files on the third). And get no less than 1GB of RAM; if you want db
and loader on the same box get at least 2GB.
-hilmar
On Tuesday, May 11, 2004, at 10:27 AM, Henry R Bigelow wrote:
> Hi,
> my name is Henry Bigelow and I recently installed bioperl-1.4,
> bioperl-db, dbi and dbd-mysql, mysql-4.0 (with InnoDB enabled),
> biosql-schema, and instantiated biosqldb-mysql.sql. i've successfully
> loaded some sequences of release43.dat, the swissprot flat file, but
> the
> throughput is roughly 1 sequence every 5 to 10 seconds, on a
> (admittedly
> slow) 400 Mhz 2 CPU Pentium III with 256 Mb memory. I ran the command:
>
> perl load_seqdatabase.pl --host localhost --dbname bioseqdb --namespace
> swissprot --dbuser bigelow --dbpass XXX --driver mysql --format swiss
> /data/swissprot/release43.dat
>
>
> I also ran it (on a set of 15 swissprot entries) with a profiler:
>
> perl -d:DProf load_seqdatabase.pl ...
> then with
> dprofpp -u
> i got this:
>
> %Time ExclSec CumulS #Calls sec/call Csec/c Name
> 9.62 0.800 0.985 15282 0.0001 0.0001
> Bio::DB::Persistent::PersistentObject::isa
> 9.54 0.793 1.403 11909 0.0001 0.0001
> Bio::DB::Persistent::PersistentObject::AUTOLOAD
> 9.25 0.769 3.152 8888 0.0001 0.0004
> Bio::DB::BioSQL::BasePersistenceAdaptor::_create_persistent
> 4.69 0.390 2.922 7733 0.0001 0.0004
> Bio::DB::BioSQL::BasePersistentAdaptor::_process_child
> 4.59 0.382 0.382 26865 0.0000 0.0000
> Bio::DB::Persistent::PersistentObject::obj
> 3.84 0.319 0.319 32822 0.0000 0.0000 UNIVERSAL::isa
> 3.69 0.307 0.372 86 0.0036 0.0043
> Bio::DB::BioSQL::ReferenceAdaptor::_crc64
> 3.28 0.273 1.195 258 0.0011 0.0046
> Bio::Root::Root::_load_module
> 2.80 0.233 3.545 5465 0.0000 0.0006
> Bio::DB::BioSQL::BasePersistenceAdaptor::create_persistent
> 2.74 0.228 0.228 291 0.0008 0.0008
> Bio::Root::RootI::stack_trace
> 1.92 0.160 0.160 1794 0.0001 0.0001 DBI::st::execute
> 1.84 0.153 0.534 1608 0.0001 0.0003
> Bio::DB::Persistent::PersistentObject::new
> 1.80 0.150 0.150 7215 0.0000 0.0000
> Bio::DB::Persistent::PersistentObject::primary_key
> 1.74 0.145 0.185 2640 0.0001 0.0001 Bio::Root::Root::new
> 1.71 0.142 1.078 474 0.0003 0.0023
> Bio::DB::BioSQL::BaseDriver::insert_object
>
> i do realize that these perl objects are large, but it still seems
> quite
> slow. (i'm not even sure whether the profiler demonstrates that the
> majority of time is spent instantiating perl objects as opposed to
> running
> mysql commands.)
>
> all bioperl-db, bioperl, dbi and dbd-mysql tests came out ok (the vast
> majority of them anyway).
>
> incidentally, it took me a week of getting errors during
> load_seqdatabase.pl loading, before i discovered the true cause: that
> a perl executable with threading enabled does NOT work with this. (The
> author of dbd-mysql or dbi warns about this, but i didn't heed the
> warning
> at first).
>
>
> if anyone has any ideas about what might be making it slow, please let
> me
> know! i'd greatly appreciate it.
>
> Sincerely,
>
> Henry Bigelow
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
--
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------
More information about the Bioperl-l
mailing list