[BioSQL-l] Timing importing GenBank files into BioSQL

Hilmar Lapp hlapp at gmx.net
Tue Aug 19 18:17:36 UTC 2008


The timings do seem a bit on the long end, but they are also whole  
genomes. The first interesting bit would be how much of that time is  
spent in the BioPerl parser, and how much time is spent loading the  
sequence. For typical genbank sequences, a rate between 10-20 seqs/sec  
is in the expected range, depending on your hardware setup (and db  
configuration) you can get slower or faster speeds.

You can get lots of output on what it is doing by passing --debug.  
Under normal operating conditions, the printed lines should be flying  
past you much faster than you can identify what it is, and should  
start doing so right after you get the line "Loading " followed by the  
filename (before that it is opening the database connection). If there  
is something that stays on the screen long enough that you can read  
(or copy&paste) it it is probably a bottle neck.

Bioperl-db essentially works like an object-relational mapper, and  
hence loading data happens one object at a time. There are some speed  
optimizations, for example some objects (like dbxrefs) are always  
looked up first and inserted if not found, whereas others (like seqs  
or features) are inserted first and updated if that fails. The  
assumptions that this is based on are for databases that you are  
updating (which is what one typically does 90% of the time), not for  
fresh loads into an empty db.

Finally any speed comparisons aren't really particularly useful so  
long as you don't know how similar (or different) the resulting data  
content is, so I would start by comparing that.

	-hilmar

On Aug 18, 2008, at 12:23 PM, Peter wrote:

> Hi,
>
> I've started trying to look at BioPerl and Biopython and how well they
> agree in writing GenBank files into BioSQL.  I've been using the
> BioPerl load_seqdatabase.pl script to import sample GenBank files, but
> I was a little surprised how long this takes to run for E. coli K12,
> NC_000913.gbk (about 10 minutes!).  I'm using E coli K12, NC_000913.2
> from ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K12_substr__MG1655/NC_000913.gbk
> and Nanoarchaeum equitans, NC_005213.1 from
> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans/NC_005213.gbk
> as my example input files.
>
> Example timing using BioPerl, after emptying most (all?) of my MySQL
> test database:
>
> $ mysql --user="gbrowse" --pass="biosql" test_biosql -e "truncate
> table bioentry; truncate table seqfeature; truncate table
> bioentry_dbxref; truncate table term; truncate table ontology;
> truncate table reference; truncate table dbxref;"
>
> $ time perl ~/Downloads/Software/bioperl-db-1.5.2_100/scripts/biosql/ 
> load_seqdatabase.pl
> --dbname test_biosql --namespace test --format genbank --dbpass biosql
> --dbuser gbrowse Nanoarchaeum_equitans/NC_005213.gbk
> Loading Nanoarchaeum_equitans/NC_005213.gbk ...
>
> real	0m17.116s
> user	0m13.914s
> sys	0m2.293s
>
> $ time perl ~/Downloads/Software/bioperl-db-1.5.2_100/scripts/biosql/ 
> load_seqdatabase.pl
> --dbname test_biosql --namespace test --format genbank --dbpass biosql
> --dbuser gbrowse Escherichia_coli_K12_substr__MG1655/NC_000913.gbk
> Loading Escherichia_coli_K12_substr__MG1655/NC_000913.gbk ...
>
> real	10m0.784s
> user	6m23.898s
> sys	3m26.189s
>
> This does seem a rather unreasonable length of time (and I've repeated
> this over three times).  Is this normal?  I know this may not be a
> fair comparison, but this it what Biopython takes (code at end of
> email):
>
> $ mysql --user="gbrowse" --pass="biosql" test_biosql -e "truncate
> table bioentry; truncate table seqfeature; truncate table
> bioentry_dbxref; truncate table term; truncate table ontology;
> truncate table reference; truncate table dbxref;"
>
> $ time python load.py
> Importing Nanoarchaeum_equitans/NC_005213.gbk
> Loaded 1 records
> Took 5.32s include the commit
> Importing Escherichia_coli_K12_substr__MG1655/NC_000913.gbk
> Loaded 1 records
> Took 64.15s including the commit
>
> real	1m10.037s
> user	0m31.942s
> sys	0m6.913s
>
> I'm wondering if the BioPerl time is typical (I hope not), and if
> there are any computationally intensive or otherwise slow things it
> does that BioPython might be skipping (checksums? fetching taxonomy?)
>
> Thanks
>
> Peter
>
> ---------------------------------------------------------------------
> The contents of my load.py script:
>
> import time
> from Bio import SeqIO
> from BioSQL import BioSeqDatabase
> server = BioSeqDatabase.open_database(driver="MySQLdb",  
> user="gbrowse",
>                passwd = "biosql", host = "localhost",  
> db="test_biosql")
>
> db = server["test"]
>
> start = time.time()
> filename = "Nanoarchaeum_equitans/NC_005213.gbk"
> print "Importing %s" % filename
> records = SeqIO.parse(open(filename), "genbank")
> print "Loaded %i records" % db.load(records)
> server.adaptor.commit()
> print "Took %0.2fs including the commit" % (time.time()-start)
>
> start = time.time()
> filename = "Escherichia_coli_K12_substr__MG1655/NC_000913.gbk"
> print "Importing %s" % filename
> records = SeqIO.parse(open(filename), "genbank")
> print "Loaded %i records" % db.load(records)
> server.adaptor.commit()
> print "Took %0.2fs including the commit" % (time.time()-start)
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================






More information about the BioSQL-l mailing list