[BioSQL-l] Timing importing GenBank files into BioSQL

Peter biopython at maubp.freeserve.co.uk
Mon Aug 18 17:05:37 UTC 2008


On Mon, Aug 18, 2008 at 5:33 PM, Nick Loman <n.j.loman at bham.ac.uk> wrote:
> Peter wrote:
>
>> I'm wondering if the BioPerl time is typical (I hope not), and if
>> there are any computationally intensive or otherwise slow things it
>> does that BioPython might be skipping (checksums? fetching taxonomy?)
>
> I also found that BioPython was faster than BioPerl at importing the same
> GenBank file.

That is reassuring that you also saw a difference - do you recall how
big a difference this was on your setup?  The factor of ten I am
seeming is rather surprising.

> There are some differences in the handling of certain tables, the dbxref
> table springs to mind. It is worth doing a dump of the database after
> importing each file using the two different methods and comparing the
> results. The differences may not be significant for you depending on your
> application.

I am hoping to bring Biopython into closer agreement with BioPerl (and
thus also BioJava) in its use of BioSQL.  If you have already made
notes on any observed differences, that could be very useful.

> I suspect the difference is speed you find is related to the number of
> object lookups done in BioPerl which is significantly more than in
> BioPython. You can specify --flatlookup to load_seqdatabase.pl which reduces
> the number of lookups.

Reading the help output from the load_seqdatabase.pl script, −−lookup
and --flatlookup seem to be related to speeding up updating existing
records (where as in my test, I am trying to start with an empty
database each time).  I tried it anyway, and it seems to make no
difference for this example.  But thanks for the suggestions, its one
thing ruled out at least.

> You could enable DBI_TRACE to get a log of SQL statements for BioPerl.

That could help track down some differences, both in what gets written
and how it gets written.  I am hoping to avoid using too much Perl,
otherwise I'm sure profiling load_seqdatabase.pl could be informative
too.

> For my purposes, I found both Bioperl and Biopython to be a bit slow devised
> a batch import script which speeds things up quite dramatically by
> eliminating most object lookups, and applying the foreign-key constraints
> post-importing.

This was your "BioSQL BatchLoader" code for PostgreSQL?  I remember
the impressive speed up you got, at the expense of a much modified
setup.
http://portal.open-bio.org/pipermail/biopython-dev/2008-April/003618.html

Peter



More information about the BioSQL-l mailing list