[Biopython] Legacy blastn XML outfile parsing is slow. What XML parser is actually used?
Peter Cock
p.j.a.cock at googlemail.com
Tue Sep 25 13:32:07 UTC 2012
On Tue, Sep 25, 2012 at 1:26 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>
> OK - so there is at least one person making heaving use of the
> NCBIXML so we shouldn't rush to deprecate it after merging
> SearchIO, and there *is* some benefit from making it faster
> (but with the same API).
>
> In principle NCBIXML would be rewritten to use cElementTree
> /ElementTree and preserve the API - if you or anyone else want
> to do that (and the unit tests still pass), then I'm happy to review
> such changes. Likewise for less dramatic optimisations.
Martin emailed me to ask about this bit of the code, and it
can be sped up - this shows about a 5% reduction:
https://github.com/biopython/biopython/commit/970364761982bf331c221b6f007e8b8e52fa9600
Summary parsing a 286MB XML file from BLASTX 2.2.26+
for 1000 genes against the NR database.
NCBIXML before change: About 162s
NCBIXML after change: About 154s
NCBIXML removing debug: About 152s
Using SearchIO: About 79s
This is probably the same test file Bow gave numbers for earlier,
although it seems SearchIO has less of an advantage on my
machine (about x2) compared to Bow's machine (almost x5).
(We should check memory usage too...)
Peter
---------------------------------------------
The full details,
Before this change:
$ time python time_ncbixml.py thousand_blastx_nr.xml
1000 records in thousand_blastx_nr.xml in 161.8s
real 2m41.894s
user 2m41.208s
sys 0m0.675s
$ time python time_ncbixml.py thousand_blastx_nr.xml
1000 records in thousand_blastx_nr.xml in 161.8s
real 2m41.984s
user 2m41.296s
sys 0m0.677s
$ time python time_ncbixml.py thousand_blastx_nr.xml
1000 records in thousand_blastx_nr.xml in 162.6s
real 2m42.771s
user 2m41.995s
sys 0m0.763s
With this change:
$ time python time_ncbixml.py thousand_blastx_nr.xml
1000 records in thousand_blastx_nr.xml in 152.4s
real 2m32.582s
user 2m31.910s
sys 0m0.663s
$ time python time_ncbixml.py thousand_blastx_nr.xml
1000 records in thousand_blastx_nr.xml in 153.5s
real 2m33.680s
user 2m32.977s
sys 0m0.695s
$ time python time_ncbixml.py thousand_blastx_nr.xml
1000 records in thousand_blastx_nr.xml in 153.8s
real 2m33.931s
user 2m33.258s
sys 0m0.661s
And if we go further and remove _debug_ignore_list and
this bit of debug code the saving is marginal:
$ time python time_ncbixml.py thousand_blastx_nr.xml
1000 records in thousand_blastx_nr.xml in 151.5s
real 2m31.611s
user 2m30.934s
sys 0m0.665s
$ time python time_ncbixml.py thousand_blastx_nr.xml
1000 records in thousand_blastx_nr.xml in 151.2s
real 2m31.348s
user 2m30.664s
sys 0m0.674s
$ time python time_ncbixml.py thousand_blastx_nr.xml
1000 records in thousand_blastx_nr.xml in 152.9s
real 2m32.994s
user 2m32.314s
sys 0m0.669s
This is the timing script I used,
$ more /tmp/time_ncbixml.py
import sys
import time
from Bio.Blast import NCBIXML
for f in sys.argv[1:]:
start = time.time()
count = 0
handle = open(f)
for record in NCBIXML.parse(handle):
count += 1
handle.close()
print "%i records in %s in %0.1fs" % (count, f, time.time() - start)
#End of file
For comparison, here is the timing on the same setup but using
SearchIO from Bow's current branch:
$ time python time_searchio.py thousand_blastx_nr.xml
1000 records in thousand_blastx_nr.xml in 79.1s
real 1m19.259s
user 1m18.397s
sys 0m0.799s
$ time python time_searchio.py thousand_blastx_nr.xml
1000 records in thousand_blastx_nr.xml in 78.7s
real 1m18.878s
user 1m18.149s
sys 0m0.719s
$ time python time_searchio.py thousand_blastx_nr.xml
1000 records in thousand_blastx_nr.xml in 79.5s
real 1m19.611s
user 1m18.683s
sys 0m0.918s
And the script:
$ more /tmp/time_searchio.py
import sys
import time
from Bio import SearchIO
for f in sys.argv[1:]:
start = time.time()
count = 0
handle = open(f)
for record in SearchIO.parse(handle, "blast-xml"):
count += 1
handle.close()
print "%i records in %s in %0.1fs" % (count, f, time.time() - start)
#End of file
More information about the Biopython
mailing list