[Biopython-dev] [Bug 1747] GenBank parser is very slow and memory hungry for large input files

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Wed Mar 9 05:32:39 EST 2005


http://bugzilla.open-bio.org/show_bug.cgi?id=1747





------- Additional Comments From biopython-bugzilla at maubp.freeserve.co.uk  2005-03-09 05:32 -------
The following times and memory usage figures are on Windows 2000,
Python 2.3.3 using the GenBank Iterator (script running from Idle).
The computer was a 2.26 GHz Intel Pentium 4, with 735 MB RAM:-

BioPython 1.30 using Martel:-
NC_003065.gbk     480 kb,   4 seconds,  28 MB RAM
NC_003064.gbk   1,217 kb,  11 seconds,  56 MB RAM
NC_000854.gbk   3,391 kb,  45 seconds, 165 MB RAM
NC_003063.gbk   4,725 kb,  55 seconds, 195 MB RAM
NC_003062.gbk   6,574 kb,  88 seconds, 268 MB RAM
NC_005966.gbk   8,858 kb, 139 seconds, 372 MB RAM
NC_000913.gbk  10,267 kb, 171 seconds, 409 MB RAM
NC_000962.gbk  11,010 kb, 200 seconds, 486 MB RAM
NC_003997.gbk  12,026 kb, 228 seconds, 496 MB RAM
NC_002678.gbk  15,120 kb, 306 seconds, 586 MB RAM
NC_005027.gbk  18,211 kb, not enough RAM
NC_004463.gbk  19,500 kb, not enough RAM
NC_003888.gbk  24,390 kb, not enough RAM
NC_004354.gbk  33,139 kb, not enough RAM
NC_003074.gbk  42,281 kb, not enough RAM
NC_003070.gbk  55,149 kb, not enough RAM

BioPython 1.30 with this patch:-
NC_003065.gbk     480 kb,   1 seconds,  13 MB RAM
NC_003064.gbk   1,217 kb,   4 seconds,  16 MB RAM
NC_000854.gbk   3,391 kb,  16 seconds,  25 MB RAM
NC_003063.gbk   4,725 kb,  17 seconds,  26 MB RAM
NC_003062.gbk   6,574 kb,  27 seconds,  33 MB RAM
NC_005966.gbk   8,858 kb,  33 seconds,  40 MB RAM
NC_000913.gbk  10,267 kb,  43 seconds,  45 MB RAM
NC_000962.gbk  11,010 kb,  41 seconds,  45 MB RAM
NC_003997.gbk  12,026 kb,  55 seconds,  52 MB RAM
NC_002678.gbk  15,120 kb,  71 seconds,  61 MB RAM
NC_005027.gbk  18,211 kb,  88 seconds,  68 MB RAM
NC_004463.gbk  19,500 kb,  95 seconds,  74 MB RAM
NC_003888.gbk  24,390 kb, 146 seconds,  95 MB RAM
NC_004354.gbk  33,139 kb, 156 seconds, 121 MB RAM
NC_003074.gbk  42,281 kb, 302 seconds, 193 MB RAM
NC_003070.gbk  55,149 kb, 436 seconds, 250 MB RAM

The last three (really big) files are from Drosophila and
Arabidopsis, the rest are bacteria.

Times recorded by the test script, memory usage recorded by hand
using Task Manager.

In summary, with the patch parsing is nearly four times faster,
and uses almost a tenth of the memory - quite an improvement.

The details of implementation for this approach could be improved,
I have had some thoughts about this over night.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


More information about the Biopython-dev mailing list