[Biopython-dev] Python 3 and Bio.SeqIO.index()

Peter biopython at maubp.freeserve.co.uk
Thu Jul 15 13:31:29 UTC 2010


On Wed, Jul 14, 2010 at 7:38 PM, Vince S. Buffalo <vsbuffalo at gmail.com> wrote:
> There's a slight (perhaps non-significant) speed up on OS X. In Python 2.7
> on OS X 10.5.8:
> vinceb$ python index2.py s_7_1_sequence.fasta
> s_7_1_sequence.fasta
> Indexed in 32.35s
> vinceb$ python index2b.py s_7_1_sequence.fasta
> s_7_1_sequence.fasta
> Indexed in 26.01s
> best,
> Vince

I don't have Python 3 on my Mac yet, so I've tried things out under Linux.

7 million entry FASTA file with Unix line endings (LF), on Linux:

python2.7 index2.py SRR001666_1.lf.fasta - 19s
python2.7 index2b.py SRR001666_1.lf.fasta - 19s
python3.1 index3.py SRR001666_1.lf.fasta - Over an hour (I killed it)
python3.1 index3b.py SRR001666_1.lf.fasta - 29s

Again, I gave up on the Python 3 plain text unicode string version.

7 million entry FASTA file with DOS line endings (CR LF), on Linux:

python2.7 index2.py SRR001666_1.crlf.fasta - 19 or 20s
python2.7 index2b.py SRR001666_1.crlf.fasta - 19 or 20s
python3.1 index3.py SRR001666_1.crlf.fasta - not tested
python3.1 index3b.py SRR001666_1.crlf.fasta - 29s

Interestingly the line endings make almost no difference to the timings.

On this machine the python3.1 bytes version is slower than either of
the Python 2.7 versions. This may be down to compiler options
or something (I compiled the Python 3.1 myself with the defaults).
Recall on the Windows machine Python 3.1 (binary mode) was
faster than Python 2.7 (binary mode or universal new lines mode).

Regarding possible speed ups under Python 2 by avoiding universal
new lines mode, as you can see above on this Linux Python 2.7 setup
timing on index2.py and index2b.py are practically equal (~19s),
unlike on the Windows machine where this did seem to help.

I think the clear message (from both Windows and Linux) is that for
Bio.SeqIO.index() to perform at a tolerable speed on Python 3 we
can't use the default text mode with unicode strings, we are going
to have to use binary mode with bytes.

Peter




More information about the Biopython-dev mailing list