[Biopython-dev] Python 3 subprocess bytes vs unicode

Peter biopython at maubp.freeserve.co.uk
Fri Jul 9 09:40:10 UTC 2010


Hi all,

Many of the unit tests failing on Python 3.1 after using 2to3 are
when calling external command line tools. Interestingly in Py3k
the sys,stdin, sys,stdout and sys,stderr are in text mode by
default - they automatically give you unicode strings instead of
the raw bytes. This makes sense to me (and you can get at the
bytes if you want them):
http://docs.python.org/py3k/library/sys.html

However, the stdin, stdout and strerr of any child process
created with subprocess default to binary mode, and so return
or expect bytes - not unicode strings:
http://docs.python.org/py3k/library/subprocess.html

It looks like we'll want to use universal_newlines=True when
calling subprocess to that we can treat subprocess handles
as text mode (i.e. unicode strings not bytes). This option is
also present on Python 2, where is just controls the automatic
handling of new line characters - so should be harmless (or
even a good idea).

This seems like a more elegant option than adding lots
of encode/decode calls when doing IO with child processes
(which I think Tiago has tried).

Peter

P.S. if we make our command line wrappers callable (or
add some kind of run method) as previously discussed, it
can set this option when calling subprocess.



More information about the Biopython-dev mailing list