[Biopython-dev] Python 3 and encoding for online resources

Tue Jul 27 13:23:27 UTC 2010

Hi all,

One of the remaining (pure python) problems with Biopython
under Python 3 relates to parsing online resources like the
NCBI Entrez API or even Bio.ExPASy.get_sprot_raw().
See for example test_SeqIO_online.py for a failure.

In Python 2, urlopen from urlib or urllib2 would give a
string handle. In python 3, you get a bytes handle (not
a unicode handle and choosing the encoding is tricky):
http://docs.python.org/py3k/library/urllib.request.html

In the case of resources like the NCBI and ExPASy we
should be able to assume an encoding (maybe UTF-8
or Latin) for all the plain text output, while from XML/HTML
there are ways for the data to specify this itself.

I think we may need to transform the urllib bytes handle into
a unicode string handle for parsing. One option would be to
extend the Bio.File.UndoHandle class (or invent a subclass)
which applies the decoding. This seems simple since
Bio.Entrez and Bio.ExPASy already use this class.

Another option (which I suggested on the Bio.SeqIO.index()
thread [1]) would be to extend our parsers to cope with both
byte and unicode handles. That could be a lot of work though...

Thoughts?

Peter

[1] http://lists.open-bio.org/pipermail/biopython-dev/2010-July/008004.html