[Biopython-dev] Python 3 and encoding for online resources

Sun Aug 1 17:54:03 UTC 2010

On Sun, Aug 1, 2010 at 4:14 PM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
> According to this post:
>
> http://stackoverflow.com/questions/1179305/expat-parsing-in-python-3
>
> we need only one parser which always parses a byte stream.
> Bio.Entrez uses File.UndoHandle but just to look for potential
> errors in the first few lines when opening the Entrez url, which
> in my opinion we shouldn't be doing anyway since it's the
> parser's job to decide whether the input is well-formed.
> So I'd suggest to not use File.UndoHandle (at all), ...

I disagree. The NCBI return multiple different file formats, so
there are multiple different parsers that may get an error page.
Given the NCBI return HTML error pages regardless of what
format the request was (XML, plain text, etc), I think we
have to look for errors before giving the data to the parser.
But that can be done using byte strings just as easily as with
unicode strings.

> make sure our parser works with Python 3 byte streams, and
> ask users to open any downloaded Entrez XML files in binary
> mode.

That sounds workable.

> Is there a Biopython version (in trunk or otherwise) that is ready
> for Python 3? If so, I can have a look at the parser to see if it
> handles byte streams correctly.

The trunk itself -- after running 2to3 on it (as described in the
README file). Or if you just want to grab some code for a quick
play, I have a branch where I've been doing this on a semi-regular
basis:

http://github.com/peterjc/biopython/tree/auto2to3

Note that we are keeping the trunk as Python 2 code, which
can make like interesting (Another option would be a Python
3 branch, but we'd then need to manually keep things in sync).
To make life a little easier, we are probably going to need some
python 3 compatibility functions (like bytes as unicode, unicode
as bytes - see the NumPy project for other possible examples),
which we are currently doing on a module by module basis.
Here I'm thinking specifically of some of the things required in
Bio/SeqIO/SffIO.py, but there are other python 3 hacks we may
want to standardise.

For the C code (which we haven't looked at yet, setup,py is
ignoring the extensions on Python 3 for now) we should be
able to use the normal #ifdef approach. Again, we can learn
a lot from looking at NumPy here.

Peter