[Biopython-dev] Python 3 and encoding for online resources

Sun Aug 1 15:14:23 UTC 2010

According to this post:

http://stackoverflow.com/questions/1179305/expat-parsing-in-python-3

we need only one parser which always parses a byte stream. Bio.Entrez uses File.UndoHandle but just to look for potential errors in the first few lines when opening the Entrez url, which in my opinion we shouldn't be doing anyway since it's the parser's job to decide whether the input is well-formed. So I'd suggest to not use File.UndoHandle (at all), make sure our parser works with Python 3 byte streams, and ask users to open any downloaded Entrez XML files in binary mode. Is there a Biopython version (in trunk or otherwise) that is ready for Python 3? If so, I can have a look at the parser to see if it handles byte streams correctly.

--Michiel.

--- On Tue, 7/27/10, Peter <biopython at maubp.freeserve.co.uk> wrote:

> From: Peter <biopython at maubp.freeserve.co.uk>
> Subject: [Biopython-dev] Python 3 and encoding for online resources
> To: "Biopython-Dev Mailing List" <biopython-dev at biopython.org>
> Date: Tuesday, July 27, 2010, 9:23 AM
> Hi all,
> 
> One of the remaining (pure python) problems with Biopython
> under Python 3 relates to parsing online resources like
> the
> NCBI Entrez API or even Bio.ExPASy.get_sprot_raw().
> See for example test_SeqIO_online.py for a failure.
> 
> In Python 2, urlopen from urlib or urllib2 would give a
> string handle. In python 3, you get a bytes handle (not
> a unicode handle and choosing the encoding is tricky):
> http://docs.python.org/py3k/library/urllib.request.html
> 
> In the case of resources like the NCBI and ExPASy we
> should be able to assume an encoding (maybe UTF-8
> or Latin) for all the plain text output, while from
> XML/HTML
> there are ways for the data to specify this itself.
> 
> I think we may need to transform the urllib bytes handle
> into
> a unicode string handle for parsing. One option would be
> to
> extend the Bio.File.UndoHandle class (or invent a
> subclass)
> which applies the decoding. This seems simple since
> Bio.Entrez and Bio.ExPASy already use this class.
> 
> Another option (which I suggested on the Bio.SeqIO.index()
> thread [1]) would be to extend our parsers to cope with
> both
> byte and unicode handles. That could be a lot of work
> though...
> 
> Thoughts?
> 
> Peter
> 
> [1] http://lists.open-bio.org/pipermail/biopython-dev/2010-July/008004.html
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>