[Biopython-dev] Bytes, Strings and Unicode (Python 2 vs 3)

Peter biopython at maubp.freeserve.co.uk
Thu Jul 29 10:29:28 UTC 2010


Hi all,

I'm forwarding something from the NumPy mailing list regarding strings
and unicode:

On Thu, Jul 29, 2010 at 4:40 AM, Fernando Perez <fperez.net at gmail.com> wrote:
>
> On Wed, Jul 28, 2010 at 12:36 PM, Fernando Perez <fperez.net at gmail.com> wrote:
>> The official Python 2.x unicode story is well explained here:
>> http://docs.python.org/howto/unicode.html
>>
>> and here is the corresponding document for 3.x:
>> http://docs.python.org/release/3.1.2/howto/unicode.html
>
> Just in case you're still thirsty for more info on Unicode... :)
>
> Min Ragan-Kelley just did a great summary writeup of these questions
> from a low-level perspective: for pyzmq we need to handle strings
> (i.e. unicode) at the python level, but efficiently and unambiguously
> communicate with a networking layer written in C.  We spent a lot of
> time thinking about this, and his writeup is a great resource for
> anyone who needs to look at this from a C/low-level angle:
>
> http://ptsg.berkeley.edu/~minrk/zmq/unicode.html
>
> This adds a view that isn't made very explicit in any of the docs I'd
> previously sent.
>
> Cheers,
>
> f

The fact that on most Linux distributions Python 3's unicode strings
will take 4x the memory of plain byte strings, and even Windows
and Mac will take 2x the memory is concerning for me (since I've
been using Biopython for some next gen sequencing stuff where
memory is already sometimes the main bottleneck).

I think we will want to make the Seq object use bytes internally,
rather than unicode strings. We'll also want to make sure the
Seq module functions will cope with bytes, unicode or Seq type
objects.

For most annotation (e.g. in SeqRecord and SeqFeature objects),
I guess the default of unicode strings will be OK. Perhaps the
SeqRecord's id/name/description might be border line cases...

Peter




More information about the Biopython-dev mailing list