[Biopython-dev] Newick Parser

Sat Sep 28 17:52:03 UTC 2013

Hi Peter,

I think handling the joys of Unicode might be a bit more trouble than it's
worth given how few of the files are probably Unicode, and I think most
bioinformatics is still done in standard ACSCII English anyway.  

I just submitted pull request 241.  It throws an error when BOMs are
detected (right now it says the number of "(" does not equal the number of
")" which is super confusing).  This way the user can just convert the file
on their end.

All the best,
N

-----Original Message-----
From: Peter Cock [mailto:p.j.a.cock at googlemail.com] 
Sent: Saturday, September 28, 2013 1:28 PM
To: Nigel Delaney
Cc: Biopython-Dev Mailing List
Subject: Re: [Biopython-dev] Newick Parser

On Sat, Sep 28, 2013 at 4:55 PM, Nigel Delaney <nigel.delaney at outlook.com>
wrote:
>
>>
>> Does it even make sense to allow non-ASCII in Newick format?
>>
>
> I think that's a matter of opinion.  The specs I found discussed how 
> to parse the string, but not how to encode the string.

Right, and they probably all pre-date unicode and are implicitly ASCII only.

> The advantages I can see are allowing people to use the extended 
> characters for node/tip label names, and being robust if different 
> text-editors/programs muck with the files (which I would suspect are 
> usually ASCII).

Yep.

> The disadvantage is that it's another case to handle in code, so could 
> just be ignored or throw an exception.
>
> Not sure what the preferred choice for biopython would be.

If you'd like to work on this it sounds useful - but you'll have to be extra
careful about testing under both Python 2 and Python 3 due to the joys of
unicode.

Peter