[Bioperl-l] Root::IO handle Mac and Win32 LF

Dave Howorth dhoworth at mrc-lmb.cam.ac.uk
Tue Dec 16 12:48:33 EST 2003


Jason Stajich wrote:
> On Tue, 16 Dec 2003, Dave Howorth wrote:
>>Ah, now that's interesting. In this specific case the application,
>>newick.pm, has explicitly opted out of Perl's end-of-line handling by
>>redefining $/ so it can slurp the whole tree at once:
>>
>>    local $/ = ";\n";
>>    return unless $_ = $self->_readline;
>>
>>Which, IMHO, makes it its problem to deal with line breaks.
> 
> Hmmm - SeqIO::fasta does this sort of thing as well.
> 
> This has nothing to do with the individual fields though - it only defines
> how much to slurp in, if it weren't working we'd get two trees mooshed
> together as one record and doesn't affect the multi-lined reports since
> they only have a ; at the end.
> 
> In the end this had nothing to do with Windows LF problems once I had
> Valentin's test file in front of me.
> 
> Adding this to newick.pm after the record is slurped in takes
> care of the problem:
>  s/[\n\r]+//g
> 
> As any sort of newline needs to be stripped out as that is what is
> getting converted to spaces.  It really wasn't a windows problem but
> a problem with Allen's changes to the newick parsing code to replace WS
> with _ but not handling LF separately.
> 
>>From the log:
> 
> revision 1.22
> date: 2003/08/15 17:07:27;  author: allenday;  state: Exp;  lines: +3 -2
> removed unnecessary escap char in space removing regex.  added regex to
> remove quotes and leading/trailing spaces
> from node labels as necessary.
> ----------------------------
> revision 1.21
> date: 2003/08/15 08:31:46;  author: allenday;  state: Exp;  lines: +5 -2
> fixing over-zealous whitespace removal from node labels.  we do this by
> not tampering with " quoted strings.  i'm not sure if newick allows " to
> be escaped within these labels... if so, there may be a bug here.
> ----------------------------

"Single quote characters in a quoted label are represented by two single 
quotes." See below for reference.

> My original code stripped all whitespace and thus we never had this
> problem because there shouldn't be any in the node names in Newick
> http://evolution.genetics.washington.edu/phylip/newicktree.html
>  "A name can be any string of printable characters except --->blanks<---,
>  colons, semcolons, parentheses, and square brackets."


I agree that page says that, but it also says:
  "The above description is actually of a subset of the Newick Standard"
and on the page which it points to as the closest thing to a standard:
  <http://evolution.genetics.washington.edu/phylip/newick_doc.html>
you can see that those characters *can* appear in labels. But newlines 
can't.

And what should also happen is that underscores in unquoted labels are 
translated to spaces in the internal format, because underscore and 
space are two different valid characters in the quoted format (and thus 
look different in graphical output in a tool that can deal with it e.g. 
TreeTool). But this breaks Bio::Tree or somesuch (don't ask me how I know :(

> but apparently he wants to support this for his purposes.

... as do I :)

> I think my small change above takes care of the bug.
> 
> -jason

Cheers, Dave

PS  If anybody wants code that reads full Newick ...
-- 
Dave Howorth
MRC Centre for Protein Engineering
Hills Road, Cambridge, CB2 2QH
01223 252960



More information about the Bioperl-l mailing list