[Bioperl-l] Root::IO handle Mac and Win32 LF
Dave Howorth
dhoworth at mrc-lmb.cam.ac.uk
Tue Dec 16 12:48:33 EST 2003
Jason Stajich wrote:
> On Tue, 16 Dec 2003, Dave Howorth wrote:
>>Ah, now that's interesting. In this specific case the application,
>>newick.pm, has explicitly opted out of Perl's end-of-line handling by
>>redefining $/ so it can slurp the whole tree at once:
>>
>> local $/ = ";\n";
>> return unless $_ = $self->_readline;
>>
>>Which, IMHO, makes it its problem to deal with line breaks.
>
> Hmmm - SeqIO::fasta does this sort of thing as well.
>
> This has nothing to do with the individual fields though - it only defines
> how much to slurp in, if it weren't working we'd get two trees mooshed
> together as one record and doesn't affect the multi-lined reports since
> they only have a ; at the end.
>
> In the end this had nothing to do with Windows LF problems once I had
> Valentin's test file in front of me.
>
> Adding this to newick.pm after the record is slurped in takes
> care of the problem:
> s/[\n\r]+//g
>
> As any sort of newline needs to be stripped out as that is what is
> getting converted to spaces. It really wasn't a windows problem but
> a problem with Allen's changes to the newick parsing code to replace WS
> with _ but not handling LF separately.
>
>>From the log:
>
> revision 1.22
> date: 2003/08/15 17:07:27; author: allenday; state: Exp; lines: +3 -2
> removed unnecessary escap char in space removing regex. added regex to
> remove quotes and leading/trailing spaces
> from node labels as necessary.
> ----------------------------
> revision 1.21
> date: 2003/08/15 08:31:46; author: allenday; state: Exp; lines: +5 -2
> fixing over-zealous whitespace removal from node labels. we do this by
> not tampering with " quoted strings. i'm not sure if newick allows " to
> be escaped within these labels... if so, there may be a bug here.
> ----------------------------
"Single quote characters in a quoted label are represented by two single
quotes." See below for reference.
> My original code stripped all whitespace and thus we never had this
> problem because there shouldn't be any in the node names in Newick
> http://evolution.genetics.washington.edu/phylip/newicktree.html
> "A name can be any string of printable characters except --->blanks<---,
> colons, semcolons, parentheses, and square brackets."
I agree that page says that, but it also says:
"The above description is actually of a subset of the Newick Standard"
and on the page which it points to as the closest thing to a standard:
<http://evolution.genetics.washington.edu/phylip/newick_doc.html>
you can see that those characters *can* appear in labels. But newlines
can't.
And what should also happen is that underscores in unquoted labels are
translated to spaces in the internal format, because underscore and
space are two different valid characters in the quoted format (and thus
look different in graphical output in a tool that can deal with it e.g.
TreeTool). But this breaks Bio::Tree or somesuch (don't ask me how I know :(
> but apparently he wants to support this for his purposes.
... as do I :)
> I think my small change above takes care of the bug.
>
> -jason
Cheers, Dave
PS If anybody wants code that reads full Newick ...
--
Dave Howorth
MRC Centre for Protein Engineering
Hills Road, Cambridge, CB2 2QH
01223 252960
More information about the Bioperl-l
mailing list