full newick support (was Re: [Bioperl-l] Root::IO handle Mac and
Win32 LF)
Jason Stajich
jason at cgt.duhs.duke.edu
Tue Dec 16 13:15:01 EST 2003
okay - I see your point Dave, supporting the quoted spaces must be done.
clearly my flu-clouded brain is trying to do too much right now.
Apologies Allen, you were on the right track I am clearly not reading up
on things as I should have.
Internally translating unquoted underscores to spaces can be done
presumably at the parser level. Will probably have to generate an
auto-quoting aspect for the writing of code.
I guess we need to strip out the comments as well - not sure we have a
slot for storing them in the current Tree::Tree object right now, but that
can be easily added.
If there are other things missing that I am unaware of speak up. I don't
suppose anyone else is keen on working on this?
-jason
On Tue, 16 Dec 2003, Dave Howorth wrote:
> Jason Stajich wrote:
> > On Tue, 16 Dec 2003, Dave Howorth wrote:
> >>Ah, now that's interesting. In this specific case the application,
> >>newick.pm, has explicitly opted out of Perl's end-of-line handling by
> >>redefining $/ so it can slurp the whole tree at once:
> >>
> >> local $/ = ";\n";
> >> return unless $_ = $self->_readline;
> >>
> >>Which, IMHO, makes it its problem to deal with line breaks.
> >
> > Hmmm - SeqIO::fasta does this sort of thing as well.
> >
> > This has nothing to do with the individual fields though - it only defines
> > how much to slurp in, if it weren't working we'd get two trees mooshed
> > together as one record and doesn't affect the multi-lined reports since
> > they only have a ; at the end.
> >
> > In the end this had nothing to do with Windows LF problems once I had
> > Valentin's test file in front of me.
> >
> > Adding this to newick.pm after the record is slurped in takes
> > care of the problem:
> > s/[\n\r]+//g
> >
> > As any sort of newline needs to be stripped out as that is what is
> > getting converted to spaces. It really wasn't a windows problem but
> > a problem with Allen's changes to the newick parsing code to replace WS
> > with _ but not handling LF separately.
> >
> >>From the log:
> >
> > revision 1.22
> > date: 2003/08/15 17:07:27; author: allenday; state: Exp; lines: +3 -2
> > removed unnecessary escap char in space removing regex. added regex to
> > remove quotes and leading/trailing spaces
> > from node labels as necessary.
> > ----------------------------
> > revision 1.21
> > date: 2003/08/15 08:31:46; author: allenday; state: Exp; lines: +5 -2
> > fixing over-zealous whitespace removal from node labels. we do this by
> > not tampering with " quoted strings. i'm not sure if newick allows " to
> > be escaped within these labels... if so, there may be a bug here.
> > ----------------------------
>
> "Single quote characters in a quoted label are represented by two single
> quotes." See below for reference.
>
> > My original code stripped all whitespace and thus we never had this
> > problem because there shouldn't be any in the node names in Newick
> > http://evolution.genetics.washington.edu/phylip/newicktree.html
> > "A name can be any string of printable characters except --->blanks<---,
> > colons, semcolons, parentheses, and square brackets."
>
>
> I agree that page says that, but it also says:
> "The above description is actually of a subset of the Newick Standard"
> and on the page which it points to as the closest thing to a standard:
> <http://evolution.genetics.washington.edu/phylip/newick_doc.html>
> you can see that those characters *can* appear in labels. But newlines
> can't.
>
> And what should also happen is that underscores in unquoted labels are
> translated to spaces in the internal format, because underscore and
> space are two different valid characters in the quoted format (and thus
> look different in graphical output in a tool that can deal with it e.g.
> TreeTool). But this breaks Bio::Tree or somesuch (don't ask me how I know :(
>
> > but apparently he wants to support this for his purposes.
>
> ... as do I :)
>
> > I think my small change above takes care of the bug.
> >
> > -jason
>
> Cheers, Dave
>
> PS If anybody wants code that reads full Newick ...
>
--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu
More information about the Bioperl-l
mailing list