full newick support (was Re: [Bioperl-l] Root::IO handle Mac and Win32 LF)

Jason Stajich jason at cgt.duhs.duke.edu
Tue Dec 16 13:15:01 EST 2003


okay - I see your point Dave, supporting the quoted spaces must be done.
clearly my flu-clouded brain is trying to do too much right now.
Apologies Allen, you were on the right track I am clearly not reading up
on things as I should have.


Internally translating unquoted underscores to spaces can be done
presumably at the parser level.  Will probably have to generate an
auto-quoting aspect for the writing of code.

I guess we need to strip out the comments as well - not sure we have a
slot for storing them in the current Tree::Tree object right now, but that
can be easily added.

If there are other things missing that I am unaware of speak up.  I don't
suppose anyone else is keen on working on this?

-jason

On Tue, 16 Dec 2003, Dave Howorth wrote:

> Jason Stajich wrote:
> > On Tue, 16 Dec 2003, Dave Howorth wrote:
> >>Ah, now that's interesting. In this specific case the application,
> >>newick.pm, has explicitly opted out of Perl's end-of-line handling by
> >>redefining $/ so it can slurp the whole tree at once:
> >>
> >>    local $/ = ";\n";
> >>    return unless $_ = $self->_readline;
> >>
> >>Which, IMHO, makes it its problem to deal with line breaks.
> >
> > Hmmm - SeqIO::fasta does this sort of thing as well.
> >
> > This has nothing to do with the individual fields though - it only defines
> > how much to slurp in, if it weren't working we'd get two trees mooshed
> > together as one record and doesn't affect the multi-lined reports since
> > they only have a ; at the end.
> >
> > In the end this had nothing to do with Windows LF problems once I had
> > Valentin's test file in front of me.
> >
> > Adding this to newick.pm after the record is slurped in takes
> > care of the problem:
> >  s/[\n\r]+//g
> >
> > As any sort of newline needs to be stripped out as that is what is
> > getting converted to spaces.  It really wasn't a windows problem but
> > a problem with Allen's changes to the newick parsing code to replace WS
> > with _ but not handling LF separately.
> >
> >>From the log:
> >
> > revision 1.22
> > date: 2003/08/15 17:07:27;  author: allenday;  state: Exp;  lines: +3 -2
> > removed unnecessary escap char in space removing regex.  added regex to
> > remove quotes and leading/trailing spaces
> > from node labels as necessary.
> > ----------------------------
> > revision 1.21
> > date: 2003/08/15 08:31:46;  author: allenday;  state: Exp;  lines: +5 -2
> > fixing over-zealous whitespace removal from node labels.  we do this by
> > not tampering with " quoted strings.  i'm not sure if newick allows " to
> > be escaped within these labels... if so, there may be a bug here.
> > ----------------------------
>
> "Single quote characters in a quoted label are represented by two single
> quotes." See below for reference.
>
> > My original code stripped all whitespace and thus we never had this
> > problem because there shouldn't be any in the node names in Newick
> > http://evolution.genetics.washington.edu/phylip/newicktree.html
> >  "A name can be any string of printable characters except --->blanks<---,
> >  colons, semcolons, parentheses, and square brackets."
>
>
> I agree that page says that, but it also says:
>   "The above description is actually of a subset of the Newick Standard"
> and on the page which it points to as the closest thing to a standard:
>   <http://evolution.genetics.washington.edu/phylip/newick_doc.html>
> you can see that those characters *can* appear in labels. But newlines
> can't.
>
> And what should also happen is that underscores in unquoted labels are
> translated to spaces in the internal format, because underscore and
> space are two different valid characters in the quoted format (and thus
> look different in graphical output in a tool that can deal with it e.g.
> TreeTool). But this breaks Bio::Tree or somesuch (don't ask me how I know :(
>
> > but apparently he wants to support this for his purposes.
>
> ... as do I :)
>
> > I think my small change above takes care of the bug.
> >
> > -jason
>
> Cheers, Dave
>
> PS  If anybody wants code that reads full Newick ...
>

--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu


More information about the Bioperl-l mailing list