[Biopython-dev] Newick support in Bio.TreeIO?

Wed Jul 29 19:37:08 UTC 2009

On Wed, Jul 29, 2009 at 6:59 PM, Eric Talevich<eric.talevich at gmail.com> wrote:
>>
>> Bio.TreeIO.parse() should be an iterator, returning complete tree
>> objects one by one. I was thinking of having Bio.TreeIO.NewickIO
>> just take a plain text file, split it up at the ";\n" characters (or
>> similar) to get each tree as a string, which is passed to
>> Bio.Nexus.Trees.Tree to parse it.
>
> OK, I did this.
>
> http://github.com/etal/biopython/blob/phyloxml/Bio/TreeIO/NewickIO.py

OK, I haven't run the code but have a couple of points.

On a general point, are you intending to re-write parse and read
functions for each tree format? For Bio.SeqIO all I do is write a
iterator (i.e. a parse) function, and Bio.SeqIO.parse() and also
Bio.SeqIO.read() call this.

I didn't use RuntimeError for SeqIO and AlignIO, I used ValueError.
I figured the data in the handle didn't match the expectations, which
was like saying it had a bad value. It would therefore be more
consistent to do the same.

The parsing code looks weird to me - but that is probably a style
thing. Certainly I had to stare at it to work out what it was doing.
It also has a bug - consider a Newick file containing one tree but
with no trailing semi colon.

On a more serious note, your output code creates a monster string
of all the trees in memory! Don't do this as it ruins the whole memory
benefit of using iterators to keep just one tree in memory at a time:

        lines = (t.to_string(plain_newick=True, plain=plain, **kwargs)
                 for t in trees)
        file.write(';\n'.join(lines)

Instead handle the trees one by one:

    for t in trees : file.write(t.to_string(...) + ";\n")

(I'm assuming like you did that the to_string method won't add the
trailing semi colon and new line.)

>> > Sounds like an incremental parse() function over these trees would be
>> > very useful for distributed bootstrap analysis etc.
>>
>> Exactly. And Bio.TreeIO.read() would be for the special case where
>> the file format contains exactly one tree.
>
> PhyloXML has a top-level object that contains multiple phylogenies, plus
> arbitrary 'other' data; PhyloXML.read() returns one of those object
> regardless of how many phylogenies it contains. Newick doesn't have a
> top-level container, so returning one tree and raising a RuntimeError if
> there isn't exactly one tree makes sense. But Nexus has a top-level
> container with (potentially) a bunch of other info -- should NexusIO.read()
> return the complete Nexus object, or just pretend to be a Newick wrapper and
> behave that way?

Ah. The top level information about all the trees may cause trouble
for the TreeIO model I had in mind (which was *just* for trees). The
advantage of this is a consistent API, the downside is certain file
format specific things cannot be supported nicely. I think this balance
has worked nicely for SeqIO and AlignIO to date. So:
* Bio.TreeIO.read(...) would return one tree.
* Bio.TreeIO.parse(...) would iterate over trees one by one.
* Bio.TreeIO.write(...) would write trees out (ideally sequentially
if the file format allows this).

Note I am assuming it is possible to write a PhyloXML tree with
minimal (empty) top level annotation? You would need to do this
in order to convert from a Nexus or Newick tree to a (minimal)
PhyloXML tree.

So, based on how SeqIO and AlignIO work, I would expect Bio.TreeIO
would only give you the trees - you'd not get the top level information.
For parsing Nexus files, Bio.TreeIO would only give access to a
subset of the data in a Nexus file - just the trees. In the same way,
parsing a Nexus file with AlignIO only gives you the alignment. If
you want any of the other data in a Nexus file, you have to use the
Bio.Nexus module.

If you (as a user) needed the top level annotation in a PhyloXML file,
then I would say use Bio.PhyloXML (or what ever we are calling it)
directly instead of Bio.TreeIO.

>> As far as I know, Bio.Nexus just parses a whole file in one go. This
>> means either Bio.TreeIO.NexusIO would call this and then loop over
>> the list (very memory inefficient), or it would need a minimal Nexus
>> parser just to spot the TREES block, and handle them only.
>
> That's what I pictured for a Bio.Nexus refactoring -- I don't know the right
> way to do it in a memory-efficient way, though, given that there are
> multiple types of blocks and they may be needed at different times. Maybe
> make an initial pass to index the file at the block level, then call
> incremental line-level parsers on the selected blocks? Or, simpler, factor
> out the efficient line-level parsers so that they can be accessed separately
> if need be -- basically the way Nexus._tree() works now -- and let the
> block-level parsing code call those specific parsers.

Maybe. Of course, in practice Nexus files may not be that big. I don't
know if anyone uses them to store (for example) 1000 bootstrap trees.
As Brad and I have noted before, spending time on refactoring Bio.Nexus
is not the best use of your GSoC project time (plus we'd need to get
Cymon and Frank much more involved, worry more about backwards
compatibility etc).

>> I'm not actually sure where I put it... it should be on my old desktop
>> at home somewhere. However, I can elaborate in that in addition NJ
>> using quicktree, I also did parsimony bootstrap values, and drew my
>> own colourful trees using reportlab. See the three supplementary
>> figures here: http://dx.doi.org/10.1099/mic.0.2007/013672-0
>
> Hey, neat. I was about to start a project involving kinases and response
> regulators.

Cool - email me off list if you want to chat more about this aspect.

> How much trouble was it to draw trees in reportlab? Do you think
> it would be worth adding a tree-drawing module to Bio.Graphics?

I agree that tree drawing would be a nice addition to Bio.Graphics.

But that code of mine as written would not be good enough. In the
end it was a bit of a hack - it got the job done but had lots of special
cases (e.g. to get colouring by species to work, and in particular the
double bootstrap values caused me pain as I had to have two otherwise
identical trees loaded). Even ignoring this, the basic code didn't use
an object orientated approach which makes it a poor match to the
rest of Bio.Graphics. Basically I would want to rewrite it from scratch
before I felt it was fit for public reuse, and have never found the time.

Peter