[Biopython-dev] PhyloXML read/parse functions and handles

Sun May 10 05:22:46 UTC 2009

On Sat, May 9, 2009 at 6:06 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> Hi Eric,
>
> Are you happy to have feedback on your PhyloXML code in public?

Sure am! I was just getting around to drafting up some questions for
biopython-dev, but I'm glad to receive some preemptive advice.

I just had a look at the stub in Bio/PhyloXML/__init__.py and
> Bio/PhyloXML/Parser.py on your github branch,
> http://github.com/etal/biopython/tree/phyloxml
>
> The convention we are following in Biopython for parsing functions is
> as follows:
> read(handle, ...) - returns a single object (e.g. a tree in your case)
> parse(handle, ...) - returns an iterator (e.g. returning multiple trees)
>
>
I noticed that; I'll change the Bio.PhyloXML.Parser.parse() stub to read()
and have it behave as expected.

The function currently allows either filenames or file handles as the source
because ElementTree.iterparse() also accepts either object as a source. The
read() function could "assert not isinstance(infile, str)", I guess...

The existing Java implementation in Forester/ATV has even more magic,
automatically performing Zip extraction if the given filename ends with
'.zip'. Since this looks like it will be a pretty common use case, at least
for big files, I thought it would be nice to also offer a wrapper function
that takes a filename and does the Right Thing -- that's what
__init__.read() does currently. Is there a precedent for this in Biopython?
The name should probably be something different; in the pdbtidy branch I
used load(), to match the Pickle module, since the wrapper function does
more than just parse or read a file.

So how about:

from Bio import PhyloXML
handle = open('somefile', 'r') # file-like object from any source
tree = PhyloXML.read(handle)

Equivalent to:

from Bio import PhyloXML
tree = PhyloXML.load('somefile') # DTRT for xml, zip, gz, ...?

Or, to be explicit, offer a read_zip or load_zip function. I'd leave well
enough alone, but the incantation to extract a character stream from a
single zipped file is kind of unintuitive, and one of the three example
files on phyloxml.org is already zipped. (I should really ask Christian
Zmasek about this to see if that's a real convention or not.)

P.S. Finally, a more general note about a possible "Bio.TreeIO"
> module. For simple Newick trees, a single file can contain one or more
> trees (e.g. from bootstrapping).  A tree can be split over multiple
> lines (but may be one long line), but multiple trees can be split up
> because they should all have a semicolon terminator.  For Nexus files,
> I'm not sure off hand if there can be more than one tree.  If you are
> going to use the Tree objects from Bio.Nexus, then we could provide a
> "Bio.TreeIO" module with read/parse/write methods coping with
> "newick", "nexus", "phyloxml" formats, all using the same tree
> objects.
>

OK, I'll give it a try. Brad recommended that I just get a simple PhyloXML
parser working first before attempting integration, but if some of Bio.Nexus
can be reused in that process, great. I'm about to go dark from the end of
this week until 3/31 (getting married, yaknow), but I'll fix all this code
when I get back and have access to git again.

Thanks for your help,
Eric