[Biopython-dev] PhyloXML helper functions

Eric Talevich eric.talevich at gmail.com
Wed Jul 8 18:58:52 UTC 2009


Hi Brad,

On Tue, Jul 7, 2009 at 8:51 AM, Brad Chapman <chapmanb at 50mail.com> wrote:

> > 2. A find() method on Clade and maybe Phylogeny objects
> [...]
> > Enhancements:
> > - The keyword argument could be a regular expression. Would that be
> useful?
>
> This seems useful. Often people use crazy naming convention hacks,
> and might want to pull out something like all proteins from a
> particular organism based on a common prefix in the name.
>
> > To handle numbers, I'd have to convert every sub-node attribute value to
> a
> > string, and that would be weird -- or else find() would have to skip
> > numerical attributes.
>
> Is this if you support regular expressions or either way? For the
> find, I think it's sufficient to define what you support and leave
> it at that set: any subset of searching will help people get their
> work done.
>

I implemented it. Here's the signature and docstring:

def find(self, cls=None, **kwargs)

"""Find all sub-nodes matching the given attributes.

The 'cls' argument specifies the class of the sub-node. Nodes that inherit
from
this type will also match. (The default, Tree.PhyloElement, matches any
standard phyloXML type.)

The arbitrary keyword arguments indicate the attribute name of the sub-node
and
the value to match: string, integer or boolean. Strings are evaluated as
regular expression matches; integers are compared directly for equality, and
booleans evaluate the attribute's truth value (True or False) before
comparing.
To handle nonzero floats, search with a boolean argument, then filter the
result manually.

If no keyword arguments are given, then just the class type is used for
matching.

The result is an iterable through all matching objects, by depth-first
search. (Not necessarily the same order as the elements appear in the
source file!)

Example:

>>> tree = PhyloXML.read('phyloxml_examples.xml').phylogenies[5]
>>> matches = tree.clade.find(code='OCTVU')
>>> matches.next()
Taxonomy(code='OCTVU', scientific_name='Octopus vulgaris')
"""

Notes:
- Phylogeny.find just directly calls self.clade.find and returns the result.

- I still use PhyloElement instead of object for the default class. The
  recursive function uses __dict__ to walk the tree, so allowing any object
to
  be searched leads to chaos (e.g. int.__dict__ has 55 keys).  Restricting
the
  search to Tree-related nodes still accommodates most use cases, I think.

- Depth-first search - if a node that matches has subnodes that also match,
the
  higher node will be yielded first, then the first matching subnode, and so
  on. But: since the object dictionary doesn't keep XML node order, the
order
  the matches are returned in isn't always what you'd expect. I think I can
  mitigate this somewhat, but still -- documented weirdness.


Thanks,
Eric



More information about the Biopython-dev mailing list