[Biopython-dev] [Biopython - Bug #3386] (New) NewickIO parse_tree is slow

Wed Oct 10 13:02:23 UTC 2012

Issue #3386 has been reported by Aleksey Kladov.

----------------------------------------
Bug #3386: NewickIO parse_tree is slow
https://redmine.open-bio.org/issues/3386

Author: Aleksey Kladov
Status: New
Priority: Normal
Assignee: 
Category: 
Target version: 
URL: 

In the file NewickIO.py class Parser method _parse_subtree seems to be inefficient in time and space. In fact, it's running time is quadratic in respect to size of input, while it can be linear. The problem is that each symbol is read many (up to O(len(text))) times, for example here

<pre>
for posn in range(1, close_posn):
            if text[posn] == '(':
                plevel += 1
            elif text[posn] == ')':
                plevel -= 1
            elif text[posn] == ',' and plevel == 0:
                subtrees.append(text[prev:posn])
                prev = posn + 1
</pre>

or here

<pre>
comment_start = text.find(NODECOMMENT_START)
</pre>

Also, _parse_subtree relies heavily on slices and strips of strings, which gives quadratic memory consumption.

Here is my dirty patched implementation. It's incomplete in many senses, I wrote it only to prove that parsing can be done faster.

For unrooted binary tree with 15000 leaves it runs for 1 second, compared to 13 seconds from current implementation.

<pre>
def _parse_tree(self, text, rooted):
        """Parses the text representation into an Tree object."""
        # XXX Pass **kwargs along from Parser.parse?
        return Newick.Tree(root=self._parse_subtree_fast(text)[0], rooted=rooted)

    def _parse_subtree_fast(self, text):
        id = re.compile(r'[A-Za-z0-9_]+')
        children = []
        if text.startswith('('):
            text = text[1:]
            while True:
                child, text = self._parse_subtree_fast(text)
                children.append(child)
                if text.startswith(','):
                    text = text[1:]
                else:
                    text = text[1:]
                    break
        m = re.match(id, text)
        if m:
            clade = self._parse_tag(m.group())
            text = text[m.end():]
        else:
            clade = Newick.Clade(comment=None)
        clade.clades = children
        return clade, text
</pre>

PS. I don't know if someone really needs to parse huge trees with BioPython, but I need this feature for couple of http://rosalind.info problems

----------------------------------------
You have received this notification because this email was added to the New Issue Alert plugin

-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org