[Biopython-dev] Re: Martel performance

Sat Dec 16 21:12:21 EST 2000

Short version.  I found the source of the slowdown.  I'll
revert my change which caused the problem, but that reintroduces
a behavioral problem I don't like.  Unfortunately, it's a
behaviour which is pretty inherent in Martel and I don't see
an easy fix, so it will have to stay in until someone better
than I figures out a good solution.

Me:
>I'm finding my PIR and the GenBank parser (the last rather modified
>from Brad's because I was trying to be more strict on whitespace)
>to be pretty slow.  The PIR parser only parses 43K of text per
>second while the GenBank one is but 6.6K/second.  Compare that
>to the SwissProt parser where I was parsing the whole file
>in 20 minutes, which is about 200K per second.

>I can only think of a few reasons which might cause this:

Ha!  There's another which was the real answer.

The MaxRepeat expression (used for "*" repeats) had a
bug where it doesn't fully allow backtracking.  For
example, suppose you have

  ([A-C][X-Z])*[A-F]

and match it against "BZA".  The bug in the code was that
it would match B against "[A-C]" then Z against "[X-Z]".
It would next try the A against "[A-C]" which would match
so try matching [X-Z], which fails since there is no other
character.

The bug was that it wouldn't backtrack against a *partial*
match, so it wouldn't try to see if [A-F] matches.  This
was because I was using 

    if max_count == sre_parse.MAXREPEAT:
        result.append( (None, TT.SubTable, tuple(tagtable),
                                  +1, 0))

which expands the current taglist instead of creating a new
subtaglist.  (Meaning matches were added to the current list
as they were found, rather than building a sublist and
merging the result only if the whole list matches.)

The fix I did was to replace the SubTable with a Table, and
use a fake tag name to tell mxTextTools to append the matches
upon success.

    if max_count == sre_parse.MAXREPEAT:
        #result.append( (">ignore", TT.Table, tuple(tagtable),
                                  +1, 0))

The consequence is that every repeat creates a new list with
the tag ">ignore" associated with it.  This explains the
memory use and performance.

Eg, consider ".*\n".  This is converted to something like
  (?P<ignore>.)*\n
which means matching a line of text of length N + "\n" creates
N sublists - one for each character.

When I took that fix out of Martel, the performance, which
was about 2 records per second, went up to 54 records per
second.  My test set, gbpri8, is 96MB and can be parsed in
525 seconds, or 187K/second.  This is equivalent to what I
was getting for parsing SWISS-PROT.

That actually the clue for how I found the problem.  I was
showing off Martel yesterday and noticed the SWISS-PROT parser
was a lot slower than it used to be.  That indicated that the
shift in performance was not anything to do with the machine
or the specific file format but with some change in Martel.
I mulled about about it enough that this morning, when I
was trying to sleep in, I ended up instead thinking about what
could be the cause.  Luckily, I remembered what I was thinking
about when I finally did wake up :)

Going back to the topic, it actually points out a problem
in Martel in that it isn't a true regular expression engine.
Once a full match occurs it doesn't consider other alternative.
Consider something which parses some of the feature keys for
GenBank.  It may have something like

  ... |prim_transcript|primer|primer_bind|protomer| ...

Suppose you have the key "primer_bind".  In that case, "primer"
will match (because that's the start of the word).  So next it
tries to match the spaces after the key and that fails, because
'_bind' isn't a space.  A real regular expression engine would
backtrack, throw the 'primer' match away and try again.
Martel doesn't do that.  Once it does a match for a given
grouping, it stays matched.  Higher level matches may discard
submatches which is why Martel appears to do backtracking.

The workaround for this '|' problem is to put the larger patterns
first, so place "primer_bind" before "primer".  Similarly, there
is a workaround to the "*" problem by putting the subpattern
in an explicit group rather than my solution of always putting
in an implicit '>ignore' group.

As another example of the problem, suppose you want to match
"\s+\n".  Martel will fail because \s+ consumes the final "\n"
so there is no additional text to match the \n after the \s+.
Again, a standard regexp engine will throw away the \s match
against \n and try again.  Martel does not.  The workaround
is to do something like " *\n".

I don't really like these workaround solutions because they
require people to be more aware of the differences between
Martel's regular expressions and the standard ones.  I haven't
been clever enough to figure out a good solution using
mxTextTools.  On the other hand, I haven't been greatly concerned
with it because:
  - Martel's behaviour is a subset of standard regular expressions,
so if Martel matches so will the standard one;
  - I figure someone cleverer than I may contribute a good solution;
  - At some point the whole evalutation engine may be replaced
by a C extension which can be made portable to Perl, Tcl, Java, etc;
  - I've still been prototyping to see what's useful and what isn't;
  - and of course, I know how to do the workarounds :)

I've been reading a bit of the regular expression and pattern
matching literature.  There are a lot of terms to describe the
types of regular expression languages.  For example, SGML DTDs
are "1-unambiguous" because it only needs to look ahead a
single tag to determine the next step in the DTD.  There's
also "deterministic" regular expressions.  I've decided I really
need to talk to someone who knows the field...

                    Andrew