[BioPython] New faster Fasta parsing using Martel

Brad Chapman chapmanb at uga.edu
Wed Feb 25 19:10:09 EST 2004


Hey all;
One of the major problems I've had with the current Fasta parser in
Biopython is that it can be quite slow if you are parsing large
files or files with large records in them. To try and get around
this problem, I've re-implemented the Fasta parsers using Martel to
do the underlying parsing. This means that everything is now parsed
using mxTextTools and C code, so it is much faster and scalable to
large files. My new implementation is API compatible with the old
code, so it should be an immediate plug in and not affect any
already written code, which will immediately take advantage of the
speedups.

I'd like to see what people think about me submitting this to CVS so
that people can take a whirl on it and shake out
bugs/incompatibilities and so on. The new __init__.py and a new
test_Fasta.py (which is more inclusive than the old one) are at:

http://evostick.agtec.uga.edu/~chapmanb/bp/new_fasta.tar.gz

The main advantages of the new code are:

1. Faster, faster, faster

2. Indexing now uses the standard open-bio indexing that was worked
out during the hackathons the past two years (and implemented by
Andrew in Bio.Mindy). So, indexed files should be able to be read by
BioPerl, BioJava and BioRuby (and likewise, they can read ours).

As with all things there are a couple of caveats with the new code:

1. The indexing does not support specifying the Dictionary object
with a filename if the originally indexed file is moved. This
reverses a change made a few months ago in regards to a problem
Chunlei Wu was having. I guess we'd just have to recommend that the
solution in this problem is using symbolic links in the file system

2. The new indexing using BerkeleyDB indexing. This should be
included by default on all Pythons (since 2.1? I'm not sure exactly
when it came in) as far as I know, but I'm not sure how widely
supported it is over platforms. If it is a big problem and a lot of
people find out they don't have BerkeleyDB on their systems, we
could switch to flat file indexing (also a standard for open-bio)
by default without any problems.

I'd be very interested on feedback about whether I should go ahead
and commit this to CVS and whether it looks good and all those sorts
of things. Thanks much!
Brad


More information about the BioPython mailing list