[Biopython-dev] [Bug 2738] Speed up GenBank parsing, in particular location parsing

Wed Jan 21 18:30:27 UTC 2009

http://bugzilla.open-bio.org/show_bug.cgi?id=2738

------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk  2009-01-21 13:30 EST -------
Created an attachment (id=1206)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1206&action=view)
Patch for Bio/GenBank/__init__.py to handle simple locations with re

This patch handles the simple cases (non-fuzzy, no database references) using
simple python and regular expressions.  Everything else works by falling back
on the old spark based Bio.GenBank.LocationParser code (e.g. fuzzy locations).

The new code is pretty simple, and could potentially be extended to cover all
the currently used location strings found in the feature table, allowing us to
remove the use of Bio.GenBank.LocationParser, which in the long term this could
lead to an overall code simplification.

In the short term, this patch does complicate the location parsing because it
means there are effectively two ways we parse the location strings (my new
code, and the old spark based Bio.GenBank.LocationParser code).

However, from my limited testing using Python 2.5 on the Mac with GenBank files
for large bacterial genomes, this may be a price worth paying.  I'll like
independent measurements (and to check this on other platforms), but this does
seem to more than halve the time taken to parse GenBank files!

-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.