[Biopython-dev] Bio.GenBank FeatureParser vs RecordParser

Peter biopython-dev at maubp.freeserve.co.uk
Sun Sep 17 11:05:14 UTC 2006


Peter wrote:
> I've been looking at some timings for parsing GenBank files, in 
> particular FeatureParser vs RecordParser
> 
> The test file I'm using is one of the largest bacterial genomes, the 
> GenBank file is almost 24MB:
> 
> ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Streptomyces_coelicolor/NC_003888.gbk
> 
> On my nice new desktop:
> 
> RecordParser takes about 5s to return a Bio.GenBank.Record object.
> 
> FeatureParser takes about 45 to 50s to return a SeqRecord object.
> 
> ...
> 
> The other option (which I do plan to look into) is improving the 
> location parser so that it doesn't cause such a slow down.
> 

I started this thread on the discussion list, but this follow up is 
probably better off on the development list...

With the following fairly small change to Bio/GenBank/LocationParser.py 
the time taken by the FeatureParser is almost halved (from about 45 to 
50s to about about 27 or 28s).

Old code:

def scan(input):
     scanner = LocationScanner()
     return scanner.tokenize(input)

def parse(tokens):
     #print "I have", tokens
     parser = LocationParser()
     return parser.parse(tokens)


New code:

_cached_scanner = LocationScanner()
def scan(input):
     return _cached_scanner.tokenize(input)

_cached_parser = LocationParser()
def parse(tokens):
     #print "I have", tokens
     return _cached_parser.parse(tokens)


These two functions are called for every feature by the location method 
of the _FeatureConsumer class in Bio/GenBank/__init__.py

I checked that test_GenBank and test_GenBankFormat still pass.

My change means the LocationScanner() and LocationParser() objects are 
created once and then reused - rather than being recreated for each feature.

Alternatively, the _FeatureConsumer could create its own copies of these 
objects (once) and call them directly instead of using the scan and 
parse functions.  This also works and takes a similar amount of time.

If no one objects, I'll double check this works (and is worthwhile) on 
my older slower windows machine, and check it in at some point next week.

Peter




More information about the Biopython-dev mailing list