[Biopython-dev] Bio.GenBank FeatureParser vs RecordParser

Peter biopython-dev at maubp.freeserve.co.uk
Sun Sep 17 22:06:32 UTC 2006


Peter wrote:
> Peter wrote:
>> I've been looking at some timings for parsing GenBank files, in 
>> particular FeatureParser vs RecordParser
>>
>> The test file I'm using is one of the largest bacterial genomes, the 
>> GenBank file is almost 24MB:
>>
>> ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Streptomyces_coelicolor/NC_003888.gbk
>>
>> On my nice new desktop:
>>
>> RecordParser takes about 5s to return a Bio.GenBank.Record object.
>>
>> FeatureParser takes about 45 to 50s to return a SeqRecord object.
>>
>> ...
>>
>> The other option (which I do plan to look into) is improving the 
>> location parser so that it doesn't cause such a slow down.
>>
> 
> I started this thread on the discussion list, but this follow up is 
> probably better off on the development list...
> 
> With the following fairly small change to Bio/GenBank/LocationParser.py 
> the time taken by the FeatureParser is almost halved (from about 45 to 
> 50s to about about 27 or 28s).
> 
> Old code:
> 
> def scan(input):
>      scanner = LocationScanner()
>      return scanner.tokenize(input)
> 
> def parse(tokens):
>      #print "I have", tokens
>      parser = LocationParser()
>      return parser.parse(tokens)
> 
> 
> New code:
> 
> _cached_scanner = LocationScanner()
> def scan(input):
>      return _cached_scanner.tokenize(input)
> 
> _cached_parser = LocationParser()
> def parse(tokens):
>      #print "I have", tokens
>      return _cached_parser.parse(tokens)
> 
> 
> These two functions are called for every feature by the location method 
> of the _FeatureConsumer class in Bio/GenBank/__init__.py
> 
> I checked that test_GenBank and test_GenBankFormat still pass.
> 
> My change means the LocationScanner() and LocationParser() objects are 
> created once and then reused - rather than being recreated for each feature.
> 
> Alternatively, the _FeatureConsumer could create its own copies of these 
> objects (once) and call them directly instead of using the scan and 
> parse functions.  This also works and takes a similar amount of time.
> 
> If no one objects, I'll double check this works (and is worthwhile) on
> my older slower windows machine, and check it in at some point next week.

I still plan to check in the above fairly minor change.

I've also looked deeper, and I have tweaked LocationParser.py to handle 
the typical (exact) cases using regular expressions as special cases 
(falling back on the existing spark parser otherwise):

"123..456"
"function(123..456)" e.g. "complement(123..456)"

The above are enough for most bacteria, I then added:

"function(123..456,789..1066,1999..2006)" to cover joins,

and:

"function(function(123..456,789..1066,1999..2006))"

to cover the complement of joins for non-bacteria. With this in place 
the parsing time for the large example falls from about 27s to about 7s 
(compared to the 45s or more taken by the CVS edition of the parser).

I'm not ready to check in this hybrid regular expressions/spark parser, 
as I think it could be done more cleanly...

Peter




More information about the Biopython-dev mailing list