[Bioperl-l] GenBankParser comparison to bioperl parser

John Kloss jkloss@sapiens.wustl.edu
Thu, 12 Sep 2002 13:22:05 -0700


The GenBankParser actually recreates it's whole parse tree _after each
entry_.  If you look at the code, after an entry is parsed, I call a
reset.  This is reset

	sub reset { $_[0] = new $_[0] }

So if I'm going to parse the LOCUS line, the FEATURES table and the CDS
fields from the FEATURES table, after each entry is parsed I recreate
(malloc, new, whatever) a new GenBankParser, LOCUS parser, FEATURES
parser, and a new FEATURES::CDS parser for each CDS field found.  In an
average genbank flat file of about 30000 entries with two CDS fields per
entry,  the GenBankParser will drop and recreate about 150000 parser
objects (not counting SV's at all).

My code use to look like this

	sub reset { foreach ( @{ $_[0] } ) { $_->reset } }

and each subparser had it's own reset feature.  By minimizing object
creation and destruction I found I could get a speed up of a few
milliseconds, maybe.  In other words, saving objects and reseting their
state had _no_ effect whatsoever on the speed of the parser.

It is, as Aaron stated, all function calls.  I don't make many.  Infact,
I wrote my code specifically for the perl compiler to optimize away
almost all function calls (thus the import's and BEGIN blocks and array
accesses).  That's where the speed came from, mostly.

	John Kloss. 

-----Original Message-----
From: bioperl-l-admin@bioperl.org [mailto:bioperl-l-admin@bioperl.org]
On Behalf Of Aaron J Mackey
Sent: Thursday, September 12, 2002 10:41 AM
To: Bioperl
Subject: RE: [Bioperl-l] GenBankParser comparison to bioperl parser



[ trimmed the reply-to lines a bit ... ]

On Thu, 12 Sep 2002, Hilmar Lapp wrote:

> I'm sure that some of the parsing logic can be substantially improved
> both in readability and speed, but honestly I'd be very surprised if
> even the ultimately best regexp combined with the ultimately best
> parsign logic can speed up the whole thing by a factor of more than
> 2-3 fold. It's the object tree construction that costs you the order
> of magnitude.

Yes (see pICalculator thread to see a little simple benchmarking on
SeqIO::fasta vs. pure-perl raw parsing - summary: 24 seconds vs. 0.5
seconds to read a 25000 sequence protein database).

I don't believe it's object *construction* (i.e. malloc-ing new memory)
so
much as all the function calls that are happening.  Having a pool of
objects is not going to help this at all (in fact, Perl is already
keeping
pools of SV's around for you to use, so you're just duplicating the
effort
if you go that route).  I repeat: look at the function calls, and all
the
@ISA tree-walking ...

-Aaron


-- 
 Aaron J Mackey
 Pearson Laboratory
 University of Virginia
 (434) 924-2821
 amackey@virginia.edu


_______________________________________________
Bioperl-l mailing list
Bioperl-l@bioperl.org
http://bioperl.org/mailman/listinfo/bioperl-l