[Bioperl-l] GenBankParser comparison to bioperl parser

Hilmar Lapp hlapp@gnf.org
Thu, 12 Sep 2002 09:58:05 -0700


> -----Original Message-----
> From: Lincoln Stein [mailto:lstein@cshl.org]
> Sent: Thursday, September 12, 2002 6:21 AM
> To: Elia Stupka; Ewan Birney
> Cc: Ian Korf; John Kloss; bioperl-l@bioperl.org;
> gishlab@species.wustl.edu
> Subject: Re: [Bioperl-l] GenBankParser comparison to bioperl parser
> 
> 
> A separate repository is also fine with me, but I prefer 
> Bioperl-contrib, 
> because it should not just be for utility code, and nicely echoes the 
> "contrib" directory of the X Windows Consortium code distribution.
> 
> I'll put Boulder into a Bioperl-contrib if there is one.

West coast finally comes to work, looking amazed at one's inbox. I second that Lincoln's suggestion is the way to go.

Adding additional modules that do the same thing an existing one does already but with a different API, be it faster or not, is not going to be helpful. Nevertheless, I'm convinced John's parser is an extremely valuable contribution for people who parse 50MB Genbank files on an every day basis.

As some people pointed out very correctly, bioperl's generic design and unified API comes at a price, both in that you have to learn that API which is not the same as the input file you're so familiar with, and all creating all those objects does cost execution time. 

I'm sure that some of the parsing logic can be substantially improved both in readability and speed, but honestly I'd be very surprised if even the ultimately best regexp combined with the ultimately best parsign logic can speed up the whole thing by a factor of more than 2-3 fold. It's the object tree construction that costs you the order of magnitude.

I think this is something worthwhile to spend more thoughts on, how can object tree construction be sped up considerably in bioperl. I started thinking about 2 possible ways that may help: 1) reusable object pools, from which clients can claim objects and release for reuse when finished; object construction then becomes resetting the state of an already created object and setting its attributes. 2) Lazy object tree construction; if someone's not going to query $seq->top_SeqFeatures() those feature objects would not have to created, let alone the relevant chunks of input parsed.

This isn't meant as ultimate wisdom suggestions, but rather to instigate discussion and brain-storm of what makes the most sense.

	-hilmar
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------