[Bioperl-l] SearchIO speed up

Fri Aug 11 23:39:35 UTC 2006

Amir,

The ability to customize your Sequence objects when parsing Genbank files is
already available:

http://www.bioperl.org/wiki/HOWTO:Feature-Annotation#Customizing_Sequence_Ob
ject_Construction

Not available for the 'embl' format, however.

Brian O.

On 8/11/06 9:06 AM, "Amir Karger" <akarger at CGR.Harvard.edu> wrote:

> Let me add my voice to the adulation here. IMO, the two main reasons
> Bioperl hasn't achieved world domination are (a) it's so huge that it's
> hard to find what you want, which the HOWTOs help with, and (b) it's so
> darn slow. Speedup is most definitely a Good Thing, and I'm sure that
> the vast majority of BLAST hits are ignored in the vast majority of
> cases, where you're just looking for hits where some criterion meets a
> certain threshold or something. It's unlikely that people want the full
> alignment for all 100k or whatever hits. (This is why I just use blast
> -m8: no parser required, and all you lose is the alignment.)
> 
> Anyway, in your spare time, maybe you do similar speedups for other
> pieces of Bioperl? My personal favorite would be the GenBank/EMBL
> parsers. The fungal genome ORF files I'm working with are only 20M or
> so, but using Bioperl to work with them takes so much longer than with
> non-Bioperl on the 6M FASTA files for other genomes. I have to imagine
> it's mostly creating objects for the gazillion tags, 90% of which I
> never peek at.
> 
> I know, you folks are busy, and I should be volunteering to do it
> myself. But you can at least consider it a user request.
> 
> - Amir Karger
> Research Computing
> Bauer Center for Genomics Research
> Harvard University
> 
>> -----Original Message-----
>> From: aaron.j.mackey at gsk.com [mailto:aaron.j.mackey at gsk.com]
>> Sent: Thursday, August 10, 2006 1:40 PM
>> To: Sendu Bala
>> Cc: bioperl-l at lists.open-bio.org
>> Subject: Re: [Bioperl-l] SearchIO speed up
>> 
>>> ...Except I need to know if the community considers the
>> speed problem 
>>> solved or not. More radical changes will make SearchIO even
>> faster, eg. 
>>> Chris Fields and Jason (if I interpret the Project priority
>> list item 
>>> correctly) have suggested an end to individual Hit and HSP objects,
>>> which become just data members of a Result-like object.
>> Ideally I don't 
>>> want to go down that route because we lose quite a bit of OO power;
>> 
>> As already mentioned, a lazy-evaluation approach would also work.
>> 
>> Jason and I did once talk about an entirely new
>> parsing/object-building
>> framework, based on nested grammars; in essence, the
>> "top-level" parser,
>> simply "chunks" the input into blobs of (minimally parsed) text that
>> correspond to the top level result object.  This chunk/blob
>> is the input 
>> to the next-level parser for Hits, which in return has chunk
>> for HSPs. 
>> Note that the Result/Hit/HSP "chunks" are "fat", i.e. they
>> *are* the same 
>> Generic*I-implementing objects we're already using.  Thus, if
>> HSPs are 
>> never interrogated, they're never parsed; as soon as one is
>> interrogated, 
>> it gets parsed, and so on.  In such an environment, you can imagine
>> flyweight objects that are built very quickly/easily (recall
>> that many 
>> previous analyses of BioPerl speed problems are not related
>> to parsing, so 
>> much as heavy-weight object creation).
>> 
>> I happen to have such a nested parser lying around for
>> Bio::SearchIO::fasta.pm, but it also uses an Inline::C,
>> yacc-generated C
>> parser backend (yet another experiment in trying to get
>> SearchIO to run 
>> faster), so really isn't ready for prime time (being entirely
>> untested, 
>> and probably not even finished).
>> 
>> -Aaron
>> 
>> 
>> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l