[Bioperl-l] SearchIO speed up
Brian Osborne
osborne1 at optonline.net
Fri Aug 11 23:39:35 UTC 2006
Amir,
The ability to customize your Sequence objects when parsing Genbank files is
already available:
http://www.bioperl.org/wiki/HOWTO:Feature-Annotation#Customizing_Sequence_Ob
ject_Construction
Not available for the 'embl' format, however.
Brian O.
On 8/11/06 9:06 AM, "Amir Karger" <akarger at CGR.Harvard.edu> wrote:
> Let me add my voice to the adulation here. IMO, the two main reasons
> Bioperl hasn't achieved world domination are (a) it's so huge that it's
> hard to find what you want, which the HOWTOs help with, and (b) it's so
> darn slow. Speedup is most definitely a Good Thing, and I'm sure that
> the vast majority of BLAST hits are ignored in the vast majority of
> cases, where you're just looking for hits where some criterion meets a
> certain threshold or something. It's unlikely that people want the full
> alignment for all 100k or whatever hits. (This is why I just use blast
> -m8: no parser required, and all you lose is the alignment.)
>
> Anyway, in your spare time, maybe you do similar speedups for other
> pieces of Bioperl? My personal favorite would be the GenBank/EMBL
> parsers. The fungal genome ORF files I'm working with are only 20M or
> so, but using Bioperl to work with them takes so much longer than with
> non-Bioperl on the 6M FASTA files for other genomes. I have to imagine
> it's mostly creating objects for the gazillion tags, 90% of which I
> never peek at.
>
> I know, you folks are busy, and I should be volunteering to do it
> myself. But you can at least consider it a user request.
>
> - Amir Karger
> Research Computing
> Bauer Center for Genomics Research
> Harvard University
>
>> -----Original Message-----
>> From: aaron.j.mackey at gsk.com [mailto:aaron.j.mackey at gsk.com]
>> Sent: Thursday, August 10, 2006 1:40 PM
>> To: Sendu Bala
>> Cc: bioperl-l at lists.open-bio.org
>> Subject: Re: [Bioperl-l] SearchIO speed up
>>
>>> ...Except I need to know if the community considers the
>> speed problem
>>> solved or not. More radical changes will make SearchIO even
>> faster, eg.
>>> Chris Fields and Jason (if I interpret the Project priority
>> list item
>>> correctly) have suggested an end to individual Hit and HSP objects,
>>> which become just data members of a Result-like object.
>> Ideally I don't
>>> want to go down that route because we lose quite a bit of OO power;
>>
>> As already mentioned, a lazy-evaluation approach would also work.
>>
>> Jason and I did once talk about an entirely new
>> parsing/object-building
>> framework, based on nested grammars; in essence, the
>> "top-level" parser,
>> simply "chunks" the input into blobs of (minimally parsed) text that
>> correspond to the top level result object. This chunk/blob
>> is the input
>> to the next-level parser for Hits, which in return has chunk
>> for HSPs.
>> Note that the Result/Hit/HSP "chunks" are "fat", i.e. they
>> *are* the same
>> Generic*I-implementing objects we're already using. Thus, if
>> HSPs are
>> never interrogated, they're never parsed; as soon as one is
>> interrogated,
>> it gets parsed, and so on. In such an environment, you can imagine
>> flyweight objects that are built very quickly/easily (recall
>> that many
>> previous analyses of BioPerl speed problems are not related
>> to parsing, so
>> much as heavy-weight object creation).
>>
>> I happen to have such a nested parser lying around for
>> Bio::SearchIO::fasta.pm, but it also uses an Inline::C,
>> yacc-generated C
>> parser backend (yet another experiment in trying to get
>> SearchIO to run
>> faster), so really isn't ready for prime time (being entirely
>> untested,
>> and probably not even finished).
>>
>> -Aaron
>>
>>
>>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list