[Bioperl-l] SearchIO Performance

Fri Mar 21 20:13:00 UTC 2008

Hi. I am pretty new to BioPerl, and have a question about performance  
with regard to Blast (nucleotide) file parsing. My Blast result files  
usually have close to 100 or more sequence hits. Each sequence is  
about 1400 nucleotides long.

After profiling code I wrote, I find that calling the next_result()  
function after creating a search object takes substantially longer  
than non-OO, quick and dirty code I am using to parse the same Blast  
files.

What is substantially longer? Well, the existing code takes about 0.25  
seconds, and the BioPerl call takes about 4.5 seconds. I find that to  
be a dramatic difference, and that kind of time difference becomes  
significant when I have to parse 30 Blast files in a row. I understand  
that SearchIO is parsing the entire file and storing it all for easy  
retrieval later, and maybe this time penalty is what I have to pay for  
that convenience and organization.

I am just wondering if there is anything other than writing custom  
code based on BioPerl to speed this up. Something I might not be aware  
of that I can do ahead of time, or during parsing, to limit what is  
parsed, or facilitate the parsing process. For instance, is there a  
way to "look ahead" and simply parse alignments that meet a specific  
expectancy cutoff?

I confess I have not read the documentation thoroughly (although  
obviously enough to make it do what I want), but am certainly willing  
to do so if someone can point me in the right direction.

Thanks

Albion