[Bioperl-l] SearchIO speed up

Torsten Seemann torsten.seemann at infotech.monash.edu.au
Thu Aug 10 23:45:33 UTC 2006


> So the only lazyness you invoke is the object instantiation (but you've 
> already done all the parsing).
> 
> My proposal involves the "chunks" being unparsed, raw text "blobs", that 
> are essentially blessed into a package that does the parsing only when 
> necessary (and even then, might choose different parsing strategies, based 
> on what's been asked for).  Thus a potentially large amount of parsing and 
> storage is skipped.  Additionally, you now have the option of not even 
> storing the blobs in memory, just file seek pointers (requiring temp. 
> storage for streaming pipe data sources), and thus can process very large 
> reports without consuming memory (currently a problem).

This approach is an excellent one, but not all file formats lend themselves to 
it. BLAST results have a semantically hierarchial layout, and the BLAST XML 
report syntax matches that layout, so the approach is well suited. Traditional 
BLAST reports are pretty similar too. ie. most of the data for a low-level 
object is encapsulated within a certain part of the input file.

However, this may not be true for other formats, perhaps HMMER reports, where 
"HSP"-related info may be spread across multiple sections of the file.

But of course, this doesn't prevent us using the approach where suitable, and 
using the "slow" method otherwise.

-- 
Torsten Seemann
Victorian Bioinformatics Consortium, Monash University, Australia



More information about the Bioperl-l mailing list