[Bioperl-l] Re: [Bioclusters] BioPerl and memory handling
Jason Stajich
jason.stajich at duke.edu
Tue Nov 30 08:46:19 EST 2004
That's true - it does create a lot of objects for all the compnents of
the report. When you have 2000 hits it needs to build quite a few
objects. It does build them all for a single result. Steve had a lazy
parser implementation in at one point, but that was more for speed when
you didn't want to actually see the HSP details for every hit.
I second Ian's comment that I use the tabular output from BLAST when
dealing with large datasets. SearchIO is intended to give you access
to the entire data in the report, so there is an overhead in that.
There are a couple of workarounds depending on what kind of data you
want. We designed SearchIO to be a modular system which separates
parsing the data from instantiating objects by throwing events (like
SAX) and having a listener build objects from these events. One can
instantiate a different listener which builds simpler objects or throws
away the data you don't want. At some point I hope we can build some
light-weight Result/Hit/HSP objects and a listener which creates these
instead of full-fledged bioperl objects. You can build your own
listener object - SearchResultEventBuilder and FastHitEventBuilder are
2 implementations and you can specify the type of Result/Hit/HSP
objects that are created by the listeners. It might be easiest to
create some lightweight Hit and HSP objects and have
SearchResultEventBuilder create these instead of the default
full-fledged ones. At some point though, if you are getting 5-10k hits
I don't think the parser is going to play nice as it wasn't really
engineered with this extreme case in mind.
Now the whole parser/listener design assumes that you want to process
all the data for a result before moving on to the next one - at least
from the listener's standpoint this means you have to store all the
data you just got from the parser - whether this is in memory, or
potentially stored in a tempfile/temp dbfile would be up to the
implementation.
Here is an example of how you can provide a different listener -
FastHitEventBuilder just throws away the HSPs and only builds Result
and Hit objects.
use Bio::SearchIO;
use Bio::SearchIO::FastHitEventBuilder;
my $searchio = new Bio::SearchIO(-format => $format, -file => $file);
$searchio->attach_EventHandler(new
Bio::SearchIO::FastHitEventBuilder);
while( my $r = $searchio->next_result ) {
while( my $h = $r->next_hit ) {
# note that Hits will NOT have HSPs
}
}
On Nov 30, 2004, at 5:59 AM, Michael Maibaum wrote:
> On Tue, Nov 30, 2004 at 01:24:24AM -0800, Steve Chervitz wrote:
>> Regarding SearchIO memory usage, I don't think this has been an issue
>> before, so I wonder if there is something about the installation or
>> specific
>> usage of it that is leading to memory hogging. I've run it over large
>> numbers of reports without noticing troubles. It would be useful to
>> see a
>> sample report + script using SearchIO that leads to the memory
>> troubles, so
>> we can try to reproduce it.
>
>
> FWIW - I at least didn't have a problem parsing many thousands of
> results in a stram with SearchIO - I had a problem with parsing
> certain specific result sets, Essentially anything with about 2000
> hits and alignments (or more) for a single query would kill a linux
> box with 1 gig of RAM (it would thrash VM to death). These would run
> on a opteron 16Gig box and used >8 gig of RAM in some cases.
> As far as I can see the majority of the memory was then returned when
> BioPerl moved on to the next record. The issue is that it takes a
> rather large amount or RAM for an individual record and I assumed
> (rightly or wrongly) that BioPerl slurps up the entire record and
> builds the objects representing it as a whole hence the large RAM
> usage. It may be that the objects to represetn 2000+ hits are just
> very (unreasonably?) large.
>
> Michael
>
>
--
Jason Stajich
jason.stajich at duke.edu
http://www.duke.edu/~jes12/
More information about the Bioperl-l
mailing list