[Bioperl-l] Re: [Bioclusters] BioPerl and memory handling
Ian Korf
iankorf at mac.com
Tue Nov 30 02:42:23 EST 2004
I tried your example (fixing the syntax error of $sel->{'foo'} to
$self->{'foo'} and I find that I get back half the memory after undef,
which is exactly the behavior I described. This could be differences in
Perl versions. What version of Perl are you using? perl -v gives me:
This is perl, v5.8.1-RC3 built for darwin-thread-multi-2level
On Nov 29, 2004, at 8:12 PM, Malay wrote:
> Thanks Ian for your mail. But you have missed a major point of the
> original discussion. What happens to object? So I did the same test
> that you did using object. Here is the result.
>
> use strict;
> package Test;
>
> sub new {
> my $class =shift;
> my $self = {};
> bless $self, $class;
> $sel->{'foo'} = 'N' x 100000000;
> return $self;
> }
>
> package main;
>
> my $ob = Test->new(); #uses 197 MB as you said.
>
> undef $ob; ## still uses 197 MB ???!!!!
>
> This was the original point. Perl never releases memory for the
> initial object creation. Infact try doing this in whatever way
> possible, reusing references or undeffing it, the memory usage will
> never go down below 197 MB, till the executaion duration of the
> program.
>
> So I humbly differ in my opinion in any elaborate in-memory object
> hierarchy in Perl. The language is not meant for that. But I am
> nobody, stallwarts will differ in opinion.
>
> -Malay
>
>
>
>
>
>
>
> Ian Korf wrote:
>
>> After a recent conversation about memory in Perl, I decided to do
>> some actual experiments. Here's the email I composed on the subject.
>>
>>
>> I looked into the Perl memory issue. It's true that if you allocate a
>> huge amount of memory that Perl doesn't like to give it back. But
>> it's not as bad a situation as you might think. Let's say you do
>> something like
>>
>> $FOO = 'N' x 100000000;
>>
>> That will allocate a chunk of about 192 Mb on my system. It doesn't
>> matter if this is a package variable or lexical.
>>
>> our $FOO = 'N' x 100000000; # 192 Mb
>> my $FOO = 'N' x 100000000; # 192 Mb
>>
>> If you put this in a subroutine
>>
>> sub foo {my $FOO = 'N' x 100000000}
>>
>> and you call this a bunch of times
>>
>> foo(); foo(); foo(); foo(); foo(); foo(); foo();
>>
>> the memory footprint stays at 192 Mb. So Perl's garbage collection
>> works just fine. Perl doesn't let go of the memory it has taken from
>> the OS, but it is happy to reassign the memory it has reserved.
>>
>> Here's something odd. The following labeled block looks like it
>> should use no memory.
>>
>> BLOCK: {
>> my $FOO = 'N' x 100000000;
>> }
>>
>> The weird thing is that after executing the block, the memory
>> footprint is still 192 Mb as if it hadn't been garbage collected.
>>
>> Now look at this:
>>
>> my $foo = 'X' x 100000000;
>> undef $foo;
>>
>> This has a memory footprint of 96 Mb. After some more
>> experimentation, I have come up with the following interpretation of
>> memory allocation and garbage collection in Perl. Perl will reuse
>> memory for a variable of a given name (either package or lexical
>> scope). There is no fear of memory leaks in loops for example. But
>> each different named variable will retain its own minimum memory.
>> That minimum memory is the size of the largest memory allocated to
>> that variable, or half that amount if other variables have taken some
>> of that space already. You can get any variable to automatically give
>> up half its memory with undef. But this takes a little more CPU time.
>> Here's some test code that shows this behavior.
>>
>> sub foo {my $FOO = 'N' x 100000000}
>> for (my $i = 0; $i < 50; $i++) {foo()} # 29.420u 1.040s
>>
>> sub bar {my $BAR = 'N' x 100000000; undef $BAR}
>> for (my $i = 0; $i < 50; $i++) {bar()} # 26.880u 21.220s
>>
>> The increase from 1 sec to 21 sec system CPU time is all the extra
>> memory allocation and freeing associated with the undef statement.
>> Why the user time is less in the undef example is a mystery to me.
>>
>> OK, to make a hideously long story short, use undef to save memory
>> and use the same variable name over and over if you can.
>>
>> ---
>>
>> But this email thread has gone to BPlite, of which I am the original
>> author. BPlite is designed to parse a stream and only reads a minimal
>> amount of information at a time. The disadvantage of this is that if
>> you want to know something about statistics, you can't get this until
>> the end of the report (the original BPlite ignored statistics
>> entirely). I like the new SearchIO interface better than BPlite, but
>> for my own uses I generally use a table format most of the time and
>> don't really use a BLAST parser very often.
>>
>> -Ian
>>
>> On Nov 29, 2004, at 3:03 PM, Mike Cariaso wrote:
>>
>>> This message is being cross posted from bioclusters to
>>> bioperl. I'd appreciate a clarification from anyone in
>>> bioperl who can speak more authoritatively than my
>>> semi-speculation.
>>>
>>>
>>> Perl does have a garbage collector. It is not wildly
>>> sophisticated. As you've suggested it uses simple
>>> reference counting. This means that circular
>>> references will cause memory to be held until program
>>> termination.
>>>
>>> However I think you are overstating the inefficiency
>>> in the system. While the perl GC *may* not release
>>> memory to the system, it does at least allow memory to
>>> be reused within the process.
>>>
>>> If the system instead behaved as you describe, I think
>>> perl would hemorrhage memory and would be unsuitable
>>> for any long running processes.
>>>
>>> However I can say with considerable certainty that
>>> that BPLite is able to handle blast reports which
>>> cause SearchIO to thrash. I've attributed this to
>>> BPLite being a true stream processor, while SearchIO
>>> seems to slurp the whole file and object heirarchy
>>> into memory.
>>>
>>> I know that SearchIO is the prefered blast parser, but
>>> it seems that BPLite is not quite dead, for the
>>> reasons above. If this is infact the unique benefit of
>>> BPLite, perhaps the documentation should be clearer
>>> about this, as I suspect I'm not the only person to
>>> have had to reengineer a substantial piece of code to
>>> adjust between their different models. Had I known of
>>> this difference early on I would have chosen BPLite.
>>>
>>> So, bioperlers (especially Jason Stajich) can you shed
>>> any light on this vestigial bioperl organ?
>>>
>>>
>>>
>>> --- Malay <mbasu at mail.nih.gov> wrote:
>>>
>>>> Michael Cariaso wrote:
>>>>
>>>>> Michael Maibaum wrote:
>>>>>
>>>>>>
>>>>>> On 10 Nov 2004, at 18:25, Al Tucker wrote:
>>>>>>
>>>>>>> Hi everybody.
>>>>>>>
>>>>>>> We're new to the Inquiry Xserve scientific
>>>>>>
>>>> cluster and trying to iron
>>>>
>>>>>>> out a few things.
>>>>>>>
>>>>>>> One thing is we seem to be coming up against is
>>>>>>
>>>> an out of memory
>>>>
>>>>>>> error when getting large sequence analysis
>>>>>>
>>>> results (5,000 seq - at
>>>>
>>>>>>> least- and above) back from BTblastall. The
>>>>>>
>>>> problem seems to be with
>>>>
>>>>>>> BioPerl.
>>>>>>>
>>>>>>> Might anyone here know if BioPerl is knows
>>>>>>
>>>> enough not to try and
>>>>
>>>>>>> access more than 4gb of RAM in a single process
>>>>>>
>>>> (an OS X limit)? I'm
>>>>
>>>>>>> told Blastall and BTblastall are and will chunk
>>>>>>
>>>> problems accordingly,
>>>>
>>>>>>> but we're not certain if BioPerl is when called
>>>>>>
>>>> to merge large Blast
>>>>
>>>>>>> results back together. It's the default version
>>>>>>
>>>> 1.2.3 that's supplied
>>>>
>>>>>>> btw, and OS X 10.3.5 with all current updates
>>>>>>
>>>> just short of the
>>>>
>>>>>>> latest 10.3.6 update.
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>>>>> BioPerl tries to slurp up the entire results set
>>>>>
>>>> from a BLAST query,
>>>>
>>>>>> and build objects for each little bit of the
>>>>>
>>>> result set and uses lots
>>>>
>>>>>> of memory. It doesn't have anything smart at all
>>>>>
>>>> about breaking up the
>>>>
>>>>>> job within the result set, afaik.
>>>>>>
>>>>
>>>> This is not really true. SearchIO module as far as I
>>>> know works on stream.
>>>>
>>>>>> I ended up stripping out results that hit a
>>>>>
>>>> certain threshold size to
>>>>
>>>>>> run on a different, large memory opteron/linux
>>>>>
>>>> box and I'm
>>>>
>>>>>> experimenting with replacing BioPerl with
>>>>>
>>>> BioPython etc.
>>>>
>>>>>>
>>>>>> Michael
>>>>>
>>>>>
>>>>>
>>>>> You may find hthat the BPLite parser works better
>>>>
>>>> when dealing with
>>>>
>>>>> large blast result files. Its not as clean or
>>>>
>>>> maintained, but it does
>>>>
>>>>> the job nicely for my current needs, which
>>>>
>>>> overloaded the usual parser.
>>>>
>>>> There is basically no difference between BPLite and
>>>> other BLAST parser
>>>> interfaces in Bioperl.
>>>>
>>>>
>>>> The problem lies in the core of Perl iteself. Perl
>>>> does not release
>>>> memory to the system even after the reference count
>>>> of an object created
>>>> in the memory goes to 0, unless the program in
>>>> actually over. Perl
>>>> object system in highly inefficient to handle large
>>>> number of objects
>>>> created in the memory.
>>>>
>>>> -Malay
>>>> _______________________________________________
>>>> Bioclusters maillist -
>>>> Bioclusters at bioinformatics.org
>>>>
>>> https://bioinformatics.org/mailman/listinfo/bioclusters
>>>
>>>>
>>>
>>>
>>> =====
>>> Mike Cariaso
>>> _______________________________________________
>>> Bioclusters maillist - Bioclusters at bioinformatics.org
>>> https://bioinformatics.org/mailman/listinfo/bioclusters
>>>
>>
>> _______________________________________________
>> Bioclusters maillist - Bioclusters at bioinformatics.org
>> https://bioinformatics.org/mailman/listinfo/bioclusters
>
>
> _______________________________________________
> Bioclusters maillist - Bioclusters at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters
>
More information about the Bioperl-l
mailing list