[Bioperl-l] Re: [Bioclusters] BioPerl and memory handling

Tue Nov 30 04:24:24 EST 2004

My perl behaves like Ian's: undefing the object leads to about half of the
memory being reclaimed by the OS. Repeatedly creating and undefing never
leads to greater than ~190Mb or less than ~95Mb of allocation. (My perl=
5.8.1-RC3 built for darwin-thread-multi-2level, Mac OS X 10.3.6)

I think that Perl, in attempting to improve performance, doesn't want to
give back memory resulting from deletions within a running process, but
keeps some or all of it in a pool for future allocation. This behavior can
probably be controlled by how you build perl. For example, not including any
compiler optimizations might make perl more forthcoming with memory, but
performance will suffer.

For a definitive answer, I'd recommend checking with the perl porters:
http://www.gossamer-threads.com/lists/perl/porters/ .

Regarding SearchIO memory usage, I don't think this has been an issue
before, so I wonder if there is something about the installation or specific
usage of it that is leading to memory hogging. I've run it over large
numbers of reports without noticing troubles. It would be useful to see a
sample report + script using SearchIO that leads to the memory troubles, so
we can try to reproduce it.

Steve

> From: Malay <mbasu at mail.nih.gov>
> Date: Mon, 29 Nov 2004 23:12:10 -0500
> To: "Clustering, compute farming & distributed computing in life science
> informatics" <bioclusters at bioinformatics.org>
> Cc: <jason.stajich at duke.edu>, <bioperl-l at bioperl.org>
> Subject: [Bioperl-l] Re: [Bioclusters] BioPerl and memory handling
> 
> Thanks Ian for your mail. But you have missed a major point of the
> original discussion. What happens to object? So I did the same test that
> you did using object. Here is the result.
> 
> use strict;
> package Test;
> 
> sub new {
>    
>     my $class =shift;
>     my $self = {};
>     bless $self, $class;
>     $sel->{'foo'} = 'N' x 100000000;
>     return $self;
> }
> 
> package main;
> 
> my $ob = Test->new();    #uses 197 MB as you said.
> 
> undef $ob;  ## still uses 197 MB ???!!!!
> 
> This was the original point. Perl never releases memory for the initial
> object creation.  Infact try doing this in whatever way  possible,
> reusing references or undeffing it, the memory usage will never go down
> below 197 MB, till the executaion duration of the program.
> 
> So I humbly differ in my opinion in any elaborate in-memory object
> hierarchy in Perl. The language is not meant for that.  But I am nobody,
> stallwarts will differ in opinion.
> 
> -Malay
> 
>  
> 
> 
> 
> 
> 
> 
> Ian Korf wrote:
> 
>> After a recent conversation about memory in Perl, I decided to do some
>> actual experiments. Here's the email I composed on the subject.
>> 
>> 
>> I looked into the Perl memory issue. It's true that if you allocate a
>> huge amount of memory that Perl doesn't like to give it back. But it's
>> not as bad a situation as you might think. Let's say you do something
>> like
>> 
>>     $FOO = 'N' x 100000000;
>> 
>> That will allocate a chunk of about 192 Mb on my system. It doesn't
>> matter if this is a package variable or lexical.
>> 
>>     our $FOO = 'N' x 100000000; # 192 Mb
>>     my  $FOO = 'N' x 100000000; # 192 Mb
>> 
>> If you put this in a subroutine
>> 
>>     sub foo {my $FOO = 'N' x 100000000}
>> 
>> and you call this a bunch of times
>> 
>>     foo(); foo(); foo(); foo(); foo(); foo(); foo();
>> 
>> the memory footprint stays at 192 Mb. So Perl's garbage collection
>> works just fine. Perl doesn't let go of the memory it has taken from
>> the OS, but it is happy to reassign the memory it has reserved.
>> 
>> Here's something odd. The following labeled block looks like it should
>> use no memory.
>> 
>>     BLOCK: {
>>         my  $FOO = 'N' x 100000000;
>>     }
>> 
>> The weird thing is that after executing the block, the memory
>> footprint is still 192 Mb as if it hadn't been garbage collected.
>> 
>> Now look at this:
>> 
>>     my $foo = 'X' x 100000000;
>>     undef $foo;
>> 
>> This has a memory footprint of 96 Mb. After some more experimentation,
>> I have come up with the following interpretation of memory allocation
>> and garbage collection in Perl. Perl will reuse memory for a variable
>> of a given name (either package or lexical scope). There is no fear of
>> memory leaks in loops for example. But each different named variable
>> will retain its own minimum memory. That minimum memory is the size of
>> the largest memory allocated to that variable, or half that amount if
>> other variables have taken some of that space already. You can get any
>> variable to automatically give up half its memory with undef. But this
>> takes a little more CPU time. Here's some test code that shows this
>> behavior.
>> 
>> sub foo {my $FOO = 'N' x 100000000}
>> for (my $i = 0; $i < 50; $i++) {foo()} # 29.420u 1.040s
>> 
>> sub bar {my $BAR = 'N' x 100000000; undef $BAR}
>> for (my $i = 0; $i < 50; $i++) {bar()} # 26.880u 21.220s
>> 
>> The increase from 1 sec to 21 sec system CPU time is all the extra
>> memory allocation and freeing associated with the undef statement. Why
>> the user time is less in the undef example is a mystery to me.
>> 
>> OK, to make a hideously long story short, use undef to save memory and
>> use the same variable name over and over if you can.
>> 
>> ---
>> 
>> But this email thread has gone to BPlite, of which I am the original
>> author. BPlite is designed to parse a stream and only reads a minimal
>> amount of information at a time. The disadvantage of this is that if
>> you want to know something about statistics, you can't get this until
>> the end of the report (the original BPlite ignored statistics
>> entirely). I like the new SearchIO interface better than BPlite, but
>> for my own uses I generally use a table format most of the time and
>> don't really use a BLAST parser very often.
>> 
>> -Ian
>> 
>> On Nov 29, 2004, at 3:03 PM, Mike Cariaso wrote:
>> 
>>> This message is being cross posted from bioclusters to
>>> bioperl. I'd appreciate a clarification from anyone in
>>> bioperl who can speak more authoritatively than my
>>> semi-speculation.
>>> 
>>> 
>>> Perl does have a garbage collector. It is not wildly
>>> sophisticated. As you've suggested it uses simple
>>> reference counting. This means that circular
>>> references will cause memory to be held until program
>>> termination.
>>> 
>>> However I think you are overstating the inefficiency
>>> in the system. While the perl GC *may* not release
>>> memory to the system, it does at least allow memory to
>>> be reused within the process.
>>> 
>>> If the system instead behaved as you describe, I think
>>> perl would hemorrhage memory and would be unsuitable
>>> for any long running processes.
>>> 
>>> However I can say with considerable certainty that
>>> that BPLite is able to handle blast reports which
>>> cause SearchIO to thrash. I've attributed this to
>>> BPLite being a true stream processor, while SearchIO
>>> seems to slurp the whole file and object heirarchy
>>> into memory.
>>> 
>>> I know that SearchIO is the prefered blast parser, but
>>> it seems that BPLite is not quite dead, for the
>>> reasons above. If this is infact the unique benefit of
>>> BPLite, perhaps the documentation should be clearer
>>> about this, as I suspect I'm not the only person to
>>> have had to reengineer a substantial piece of code to
>>> adjust between their different models. Had I known of
>>> this difference early on I would have chosen BPLite.
>>> 
>>> So, bioperlers (especially Jason Stajich) can you shed
>>> any light on this vestigial bioperl organ?
>>> 
>>> 
>>> 
>>> --- Malay <mbasu at mail.nih.gov> wrote:
>>> 
>>>> Michael Cariaso wrote:
>>>> 
>>>>> Michael Maibaum wrote:
>>>>> 
>>>>>> 
>>>>>> On 10 Nov 2004, at 18:25, Al Tucker wrote:
>>>>>> 
>>>>>>> Hi everybody.
>>>>>>> 
>>>>>>> We're new to the Inquiry Xserve scientific
>>>>>>> 
>>>> cluster and trying to iron
>>>> 
>>>>>>> out a few things.
>>>>>>> 
>>>>>>> One thing is we seem to be coming up against is
>>>>>>> 
>>>> an out of memory
>>>> 
>>>>>>> error when getting large sequence analysis
>>>>>>> 
>>>> results (5,000 seq - at
>>>> 
>>>>>>> least- and above) back from BTblastall. The
>>>>>>> 
>>>> problem seems to be with
>>>> 
>>>>>>> BioPerl.
>>>>>>> 
>>>>>>> Might anyone here know if BioPerl is knows
>>>>>>> 
>>>> enough not to try and
>>>> 
>>>>>>> access more than 4gb of RAM in a single process
>>>>>>> 
>>>> (an OS X limit)? I'm
>>>> 
>>>>>>> told Blastall and BTblastall are and will chunk
>>>>>>> 
>>>> problems accordingly,
>>>> 
>>>>>>> but we're not certain if BioPerl is when called
>>>>>>> 
>>>> to merge large Blast
>>>> 
>>>>>>> results back together. It's the default version
>>>>>>> 
>>>> 1.2.3 that's supplied
>>>> 
>>>>>>> btw, and OS X 10.3.5 with all current updates
>>>>>>> 
>>>> just short of the
>>>> 
>>>>>>> latest 10.3.6 update.
>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>>>>> BioPerl tries to slurp up the entire results set
>>>>>> 
>>>> from a BLAST query,
>>>> 
>>>>>> and build objects for each little bit of the
>>>>>> 
>>>> result set and uses lots
>>>> 
>>>>>> of memory. It doesn't have anything smart at all
>>>>>> 
>>>> about breaking up the
>>>> 
>>>>>> job within the result set, afaik.
>>>>>> 
>>>> 
>>>> This is not really true. SearchIO module as far as I know works on stream.
>>>> 
>>>>>> I ended up stripping out results that hit a
>>>>>> 
>>>> certain threshold size to
>>>> 
>>>>>> run on a different, large memory opteron/linux
>>>>>> 
>>>> box and I'm
>>>> 
>>>>>> experimenting with replacing BioPerl with
>>>>>> 
>>>> BioPython etc.
>>>> 
>>>>>> 
>>>>>> Michael
>>>>>> 
>>>>> 
>>>>> 
>>>>> You may find hthat the BPLite parser works better
>>>>> 
>>>> when dealing with
>>>> 
>>>>> large blast result files. Its not as clean or
>>>>> 
>>>> maintained, but it does
>>>> 
>>>>> the job nicely for my current needs, which
>>>>> 
>>>> overloaded the usual parser.
>>>> 
>>>> There is basically no difference between BPLite and other BLAST parser
>>>> interfaces in Bioperl.
>>>> 
>>>> 
>>>> The problem lies in the core of Perl iteself. Perl does not release memory
>>>> to the system even after the reference count of an object created in the
>>>> memory goes to 0, unless the program in actually over. Perl object system
>>>> in highly inefficient to handle large number of objects created in the
>>>> memory.
>>>> 
>>>> -Malay
>>>> _______________________________________________
>>>> Bioclusters maillist  -
>>>> Bioclusters at bioinformatics.org
>>>> 
>>> https://bioinformatics.org/mailman/listinfo/bioclusters
>>> 
>>>> 
>>> 
>>> 
>>> =====
>>> Mike Cariaso
>>> _______________________________________________
>>> Bioclusters maillist  -  Bioclusters at bioinformatics.org
>>> https://bioinformatics.org/mailman/listinfo/bioclusters
>>> 
>> 
>> _______________________________________________
>> Bioclusters maillist  -  Bioclusters at bioinformatics.org
>> https://bioinformatics.org/mailman/listinfo/bioclusters
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l