[Bioperl-l] Bio::Restriction::Analysis::fragments() memory usage

Sun Oct 29 19:16:36 UTC 2006

Nathan S. Haigh wrote:
> Sorry for the repeat post but I haven't had a response. Just wondered if 
> anyone had any idea about this?
>
> Thanks
> Nath
>
> Nathan S. Haigh wrote:
>   
>> As you may be aware by now, i'm working with Bio::Restriction::Analysis
>> and friends.
>>
>> I'm doing restriction analysis on large sequences - chromosomes. I need
>> to identify an appropriate enzyme based on the total length of fragments
>> that are of a certain size (e.g. 100 - 500 bp). However, the amount of
>> memory used by Bio::Restriction::Analysis::fragments() is prohibative. I
>> have the following code (bottom) which downloads 2 thaliana chromosomes
>> (mito and chloro - so pretty small) and runs an analysis and then loops
>> through the fragments for all enzymes in the default collection.
>>
>> My memory usage just keep on climbing and none seems to get freed up
>> even when a $ra goes out of scope (start dealing with the next
>> sequence). Is this a memory leak of some sort, is there a way to free up
>> memory as I go? I'd appreciate any help/advice on how to reduce the
>> amount of memory being consumed as I'd like to use all the thaliana
>> chromosomes (not just mito and chloro), which at the moment probably
>> won't work.
>>
>> Cheers
>> Nath
>>
>> use strict;
>> use Bio::DB::GenBank;
>> use Bio::Restriction::Analysis;
>> use Bio::Restriction::EnzymeCollection;
>>
>> my @seq_objs;
>> my @gis = ( 7525012,  26556996 );
>>
>> my $db = Bio::DB::GenBank->new(-format => "fasta");
>> foreach my $gi (@gis) {
>>   print "Getting GI: $gi\n";
>>   push @seq_objs, $db->get_Seq_by_id($gi)
>> }
>>
>> my $min_fragment_size = 100;
>> my $max_fragment_size = 500;
>> my $enz_Coll = Bio::Restriction::EnzymeCollection->new();
>>
>> foreach my $seq (@seq_objs) {
>>   my $tot_size = 0;
>>   print "Processing ", $seq->primary_id,"\n";
>>   my $ra = Bio::Restriction::Analysis->new(
>>                                          -seq=>$seq,
>>                                          -enzymes=>$enz_Coll,
>>   );
>>  
>>   my @all_enzymes = $ra->cutters->each_enzyme;
>>   print "  Calc total length of fragments in range: $min_fragment_size -
>> $max_fragment_size\n";
>>   foreach my $enzyme ( @all_enzymes ) {
>>     # fragments() is a real memory hog
>>     foreach my $frag ($ra->fragments($enzyme)) {
>>       next if $min_fragment_size && (length $frag < $min_fragment_size);
>>       next if $max_fragment_size && (length $frag > $max_fragment_size);
>>       $tot_size += length $frag;
>>     }
>>     # do something based on value of $tot_size
>>     #print "    ", $enzyme->name, " total = $tot_size\n";
>>   }
>>   print "DONE\n";
>> }
>>
>>     
Try this code, which creates a new Bio::Restriction::Analysis object for 
each digest. On my PowerBook, this doesn't use more than 13 Mb of memory.

Reading the code for Bio::Restriction::Analysis reveals that the 
fragments() method calls the cut() method. The documentation for the cut 
method states:

Note: cut doesn't now re-initialize everything before figuring out
cuts. This is so that you can do multiple digests, or add more data or
whatever. You'll have to use new to reset everything.

This means there is no memory leak; it's just that the 
Bio::Restriction::Analysis object is retaining cut information for each 
enzyme, which takes a lot of memory.

use strict;
use warnings;
use Bio::DB::GenBank;
use Bio::Restriction::Analysis;
use Bio::Restriction::EnzymeCollection;

my @seq_objs;
my @gis = ( 7525012,  26556996 );

my $db = Bio::DB::GenBank->new(-format => "fasta");
foreach my $gi (@gis) {
  print "Getting GI: $gi\n";
  push @seq_objs, $db->get_Seq_by_id($gi)
}

my $min_fragment_size = 100;
my $max_fragment_size = 500;
my $enz_Coll = Bio::Restriction::EnzymeCollection->new();

foreach my $seq (@seq_objs) {
  print "Processing ", $seq->primary_id, "\n";
  foreach my $enzyme ( $enz_Coll->each_enzyme() ) {
    my $ra = Bio::Restriction::Analysis->new(
      -seq => $seq,
      -enzymes => $enzyme );
    my $tot_size = 0;

    print "  Calc total length of fragments in range: $min_fragment_size 
-" .
      " $max_fragment_size\n";

    foreach my $frag ($ra->fragments($enzyme)) {
      next if $min_fragment_size && (length $frag < $min_fragment_size);
      next if $max_fragment_size && (length $frag > $max_fragment_size);
      $tot_size += length $frag;
    }
    # do something based on value of $tot_size
    print "    ", $enzyme->name, " total = $tot_size\n";
  }
  print "DONE\n";
}

-- 
Conrad Halling
chhalling at alumni.ls.berkeley.edu