[Bioperl-l] dealing with large files

Tue Dec 18 18:04:21 UTC 2007

Dear all,
  I'm facing with a really annoying problem regarding large files handling.
I wrote a script (below) which should keep sequences from an embl formatted file and write out the sequences in a customized fasta format. The script works, but since the input file is rather big 5.6 GB unzipped (987 MB zipped), after a while all the physical and virtual memories of my workstation (4GB RAM) are filled and the script is killed...
I really don't know how to avoid this huge memory usage...and now I'm wondering if this is the right approach....
Please help me!
Best wishes,
Stefano 

#################
#!/usr/bin/perl -w

use strict;

use warnings;

use Fcntl;
use Cwd;

use Bio::SeqIO;

my $infile = $ARGV[0];
my $outfile = "$ARGV[0].fasta";
my $organism;
my $count;
my $path = cwd()."/$outfile";

print "Working dir is: ".cwd().".\nCreating file: $path\n";

my $in  = Bio::SeqIO->new(-file => "/bin/gunzip -c $infile |", -format => 'EMBL');

while ( my $seq = $in->next_seq() ) {
	sysopen(TO, $path, O_WRONLY | O_APPEND | O_CREAT);  
	my $id = $seq->accession_number();	
	my $desc = $seq->desc(); chop $desc;
	my $species = $seq->species->binomial();
	my $subspecies = $seq->species->sub_species();
	if ($seq->species->sub_species()) {chop $subspecies; $organism = $species." ".$subspecies;}
		else {$organism = $species;}
	my $sequence = $seq->seq();
	print TO ">$id $desc [$organism]\n$sequence\n";
    	$count++;
	warn $@ if $@;
	close TO;
}

print "Done!\n\t$count sequences have been treated. The file $ARGV[0].fasta is ready.\n";