[Bioperl-l] dealing with large files
Stefano Ghignone
ste.ghi at libero.it
Tue Dec 18 18:04:21 UTC 2007
Dear all,
I'm facing with a really annoying problem regarding large files handling.
I wrote a script (below) which should keep sequences from an embl formatted file and write out the sequences in a customized fasta format. The script works, but since the input file is rather big 5.6 GB unzipped (987 MB zipped), after a while all the physical and virtual memories of my workstation (4GB RAM) are filled and the script is killed...
I really don't know how to avoid this huge memory usage...and now I'm wondering if this is the right approach....
Please help me!
Best wishes,
Stefano
#################
#!/usr/bin/perl -w
use strict;
use warnings;
use Fcntl;
use Cwd;
use Bio::SeqIO;
my $infile = $ARGV[0];
my $outfile = "$ARGV[0].fasta";
my $organism;
my $count;
my $path = cwd()."/$outfile";
print "Working dir is: ".cwd().".\nCreating file: $path\n";
my $in = Bio::SeqIO->new(-file => "/bin/gunzip -c $infile |", -format => 'EMBL');
while ( my $seq = $in->next_seq() ) {
sysopen(TO, $path, O_WRONLY | O_APPEND | O_CREAT);
my $id = $seq->accession_number();
my $desc = $seq->desc(); chop $desc;
my $species = $seq->species->binomial();
my $subspecies = $seq->species->sub_species();
if ($seq->species->sub_species()) {chop $subspecies; $organism = $species." ".$subspecies;}
else {$organism = $species;}
my $sequence = $seq->seq();
print TO ">$id $desc [$organism]\n$sequence\n";
$count++;
warn $@ if $@;
close TO;
}
print "Done!\n\t$count sequences have been treated. The file $ARGV[0].fasta is ready.\n";
More information about the Bioperl-l
mailing list