[Bioperl-l] dealing with large files

Jason Stajich jason at bioperl.org
Tue Dec 18 18:31:39 UTC 2007


Not exactly clear why you aren't using Bio::SeqIO to write the  
sequence back out in FASTA format and why you are re-opening the file  
each time?

Did you look at the examples that show how to convert file formats?
http://bioperl.org/wiki/HOWTO:SeqIO

You can set the description with
$seq->description($newdescription);
and the ID with
$seq->display_id($newid);
before writing.

It isn't clear to me from your code why it would be leaking memory  
and causing a problem - is it possible that you have a huge sequence  
in the EMBL file?

-jason
On Dec 18, 2007, at 10:04 AM, Stefano Ghignone wrote:

> Dear all,
>   I'm facing with a really annoying problem regarding large files  
> handling.
> I wrote a script (below) which should keep sequences from an embl  
> formatted file and write out the sequences in a customized fasta  
> format. The script works, but since the input file is rather big  
> 5.6 GB unzipped (987 MB zipped), after a while all the physical and  
> virtual memories of my workstation (4GB RAM) are filled and the  
> script is killed...
> I really don't know how to avoid this huge memory usage...and now  
> I'm wondering if this is the right approach....
> Please help me!
> Best wishes,
> Stefano
>
>
>
> #################
> #!/usr/bin/perl -w
>
> use strict;
>
> use warnings;
>
> use Fcntl;
> use Cwd;
>
> use Bio::SeqIO;
>
> my $infile = $ARGV[0];
> my $outfile = "$ARGV[0].fasta";
> my $organism;
> my $count;
> my $path = cwd()."/$outfile";
>
> print "Working dir is: ".cwd().".\nCreating file: $path\n";
>
> my $in  = Bio::SeqIO->new(-file => "/bin/gunzip -c $infile |", - 
> format => 'EMBL');
>
> while ( my $seq = $in->next_seq() ) {
> 	sysopen(TO, $path, O_WRONLY | O_APPEND | O_CREAT);
> 	my $id = $seq->accession_number();	
> 	my $desc = $seq->desc(); chop $desc;
> 	my $species = $seq->species->binomial();
> 	my $subspecies = $seq->species->sub_species();
> 	if ($seq->species->sub_species()) {chop $subspecies; $organism =  
> $species." ".$subspecies;}
> 		else {$organism = $species;}
> 	my $sequence = $seq->seq();
> 	print TO ">$id $desc [$organism]\n$sequence\n";
>     	$count++;
> 	warn $@ if $@;
> 	close TO;
> }
>
> print "Done!\n\t$count sequences have been treated. The file $ARGV 
> [0].fasta is ready.\n";
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l




More information about the Bioperl-l mailing list