[Bioperl-l] dealing with large files
Jason Stajich
jason at bioperl.org
Tue Dec 18 18:31:39 UTC 2007
Not exactly clear why you aren't using Bio::SeqIO to write the
sequence back out in FASTA format and why you are re-opening the file
each time?
Did you look at the examples that show how to convert file formats?
http://bioperl.org/wiki/HOWTO:SeqIO
You can set the description with
$seq->description($newdescription);
and the ID with
$seq->display_id($newid);
before writing.
It isn't clear to me from your code why it would be leaking memory
and causing a problem - is it possible that you have a huge sequence
in the EMBL file?
-jason
On Dec 18, 2007, at 10:04 AM, Stefano Ghignone wrote:
> Dear all,
> I'm facing with a really annoying problem regarding large files
> handling.
> I wrote a script (below) which should keep sequences from an embl
> formatted file and write out the sequences in a customized fasta
> format. The script works, but since the input file is rather big
> 5.6 GB unzipped (987 MB zipped), after a while all the physical and
> virtual memories of my workstation (4GB RAM) are filled and the
> script is killed...
> I really don't know how to avoid this huge memory usage...and now
> I'm wondering if this is the right approach....
> Please help me!
> Best wishes,
> Stefano
>
>
>
> #################
> #!/usr/bin/perl -w
>
> use strict;
>
> use warnings;
>
> use Fcntl;
> use Cwd;
>
> use Bio::SeqIO;
>
> my $infile = $ARGV[0];
> my $outfile = "$ARGV[0].fasta";
> my $organism;
> my $count;
> my $path = cwd()."/$outfile";
>
> print "Working dir is: ".cwd().".\nCreating file: $path\n";
>
> my $in = Bio::SeqIO->new(-file => "/bin/gunzip -c $infile |", -
> format => 'EMBL');
>
> while ( my $seq = $in->next_seq() ) {
> sysopen(TO, $path, O_WRONLY | O_APPEND | O_CREAT);
> my $id = $seq->accession_number();
> my $desc = $seq->desc(); chop $desc;
> my $species = $seq->species->binomial();
> my $subspecies = $seq->species->sub_species();
> if ($seq->species->sub_species()) {chop $subspecies; $organism =
> $species." ".$subspecies;}
> else {$organism = $species;}
> my $sequence = $seq->seq();
> print TO ">$id $desc [$organism]\n$sequence\n";
> $count++;
> warn $@ if $@;
> close TO;
> }
>
> print "Done!\n\t$count sequences have been treated. The file $ARGV
> [0].fasta is ready.\n";
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list