[Bioperl-l] dealing with large files

Albert Vilella avilella at gmail.com
Tue Dec 18 20:33:43 UTC 2007


There is a Bio::SeqIO "largefasta" object that will use the hard-disk
for very large fasta files.

On Dec 18, 2007 6:31 PM, Jason Stajich <jason at bioperl.org> wrote:
> Not exactly clear why you aren't using Bio::SeqIO to write the
> sequence back out in FASTA format and why you are re-opening the file
> each time?
>
> Did you look at the examples that show how to convert file formats?
> http://bioperl.org/wiki/HOWTO:SeqIO
>
> You can set the description with
> $seq->description($newdescription);
> and the ID with
> $seq->display_id($newid);
> before writing.
>
> It isn't clear to me from your code why it would be leaking memory
> and causing a problem - is it possible that you have a huge sequence
> in the EMBL file?
>
> -jason
>
> On Dec 18, 2007, at 10:04 AM, Stefano Ghignone wrote:
>
> > Dear all,
> >   I'm facing with a really annoying problem regarding large files
> > handling.
> > I wrote a script (below) which should keep sequences from an embl
> > formatted file and write out the sequences in a customized fasta
> > format. The script works, but since the input file is rather big
> > 5.6 GB unzipped (987 MB zipped), after a while all the physical and
> > virtual memories of my workstation (4GB RAM) are filled and the
> > script is killed...
> > I really don't know how to avoid this huge memory usage...and now
> > I'm wondering if this is the right approach....
> > Please help me!
> > Best wishes,
> > Stefano
> >
> >
> >
> > #################
> > #!/usr/bin/perl -w
> >
> > use strict;
> >
> > use warnings;
> >
> > use Fcntl;
> > use Cwd;
> >
> > use Bio::SeqIO;
> >
> > my $infile = $ARGV[0];
> > my $outfile = "$ARGV[0].fasta";
> > my $organism;
> > my $count;
> > my $path = cwd()."/$outfile";
> >
> > print "Working dir is: ".cwd().".\nCreating file: $path\n";
> >
> > my $in  = Bio::SeqIO->new(-file => "/bin/gunzip -c $infile |", -
> > format => 'EMBL');
> >
> > while ( my $seq = $in->next_seq() ) {
> >       sysopen(TO, $path, O_WRONLY | O_APPEND | O_CREAT);
> >       my $id = $seq->accession_number();
> >       my $desc = $seq->desc(); chop $desc;
> >       my $species = $seq->species->binomial();
> >       my $subspecies = $seq->species->sub_species();
> >       if ($seq->species->sub_species()) {chop $subspecies; $organism =
> > $species." ".$subspecies;}
> >               else {$organism = $species;}
> >       my $sequence = $seq->seq();
> >       print TO ">$id $desc [$organism]\n$sequence\n";
> >       $count++;
> >       warn $@ if $@;
> >       close TO;
> > }
> >
> > print "Done!\n\t$count sequences have been treated. The file $ARGV
> > [0].fasta is ready.\n";
> >
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>



More information about the Bioperl-l mailing list