[Bioperl-l] dealing with large files

Stefano Ghignone ste.ghi at libero.it
Wed Dec 19 16:45:15 UTC 2007


> Not exactly clear why you aren't using Bio::SeqIO to write the  
> sequence back out in FASTA format and why you are re-opening the file  
> each time?
It was to avoid tho keep the out file always opened...

> Did you look at the examples that show how to convert file formats?
> http://bioperl.org/wiki/HOWTO:SeqIO
yes I did...but I didn't realized how to set a customized description...

> You can set the description with
> $seq->description($newdescription);
> and the ID with
> $seq->display_id($newid);
> before writing.
Thanks for the hint. Anyway, just using the simple code reported to convert embl to fasta format, the results are the same...I remember you that I'm using a huge input file: the uniprot_trembl_bacteria.dat.gz...it contains 13101418 sequences!

> It isn't clear to me from your code why it would be leaking memory  
> and causing a problem - is it possible that you have a huge sequence  
> in the EMBL file?
> -jason

At the end, I succeeded in the format conversion using this command:

gunzip -c uniprot_trembl_bacteria.dat.gz | perl -ne 'print ">$1 " if
(/^AC\s+(\S+);/); print " $1" if (/^DE\s+(.*)/);print " [$1]\n" if
(/^OS\s+(.*)/); if (($a)=/^\s+(.*)/){$a=~s/ //g; print "$a\n"};'

(Thanks to Riccardo Percudani). It's not bioperl...but it works!

My best wishes,
Stefano


> On Dec 18, 2007, at 10:04 AM, Stefano Ghignone wrote:
> 
> > Dear all,
> >   I'm facing with a really annoying problem regarding large files  
> > handling.
> > I wrote a script (below) which should keep sequences from an embl  
> > formatted file and write out the sequences in a customized fasta  
> > format. The script works, but since the input file is rather big  
> > 5.6 GB unzipped (987 MB zipped), after a while all the physical and  
> > virtual memories of my workstation (4GB RAM) are filled and the  
> > script is killed...
> > I really don't know how to avoid this huge memory usage...and now  
> > I'm wondering if this is the right approach....
> > Please help me!
> > Best wishes,
> > Stefano
> >
> >
> >
> > #################
> > #!/usr/bin/perl -w
> >
> > use strict;
> >
> > use warnings;
> >
> > use Fcntl;
> > use Cwd;
> >
> > use Bio::SeqIO;
> >
> > my $infile = $ARGV[0];
> > my $outfile = "$ARGV[0].fasta";
> > my $organism;
> > my $count;
> > my $path = cwd()."/$outfile";
> >
> > print "Working dir is: ".cwd().".\nCreating file: $path\n";
> >
> > my $in  = Bio::SeqIO->new(-file => "/bin/gunzip -c $infile |", - 
> > format => 'EMBL');
> >
> > while ( my $seq = $in->next_seq() ) {
> > 	sysopen(TO, $path, O_WRONLY | O_APPEND | O_CREAT);
> > 	my $id = $seq->accession_number();	
> > 	my $desc = $seq->desc(); chop $desc;
> > 	my $species = $seq->species->binomial();
> > 	my $subspecies = $seq->species->sub_species();
> > 	if ($seq->species->sub_species()) {chop $subspecies; $organism =  
> > $species." ".$subspecies;}
> > 		else {$organism = $species;}
> > 	my $sequence = $seq->seq();
> > 	print TO ">$id $desc [$organism]\n$sequence\n";
> >     	$count++;
> > 	warn $@ if $@;
> > 	close TO;
> > }
> >
> > print "Done!\n\t$count sequences have been treated. The file $ARGV 
> > [0].fasta is ready.\n";
> >
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> 





More information about the Bioperl-l mailing list