[Bioperl-l] Performance problems with BioPerl and Perl 5.8 on Windows
David_Waner/San_Diego/Accelrys at scitegic.com
David_Waner/San_Diego/Accelrys at scitegic.com
Thu May 18 19:30:46 UTC 2006
BioPerl Users/Developers,
In our testing we have found severe performance problems using BioPerl
with Perl 5.8 on Windows (but not on Linux). They show up especially in
SeqIO when reading or writing Fasta files containing large (~16 MB)
sequences. The same files that can be read in 1 or 2 seconds with Windows
Perl 5.6 or Linux Perl 5.8, take minutes in Windows Perl 5.8.
Although the fault is clearly with Perl, not with BioPerl, I have
identified a couple of places where BioPerl could be modified in order to
save Windows Perl 5.8 users a lot of time, while not affecting other
users.
For example, in my testing the following excerpt from
Bio::Root::IO::_readline() takes 50 seconds (!) to execute (when reading a
16 MB sequence):
if( (!$param{-raw}) && (defined $line) ) {
$line =~ s/\015?\012/\n/g;
$line =~ s/\015/\n/g unless $ONMAC;
}
whereas the following replacement code should be equivalent:
if( (!$param{-raw}) && (defined $line) ) {
$line =~ s/\015\012/\012/g; # Change all
CR/LF pairs to LF
$line =~ tr/\015/\n/ unless $ONMAC; # Change all single CRs to
NEWLINE
}
but executes in less than 1 second.
In addition, changing:
defined $sequence && $sequence =~ s/\s//g; # Remove whitespace
to:
defined $sequence && $sequence =~ tr/ \t\n\r//d; # Remove
whitespace
in Bio::SeqIO::fasta.pm saves an additional ~20 seconds.
There are also problems in reading files with the <> operator when $/ is
redefined to "\n>", where reading the first line of Fasta files containing
large sequences takes ~50 seconds, but reading subsequent lines or files
takes about 1 second. I don't have a work-around for this.
I would like to ask the mailing list:
1. Has anyone else run into this problem? Any fixes?
2. Do you think BioPerl should incorporate these changes?
I plan to submit a bug report to perlbug, but don't know when or if the
problem will be fixed.
- David
More information about the Bioperl-l
mailing list