[Bioperl-l] Performance problems with BioPerl and Perl 5.8 onWindows
Chris Fields
cjfields at uiuc.edu
Thu May 18 20:07:14 UTC 2006
David,
I have seen some slowdowns with Bio::SeqIO associated with GenBank files,
which this could be related to. I can't do anything about it (test or
commit changes) until next week but someone else using Windows might (though
we are few and far between, and I'm switching to Mac OS X in fall). Would
be nice to try the changes and test it out on a few platforms.
Chris
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of
> David_Waner/San_Diego/Accelrys at scitegic.com
> Sent: Thursday, May 18, 2006 2:31 PM
> To: bioperl-l at lists.open-bio.org
> Subject: [Bioperl-l] Performance problems with BioPerl and Perl 5.8
> onWindows
>
> BioPerl Users/Developers,
>
> In our testing we have found severe performance problems using BioPerl
> with Perl 5.8 on Windows (but not on Linux). They show up especially in
> SeqIO when reading or writing Fasta files containing large (~16 MB)
> sequences. The same files that can be read in 1 or 2 seconds with Windows
> Perl 5.6 or Linux Perl 5.8, take minutes in Windows Perl 5.8.
>
> Although the fault is clearly with Perl, not with BioPerl, I have
> identified a couple of places where BioPerl could be modified in order to
> save Windows Perl 5.8 users a lot of time, while not affecting other
> users.
>
> For example, in my testing the following excerpt from
> Bio::Root::IO::_readline() takes 50 seconds (!) to execute (when reading a
> 16 MB sequence):
>
> if( (!$param{-raw}) && (defined $line) ) {
> $line =~ s/\015?\012/\n/g;
> $line =~ s/\015/\n/g unless $ONMAC;
> }
>
> whereas the following replacement code should be equivalent:
>
> if( (!$param{-raw}) && (defined $line) ) {
> $line =~ s/\015\012/\012/g; # Change all
> CR/LF pairs to LF
> $line =~ tr/\015/\n/ unless $ONMAC; # Change all single CRs to
> NEWLINE
> }
>
> but executes in less than 1 second.
>
> In addition, changing:
>
> defined $sequence && $sequence =~ s/\s//g; # Remove whitespace
>
> to:
>
> defined $sequence && $sequence =~ tr/ \t\n\r//d; # Remove
> whitespace
>
> in Bio::SeqIO::fasta.pm saves an additional ~20 seconds.
>
> There are also problems in reading files with the <> operator when $/ is
> redefined to "\n>", where reading the first line of Fasta files containing
> large sequences takes ~50 seconds, but reading subsequent lines or files
> takes about 1 second. I don't have a work-around for this.
>
> I would like to ask the mailing list:
>
> 1. Has anyone else run into this problem? Any fixes?
> 2. Do you think BioPerl should incorporate these changes?
>
> I plan to submit a bug report to perlbug, but don't know when or if the
> problem will be fixed.
>
> - David
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list