[Bioperl-l] Performance problems with BioPerl and Perl 5.8 on Windows

Brian Osborne osborne1 at optonline.net
Thu May 18 20:27:57 UTC 2006


David,

What are the results from the relevant t/*t files before and after these
patches?

Brian O.


On 5/18/06 3:30 PM, "David_Waner/San_Diego/Accelrys at scitegic.com"
<David_Waner/San_Diego/Accelrys at scitegic.com> wrote:

> BioPerl Users/Developers,
> 
> In our testing we have found severe performance problems using BioPerl
> with Perl 5.8 on Windows (but not on Linux). They show up especially in
> SeqIO when reading or writing Fasta files containing large (~16 MB)
> sequences.  The same files that can be read in 1 or 2 seconds with Windows
> Perl 5.6 or Linux Perl 5.8, take minutes in Windows Perl 5.8.
> 
> Although the fault is clearly with Perl, not with BioPerl, I have
> identified a couple of places where BioPerl could be modified in order to
> save Windows Perl 5.8 users a lot of time, while not affecting other
> users. 
> 
> For example, in my testing the following excerpt from
> Bio::Root::IO::_readline() takes 50 seconds (!) to execute (when reading a
> 16 MB sequence):
> 
>     if( (!$param{-raw}) && (defined $line) ) {
>         $line =~ s/\015?\012/\n/g;
>         $line =~ s/\015/\n/g unless $ONMAC;
>     }
>  
> whereas the following replacement code should be equivalent:
> 
>     if( (!$param{-raw}) && (defined $line) ) {
>         $line =~ s/\015\012/\012/g;                        # Change all
> CR/LF pairs to LF
>         $line =~ tr/\015/\n/ unless $ONMAC;     # Change all single CRs to
> NEWLINE
>     }
>  
> but executes in less than 1 second.
> 
> In addition, changing:
> 
>     defined $sequence && $sequence =~ s/\s//g;        # Remove whitespace
>  
> to:
> 
>     defined $sequence && $sequence =~ tr/ \t\n\r//d;        # Remove
> whitespace
>  
> in Bio::SeqIO::fasta.pm saves an additional ~20 seconds.
> 
> There are also problems in reading files with the <> operator when $/ is
> redefined to "\n>", where reading the first line of Fasta files containing
> large sequences takes ~50 seconds, but reading subsequent lines or files
> takes about 1 second. I don't have a work-around for this.
> 
> I would like to ask the mailing list:
> 
> 1. Has anyone else run into this problem? Any fixes?
> 2. Do you think BioPerl should incorporate these changes?
> 
> I plan to submit a bug report to perlbug, but don't know when or if the
> problem will be fixed.
> 
> - David
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l





More information about the Bioperl-l mailing list