[Bioperl-l] Performance problems with BioPerl and Perl 5.8 onWindows

Chris Fields cjfields at uiuc.edu
Thu May 18 20:07:14 UTC 2006


David,

I have seen some slowdowns with Bio::SeqIO associated with GenBank files,
which this could be related to.  I can't do anything about it (test or
commit changes) until next week but someone else using Windows might (though
we are few and far between, and I'm switching to Mac OS X in fall).  Would
be nice to try the changes and test it out on a few platforms.  

Chris

> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of
> David_Waner/San_Diego/Accelrys at scitegic.com
> Sent: Thursday, May 18, 2006 2:31 PM
> To: bioperl-l at lists.open-bio.org
> Subject: [Bioperl-l] Performance problems with BioPerl and Perl 5.8
> onWindows
> 
> BioPerl Users/Developers,
> 
> In our testing we have found severe performance problems using BioPerl
> with Perl 5.8 on Windows (but not on Linux). They show up especially in
> SeqIO when reading or writing Fasta files containing large (~16 MB)
> sequences.  The same files that can be read in 1 or 2 seconds with Windows
> Perl 5.6 or Linux Perl 5.8, take minutes in Windows Perl 5.8.
> 
> Although the fault is clearly with Perl, not with BioPerl, I have
> identified a couple of places where BioPerl could be modified in order to
> save Windows Perl 5.8 users a lot of time, while not affecting other
> users.
> 
> For example, in my testing the following excerpt from
> Bio::Root::IO::_readline() takes 50 seconds (!) to execute (when reading a
> 16 MB sequence):
> 
>     if( (!$param{-raw}) && (defined $line) ) {
>         $line =~ s/\015?\012/\n/g;
>         $line =~ s/\015/\n/g unless $ONMAC;
>     }
> 
> whereas the following replacement code should be equivalent:
> 
>     if( (!$param{-raw}) && (defined $line) ) {
>         $line =~ s/\015\012/\012/g;                        # Change all
> CR/LF pairs to LF
>         $line =~ tr/\015/\n/ unless $ONMAC;     # Change all single CRs to
> NEWLINE
>     }
> 
> but executes in less than 1 second.
> 
> In addition, changing:
> 
>     defined $sequence && $sequence =~ s/\s//g;        # Remove whitespace
> 
> to:
> 
>     defined $sequence && $sequence =~ tr/ \t\n\r//d;        # Remove
> whitespace
> 
> in Bio::SeqIO::fasta.pm saves an additional ~20 seconds.
> 
> There are also problems in reading files with the <> operator when $/ is
> redefined to "\n>", where reading the first line of Fasta files containing
> large sequences takes ~50 seconds, but reading subsequent lines or files
> takes about 1 second. I don't have a work-around for this.
> 
> I would like to ask the mailing list:
> 
> 1. Has anyone else run into this problem? Any fixes?
> 2. Do you think BioPerl should incorporate these changes?
> 
> I plan to submit a bug report to perlbug, but don't know when or if the
> problem will be fixed.
> 
> - David
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l




More information about the Bioperl-l mailing list