[Bioperl-l] Bio::PopGen modules performance

Jason Stajich jason.stajich at duke.edu
Fri Nov 4 15:35:48 EST 2005


My guess is it has more to do with the object creation/teardown than  
the actual code calculating the statistics.  I'm not entirely sure  
how we solve this as I chose to use a rich objects so that you can  
pass lots of different kinds of data in.

I wrote simple methods to calc Tajima's D, Fu & Li's D, etc just from  
the simple counts - see the XX_counts method for more information.

For example, for Tajima's D, you can call tajima_D_counts with the #  
samples, # sites, and pi  to just get back D.  But of course to  
calculate pi you need to pass in a population object to the pi method  
so it doesn't really solve it for you.  Maybe we can figure out a way  
to simplify it, but I embraced the object-oriented here to support a  
flexible design, but I didn't realize the speed was going to be so bad.


You can certainly use Kevin Thornton's msstats which is going to be  
bazillion (approx.) times faster than the bioperl object code.
http://molpopgen.org/software/libsequence_html/libsequence.htm
search down for msstats

I am hoping someone will have a magic Perl insight on how to do OO  
better one day - and may that be a day before Perl6!



On Nov 4, 2005, at 2:18 PM, Bingshan Li wrote:

> Hi all,
>
> I used Bio::PopGen modules to calculate various statistics such as  
> Tajima's D, Pi and so on. For single data, the performance is fine.  
> But to get a sense of significance, I simulated the data using  
> Hudson's "ms" program to generate 10000 simulated populations. When  
> I used Bio::PopGen modules on the 10000 samples, it takes long time  
> (finished 600 samples in about 10 hours, population size about 200,  
> segregating size about 500). If I have a set of data, say 100, for  
> each data I need 10000 simulated populations, I do not think it is  
> doable. I am wondering if it makes sense for these modules or I can  
> increase the performance by optimization of my code. I think 10000  
> simulations are typical for population genetics analysis. Does any  
> body have experiences with this issue and can anyone give me any  
> suggestions about the performance?
>
> Thanks a lot!
>
> --bs
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>

--
Jason Stajich
Duke University
http://www.duke.edu/~jes12




More information about the Bioperl-l mailing list