[Bioperl-l] Bio::PopGen modules performance

Albert Vilella avilella at ub.edu
Fri Nov 4 16:29:45 EST 2005


If your datasets are annotations inside large syntenic regions (MBp-GBp
scale) and you are interested in sliding windows/Multiresolution
analysis...

<self_promotion=on>
you may be interested in trying VariScan:
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=15814564&query_hl=1
<self_promotion=off>

otherwise, for "gene-like" sequences (Kbp-MBp scale), Kevin's
libsequence+analyse is a fantastic tool, and I believe the best choice.

Cheers,

    Albert.

El dv 04 de 11 del 2005 a les 15:35 -0500, en/na Jason Stajich va
escriure:
> My guess is it has more to do with the object creation/teardown than  
> the actual code calculating the statistics.  I'm not entirely sure  
> how we solve this as I chose to use a rich objects so that you can  
> pass lots of different kinds of data in.
> 
> I wrote simple methods to calc Tajima's D, Fu & Li's D, etc just from  
> the simple counts - see the XX_counts method for more information.
> 
> For example, for Tajima's D, you can call tajima_D_counts with the #  
> samples, # sites, and pi  to just get back D.  But of course to  
> calculate pi you need to pass in a population object to the pi method  
> so it doesn't really solve it for you.  Maybe we can figure out a way  
> to simplify it, but I embraced the object-oriented here to support a  
> flexible design, but I didn't realize the speed was going to be so bad.
> 
> 
> You can certainly use Kevin Thornton's msstats which is going to be  
> bazillion (approx.) times faster than the bioperl object code.
> http://molpopgen.org/software/libsequence_html/libsequence.htm
> search down for msstats
> 
> I am hoping someone will have a magic Perl insight on how to do OO  
> better one day - and may that be a day before Perl6!
> 
> 
> 
> On Nov 4, 2005, at 2:18 PM, Bingshan Li wrote:
> 
> > Hi all,
> >
> > I used Bio::PopGen modules to calculate various statistics such as  
> > Tajima's D, Pi and so on. For single data, the performance is fine.  
> > But to get a sense of significance, I simulated the data using  
> > Hudson's "ms" program to generate 10000 simulated populations. When  
> > I used Bio::PopGen modules on the 10000 samples, it takes long time  
> > (finished 600 samples in about 10 hours, population size about 200,  
> > segregating size about 500). If I have a set of data, say 100, for  
> > each data I need 10000 simulated populations, I do not think it is  
> > doable. I am wondering if it makes sense for these modules or I can  
> > increase the performance by optimization of my code. I think 10000  
> > simulations are typical for population genetics analysis. Does any  
> > body have experiences with this issue and can anyone give me any  
> > suggestions about the performance?
> >
> > Thanks a lot!
> >
> > --bs
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at portal.open-bio.org
> > http://portal.open-bio.org/mailman/listinfo/bioperl-l
> >
> 
> --
> Jason Stajich
> Duke University
> http://www.duke.edu/~jes12
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l


More information about the Bioperl-l mailing list