[Bioperl-l] Bio::PopGen modules performance
Albert Vilella
avilella at ub.edu
Fri Nov 4 16:29:45 EST 2005
If your datasets are annotations inside large syntenic regions (MBp-GBp
scale) and you are interested in sliding windows/Multiresolution
analysis...
<self_promotion=on>
you may be interested in trying VariScan:
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=15814564&query_hl=1
<self_promotion=off>
otherwise, for "gene-like" sequences (Kbp-MBp scale), Kevin's
libsequence+analyse is a fantastic tool, and I believe the best choice.
Cheers,
Albert.
El dv 04 de 11 del 2005 a les 15:35 -0500, en/na Jason Stajich va
escriure:
> My guess is it has more to do with the object creation/teardown than
> the actual code calculating the statistics. I'm not entirely sure
> how we solve this as I chose to use a rich objects so that you can
> pass lots of different kinds of data in.
>
> I wrote simple methods to calc Tajima's D, Fu & Li's D, etc just from
> the simple counts - see the XX_counts method for more information.
>
> For example, for Tajima's D, you can call tajima_D_counts with the #
> samples, # sites, and pi to just get back D. But of course to
> calculate pi you need to pass in a population object to the pi method
> so it doesn't really solve it for you. Maybe we can figure out a way
> to simplify it, but I embraced the object-oriented here to support a
> flexible design, but I didn't realize the speed was going to be so bad.
>
>
> You can certainly use Kevin Thornton's msstats which is going to be
> bazillion (approx.) times faster than the bioperl object code.
> http://molpopgen.org/software/libsequence_html/libsequence.htm
> search down for msstats
>
> I am hoping someone will have a magic Perl insight on how to do OO
> better one day - and may that be a day before Perl6!
>
>
>
> On Nov 4, 2005, at 2:18 PM, Bingshan Li wrote:
>
> > Hi all,
> >
> > I used Bio::PopGen modules to calculate various statistics such as
> > Tajima's D, Pi and so on. For single data, the performance is fine.
> > But to get a sense of significance, I simulated the data using
> > Hudson's "ms" program to generate 10000 simulated populations. When
> > I used Bio::PopGen modules on the 10000 samples, it takes long time
> > (finished 600 samples in about 10 hours, population size about 200,
> > segregating size about 500). If I have a set of data, say 100, for
> > each data I need 10000 simulated populations, I do not think it is
> > doable. I am wondering if it makes sense for these modules or I can
> > increase the performance by optimization of my code. I think 10000
> > simulations are typical for population genetics analysis. Does any
> > body have experiences with this issue and can anyone give me any
> > suggestions about the performance?
> >
> > Thanks a lot!
> >
> > --bs
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at portal.open-bio.org
> > http://portal.open-bio.org/mailman/listinfo/bioperl-l
> >
>
> --
> Jason Stajich
> Duke University
> http://www.duke.edu/~jes12
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list