[Bioperl-l] Fwd: [Bioperl-guts-l] [BioPerl - Bug #3328] (New) segregating sites calculation fails on gapped sequences

Fri Feb 17 17:42:29 UTC 2012

This should be an easy bug for someone to fix -- I am pretty sure the solution is to ignore gapped columns but I haven't looked deeper and I don't have any time right now to work on bioperl fixes so be great if someone wanted to help out here.

The redmine bug info is appended below.

Jason

Begin forwarded message:

> From: redmine at redmine.open-bio.org
> Subject: [Bioperl-guts-l] [BioPerl - Bug #3328] (New) segregating sites calculation fails on gapped sequences
> Date: February 17, 2012 9:39:42 AM PST
> To: bioperl-guts-l at lists.open-bio.org
> 
> 
> Issue #3328 has been reported by Jason Stajich.
> 
> ----------------------------------------
> Bug #3328: segregating sites calculation fails on gapped sequences
> https://redmine.open-bio.org/issues/3328
> 
> Author: Jason Stajich
> Status: New
> Priority: Normal
> Assignee: Bioperl Guts
> Category: Bio::PopGen
> Target version: 
> URL: 
> 
> 
> 
>   I am Cheng-Ruei Lee, a graduate student in Duke Biology. I'm analyzing many DNA alignments of a plant species.
>   I first used (Bio::PopGen::Utilities -> aln_to_population()) to read in the fasta format alignment, and then use Bio::PopGen::Statistics to calculate some statistics without outgroup. Most gene work fine, but I think a bug happened when it meets alignments like this:
> 
>> Genotype1
> ATGATCGTAGCTGATGCTGTGATCGATCGCTAGCTAGCTCGA
>> Genotype2
> ------------GATGCTGTGATCGATCGCTAGCTAGCTCGA
>> Genotype3
> ------------GATGCTGTGATCGATCGCTAGCTAGCTCGA
>> Genotype4
> ------------GATGCTGTGATCGATCGCTAGCTAGCTCGA
> 
>   I get this data set from other people. I guess due to the annotation program people used, the definition of coding sequence is much longer in genotype 1 than in other genotypes. This creates a long stretch of gap in the very beginning. Whenever Bio::PopGen meets this kind of genes, the number of singleton counts boost a lot - seems like the long stretch of sites with gap is also counted as singletons. Also, some Fu & Li statistics boosted. The "number of segregation sites" seems not to be affected. (And therefore, there are genes with hundreds of singleton sites but only a few total segregating sites.)
>   May be a possible bug in Bio::PopGen::Utilities when reading in the data? Or when calculating singletons?
> 
> Sincerely,
> Cheng-Ruei Lee <cl134 at duke.edu>
> 
> 
> -- 
> You have received this notification because you have either subscribed to it, or are involved in it.
> To change your notification preferences, please click here and login: http://redmine.open-bio.org
> 
> _______________________________________________
> Bioperl-guts-l mailing list
> Bioperl-guts-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-guts-l

Jason Stajich
jason.stajich at gmail.com
jason at bioperl.org