[Bioperl-l] SiteMatrix changes

Thu Aug 31 17:48:49 UTC 2006

On Aug 31, 2006, at 12:47 PM, Sendu Bala wrote:

> skirov wrote:
>>  Add 1 (or the user supplied correction
>> value) to any position that has 0. Perhaps you are right (if I
>> understood correctly) and 1 should be added to everything if any
>> position contains 0. I am not really sure abut this.
>
> If you're going to do the correction, you always do it, not just when
> one of the positions contains 0.

That's mostly true. As the wikipedia piece indicated, you need pseudo- 
counts mostly in applications of probabilistic methods to count data  
for which at least one of the events has a low (but non-zero)  
likelihood of being observed. That is so because when you compute  
joint probabilities and only a single probability of many is zero,  
then the joint probability is automatically zero too, which is seldom  
if ever correct or helpful.

Adding the same pseudo-count to each event count in a Bayesian sense  
is like starting with a uniform prior distribution over all events  
and then adjust the likelihood estimates based on observed data. If  
there are many observations (i.e., N(obs) >> N(events)), the  
contribution of the pseudo-counts (i.e., prior distribution) to the  
resulting estimates will be negligible. If there are few  
observations, the prior (non-informative) distribution will still be  
visible in the frequency estimates (reflecting that the amount of  
data then doesn't support extreme estimates).

You may add non-uniform pseudo-counts to reflect a non-uniform prior  
distribution. However, having zero in a pseudo-count slot means the  
prior probability for that event is exactly zero, which would be a  
gap in the distribution and I'm not sure you would ever desire that.

BTW just to state the obvious, when computing the frequency estimate,  
your sum of pseudo-counts must be added in the denominator.

As for applying pseudo-counts by default, I would certainly not do  
this. For people who don't read the documentation (which would  
document the pseudo-count option) and feed the module with  
insufficient data, I think it is always better to return obviously  
non-sensical results, than return results that at least on the  
surface look sensible but require interpretation, since with using  
pseudo-counts you need to know how much data you have seen in order  
to disambiguate whether the estimates you are seeing truly reflect  
your observations or rather mirror the influence of the pseudo-counts.

My $0.02.

	-hilmar

>
>
>>> IUPAC has no concept of frequencies or have a cutoff. When there is
>>> a chance of all four bases (complete ambiguity), the IUPAC code is
>>> N. If you want it to return 'R' in this case, the IUPAC method
>>> would need to be extended to allow input of a user-defined
>>> threshold defining what frequencies to ignore.
>>
>> So are you saying that if A is 0.9999, C is 0.00002, G is 0.00004 and
>> T is 0.00004 you would have N???
>
> Yes, because that is correct. There's no way to judge automatically  
> what
> frequencies should be ignored, unless we know exactly how and if  
> psuedo
> count correction was done. Not guessing and being correct is better  
> than
> guessing to give 'nicer' answers but sometimes being completely wrong
> (and always being strictly wrong). Really, if you want a nice IUPAC
> string you must simply chose not to do pseudo count correction.
>
>
>> Allowing customer supplied thresholds is not a bad idea, you could
>> implement it if you wish. But please do not fix something that is not
>> broken.
>
> It was giving the wrong answers, so needed fixing. I'll add the
> threshold option.
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================