[Bioperl-l] SiteMatrix changes

skirov skirov at utk.edu
Thu Aug 31 17:26:48 UTC 2006


>===== Original Message From Sendu Bala <bix at sendu.me.uk> =====
>skirov wrote:
>>> What has adding the number 1 to some but not all input numbers got
>>> to do with pseudo counts? Can you explain your thinking?
>>
>> The code was: if ($self->{_corrected}) { ${$self->{probA}}[$i] +=
>> $self->{_correction}; ${$self->{probC}}[$i] += $self->{_correction};
>> ${$self->{probG}}[$i] += $self->{_correction}; ${$self->{probT}}[$i]
>> += $self->{_correction}; } Add 1 (or the user supplied correction
>> value) to any position that has 0. Perhaps you are right (if I
>> understood correctly) and 1 should be added to everything if any
>> position contains 0. I am not really sure abut this.
>
>If you're going to do the correction, you always do it, not just when
>one of the positions contains 0. I imagine you were detecting 0 as an
>indicator that no correction had been done, but its possible no
>correction has been done even if none of the positions has a 0.

Well, if none of the positions is 0, no correction is really necessary.

>
>>> Why is having 0 a bad idea?
>>
>> Here is a wikipedia explanation: "In any observed data set or sample
>> there is the possibility, especially with low-probability events
>> and/or small data sets, of a possible event not occurring. Its
>> observed frequency is therefore 0, implying a probability of 0. This
>> is an oversimplification and is often unhelpful, particularly in
>> probability-based machine learning techniques such as artificial
>> neural networks and hidden Markov models."
>
>Yes, and it goes on to say:
>
>"The simplest approach is to add 1 to each observed number of events
>including the zero-count one. [...] A more complex approach is to
>estimate the probability of the events from other factors and adjust
>accordingly. Neither approach is completely satisfactory and both are a
>bit of a fudge."
>
>
>>> I don't think the module should be trying to do any kind of
>>> analysis, especially given that it has no idea of the source of its
>>> input data. It must just accept what it is given. If a user or
>>> other module wants to do pseudo-count correction, they can do it
>>> themselves in the most appropriate way for their data.
>>
>> You are wrong here- this gives an option to the user since correction
>> can be disabled (which should be the case with frequencies.). In most
>> cases pseudo counts are necessary and that is why this should be the
>> default behavior.
>
>But there are lots of possible algorithms for doing pseudo count
>correction, adding a number being the worst. There's no gain to adding
>the number. The option only serves to a) hide the problem but not fix
>it, and b) give the wrong frequencies. Two dangerous things.
>
>But ok, I'll revert to the previous behaviour and make the docs very
>clear about what is going on. My main problem was your implementation of
>it, which gave really bad results since you didn't apply the correction
>to everything.
Bad results meaning? There was an error in the constructor method, because 
corrections were used even if the input is a frequency. Even in this case you 
can say -correction=>0.0000001 and be fine. So the real problem here is lack 
of documentation. I wish you would have asked me before changing the code, 
then most of this email exchange would have been unnecessary.

>
>
>>> IUPAC has no concept of frequencies or have a cutoff. When there is
>>> a chance of all four bases (complete ambiguity), the IUPAC code is
>>> N. If you want it to return 'R' in this case, the IUPAC method
>>> would need to be extended to allow input of a user-defined
>>> threshold defining what frequencies to ignore.
>>
>> So are you saying that if A is 0.9999, C is 0.00002, G is 0.00004 and
>> T is 0.00004 you would have N???
>
>Yes, because that is correct.

No it is not- IUPAC is an approximation of a PFM, which is very different from 
what you are doing. It is flat out wrong and IUPAC consensus has nothing to do 
with the correction.

There's no way to judge automatically what
>frequencies should be ignored, unless we know exactly how and if psuedo
>count correction was done. Not guessing and being correct is better than
>guessing to give 'nicer' answers but sometimes being completely wrong
>(and always being strictly wrong). Really, if you want a nice IUPAC
>string you must simply chose not to do pseudo count correction.
Sure, when you create the object add -correction=>0.
>
>
>> Allowing customer supplied thresholds is not a bad idea, you could
>> implement it if you wish. But please do not fix something that is not
>> broken.
>
>It was giving the wrong answers, so needed fixing. I'll add the
>threshold option.

No it was not giving wrong answers, the error was not in this function, so it 
did not need fixing. Please add the default thresholds or leave it unchanged 
and I will do it.





More information about the Bioperl-l mailing list