[Bioperl-l] SiteMatrix changes
skirov
skirov at utk.edu
Thu Aug 31 15:57:51 UTC 2006
>===== Original Message From Sendu Bala <bix at sendu.me.uk> =====
>Stefan Kirov wrote:
>> Perhaps I do not understand your idea, but it seems to me the changes
>> you made to SiteMatrix are wrong. Why did you have to remove the
>> pseudo-counts? The correction can be set to 0 which will disable it
>> ic case this is necessary. Pseudo counts are intended to account for
>> the probabilistic uncertainty.
>
>What has adding the number 1 to some but not all input numbers got to do
>with pseudo counts? Can you explain your thinking?
The code was:
if ($self->{_corrected}) {
${$self->{probA}}[$i] += $self->{_correction};
${$self->{probC}}[$i] += $self->{_correction};
${$self->{probG}}[$i] += $self->{_correction};
${$self->{probT}}[$i] += $self->{_correction};
}
Add 1 (or the user supplied correction value) to any position that has 0.
Perhaps you are right (if I understood correctly) and 1 should be added to
everything if any position contains 0. I am not really sure abut this.
>
>
>> On the other hand the correction should be disabled by default if
>> instead of raw count frequencies are used for the construction of the
>> object (still having 0 is a bad idea).
>
>Why is having 0 a bad idea?
Here is a wikipedia explanation:
"In any observed data set or sample there is the possibility, especially with
low-probability events and/or small data sets, of a possible event not
occurring. Its observed frequency is therefore 0, implying a probability of 0.
This is an oversimplification and is often unhelpful, particularly in
probability-based machine learning techniques such as artificial neural
networks and hidden Markov models."
It is correct if the user is creating a
>simple count-based matrix. I don't think the module should be trying to
>do any kind of analysis, especially given that it has no idea of the
>source of its input data. It must just accept what it is given. If a
>user or other module wants to do pseudo-count correction, they can do it
>themselves in the most appropriate way for their data.
You are wrong here- this gives an option to the user since correction can be
disabled (which should be the case with frequencies.). In most cases pseudo
counts are necessary and that is why this should be the default behavior.
>
>I can't imagine that sometimes adding 1 is /ever/ an appropriate way of
>doing it, but please explain if it is.
This is parameter so it could be changed. Why 1- search for Laplace's rule of
succession.
>
>
>> Next, the rules you have enforced for the IUPAC do not make sense to
>> me. For example in case the frequency for A is 0.45, G 0.45, C 0.05
>> and T 0.05, according to you rules the result would be N, which makes
>> no sense.
>
>Why does that make no sense? IUPAC has no concept of frequencies or have
>a cutoff. When there is a chance of all four bases (complete ambiguity),
>the IUPAC code is N. If you want it to return 'R' in this case, the
>IUPAC method would need to be extended to allow input of a user-defined
>threshold defining what frequencies to ignore.
So are you saying that if A is 0.9999, C is 0.00002, G is 0.00004 and T is
0.00004 you would have N??? Allowing customer supplied thresholds is not a bad
idea, you could implement it if you wish. But please do not fix something that
is not broken.
Stefan
>_______________________________________________
>Bioperl-l mailing list
>Bioperl-l at lists.open-bio.org
>http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list