[Bioperl-l] SiteMatrix changes
Sendu Bala
bix at sendu.me.uk
Thu Aug 31 18:50:28 UTC 2006
skirov wrote:
>
>>>> IUPAC has no concept of frequencies or have a cutoff. When there is
>>>> a chance of all four bases (complete ambiguity), the IUPAC code is
>>>> N. If you want it to return 'R' in this case, the IUPAC method
>>>> would need to be extended to allow input of a user-defined
>>>> threshold defining what frequencies to ignore.
>
>>> So are you saying that if A is 0.9999, C is 0.00002, G is 0.00004 and
>>> T is 0.00004 you would have N???
>
>> Yes, because that is correct.
>
> No it is not- IUPAC is an approximation of a PFM, which is very different from
> what you are doing.
The IUPAC-IUB nomenclature for incompletely specified bases in nucleic
acid sequences is just that and no more. They were not trying to
'approximate a PFM'. It is a nomenclature for describing uncertainty.
http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html
If you tell me that you are uncertain what the nucleic acid is by giving
me a non-zero probability of having all nucleic acids, the appropriate
IUPAC symbol is 'N'.
>> There's no way to judge automatically what
>> frequencies should be ignored, unless we know exactly how and if psuedo
>> count correction was done. Not guessing and being correct is better than
>> guessing to give 'nicer' answers but sometimes being completely wrong
>> (and always being strictly wrong). Really, if you want a nice IUPAC
>> string you must simply chose not to do pseudo count correction.
>
> Sure, when you create the object add -correction=>0.
Right, so you're happy with getting IUPAC strings of all 'N' unless
-correction => 0 has been used? When a -correction is requested, the
only way to get non Ns should be to supply a threshold. There could be a
default threshold, but what would it be?
>>> Allowing customer supplied thresholds is not a bad idea, you could
>>> implement it if you wish. But please do not fix something that is not
>>> broken.
>>
>> It was giving the wrong answers, so needed fixing. I'll add the
>> threshold option.
>
> No it was not giving wrong answers, the error was not in this function, so it
> did not need fixing.
If you revert just _calculate_consensus() and _to_IUPAC() to the
previous implementation, test 34 on line 144 of t/psm.t gives the answer:
VVDCAGGTGBYD
having parsed t/data/transfac.dat
That file has the matrix:
P0 A C G T
01 1 2 2 0
02 2 1 2 0
03 3 0 1 1
04 0 5 0 0
05 5 0 0 0
06 0 0 4 1
07 0 1 4 0
08 0 0 0 5
09 0 0 5 0
10 0 1 2 2
11 0 2 0 3
12 1 0 3 1
The correct answer here is:
VVDCAKSTGBYD
To call positions 6 and 7 Gs is wrong. This is the problem with having a
default threshold. It is dangerous in sometimes giving the wrong answer
and the user would never realize.
If we revert all my changes the answer given is:
NNNNNNNNNNNN
If you think you can fix this bug another way (that isn't specific to
Bio::Matrix::PSM::IO::transfac), I'll revert the changes and create a
bug report for you.
More information about the Bioperl-l
mailing list