[Bioperl-l] IUPAC support for DNA alignment
Alexie Papanicolaou
apapanicolaou at ice.mpg.de
Fri Jun 27 10:02:08 UTC 2008
Hello
I'm the user who asked for it. I don't know of any conventions but
perhaps people can help on this?
I'm not an expert at all but here is my opinion:
If you don't know the codon position (or even if it is coding) then you
can't estimate the codon degeneracy. If you don't know the frequency of
the bases representated in the degenerate site then you can't model it
either on the DNA level. So any solution will be ad-hoc.
Regarding 2 base degenerate positions: My suggestion is that in a
situation of alignment between, say a polymorphic and non polymorphic
population for that site, and the user is interested in the distance
between the populations, it would make sense to have the score to the
full match.
Regarding 3 bases: I don't really know (see N below) but I 'd go for a
full match again, assuming the user build the consensus.
Regarding N:
I think this is more likely to be missing data. I doubt you can have a
SNP occuring four times in the same position (three times are expected
under infinite sites, too for that matter). Or the consensus is derived
from very diverged sequences. I wouldn't score N therefore.
Regarding X:
That one shouldn't find in a DNA alignment unless it is a mask. I'd
expect no score as well.
my /practical/ suggestion would be to have the user to define it, as you
allow for the other options, perhaps even allowing 2fold and 3fold
degenerate IUPAC codes to be given different scores. That might save you
(the owner) some future work when the user wants it...
many thanks to anyone who can help,
alexie
ps. Yee Man had cleverly suggested a workaround: one can use the Protein
Matrix to create a scoring matrix. Might require some caution,
remembering resetting the alphabet though?
Yee Man Chan wrote:
> Hi all
>
> I am the owner of Bio::Tools::dpAlign. A user emailed me to add
> support for IUPAC nucleotide codes. I am ok to add this feature but I
> would like to know what are the conventions to handle these IUPAC codes.
>
> Suppose match is +3 and mismatch is -1. Then what should be the
> score when T matches with U, A with W, A with D, A with N and A with X?
> Does anyone know the conventions?
>
> Thanks a lot.
> Yee Man
>
>
--
"You can't find a hermit to teach you herming, because of course that rather spoils the whole thing."
-- (Terry Pratchett, Small Gods)
Alexie Papanicolaou
Department of Entomology,
Max Planck Institute for Chemical Ecology,
Hans-Knoell-Strasse 8,
D-07745 Jena, Germany.
More information about the Bioperl-l
mailing list