[Bioperl-l] IUPAC support for DNA alignment
Alexie Papanicolaou
apapanicolaou at ice.mpg.de
Fri Jun 27 14:13:57 UTC 2008
Hello
I guess I didn't give enough info... (also sorry Yee Man, forget to CC
you before)
Scenario 1 - polymorphic allele vs non-polymorphic one. e.g.
Let [A/G] be SNP in two alleles in population A and the one fixed allele
[G] is in population B.
In this scenario we want to calculate the distance between one locus
between two populations ,thus a degenerate site is not the result of
uncertaintly but of reality. Obviously the best method is to provide a
matrix (if the user can be bothered) but Yee Man already allows this
option. Personally, I wouldn't really using an alignment score to
measure distance though... The application here is: we first want to
align those two sequences and there should be no penalty because there
is a SNP in one population (then estimate distance with another algorithm).
Scenario 2 - uncertainty
If the scenario is that [A/G] is the result of uncertainty then I gladly
agree with you! I'm also perplexed how to score IUPAC codes allowing for
three nucleotides (i.e. there might not be a SNP after all... but then
again infinitite sites doesn't have to hold - in some species less than
others...)
Scenario 3 - a type profile alignment to a consensus
In my particular case, I'm doing something different: I have the
consensus of an alignment of multiple sequences (dozens to hundrends
depending on dataset) with some mismatches including a SNP say [A/G]. A
third sequence that I wish to align has A in that position. So
obviously, it shouldn't be penalized.
So it really depends on application and the user should be able to
decide in the end... (Yee Man already provides the option for a protein
substitution matrix). It would be nice if we had the option of
specifying it though much more easily (a simple switch) so i can use for
scenario 3.
a
ps. sorry, my english is going the drain...
Hilmar Lapp wrote:
> Hi Alexie,
>
> On Jun 27, 2008, at 6:02 AM, Alexie Papanicolaou wrote:
>
>> Hello
>>
>> I'm the user who asked for it. I don't know of any conventions but
>> perhaps people can help on this?
>>
>> I'm not an expert at all but here is my opinion:
>> If you don't know the codon position (or even if it is coding) then
>> you can't estimate the codon degeneracy. If you don't know the
>> frequency of the bases representated in the degenerate site then you
>> can't model it either on the DNA level. So any solution will be ad-hoc.
>>
>> Regarding 2 base degenerate positions: My suggestion is that in a
>> situation of alignment between, say a polymorphic and non polymorphic
>> population for that site, and the user is interested in the distance
>> between the populations, it would make sense to have the score to the
>> full match.
>>
>> Regarding 3 bases: I don't really know (see N below) but I 'd go for
>> a full match again, assuming the user build the consensus.
>
> are you suggesting that a determined and a degenerate site aligned
> pairwise should score as much as two determined sites?
>
> My (possibly naive) default would be to average over all
> possibilities, each weighted by base frequency (if base frequencies
> are assumed unequal or independent), thus integrating out the
> uncertainty. (For standard matrices, I think this would also result in
> N receiving zero score.)
>
> In the end though, maybe there should be an option for a user to just
> provide a substitution matrix?
>
> -hilmar
>
--
--
"Eppur si evolve" ("And yet it evolves")
-Galileo Jr (ca 21st century)
--
Alexie Papanicolaou
Entomology
Max Planck Institute for Chemical Ecology
Hans Knoell Str 8
Jena 07745
Germany
Email apapanicolaou at ice.mpg.de
Tel +493641571561
More information about the Bioperl-l
mailing list