[Bioperl-l] IUPAC support for DNA alignment

Fri Jun 27 14:13:57 UTC 2008

Hello

I guess I didn't give enough info... (also sorry Yee Man, forget to CC 
you before)

Scenario 1 - polymorphic allele vs non-polymorphic one. e.g.
Let [A/G] be SNP in two alleles in population A and the one fixed allele 
[G] is in population B.

In this scenario we want to calculate the distance between one locus 
between two populations ,thus a degenerate site is not the result of 
uncertaintly but of reality. Obviously the best method is to provide a 
matrix (if the user can be bothered) but Yee Man already allows this 
option. Personally, I wouldn't really using an alignment score to 
measure distance though... The application here is: we first want to 
align those two sequences and there should be no penalty because there 
is a SNP in one population (then estimate distance with another algorithm).

Scenario 2 - uncertainty
If the scenario is that [A/G] is the result of uncertainty then I gladly 
agree with you! I'm also perplexed how to score IUPAC codes allowing for 
three nucleotides (i.e. there might not be a SNP after all... but then 
again infinitite sites doesn't have to hold - in some species less than 
others...)

Scenario 3 - a type profile alignment to a consensus
In my particular case, I'm doing something different: I have the 
consensus of an alignment of multiple sequences (dozens to hundrends 
depending on dataset) with some mismatches including a SNP say [A/G]. A 
third sequence that I wish to align has A in that position. So 
obviously, it shouldn't be penalized.

So it really depends on application and the user should be able to 
decide in the end...  (Yee Man already provides the option for a protein 
substitution matrix). It would be nice if we had the option of 
specifying it though much more easily (a simple switch) so i can use for 
scenario 3.
a
ps. sorry, my english is going the drain...

Hilmar Lapp wrote:
> Hi Alexie,
>
> On Jun 27, 2008, at 6:02 AM, Alexie Papanicolaou wrote:
>
>> Hello
>>
>> I'm the user who asked for it. I don't know of any conventions but 
>> perhaps people can help on this?
>>
>> I'm not an expert at all but here is my opinion:
>> If you don't know the codon position (or even if it is coding) then 
>> you can't estimate the codon degeneracy. If you don't know the 
>> frequency of the bases representated in the degenerate site then you 
>> can't model it either on the DNA level. So any solution will be ad-hoc.
>>
>> Regarding 2 base degenerate positions: My suggestion is that in a 
>> situation of alignment between, say a polymorphic and non polymorphic 
>> population for that site, and the user is interested in the distance 
>> between the populations, it would make sense to have the score to the 
>> full match.
>>
>> Regarding 3 bases: I don't really know (see N below) but I 'd go for 
>> a full match again, assuming the user build the consensus.
>
> are you suggesting that a determined and a degenerate site aligned 
> pairwise should score as much as two determined sites?
>
> My (possibly naive) default would be to average over all 
> possibilities, each weighted by base frequency (if base frequencies 
> are assumed unequal or independent), thus integrating out the 
> uncertainty. (For standard matrices, I think this would also result in 
> N receiving zero score.)
>
> In the end though, maybe there should be an option for a user to just 
> provide a substitution matrix?
>
>     -hilmar
>

-- 
--
"Eppur si evolve" ("And yet it evolves")
-Galileo Jr (ca 21st century)

--
Alexie Papanicolaou
Entomology
Max Planck Institute for Chemical Ecology
Hans Knoell Str 8
Jena 07745
Germany
Email apapanicolaou at ice.mpg.de
Tel +493641571561