[Bioperl-l] IUPAC support for DNA alignment
Alexie Papanicolaou
apapanicolaou at ice.mpg.de
Fri Jun 27 17:40:34 UTC 2008
Yes, I don't want to use a special scoring for what i'm doing now. The
option would allow to score a C or A aligned with M the same score as
specified in -match. I guess it would be quickier if I just made my own
matrix but there is a TODO line on IUPAC codes so I thought I push it a bit.
Yes from a computation point of view Sc 1 & 3 are the same.
a
Hilmar Lapp wrote:
> So instead of the user choosing a special matrix you would like to
> have a simple argument (that would probably under the hood do exactly
> that)?
>
> BTW Scenarios #1 and #3 sound more or less the same to me (i.e., you
> believe the degenerate code to reflect site polymorphism, not sequence
> uncertainty).
>
> -hilmar
>
> On Jun 27, 2008, at 10:13 AM, Alexie Papanicolaou wrote:
>
>> Hello
>>
>> I guess I didn't give enough info... (also sorry Yee Man, forget to
>> CC you before)
>>
>> Scenario 1 - polymorphic allele vs non-polymorphic one. e.g.
>> Let [A/G] be SNP in two alleles in population A and the one fixed
>> allele [G] is in population B.
>>
>> In this scenario we want to calculate the distance between one locus
>> between two populations ,thus a degenerate site is not the result of
>> uncertaintly but of reality. Obviously the best method is to provide
>> a matrix (if the user can be bothered) but Yee Man already allows
>> this option. Personally, I wouldn't really using an alignment score
>> to measure distance though... The application here is: we first want
>> to align those two sequences and there should be no penalty because
>> there is a SNP in one population (then estimate distance with another
>> algorithm).
>>
>> Scenario 2 - uncertainty
>> If the scenario is that [A/G] is the result of uncertainty then I
>> gladly agree with you! I'm also perplexed how to score IUPAC codes
>> allowing for three nucleotides (i.e. there might not be a SNP after
>> all... but then again infinitite sites doesn't have to hold - in some
>> species less than others...)
>>
>> Scenario 3 - a type profile alignment to a consensus
>> In my particular case, I'm doing something different: I have the
>> consensus of an alignment of multiple sequences (dozens to hundrends
>> depending on dataset) with some mismatches including a SNP say [A/G].
>> A third sequence that I wish to align has A in that position. So
>> obviously, it shouldn't be penalized.
>>
>> So it really depends on application and the user should be able to
>> decide in the end... (Yee Man already provides the option for a
>> protein substitution matrix). It would be nice if we had the option
>> of specifying it though much more easily (a simple switch) so i can
>> use for scenario 3.
>> a
>> ps. sorry, my english is going the drain...
>>
>>
>> Hilmar Lapp wrote:
>>> Hi Alexie,
>>>
>>> On Jun 27, 2008, at 6:02 AM, Alexie Papanicolaou wrote:
>>>
>>>> Hello
>>>>
>>>> I'm the user who asked for it. I don't know of any conventions but
>>>> perhaps people can help on this?
>>>>
>>>> I'm not an expert at all but here is my opinion:
>>>> If you don't know the codon position (or even if it is coding) then
>>>> you can't estimate the codon degeneracy. If you don't know the
>>>> frequency of the bases representated in the degenerate site then
>>>> you can't model it either on the DNA level. So any solution will be
>>>> ad-hoc.
>>>>
>>>> Regarding 2 base degenerate positions: My suggestion is that in a
>>>> situation of alignment between, say a polymorphic and non
>>>> polymorphic population for that site, and the user is interested in
>>>> the distance between the populations, it would make sense to have
>>>> the score to the full match.
>>>>
>>>> Regarding 3 bases: I don't really know (see N below) but I 'd go
>>>> for a full match again, assuming the user build the consensus.
>>>
>>> are you suggesting that a determined and a degenerate site aligned
>>> pairwise should score as much as two determined sites?
>>>
>>> My (possibly naive) default would be to average over all
>>> possibilities, each weighted by base frequency (if base frequencies
>>> are assumed unequal or independent), thus integrating out the
>>> uncertainty. (For standard matrices, I think this would also result
>>> in N receiving zero score.)
>>>
>>> In the end though, maybe there should be an option for a user to
>>> just provide a substitution matrix?
>>>
>>> -hilmar
>>>
>>
>> --
>> --
>> "Eppur si evolve" ("And yet it evolves")
>> -Galileo Jr (ca 21st century)
>>
>> --
>> Alexie Papanicolaou
>> Entomology
>> Max Planck Institute for Chemical Ecology
>> Hans Knoell Str 8
>> Jena 07745
>> Germany
>> Email apapanicolaou at ice.mpg.de
>> Tel +493641571561
>
--
--
"Eppur si evolve" ("And yet it evolves")
-Galileo Jr (ca 21st century)
--
Alexie Papanicolaou
Entomology
Max Planck Institute for Chemical Ecology
Hans Knoell Str 8
Jena 07745
Germany
Email apapanicolaou at ice.mpg.de
Tel +493641571561
More information about the Bioperl-l
mailing list