[Bioperl-l] IUPAC support for DNA alignment
Hilmar Lapp
hlapp at gmx.net
Fri Jun 27 17:33:46 UTC 2008
So instead of the user choosing a special matrix you would like to
have a simple argument (that would probably under the hood do exactly
that)?
BTW Scenarios #1 and #3 sound more or less the same to me (i.e., you
believe the degenerate code to reflect site polymorphism, not sequence
uncertainty).
-hilmar
On Jun 27, 2008, at 10:13 AM, Alexie Papanicolaou wrote:
> Hello
>
> I guess I didn't give enough info... (also sorry Yee Man, forget to
> CC you before)
>
> Scenario 1 - polymorphic allele vs non-polymorphic one. e.g.
> Let [A/G] be SNP in two alleles in population A and the one fixed
> allele [G] is in population B.
>
> In this scenario we want to calculate the distance between one locus
> between two populations ,thus a degenerate site is not the result of
> uncertaintly but of reality. Obviously the best method is to provide
> a matrix (if the user can be bothered) but Yee Man already allows
> this option. Personally, I wouldn't really using an alignment score
> to measure distance though... The application here is: we first want
> to align those two sequences and there should be no penalty because
> there is a SNP in one population (then estimate distance with
> another algorithm).
>
> Scenario 2 - uncertainty
> If the scenario is that [A/G] is the result of uncertainty then I
> gladly agree with you! I'm also perplexed how to score IUPAC codes
> allowing for three nucleotides (i.e. there might not be a SNP after
> all... but then again infinitite sites doesn't have to hold - in
> some species less than others...)
>
> Scenario 3 - a type profile alignment to a consensus
> In my particular case, I'm doing something different: I have the
> consensus of an alignment of multiple sequences (dozens to hundrends
> depending on dataset) with some mismatches including a SNP say [A/
> G]. A third sequence that I wish to align has A in that position. So
> obviously, it shouldn't be penalized.
>
> So it really depends on application and the user should be able to
> decide in the end... (Yee Man already provides the option for a
> protein substitution matrix). It would be nice if we had the option
> of specifying it though much more easily (a simple switch) so i can
> use for scenario 3.
> a
> ps. sorry, my english is going the drain...
>
>
> Hilmar Lapp wrote:
>> Hi Alexie,
>>
>> On Jun 27, 2008, at 6:02 AM, Alexie Papanicolaou wrote:
>>
>>> Hello
>>>
>>> I'm the user who asked for it. I don't know of any conventions but
>>> perhaps people can help on this?
>>>
>>> I'm not an expert at all but here is my opinion:
>>> If you don't know the codon position (or even if it is coding)
>>> then you can't estimate the codon degeneracy. If you don't know
>>> the frequency of the bases representated in the degenerate site
>>> then you can't model it either on the DNA level. So any solution
>>> will be ad-hoc.
>>>
>>> Regarding 2 base degenerate positions: My suggestion is that in a
>>> situation of alignment between, say a polymorphic and non
>>> polymorphic population for that site, and the user is interested
>>> in the distance between the populations, it would make sense to
>>> have the score to the full match.
>>>
>>> Regarding 3 bases: I don't really know (see N below) but I 'd go
>>> for a full match again, assuming the user build the consensus.
>>
>> are you suggesting that a determined and a degenerate site aligned
>> pairwise should score as much as two determined sites?
>>
>> My (possibly naive) default would be to average over all
>> possibilities, each weighted by base frequency (if base frequencies
>> are assumed unequal or independent), thus integrating out the
>> uncertainty. (For standard matrices, I think this would also result
>> in N receiving zero score.)
>>
>> In the end though, maybe there should be an option for a user to
>> just provide a substitution matrix?
>>
>> -hilmar
>>
>
> --
> --
> "Eppur si evolve" ("And yet it evolves")
> -Galileo Jr (ca 21st century)
>
> --
> Alexie Papanicolaou
> Entomology
> Max Planck Institute for Chemical Ecology
> Hans Knoell Str 8
> Jena 07745
> Germany
> Email apapanicolaou at ice.mpg.de
> Tel +493641571561
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
More information about the Bioperl-l
mailing list