# [Bioperl-l] Count or weight matrix in bioperl?

pedro fabre pedro.fabre at gmail.com
Fri Feb 17 18:36:37 UTC 2006

>Torsten and all,
>
>  I don't think this will work for me for it only generates
>statistics for a single sequence.  What I need is a count matrix for
>each position for a number of DNA sequences.  In other words, if I
>pass there 3 sequences to this function then it returns the count
>for each postion for each nucleotide.
>
>  For example if I pass an array of sequences say: ATC,CCC,TTT
>  then I should get a matrix back that will have count for postion
>1,2,3 for each A,C,T, or G like this:
>
>
>                  1    2   3
>       A        1    0    0
>       C        1    1    2
>       T        1    2    1
>       G        0    0    0
>
>  Any idea of this is already built somewhere in bioperl?
>
>  Thank you.
>
>

Sam,

I worked in something like that some time ago for SNP calculation

and it looks to me you are on the same way.

If you have a sequence like

A       C       G       T       C       C       A       -       T
C       G       G       T       A       G       T       G       C
C       C       C       C       C       G       T       G       C
C       G       C       T       C       G       T       G       C

Convert the sequence to numbers (0 for the first value, 1 for the
first modification (reading by columns), 2 for the second
modification and so on)
Deletions can be considered as another base if you like

After that:

0       0       0       0       0       0       0       0       0
1       1       0       0       1       1       1       1       1
1       0       1       1       0       1       1       1       1
1       1       1       0       0       1       1       1       1

Once we have the haplotype converted to numbers we have to generate the
snp type information for the haplotype.

SNP code = SUM ( value * multiplicity ^ position );>

where:
SUM is the sum of the values for the SNP
value is the SNP number code (0 [generally for the mayor allele],
1 [for the minor allele].
position is the position on the block.

For this example the code is:

0       0       0       0       0       0       0       0       0
1       1       0       0       1       1       1       1       1
1       0       1       1       0       1       1       1       1
1       1       1       0       0       1       1       1       1
------------------------------------------------------------------
14      10      12      4       2       14      14      14      14

14 = 0*2^0 + 1*2^1 + 1*2^2 + 1*2^3
12 = 0*2^0 + 1*2^1 + 0*2^2 + 1*2^3
....

Once we have the families classify. We will B<take> just the SNP's B<not
redundant>.

14      10      12      4       2

http://users.bioperl.org/cgi-bin/viewcvs/viewcvs.cgi/bioperl-live/Bio/PopGen/HtSNP.pm?rev=1.4&content-type=text/vnd.viewcvs-markup

HTH
Pedro

>  Torsten Seemann <torsten.seemann at infotech.monash.edu.au> wrote:>
>Say I have an array of nucleotide sequences of of length N. I want
>to calculate the count matrix (weight matrix). That is for each
>position 1..N, I want to know how many As, Cs ,Ts and Gs there are.
>Is the code to do this already written in bioperl to build this
>matrix if I pass it those strings?
>>    Please excuse my lack of knowledge as I am a new comer to bioinformatics.
>
>Use the Bio::Tools::SeqStats module. The PDoc documentation even has an
>example similar to what you want to do:
>
>http://doc.bioperl.org/releases/bioperl-1.5.0-RC1/Bio/Tools/SeqStats.html
>
>--Torsten Seemann
>
>
>
>
>Sincerely,
>Sam Al-Droubi, M.S.
>saldroubi at yahoo.com
>_______________________________________________
>Bioperl-l mailing list
>Bioperl-l at lists.open-bio.org
>http://lists.open-bio.org/mailman/listinfo/bioperl-l