[Bioperl-l] Count or weight matrix in bioperl?

Fri Feb 17 18:36:37 UTC 2006

>Torsten and all,
>
>  I don't think this will work for me for it only generates 
>statistics for a single sequence.  What I need is a count matrix for 
>each position for a number of DNA sequences.  In other words, if I 
>pass there 3 sequences to this function then it returns the count 
>for each postion for each nucleotide.
>
>  For example if I pass an array of sequences say: ATC,CCC,TTT
>  then I should get a matrix back that will have count for postion 
>1,2,3 for each A,C,T, or G like this:
>
>
>                  1    2   3
>       A        1    0    0
>       C        1    1    2
>       T        1    2    1    
>       G        0    0    0
>
>  Any idea of this is already built somewhere in bioperl?
>
>  Thank you.
>
>

Sam,

What about this?

I worked in something like that some time ago for SNP calculation

and it looks to me you are on the same way.

If you have a sequence like

   A       C       G       T       C       C       A       -       T
   C       G       G       T       A       G       T       G       C
   C       C       C       C       C       G       T       G       C
   C       G       C       T       C       G       T       G       C

Convert the sequence to numbers (0 for the first value, 1 for the 
first modification (reading by columns), 2 for the second 
modification and so on)
Deletions can be considered as another base if you like

After that:

   0       0       0       0       0       0       0       0       0
   1       1       0       0       1       1       1       1       1
   1       0       1       1       0       1       1       1       1
   1       1       1       0       0       1       1       1       1

Once we have the haplotype converted to numbers we have to generate the
snp type information for the haplotype.

SNP code = SUM ( value * multiplicity ^ position );>

     where:
       SUM is the sum of the values for the SNP
       value is the SNP number code (0 [generally for the mayor allele],
                                     1 [for the minor allele].
       position is the position on the block.

For this example the code is:

   0       0       0       0       0       0       0       0       0
   1       1       0       0       1       1       1       1       1
   1       0       1       1       0       1       1       1       1
   1       1       1       0       0       1       1       1       1
  ------------------------------------------------------------------
   14      10      12      4       2       14      14      14      14

   14 = 0*2^0 + 1*2^1 + 1*2^2 + 1*2^3
   12 = 0*2^0 + 1*2^1 + 0*2^2 + 1*2^3
   ....

Once we have the families classify. We will B<take> just the SNP's B<not
redundant>.

   14      10      12      4       2

If you want to look into the code follow this link.

http://users.bioperl.org/cgi-bin/viewcvs/viewcvs.cgi/bioperl-live/Bio/PopGen/HtSNP.pm?rev=1.4&content-type=text/vnd.viewcvs-markup

HTH
Pedro

>  Torsten Seemann <torsten.seemann at infotech.monash.edu.au> wrote:> 
>Say I have an array of nucleotide sequences of of length N. I want 
>to calculate the count matrix (weight matrix). That is for each 
>position 1..N, I want to know how many As, Cs ,Ts and Gs there are. 
>Is the code to do this already written in bioperl to build this 
>matrix if I pass it those strings?
>>    Please excuse my lack of knowledge as I am a new comer to bioinformatics.
>
>Use the Bio::Tools::SeqStats module. The PDoc documentation even has an
>example similar to what you want to do:
>
>http://doc.bioperl.org/releases/bioperl-1.5.0-RC1/Bio/Tools/SeqStats.html
>
>--Torsten Seemann
>
>
>
>
>Sincerely,
>Sam Al-Droubi, M.S.
>saldroubi at yahoo.com
>_______________________________________________
>Bioperl-l mailing list
>Bioperl-l at lists.open-bio.org
>http://lists.open-bio.org/mailman/listinfo/bioperl-l