[Bioperl-l] seq_word and pattern counts

Torsten Seemann torsten.seemann at infotech.monash.edu.au
Tue Feb 28 22:01:38 UTC 2006


Staffa, Nick (NIH/NIEHS) [C] wrote:
> Yes 
> N matches any of the four bases.

It's still not clear what you want to me.

For simplicity, let's say we are counting words of length 1,
(which means overlapping and non-overlapping are the same)
and our sequence is "AGTN" (ie. 4 letters long)

The module would return the following
{ A=>1, G=>1, T=>1, N=>1 }    # sum of counts = 4

But you want it to return this?
{ A=>2, G=>2, T=>2, C=>1 }    # sum of counts = 7
ie. the N contributes 1 A, 1 G, 1 T and 1 C (and 0 N)

And correspondingly for all the possible ambiguity codes?

And if the word length was 2, then if we encoutered a "NN"
it would add 16 to the total count ie. 1 AA, 1 AT, 1 AC etc?

>>Does anyone know if Bio::Tools::SeqWords
>>*count_words
>>or
>>count_overlap_words
>>will do DNA pattern searches and honor ambiguity symbols
>>like exist in some restriction enzyme pattern definitions,
>>e.g. GGnnCC*

> suggests that all it does is count N-mers of any set of letters,
> and does so in a case-insensitive way ie. CAT, Cat, cat are counted as 
> the same N-mer.
> So no it does not handle ambiguity symbols in any special manner.
> What would you like it to do?
> If a N-mer has 1 "N" in it, does it count towards the 4 possible N-mers 
> it could be?
> If it has 2 "N"s in it, does it count toward all 16 possible 
> non-ambiguous N-mers?
> And so on?

-- 
Torsten Seemann
Victorian Bioinformatics Consortium, Monash University, Australia
http://www.vicbioinformatics.com/
Phone: +61 3 9905 9010



More information about the Bioperl-l mailing list