[Bioperl-l] seq_word and pattern counts

Tue Feb 28 22:08:40 UTC 2006

The real problem is this:
We want to count sites in a long sequence where a restriction enzyme would cut.
This restriction enzyme, in the example I gave will recognize GGnnCC,
that is two G separated by two of any bases followed by two C.

The GCG program findpatterns will do this, but bioperl makes certain statistics easy.
I'm sure there is some module somewhere for this purpose. 

Nick Staffa
Telephone: 919-316-4569  (NIEHS: 6-4569)
Scientific Computing Support Group
NIEHS Information Technology Support Services Contract
(Science Task Monitor: Jack L. Field (field1 at niehs.nih.gov ))
National Institute of Environmental Health Sciences
National Institutes of Health
Research Triangle Park, North Carolina

-----Original Message-----
From: Torsten Seemann [mailto:torsten.seemann at infotech.monash.edu.au]
Sent: Tuesday, February 28, 2006 5:02 PM
To: Staffa, Nick (NIH/NIEHS) [C]
Cc: bioperl-l
Subject: Re: [Bioperl-l] seq_word and pattern counts

Staffa, Nick (NIH/NIEHS) [C] wrote:
> Yes 
> N matches any of the four bases.

It's still not clear what you want to me.

For simplicity, let's say we are counting words of length 1,
(which means overlapping and non-overlapping are the same)
and our sequence is "AGTN" (ie. 4 letters long)

The module would return the following
{ A=>1, G=>1, T=>1, N=>1 }    # sum of counts = 4

But you want it to return this?
{ A=>2, G=>2, T=>2, C=>1 }    # sum of counts = 7
ie. the N contributes 1 A, 1 G, 1 T and 1 C (and 0 N)

And correspondingly for all the possible ambiguity codes?

And if the word length was 2, then if we encoutered a "NN"
it would add 16 to the total count ie. 1 AA, 1 AT, 1 AC etc?

>>Does anyone know if Bio::Tools::SeqWords
>>*count_words
>>or
>>count_overlap_words
>>will do DNA pattern searches and honor ambiguity symbols
>>like exist in some restriction enzyme pattern definitions,
>>e.g. GGnnCC*

> suggests that all it does is count N-mers of any set of letters,
> and does so in a case-insensitive way ie. CAT, Cat, cat are counted as 
> the same N-mer.
> So no it does not handle ambiguity symbols in any special manner.
> What would you like it to do?
> If a N-mer has 1 "N" in it, does it count towards the 4 possible N-mers 
> it could be?
> If it has 2 "N"s in it, does it count toward all 16 possible 
> non-ambiguous N-mers?
> And so on?

-- 
Torsten Seemann
Victorian Bioinformatics Consortium, Monash University, Australia
http://www.vicbioinformatics.com/
Phone: +61 3 9905 9010