[Bioperl-l] seq_word and pattern counts

Tue Feb 28 21:46:30 UTC 2006

Yes 
N matches any of the four bases.

Nick Staffa
Telephone: 919-316-4569  (NIEHS: 6-4569)
Scientific Computing Support Group
NIEHS Information Technology Support Services Contract
(Science Task Monitor: Jack L. Field (field1 at niehs.nih.gov ))
National Institute of Environmental Health Sciences
National Institutes of Health
Research Triangle Park, North Carolina

-----Original Message-----
From: Torsten Seemann [mailto:torsten.seemann at infotech.monash.edu.au]
Sent: Tuesday, February 28, 2006 4:45 PM
To: Staffa, Nick (NIH/NIEHS) [C]
Cc: bioperl-l at lists.open-bio.org
Subject: Re: [Bioperl-l] seq_word and pattern counts

Nick

> Does anyone know if Bio::Tools::SeqWords
> *count_words
> or
> count_overlap_words
> will do DNA pattern searches and honor ambiguity symbols
> like exist in some restriction enzyme pattern definitions,
> e.g. GGnnCC*

Examination of the code

http://doc.bioperl.org/releases/bioperl-1.5.0-RC1/Bio/Tools/SeqWords.html#CODE4

suggests that all it does is count N-mers of any set of letters,
and does so in a case-insensitive way ie. CAT, Cat, cat are counted as 
the same N-mer.

So no it does not handle ambiguity symbols in any special manner.

What would you like it to do?
If a N-mer has 1 "N" in it, does it count towards the 4 possible N-mers 
it could be?
If it has 2 "N"s in it, does it count toward all 16 possible 
non-ambiguous N-mers?
And so on?

-- 
Torsten Seemann
Victorian Bioinformatics Consortium, Monash University, Australia
http://www.vicbioinformatics.com/
Phone: +61 3 9905 9010