[EMBOSS] relative abundance/word bias application

Thu Jul 26 16:44:00 UTC 2007

hello,

I'm wondering if people have any interest in including in EMBOSS an 
application to calculate the relative abundance/bias of words.

The measure I have in mind is that used by Karlin and others (for 
example in Burge, C. et al. PNAS 1992). It is the frequency of a 
particular word, divided by its expected frequency based on the 
frequencies of all its subwords, including gapped subwords. This gives 
you bias at a particular word size, removing the effects at smaller word 
sizes.

For small word sizes there are formulas which one can use, but as you 
get to larger sizes these get unwieldy. I've been working on some code 
which is able calculate this measure up to 10 or 11 bp words in 
reasonable amounts of time. If there is interest, I would be happy to 
contribute it.

Eliot