Etandem in EMBOSS packages

Peter Rice peter.rice at uk.lionbioscience.com
Tue Mar 6 18:10:18 UTC 2001


Maija Lahtela wrote:

> I would like to know how the score ( i.e 120) is calculated in output file
> of etandem
> 120        793        936  6  24  93.8 acccta
>  90        283        420  6  23  84.8 taaccc
> 
> The other thing that worries me is how to select threshold value. What is
> difference if I choose threshold=6 or threshold=10 ?

The algorithm is buried inside Richard Durbin's original version of the
code, and was never documented by the author.

As far as I could tell, from inspecting the code (and this could turn into
part of the program documentation :-) :

sequences are converted into ACGT or N (so ambiguity codes are ignored)
The score is +1 for a match, -1 for a mismatch.
The first copy of a repeat is ignored.
The highest score is kept for each start position and repeat size.

The threshold score is the lowest score to be reported. For perfect
repeats, it is the length of the repeat (except for the first copy). Reduce
it a little to allow mismatches. Each mismatch scores -1 instead of +1 so
it scores 2 less than perfect match in the same number of bases.

Running with a wide range of repeat sizes is inefficient. That is why
equicktandem was written - to give a rapid estimate of the major repeat
sizes.

-- 
------------------------------------------------
Peter Rice, LION Bioscience Ltd, Cambridge, UK
peter.rice at uk.lionbioscience.com +44 1223 224723






More information about the EMBOSS mailing list