[EMBOSS] how strong is shuffleseq (Summary)

David Mathog mathog at caltech.edu
Thu Aug 7 19:29:35 UTC 2008


Derek Gatherer wrote:
> Therefore, the answer to the original question, I reckon, is: 
> shuffleseq is just as good if you choose to shuffle once as to 
> shuffle 100 times. The same is true for make_randon_dna.  

Actually, running make_random_seq twice in a row to generate a single
sequence is actually counterproductive.  If you do that the first run
will have a transition table which exactly matches those in the input
sequence, while the second run will generate a transition table from the
first randomized sequence, and since that is of finite length, the
second run will only obtain an approximation of the originally observed
transition frequencies for use in generating the final randomized
sequence.  The shorter the sequence, the greater this effect will be. 

>  There is 
> nothing to separate the two programs in performance.

The two randomized sequences should have slightly different properties. 

The output of shuffleseq will maintain composition (exactly), while the
output of make_random_seq, with the parameters you used, will maintain
dimer composition (approximately).  How different the random sequences
produced  by the two programs are will depend to a great extent on how
skewed the dimer composition of the input sequence was with respect to
the expected dimer composition (as calculated from the monomer
composition).  That is, if A,G,C,T are all 25%, and all dimers are
6.25%, the outputs of the two programs would be very similar.  However,
consider an extreme case which illustrates how much they can differ:

% echo AGCTAGCTAGCTAGCT \
  | make_random_seq -in - -inproc 2 -order 1 -n
>random_sequence_0
TAGCTAGCTAGCTAGC

Which is just the original sequence (phase shifted).  Similarly, for
this very short sequence, even -order 0 would be distinguishable:

% echo AGCTAGCTAGCTAGCT | make_random_seq -in - -inproc 2 -order 0 -n  
>random_sequence_0
GAAAGACTCTGTATGG

In this case resulting in a sequence with 5 G, 5 A, 4 T, 2 C, whereas
shuffleseq would still have exactly 4 of each.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech



More information about the EMBOSS mailing list