[EMBOSS] shuffleseq questions

Sun Dec 30 04:12:11 UTC 2018

Greetings EMBOSS users!

   - I am using shuffleseq on entire genomic DNA multifasta input files
   (EMBOSS ver 6.6.0).
   - For just one genome, that is relatively larger (~ 2GB) with several
   pseudomolecules in the 150-250Mb size range, I am splitting into individual
   sequences and running them as an arrau job.
   - All runs on UNIX based compute cluster using  SLURM queue controller.
   - My syntax is simply: shuffleseq srun shuffleseq -sformat pearson $IN
   $OUT
   - For the most part, all is well.
   - With that as context, I have a few questions about the use of
   shuffleseq:

*Q1.* What is the calculation for RAM required, based on input file size?
Is there an apprximate formula? Or have users figured it out empirically?

*Q2.* When I performed some downstream analyses of shuffled genomes from 5
independent runs of shuffleseq, 4/5 gave me no DNA sequence matches -
suggesting shuffling worked well, but in 1/5 this was not at all the case.
So I wonder whether the randomization step during shuffling is quirky in
any way!?
I came across this link <http://eyegene.ophthy.med.umich.edu/shuffle/> -
describing possible issues with lack of true randomization in an old EMBOSS
release. I makes me wonder if these sort of issues still play any role in
version 6.6.0 as well?
Or could there be other explanation(s) for why 4 are good shuffles but 1 is
not at all. The scripts across the repetitions are easy to copy and modify
suitably. Nevertheless, I've checked and re-checked syntax, no errors
there.

Thanks, in advance, for advice and pointers from forum members.
And, in advance, best wishes for  a happy and productive 2019.

Cheers!
Anand
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/emboss/attachments/20181229/73cf1bc6/attachment.html>