[EMBOSS] shuffleseq questions
Anandkumar Surendrarao
aksrao at ucdavis.edu
Sun Dec 30 04:12:11 UTC 2018
Greetings EMBOSS users!
- I am using shuffleseq on entire genomic DNA multifasta input files
(EMBOSS ver 6.6.0).
- For just one genome, that is relatively larger (~ 2GB) with several
pseudomolecules in the 150-250Mb size range, I am splitting into individual
sequences and running them as an arrau job.
- All runs on UNIX based compute cluster using SLURM queue controller.
- My syntax is simply: shuffleseq srun shuffleseq -sformat pearson $IN
$OUT
- For the most part, all is well.
- With that as context, I have a few questions about the use of
shuffleseq:
*Q1.* What is the calculation for RAM required, based on input file size?
Is there an apprximate formula? Or have users figured it out empirically?
*Q2.* When I performed some downstream analyses of shuffled genomes from 5
independent runs of shuffleseq, 4/5 gave me no DNA sequence matches -
suggesting shuffling worked well, but in 1/5 this was not at all the case.
So I wonder whether the randomization step during shuffling is quirky in
any way!?
I came across this link <http://eyegene.ophthy.med.umich.edu/shuffle/> -
describing possible issues with lack of true randomization in an old EMBOSS
release. I makes me wonder if these sort of issues still play any role in
version 6.6.0 as well?
Or could there be other explanation(s) for why 4 are good shuffles but 1 is
not at all. The scripts across the repetitions are easy to copy and modify
suitably. Nevertheless, I've checked and re-checked syntax, no errors
there.
Thanks, in advance, for advice and pointers from forum members.
And, in advance, best wishes for a happy and productive 2019.
Cheers!
Anand
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/emboss/attachments/20181229/73cf1bc6/attachment.html>
More information about the EMBOSS
mailing list