[EMBOSS] shuffleseq for multifasta?
David Mathog
mathog at caltech.edu
Fri Nov 9 17:32:22 UTC 2018
On 08-Nov-2018 19:19, Anandkumar Surendrarao wrote:
> I am new to EMBOSS, and trying to use shufflseq to randomly shuffle
> entire
> genomes (one-by-one). My input genomic sequences are in multifasta
> format.
> And I wish to retain the same multifasta format for the output file as
> well, containing the shuffled DNA sequences.
This isn't an EMBOSS solution, but I use
http://saf.bio.caltech.edu/pub/software/molbio/fastaselecth.c
to do many similar tasks. It takes sequence entries from -in, writes
the selected ones to -out, selects using the headers in the order
provided through -sel. It can also be used to reject just those
sequences from -sel (in which case the input order carries over to the
output). Assuming the headers are simple (so that alternate header
parsing flags are not needed) and using also my extract program (from
drm_tools on sourceforge, there are lots of other ways of doing this) to
make a list of the header names:
extract -in source.fasta -if '>' -ifonly -mt -dl '> ' -fmt '[1]' \
| shuf \
| fastaselecth -in source.fasta -out shuffled.fasta -sel -
If you want a random subset of 1000 change the second line to
| shuf | head -1000 \
and so forth.
Bump up the -wl parameter if sequences longer than 10Mbp are possible.
I'm sure that somewhere there are fasta files with headers so complex
that the alternative header parsing options in the program are not
sufficient. If that happens use extract (or perl or awk or ...) to
simplify the headers.
Regards,
David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
More information about the EMBOSS
mailing list