[EMBOSS] shuffleseq for multifasta?

David Mathog mathog at caltech.edu
Fri Nov 9 17:32:22 UTC 2018


On 08-Nov-2018 19:19, Anandkumar Surendrarao wrote:
> I am new to EMBOSS, and trying to use shufflseq to randomly shuffle 
> entire
> genomes (one-by-one). My input genomic sequences are in multifasta 
> format.
> And I wish to retain the same multifasta format for the output file as
> well, containing the shuffled DNA sequences.

This isn't an EMBOSS solution, but I use

   http://saf.bio.caltech.edu/pub/software/molbio/fastaselecth.c

to do many similar tasks.  It takes sequence entries from -in, writes 
the selected ones to -out, selects using the headers in the order 
provided through -sel.  It can also be used to reject just those 
sequences from -sel (in which case the input order carries over to the 
output).  Assuming the headers are simple (so that alternate header 
parsing flags are not needed) and using also my extract program (from 
drm_tools on sourceforge, there are lots of other ways of doing this) to 
make a list of the header names:

extract -in source.fasta -if '>' -ifonly -mt -dl '> ' -fmt '[1]' \
  | shuf \
  | fastaselecth -in source.fasta -out shuffled.fasta -sel -

If you want a random subset of 1000 change the second line to

  | shuf | head -1000 \

and so forth.

Bump up the -wl parameter if sequences longer than 10Mbp are possible.

I'm sure that somewhere there are fasta files with headers so complex 
that the alternative header parsing options in the program are not 
sufficient.  If that happens use extract (or perl or awk or ...) to 
simplify the headers.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


More information about the EMBOSS mailing list