[EMBOSS] shuffleseq for multifasta?
Anandkumar Surendrarao
aksrao at ucdavis.edu
Fri Nov 9 15:16:27 UTC 2018
Thank you, David and Peter.
My input file actually has shortened IDs (shIDs) and alternating lines of
fasta header and sequences (cleaned-up).
First, I copied my input file to a name without EMBOSS special characters:
cp Athaliana_167_TAIR9.fa.shIDscleaned-up Athaliana_167_TAIR9_UNshuffled.fa
Next, I ran shuffleseq using advice from both of you, as follows:
time shuffleseq -sformat pearson Athaliana_167_TAIR9_UNshuffled.fa EMBOSS.fa
Shuffle a set of sequences maintaining composition
real 15m13.015s
user 15m11.998s
sys 0m0.844s
And this works, so thank you both very much.
Best,
Anand
_____
*Anandk*umar *S*urendra*rao*, PhD
+1.530.574.5134
+91.91760.70887
*note to self:*
For ChrC, I compared sequences using BLAST 2 - no similarity detected, as
expected.
For Chr1 and ChrC, I used a Perl script to calculate a,c,g,t,n and ? and
found them to be exactly the same before and after shuffling.
Perl script = summarizeACGTcontent.pl
On Fri, Nov 9, 2018 at 4:03 AM Peter Rice <ricepeterm at yahoo.co.uk> wrote:
> Hi Anand,
>
> As we found when we wrote EMBOSS, "FASTA format" is actually hard to
> define. The problem is the many ways you can define the ID, and the
> other information on the first line (it is amazing how much information
> you can encode in a simple description).
>
> Our solution was to define a set of formats that all read FASTA files,
> but parse the first line in different ways, for example "ncbi format"
> tries to read the NCBI database and id syntax.
>
> We added a format to read the sequence ID as-is for really awkward
> cases, and in honour of the author of FASTA we called it "pearson"
>
> So, if you add -sformat pearson it should read the full IDs up to the
> first space. If you re-read the output, you should use -sf pearson again
> (-sf is just short for -sformat)
>
> Hope that helps.
>
> Peter Rice
> ricepeterm at yahoo.co.uk
>
> On 09/11/2018 07:30, David Bauer wrote:
> > Hi Anand,
> >
> > if you run “shuffleseq –help” you will see the type of input and output
> > sequences.
> >
> > Version: EMBOSS:6.5.7.0
> >
> > Standard (Mandatory) qualifiers:
> >
> > [-sequence] seqall Sequence(s) filename and optional
> > format, or
> >
> > reference (input USA)
> >
> > [-outseq] seqoutall [<sequence>.<format>] Sequence set(s)
> >
> > filename and optional format (output
> USA)
> >
> > The “all” in seqall and seqoutall indicates that input and output can be
> > sequence files with multiple sequences.
> >
> > This can be fasta format or any other sequence format supported by
> > EMBOSS (genbank, embl etc.)
> >
> > The names of the sequences as they are in the original file, will be
> > preserved in the output file.
> >
> > If I try to reproduce your example with the file downloaded from IPK:
> >
> > shuffleseq Athaliana_167_TAIR9.fa test1.fa
> >
> > the output file contains the sequences as named in the input file:
> >
> > infoseq -only -name -desc test1.fa
> >
> > Name Description
> >
> > Chr1 CHROMOSOME dumped from ADB: Feb/3/09 16:9; last updated:
> > 2007-12-20
> >
> > Chr2 CHROMOSOME dumped from ADB: Feb/3/09 16:10; last updated:
> > 2007-12-20
> >
> > Chr3 CHROMOSOME dumped from ADB: Feb/3/09 16:10; last updated:
> > 2007-12-20
> >
> > Chr4 CHROMOSOME dumped from ADB: Feb/3/09 16:10; last updated:
> > 2007-12-20
> >
> > Chr5 CHROMOSOME dumped from ADB: Feb/3/09 16:10; last updated:
> > 2007-12-20
> >
> > ChrM CHROMOSOME dumped from ADB: Feb/3/09 16:10; last updated:
> > 2005-06-03
> >
> > ChrC CHROMOSOME dumped from ADB: Feb/3/09 16:10; last updated:
> > 2005-06-03
> >
> > Your input file contains in the name “shIDscleaned-up”. You may have
> > done some modifications to the sequence names which confuse EMBOSS.
> >
> > You can test this by running the infoseq as above and check if you get
> > for “Name” what you expect.
> >
> > Make sure you don’t have any “:” characters in the sequence names in
> > your fasta file. This character has a special meaning in EMBOSS sequence
> > names.
> >
> > Hope this helps.
> >
> > Sincerely,
> >
> > David.
> >
> > *Von:*EMBOSS <emboss-bounces+david.bauer=bayer.com at mailman.open-bio.org>
>
> > *Im Auftrag von *Anandkumar Surendrarao
> > *Gesendet:* 09 November 2018 04:20
> > *An:* emboss at mailman.open-bio.org
> > *Betreff:* [EMBOSS] shuffleseq for multifasta?
> >
> > Greetings!
> >
> > I am new to EMBOSS, and trying to use shufflseq to randomly shuffle
> > entire genomes (one-by-one). My input genomic sequences are in
> > multifasta format. And I wish to retain the same multifasta format for
> > the output file as well, containing the shuffled DNA sequences.
> >
> > From the information at
> > http://emboss.sourceforge.net/apps/cvs/emboss/apps/shuffleseq.html, it
> > appears to me that FASTA format for neither input not output is
> > supported. Am I mistaken?
> >
> > OR
> >
> > Is there a way to specify (multi)FASTA as both input and output formats?
> >
> > In one run that I completed with a genome assembly with 5 chromosmes -
> > Chr1 ... Chr5, the syntax I used was:
> >
> > shuffleseq -sequence Athaliana_167_TAIR9.fa.shIDscleaned-up -outseq
> > Athaliana_167_TAIR9_EmbossShuffled.fas
> >
> > Strangely, in the output file, the fasta headers were all repetitive
> Chr1.
> >
> > Hence my confusion. Could someone please clarify what my input
> > formatting should be and the correct syntax?
> >
> > Thanks, in advance, for your help.
> >
> > Sincerely,
> >
> > Anand
> >
> > _____
> >
> > *Anand**k*umar *S*urendra*rao*, PhD
> >
> > +1.530.574.5134
> >
> > +91.91760.70887
> >
> >
> > _______________________________________________
> > EMBOSS mailing list
> > EMBOSS at mailman.open-bio.org
> > http://mailman.open-bio.org/mailman/listinfo/emboss
> >
>
> ---
> This email has been checked for viruses by AVG.
> https://www.avg.com
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/emboss/attachments/20181109/5c6c7ae9/attachment.html>
More information about the EMBOSS
mailing list