[EMBOSS] uniq sequences on a list
Peter Rice
pmr at ebi.ac.uk
Tue Oct 4 14:13:21 UTC 2011
On 10/04/2011 02:38 PM, Fernando Martinez wrote:
> Hi, I am trying to retrieve sequences from a multi-fasta file were there are
> identical sequences and i want to extract only the ones in my list, how can
> I do that?
> Example:
>
> Multi.fasta file:
>
>> seq1
> atataga...
>> seq2
> ttatggttca..
> [...]
>> seq1
> atataga...
> [...]
> And I only want to take seq1 an seq2, not two times seq1!!
If you really must start from that file .... as usual with EMBOSS there
are several ways to do it
1. Index with dbifasta
----------------------
You can index with the older dbifasta program. This does not allow
duplicate IDs so only one seq1 will be indexed.
% dbifasta -dbname multi -dir . -index . -file Multi.fasta -idformat
simple -auto
Then define a database in your .embossrc file:
DB multi [
format: "fasta"
method: "emblcd"
type: "nucleotide"
directory: "."
]
Then replace "Multi.fasta" in your listfile with "multi" and you will
have the sequences you want.
2. rewrite as single files in a new directory, then rewrite as one file
% mkdir multi
% seqret -ossingle -odsir multi Multi.fasta -auto
% ls multi
seq1.fasta seq2.fasta ...
% cd multi
seqret '*.fasta' ../Single.fasta
(note: you do need the quotes around the wild card file name)
this will give you a file Single.fasta in the original directory with
only the last version of each id.
3. Write a new application
---------------------------
Another approach is to write your own new application. A copy of seqret
which keeps a table of ids and rejects any sequence with known ID will
rewrite the file (in any format) with only the first occurrence of each
id. We will add this to the next release.
4. ... there may be more ways, but these will be enough to solve your
problem.
Hope that helps,
Peter Rice
EMBOSS Team
More information about the EMBOSS
mailing list