[EMBOSS] uniq sequences on a list

Tue Oct 4 14:13:21 UTC 2011

On 10/04/2011 02:38 PM, Fernando Martinez wrote:
> Hi, I am trying to retrieve sequences from a multi-fasta file were there are
> identical sequences and i want to extract only the ones in my list, how can
> I do that?
> Example:
>
> Multi.fasta file:
>
>> seq1
> atataga...
>> seq2
> ttatggttca..
> [...]
>> seq1
> atataga...
> [...]
> And I only want to take seq1 an seq2, not two times seq1!!

If you really must start from that file .... as usual with EMBOSS there 
are several ways to do it

1. Index with dbifasta
----------------------

You can index with the older dbifasta program. This does not allow 
duplicate IDs so only one seq1 will be indexed.

% dbifasta -dbname multi -dir . -index . -file Multi.fasta -idformat 
simple -auto

Then define a database in your .embossrc file:

DB multi [
   format: "fasta"
   method: "emblcd"
   type: "nucleotide"
   directory: "."
]

Then replace "Multi.fasta" in your listfile with "multi" and you will 
have the sequences you want.

2. rewrite as single files in a new directory, then rewrite as one file

% mkdir multi
% seqret -ossingle -odsir multi Multi.fasta -auto
% ls multi
seq1.fasta  seq2.fasta ...

% cd multi
seqret '*.fasta' ../Single.fasta

(note: you do need the quotes around the wild card file name)

this will give you a file Single.fasta in the original directory with 
only the last version of each id.

3. Write a new application
---------------------------

Another approach is to write your own new application. A copy of seqret 
which keeps a table of ids and rejects any sequence with known ID will 
rewrite the file (in any format) with only the first occurrence of each 
id. We will add this to the next release.

4.  ... there may be more ways, but these will be enough to solve your 
problem.

Hope that helps,

Peter Rice
EMBOSS Team