[EMBOSS] uniq sequences on a list

Fernando Martínez-Alberola fermaral1981 at gmail.com
Wed Oct 5 10:52:43 UTC 2011


El mar, 04-10-2011 a las 15:13 +0100, Peter Rice escribió: 
> On 10/04/2011 02:38 PM, Fernando Martinez wrote:
> > Hi, I am trying to retrieve sequences from a multi-fasta file were there are
> > identical sequences and i want to extract only the ones in my list, how can
> > I do that?
> > Example:
> >
> > Multi.fasta file:
> >
> >> seq1
> > atataga...
> >> seq2
> > ttatggttca..
> > [...]
> >> seq1
> > atataga...
> > [...]
> > And I only want to take seq1 an seq2, not two times seq1!!
> 
> If you really must start from that file .... as usual with EMBOSS there 
> are several ways to do it
> 
> 1. Index with dbifasta
> ----------------------
> 
> You can index with the older dbifasta program. This does not allow 
> duplicate IDs so only one seq1 will be indexed.
> 
> % dbifasta -dbname multi -dir . -index . -file Multi.fasta -idformat 
> simple -auto
> 
> Then define a database in your .embossrc file:
> 
> DB multi [
>    format: "fasta"
>    method: "emblcd"
>    type: "nucleotide"
>    directory: "."
> ]
> 
> Then replace "Multi.fasta" in your listfile with "multi" and you will 
> have the sequences you want.
> 
> 
> 
> 2. rewrite as single files in a new directory, then rewrite as one file
> 
> % mkdir multi
> % seqret -ossingle -odsir multi Multi.fasta -auto
> % ls multi
> seq1.fasta  seq2.fasta ...
> 
> % cd multi
> seqret '*.fasta' ../Single.fasta
> 
> (note: you do need the quotes around the wild card file name)
> 
> this will give you a file Single.fasta in the original directory with 
> only the last version of each id.
> 
> 
> 
> 3. Write a new application
> ---------------------------
> 
> Another approach is to write your own new application. A copy of seqret 
> which keeps a table of ids and rejects any sequence with known ID will 
> rewrite the file (in any format) with only the first occurrence of each 
> id. We will add this to the next release.
> 
> 
> 4.  ... there may be more ways, but these will be enough to solve your 
> problem.
> 
> Hope that helps,
> 
> Peter Rice
> EMBOSS Team

Thanks, your help was very useful, in particular the second mode.
Best regards, Fernando




More information about the EMBOSS mailing list