[EMBOSS] uniq sequences on a list
Fernando Martínez-Alberola
fermaral1981 at gmail.com
Wed Oct 5 10:52:43 UTC 2011
El mar, 04-10-2011 a las 15:13 +0100, Peter Rice escribió:
> On 10/04/2011 02:38 PM, Fernando Martinez wrote:
> > Hi, I am trying to retrieve sequences from a multi-fasta file were there are
> > identical sequences and i want to extract only the ones in my list, how can
> > I do that?
> > Example:
> >
> > Multi.fasta file:
> >
> >> seq1
> > atataga...
> >> seq2
> > ttatggttca..
> > [...]
> >> seq1
> > atataga...
> > [...]
> > And I only want to take seq1 an seq2, not two times seq1!!
>
> If you really must start from that file .... as usual with EMBOSS there
> are several ways to do it
>
> 1. Index with dbifasta
> ----------------------
>
> You can index with the older dbifasta program. This does not allow
> duplicate IDs so only one seq1 will be indexed.
>
> % dbifasta -dbname multi -dir . -index . -file Multi.fasta -idformat
> simple -auto
>
> Then define a database in your .embossrc file:
>
> DB multi [
> format: "fasta"
> method: "emblcd"
> type: "nucleotide"
> directory: "."
> ]
>
> Then replace "Multi.fasta" in your listfile with "multi" and you will
> have the sequences you want.
>
>
>
> 2. rewrite as single files in a new directory, then rewrite as one file
>
> % mkdir multi
> % seqret -ossingle -odsir multi Multi.fasta -auto
> % ls multi
> seq1.fasta seq2.fasta ...
>
> % cd multi
> seqret '*.fasta' ../Single.fasta
>
> (note: you do need the quotes around the wild card file name)
>
> this will give you a file Single.fasta in the original directory with
> only the last version of each id.
>
>
>
> 3. Write a new application
> ---------------------------
>
> Another approach is to write your own new application. A copy of seqret
> which keeps a table of ids and rejects any sequence with known ID will
> rewrite the file (in any format) with only the first occurrence of each
> id. We will add this to the next release.
>
>
> 4. ... there may be more ways, but these will be enough to solve your
> problem.
>
> Hope that helps,
>
> Peter Rice
> EMBOSS Team
Thanks, your help was very useful, in particular the second mode.
Best regards, Fernando
More information about the EMBOSS
mailing list