[EMBOSS] how to find unique DNA sequences from a large database

Fri Dec 8 08:37:32 UTC 2006

Dear Yun Zheng,

> Are there any tools for find unique sequences from a large database? Many
> thanks.
>
> I need to find unique DNA sequences from a large database. A short piece
> is
> given as follows.
>

> All these seem to be the same sequence, since BLASTN gives very small
> e-values for their alignments.

Remember than BLASTN is a local alignment tool. The small e-values
indicate that some part of your 001 query sequence is similar to some part
of a sequence in the database.

You need to check what is matching in the alignments reported by BLASTN.
One useful test is whether the whole length of your query is matching to
any of the sequences in the database, also for DNA whether it is matching
in one or both directions (as sequences can have biologically significant
inverted repeats).

There are tools (not in EMBOSS) available for building non-redundant
databases - excluding sequences which are subsequences of others in the
database, or selecting one of a set of sequences that match closely over
their whole length. But you do have to decide what you mean by redundancy
and make sure that the methods you apply are appropriate.

Hope that helps,

Peter Rice