[EMBOSS] how to find unique DNA sequences from a large database

Fri Dec 8 01:30:38 UTC 2006

Although these are not the good ways to do,
they are the workable solutions:

First, for each sequence in your database, make
a long string of sequence.  Then use a for loop
scan over your long sequence string with the
window size of your search sequence.  You do
all for each sequences in the database.  It
may take a few days if you need to scan big
databases such as human genome.

The other way is to elongate your short query
to 17 or 21 nt (not sure which is the shortest one
that blast works) long where blast can search.
That means, if you have 15 nt oligo, you can
creat four x four possible 17 nt sequences.
Such as:

   AAACCCGGGC CCTTTAAaa
   AAACCCGGGC CCTTTAAag
   AAACCCGGGC CCTTTAAac
   AAACCCGGGC CCTTTAAat
   AAACCCGGGC CCTTTAAga
   AAACCCGGGC CCTTTAAgg
   AAACCCGGGC CCTTTAAgc
   AAACCCGGGC CCTTTAAgt
   AAACCCGGGC CCTTTAAca
   AAACCCGGGC CCTTTAAct
   AAACCCGGGC CCTTTAAcg
   AAACCCGGGC CCTTTAAcc
   .....

Then you run blast and combine all results from
16 17-nt sequences as the hits for your 15 nt
query sequence.

Hope this useful.

Thanks,  TU

==================================

On Thu, 7 Dec 2006, Michael Thon wrote:

> Hi Yun , you might try a clustering algorithm like blastclust (single
> linkage clustering) or mcl (a.k.a tribe-mcl) or one of the others
> that exist.  I can't think of any EMBOSS apps that would solve this
> problem, but maybe someone else has a better answer.
> Mike
>
>
> On Dec 7, 2006, at 2:36 PM, yun zheng wrote:
>
>> Hi,
>>
>> Are there any tools for find unique sequences from a large
>> database? Many
>> thanks.
>>
>> I need to find unique DNA sequences from a large database. A short
>> piece is
>> given as follows.
>>
>>> 001
>> aaaagttgtgtgtgtatgacaggtt
>>> 013
>> aacctgtcatacacacacaactttt
>>> 289
>> gttgtgtgtgtatgacaggtt
>>> 375
>> tgtgtgtatgacaggttgat
>>> 319
>> tcaacctgtcatacacaca
>>> 177
>> cgcagtgtgtgtatgacagg
>>> 271
>> gtcctacctgtcatacacac
>>> 020
>> aagacataatgtgtgtatgacag
>>
>> All these seem to be the same sequence, since BLASTN gives very small
>> e-values for their alignments.
>>
>> BLASTN 2.2.8 [Jan-05-2004]
>>
>>
>> Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A.
>> Schaffer,
>> Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
>> "Gapped BLAST and PSI-BLAST: a new generation of protein database
>> search
>> programs",  Nucleic Acids Res. 25:3389-3402.
>>
>> Query= 001
>>          (25 letters)
>>
>> Database: drought-clustered.fa
>>            410 sequences; 8877 total letters
>>
>> Searching.done
>>
>>
>> Score    E
>> Sequences producing significant alignments:
>> (bits)
>> Value
>>
>> 013
>>  50
>> 8e-11
>> 001
>>  50
>> 8e-11
>> 289
>>  42
>> 2e-08
>> 375
>>  34
>> 5e-06
>> 319
>>  34
>> 5e-06
>> 177
>>  32
>> 2e-05
>> 271
>>  30
>> 8e-05
>> 020
>>  28
>> 3e-04
>>
>> Best regards.
>>
>> sincerely
>>
>> Zheng, Yun
>>
>> Department of Computer Science
>>
>> Washington University in St Louis
>>
>> Campus Box 1045
>>
>> 1 Brookings Drive, St Louis, MO 63130
>> _______________________________________________
>> EMBOSS mailing list
>> EMBOSS at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/emboss
>
> _______________________________________________
> EMBOSS mailing list
> EMBOSS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/emboss
>