[ssml] RE: [Bioperl-l] Small word sizes with BLAST (WU, NCBI)

Wed Mar 3 16:14:47 EST 2004

If searching for an exact match to a 5-mer, the approximate-match
tools are a poor choice.  You're much better off just reading the flat
file and scanning the sequence with a standard string-match algorithm.
You can probably even use the built-in regular-expression search in perl.
That should be reasonably fast for a single search.  The Bioperl
wrappers for reading the files should make this a pretty trivial
program to write, though they might make things a little too slow for
heavy-duty use.

If you need more speed, you could write a c or c++ program to do the
i/o and use the gnu regular-expression package to do the searching.

If you have many different 5-mers to search for, you could build an
index, listing for each 5-mer all the sequences that contain that 5-mer.
Building the index would take only one pass over the data and would
allow very fast lookup.  Again, one could build a prototype quickly in
perl, and reimplement in a faster language if it turns out to be
necessary. 

Kevin Karplus 	karplus at soe.ucsc.edu	http://www.soe.ucsc.edu/~karplus
life member (LAB, Adventure Cycling, American Youth Hostels)
Effective Cycling Instructor #218-ck (lapsed)
Professor of Biomolecular Engineering, University of California, Santa Cruz
Undergraduate and Graduate Director, Bioinformatics
Affiliations for identification only.