[EMBOSS] Align 12bp DNA back to genome

David Mathog mathog at caltech.edu
Fri Feb 6 19:56:45 UTC 2009


> After trimming a bunch of illumina reads I have sequences that are
12bp long and "we" want to find 
> their genomic location

location(S!)

> 
> I have both a perl regex and fuzznuc running now for oh about a
week..........

It shouldn't take that long.  Assuming the words are perfectly known (so
identity match only), and you have more than one of these words to do,
the fastest way would be to break the genome(s) down into words, sort
that form, and then just pull out matching word positions from the
resulting table(s). 

If on the other hand, you are allowing mismatches, well, ugh.  12 is
pretty small to start with...

> 
> I have also attempted to build/configure/run without success
> maq
> soap
> blat

Here is a tool which is very simple to get you started:

http://saf.bio.caltech.edu/pub/software/molbio/fastatuples.c

Do 
  fastatuples -h 
to see the options.

Run it like (for instance)

fastatuples -n -w 12 -d -p -s <dna.nfa
       1:       0 GAATTCTGCTCC
       1:       1 AATTCTGCTCCA
       1:       2 ATTCTGCTCCAG
       1:       3 ACTGGAGCAGAA
       1:       4 AACTGGAGCAGA
       1:       5 CAACTGGAGCAG
       1:       6 GCAACTGGAGCA
       1:       7 GCTCCAGTTGCC
       1:       8 CTCCAGTTGCCA
etc.
Where the first column is the sequence entry, the second the offset, and
the third the sequence.  (With -d it emits "degenerate sequence", that
is, it compares the forward and reverse sequences and puts out the one
with the smaller collating sequence.)  The output will be of size 31 X
length of genome.  You can make it smaller by using the -b (binary
output) option.

When that is done sort it using: the 3rd column as the primary key,
the 1st column as the secondary key, and the 2nd column as the tertiary
key.  (Trickier with binary output.)

At that point if desired you can trivially reduce it with a perl script
to emit something like:

CTCCAGTTGCCA (1:8,30,501...),(2:1000,...)
etc.
Then just look up the matching rows for each word.

Or leave it unpacked, and use a binary search in the sorted file to find
the words you want.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech



More information about the EMBOSS mailing list