[EMBOSS] Probabilistic versions of needle/water?

Mon Jul 6 11:56:06 UTC 2009

Hi all,

I have another suggestion for new or enhanced EMBOSS applications,
again related to the existing pairwise sequence alignment tools needle
and water.

The FASTQ file format (or others) contains quality scores (often PHRED
scores) representing the probability of an error in the associated
nucleotide. Solexa/Illumina machines also provide another file with a
more precise breakdown of the likelihood of each of the four bases.

In some cases both sequences could have probability scores (e.g.
trying to align the ends of contigs to each other), but often one
sequence will be taken as fact (e.g. mapping reads onto a reference).

It is possible to take these probabilities into account when
considering the matches in needle (or water) by using a probabilistic
version of the Needleman‐Wunsch sequence alignment algorithm (or a
probabilistic Smith-Waterman).

As an example of this idea, did you (Peter R) see the GNUMAP
talk/poster at ISMB 2009? See http://dna.cs.byu.edu/gnumap/

I am aware of people using EMBOSS tools (I assume water) to identify
(known) adaptor sequences in raw Solexa/Illumina data. I considered
doing something similar myself when trying to remove primer sequences
from 454 data. Such a pipeline using the current EMBOSS water would be
doing this matching at a purely fixed nucleotide level (ignoring the
qualities), which isn't ideal. Upgrading to a probabilistic version of
water should be an improvement.

Peter C.