[EMBOSS] Probabilistic versions of needle/water?

Peter Rice pmr at ebi.ac.uk
Mon Jul 6 12:32:18 UTC 2009


Peter C. wrote:
> I have another suggestion for new or enhanced EMBOSS applications,
> again related to the existing pairwise sequence alignment tools needle
> and water.
> 
> The FASTQ file format (or others) contains quality scores (often PHRED
> scores) representing the probability of an error in the associated
> nucleotide. Solexa/Illumina machines also provide another file with a
> more precise breakdown of the likelihood of each of the four bases.
> 
> In some cases both sequences could have probability scores (e.g.
> trying to align the ends of contigs to each other), but often one
> sequence will be taken as fact (e.g. mapping reads onto a reference).
> 
> It is possible to take these probabilities into account when
> considering the matches in needle (or water) by using a probabilistic
> version of the Needleman‐Wunsch sequence alignment algorithm (or a
> probabilistic Smith-Waterman).
> 
> As an example of this idea, did you (Peter R) see the GNUMAP
> talk/poster at ISMB 2009? See http://dna.cs.byu.edu/gnumap/

I saw the talk, and was wondering about their algorithm. They did not
have a separate treatment for gaps in the redas and the consensus, which
seemed like an obvious extension.

> I am aware of people using EMBOSS tools (I assume water) to identify
> (known) adaptor sequences in raw Solexa/Illumina data. I considered
> doing something similar myself when trying to remove primer sequences
> from 454 data. Such a pipeline using the current EMBOSS water would be
> doing this matching at a purely fixed nucleotide level (ignoring the
> qualities), which isn't ideal. Upgrading to a probabilistic version of
> water should be an improvement.

Would be interesting.

Where can I look up adaptor calling methods?

Peter Rice



More information about the EMBOSS mailing list