[EMBOSS] needle question!

Mon Feb 12 19:13:35 UTC 2007

 I have an additional question about needle, as I would like to
actually remove noninformative bases from the final alignment score:

ie.  If the sequence follows
-CATTCNNNCA-
-CATTCAAACA-

With suggested matrix weight changes I would expect to see a 100%
similarity of 10/10 bases
However, it is more informative to me to to see 100% similarity of 7/7
bases (with N no longer aiding my alignment score).  One could imagine
an artificial similarity score inflation if the entire length is used
to generate the score...ie. if 100 bases were being aligned to 100 bp
sequence (containing 10 "Ns"), and then 5 of those bases were an
informative mismatch:

Needle would currently provide:
95/100 (or simply 95% similarity)

But the answer needed would be:
85/90 (or 94.4% similarity).

Does this make sense?
Thank you in advance for any help you can offer!

Karen

On 2/8/07, Karen Hayden <kehayden at gmail.com> wrote:
> Hey Peter,
>   That was absolutely perfect.  Thank you!
>
> Best wishes,
> Karen
>
>
> On 2/8/07, pmr at ebi.ac.uk <pmr at ebi.ac.uk> wrote:
> > Dear Karen,
> >
> > >  I am currently using needle to generate an alignment between two
> > > sequences which contain non-informative bases (ie, identified low
> > > quality bases (phred scores) and have been changed to  "N").
> > > Presently,  these bases are penalized as any other non-matching
> > > character.  Is there any way to change needle to "overlook" these
> > > bases when generating the best scoring alignment (or, do I need to
> > > write my own version of needle?)
> >
> > There are two matrix files for nucleotide comparisons. The default is
> > EDNAFULL which counts N as an average of all possible scores (1 match
> > against 3 possible mismatches).
> >
> > The alternative is EDNAMAT which only scores exact matches like blastn
> > (use -data EDNAMAT on the command line to see the difference).
> >
> > But you can also copy EDNAMAT to your local directory with
> >
> > embossdata EDNAFULL -fetch
> > mv EDNAFULL EDNAPHRED
> > (best to do this rename or you will accidentally be using this file by
> > default for other needle runs in the same directory)
> >
> > edit EDNAPHRED to have the scores you want for N (perhaps +1 for a small
> > match to ACGTU, +2 for a match to a 2-base code RYSWKM, +3 for a match to
> > a 3-base code BDHV and +4 for a match to another N.
> >
> > Then run with:
> >
> > needle -data EDNAPHRED
> >
> > If enough users think this is a meaningful scoring system we could add
> > such a matrix to the distribution. Let us know if it really gives you more
> > useful scores. My natural prejudice is to trust EDNAFULL. I guess you are
> > expecting to often find the base in the other sequence is the one phred
> > started with, which will indeed bias the scoring.
> >
> > Hope this helps,
> >
> > Peter
> >
> >
> >
>
>
> --
> Karen E. Hayden
> Starving Graduate Student
> Duke University
> Durham, NC  27708
>

-- 
Karen E. Hayden
Starving Graduate Student
Duke University
Durham, NC  27708