[EMBOSS] Many-to-many with needle and water

Mon Jul 6 10:12:23 UTC 2009

Hi Peter R. et al,

I gather EMBOSS is looking for feedback for new applications (given
the recent funding from the BBSRC - congratulations again). How about
suggestions for extensions to existing EMBOSS applications?

I've used bits of EMBOSS for several years now (thank you!). Something
I have sometimes wanted to do is a many-to-many pairwise sequence
alignment with the EMBOSS tools needle and water.

Right now, needle and water take two files (here referred to as A and
B), file A has just one sequence, and file B can have one or more
sequences. I'd like to be able to supply two files both with multiple
entries, and have needle/water do pairwise alignments between all the
sequences in A against all the sequences in B. This might be useful
for finding reciprocal best hits in comparative genomics (as an slower
but exact alternative to FASTA or BLAST).

>From an implementation point of view, I might imagine doing sequence
A1 against all of B, then sequence A2 against all of B, etc. This
would require looping over file B many times (easy if on disk). This
would also work if the A input was stdin, but having the B input on
stdin would require caching the data if A has more than one sequence
:(

It may sometimes also be useful to have an all-against-all pairwise
comparison for a single set of sequences. The above suggested
enhancement would let you do this by comparing file A to file A.
However, here you only really need to do half the possible
combinations (as aligning sequence A1 to sequence A2 should be the
same as A2 to A1). This could be useful for implementing a basic
clustering algorithm, or maybe as part of a worked example in building
a simple NJ tree?

So, does supporting many-to-many comparisons sound like a useful
enhancement to needle and water?

I should stress this isn't something I need right now. Also, it can be
worked around with a wrapper script to call needle/water once for each
sequence in file A (against all the sequences in file B), with the
added bonus that then these jobs one-to-many comparisons can then be
shared across multiple CPU cores.

Regards,

Peter C.