[EMBOSS] Many-to-many with needle and water

Peter biopython at maubp.freeserve.co.uk
Mon Jul 6 10:58:06 UTC 2009


On Mon, Jul 6, 2009 at 11:35 AM, Peter Rice <pmr at ebi.ac.uk> wrote:
>
> Peter C wrote:
> > Hi Peter R. et al,
> >
> > I gather EMBOSS is looking for feedback for new applications (given
> > the recent funding from the BBSRC - congratulations again). How about
> > suggestions for extensions to existing EMBOSS applications?
> >
> > I've used bits of EMBOSS for several years now (thank you!). Something
> > I have sometimes wanted to do is a many-to-many pairwise sequence
> > alignment with the EMBOSS tools needle and water.
> >
> > Right now, needle and water take two files (here referred to as A and
> > B), file A has just one sequence, and file B can have one or more
> > sequences. I'd like to be able to supply two files both with multiple
> > entries, and have needle/water do pairwise alignments between all the
> > sequences in A against all the sequences in B. This might be useful
> > for finding reciprocal best hits in comparative genomics (as an slower
> > but exact alternative to FASTA or BLAST).
>
> The application is easy to add (after the release)
>
> The usual problem with all-against-all is that it involves loading one
> of the inputs as a sequence set entirely in memory - to avoid reading
> one input many times over.

Right - and it would be difficult to decide if in memory vs reading the
file many times is best in general without some specific use cases.

[I suppose you could do something a bit more cunning like start by
caching the sequences as you read them read for re-use, but if the
number of sequences crosses a threshold, stop caching and switch
to re-reading the file for subsequence loops?]

> We have an application supermatcher which does this - the first sequence
> is streamed through, the second is a sequence set loaded into memory. It
> uses work matching to find seed alignments then runs a limited alignment
> around the hits.
>
> superwater would be a possible name (or superneedle).

If you see many-to-many versions of water and needle as a separate
applications, then those names sound fine.

> How popular would such a program be?

I don't know - as I said, this is more of suggestion than a request.
I don't *need* this tool, but there have been occasions in the past
where I would have tried using it if it had existed.

Perhaps others on the list can think of a better uses for this tool idea?

> How large would the smaller input set be?

Hard to say without specific examples in mind. For some hand waving
upper limits, for comparative genomics of bacteria using protein
sequences, you might have a few thousand in each file. If I was trying
this as part of an ad-hoc clustering algorithm (all-against-all), again
maybe a few thousand sequences. In practice, a heuristic tool like
supermatcher (or FASTA or BLAST) would probably be more sensible
for large datasets like this due to the computational time.

I see needle and water as most useful on smaller datasets where
the runtime cost of using an exact algorithm isn't too high. Therefore
many-to-many needle/water searches may be best targeted at
smaller sequence files. Things might be different with a multicore
or GPU/OpenCL version of needle and water ;)

Anyway, unless someone else thinks a many-to-many version
of needle and water would be useful, I wouldn't expect you to
implement this. I'm just putting the idea forward for discussion.

Regards,

Peter C.



More information about the EMBOSS mailing list