[EMBOSS] diffseq memory problem?

Tue Feb 8 10:46:32 UTC 2011

Dear Caroline,

On 08/02/2011 10:01, Barretto, Caroline, LAUSANNE, BioInformatics wrote:
> Dear EMBOSS developers,
>
> I have been using diffseq to compare too strains of the same bacteria
> species using "10" as wordsize without any problem.
>
> However, when I try to reduce this number to "4", after several hours of
> calculation the server collapses, all RAM and SWAP are used.
>
> Is there any option to avoid that, or do you know if someone is working
> on that problem?

Depending on the input size, and the number of simple repeats, a low 
word size could easily generate too many matches for large sequence lengths.

We would recommend reducing the word size more slowly (maybe 10, 8, 6).

As a guideline, finding more matches than there are non-overlapping 
words in the sequence is unlikely to be useful and is a reasonable point 
to stop reducing the word size.

Meanwhile, we will take a look at diffseq in case there is some way to 
improve its performance or to warn an early stage if the word size 
appears small for the input sequence lengths and may generate too many 
matches.

Hope this helps

Peter Rice
EMBOSS Team