[EMBOSS] Transeq and very large sequences

David Mathog mathog at caltech.edu
Mon Nov 27 20:06:06 UTC 2006


On Mon, 27 Nov 2006, michael watson (IAH-C) wrote:

> Hi
>
> I want to translate very large (eukrayotic chromosomes!) DNA sequences
> in all 6 frames.  Transeq takes about a day per large chromosome,
> running on a linux machine with 3Gb of RAM.

Well, you might try my fasttrans program.  It may not do exactly
what you want though. If the input sequence is bigger than 100kb
it automatically fragments the input into 101kb chunks with a 1kb
overlap.  You could easily modify the code to make that chunk size
so large that the whole chromosome would be read.  I just 
tested it on Human chromosome 10 and it took 29
seconds on an Opteron system to do all 6 frames with the command:

% time gunzip -c  Homo_sapiens.NCBI36.41.dna.chromosome.10.fa.gz | 
fasttrans 123456 > foo.out

ftp://saf.bio.caltech.edu/pub/software/molbio/fasttrans.c


As for fixing the original problem, without looking at the code
I'm going to hazard a wild guess.  The program may
be allocating smallish chunks for a buffer and then searching
from the front of the buffer for the new end each time.  This
bug is never obvious when there are only a few chunks added
but the time goes up as the square of the length if innumerable
chunks  must be added.  So when presented with
an input 100 times bigger than typical test cases the run time
takes 10000 times longer, which sounds more or less like what you're
seeing.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech



More information about the EMBOSS mailing list