transeq/fuzzpro on large sequences

David Mathog mathog at mendel.bio.caltech.edu
Fri Mar 8 16:12:35 UTC 2002


One user here tried to do the following (switches omitted for clarity):

gunzip chr1.gz | transeq | fuzzpro

where chr1 is the Ensembl golden path for Human chromosome 1.  This does
not
work well because transeq tries to accept the entire input sequence
before doing its
translations.  Since chr1 is very large transeq grows to an unacceptable
size before
emitting anything.  In this case I worked around it by using my own
program fasttrans

  ftp://saf.bio.caltech.edu/pub/software/molbio/fasttrans.c

instead of transeq.  fasttrans breaks large input sequences at 1 Mb
intervals, translates that chunk in the desired frames, and then resumes
input, etc.  That would be a very useful option to add to transeq since
it would let it run in 1 or 2 Mb instead of the
100's of Mb it now requires for large genomic sequences.

That said, it might make even more sense in terms of performance, if not
in terms of
clean functional separation, if fuzzpro could translate for itself.   
It could then easily
emit the DNA position of the match (which has to be calculated from
fragment
number and offset for fragmented input)  And since it makes no sense for
fuzzpro
patterns to be much larger than a couple of thousand residues from one
end to the
other, the "bin size" of 1 Mb for fasttrans could be effectively reduced
to 10kb or so, which should reduce the size of the various buffers so
much that they would all fit inside cache memory, and greatly speed up
the performance.  Also with regards to fuzzpro
the only way that I know of now to show surrounding sequence is to put X
characters
on both sides of the pattern (or indicate a gap of 100 and then an X). 
But either way
that makes the regexp code run unneeded operations and this sort of
logical hack is 
too complex for the average user.  Far better if there were switches
something like this:

 fuzzpro sequence -pattern=whatever -upstream=100 -downstream=200

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech




More information about the EMBOSS mailing list