[EMBOSS] CODON USAGE TABLES
Guy Bottu
gbottu at ben.vub.ac.be
Wed Mar 30 17:21:18 UTC 2005
Dear Peter, dear all,
A few thoughts on the codon usage tables, now that you are working on
them.
Do you intend to drop the existing tables from the distribution in favor
of tables from CUTG ? CUTG has one drawback : the entries for each
organism/organelle are made from all the genes, without taking account of
the fact that there exist distinct subpopulations. E.g. in E. coli there
are the highly expressed genes, the lowly expressed genes and the
horizontally transferred genes, which have different codon usage. I think
that in the distribution there are at least for some organisms specific
files (e.g. Eeco.cut and Eeco_h.cut). The great problem with the files
from the current distribution is that it is hard to find out which file
contains what.
There is the issue of the number of files in the face of GUI's. Some GUI's
for EMBOSS generate a selector from which the user can choose a codon
usage table. If the complete CUTG has been extracted and installed, this
does not work well anymore. A selector with more than 10000 entries is not
convenient and furthermore, in a WWW interface the HTML page takes a
perceptibly long time to download.
At the BEN site I solved this the following (not necessarily satisfactory)
way : I modified cutgextract so that it creates files with extension .cutg
rather than .cut. The interface wEMBOSS only shows the *.cut files in the
selector. If a user wants to use a CUTG rather than a standard
distribution file under wEMBOSS, he must first copy it to his project
using embossdata (at the command line there is no problem).
As formats, it would of course be nice if EMBOSS programs could read and
write codon usage tables (and other data) in any format, just as they do
for sequences. Which formats should we support besides what EMBOSS uses
now ? Is there such a thing as "native" CUTG format (with one entry a
file) ?. I know about GCG format (not useful for us, but other people
certainly might want it). There is Staden format. Staden format supports
also files with 2 tables (codon usage in genes + trinucleotide frequency
in noncoding DNA) ; what to do with this ? only read the first ? There is
also the format used by CODEHOP
(http://blocks.fhcrc.org/blocks/codehop.html). Does
someone know other formats ?
Regards,
Guy Bottu,
Belgian EMBnet Node
More information about the EMBOSS
mailing list