[EMBOSS] CODON USAGE TABLES

Wed Mar 30 17:21:18 UTC 2005

	Dear Peter, dear all,

A few thoughts on the codon usage tables, now that you are working on 
them.

Do you intend to drop the existing tables from the distribution in favor 
of tables from CUTG ? CUTG has one drawback : the entries for each 
organism/organelle are made from all the genes, without taking account of 
the fact that there exist distinct subpopulations. E.g. in E. coli there 
are the highly expressed genes, the lowly expressed genes and the 
horizontally transferred genes, which have different codon usage. I think 
that in the distribution there are at least for some organisms specific 
files (e.g. Eeco.cut and Eeco_h.cut). The great problem with the files 
from the current distribution is that it is hard to find out which file 
contains what.

There is the issue of the number of files in the face of GUI's. Some GUI's 
for EMBOSS generate a selector from which the user can choose a codon 
usage table. If the complete CUTG has been extracted and installed, this 
does not work well anymore. A selector with more than 10000 entries is not 
convenient and furthermore, in a WWW interface the HTML page takes a 
perceptibly long time to download.
At the BEN site I solved this the following (not necessarily satisfactory) 
way : I modified cutgextract so that it creates files with extension .cutg 
rather than .cut. The interface wEMBOSS only shows the *.cut files in the 
selector. If a user wants to use a CUTG rather than a standard 
distribution file under wEMBOSS, he must first copy it to his project 
using embossdata (at the command line there is no problem).

As formats, it would of course be nice if EMBOSS programs could read and 
write codon usage tables (and other data) in any format, just as they do 
for sequences. Which formats should we support besides what EMBOSS uses 
now ? Is there such a thing as "native" CUTG format (with one entry a 
file) ?. I know about GCG format (not useful for us, but other people 
certainly might want it). There is Staden format. Staden format supports 
also files with 2 tables (codon usage in genes + trinucleotide frequency 
in noncoding DNA) ; what to do with this ? only read the first ? There is 
also the format used by CODEHOP 
(http://blocks.fhcrc.org/blocks/codehop.html). Does 
someone know other formats ?

	Regards,
	Guy Bottu,
	Belgian EMBnet Node