[EMBOSS] CODON USAGE TABLES

Fri Apr 1 08:33:41 UTC 2005

Guy Bottu wrote:
> 	Dear Peter, dear all,
> 
> A few thoughts on the codon usage tables, now that you are working on 
> them.
> 
> Do you intend to drop the existing tables from the distribution in favor 
> of tables from CUTG ? CUTG has one drawback : the entries for each 
> organism/organelle are made from all the genes, without taking account of 
> the fact that there exist distinct subpopulations. E.g. in E. coli there 
> are the highly expressed genes, the lowly expressed genes and the 
> horizontally transferred genes, which have different codon usage. I think 
> that in the distribution there are at least for some organisms specific 
> files (e.g. Eeco.cut and Eeco_h.cut). The great problem with the files 
> from the current distribution is that it is hard to find out which file 
> contains what.

The file will be annotated with the species and the source database

The _h files will be kept (the chips program needs them for example) ... but 
if we have no documentation on which genes are highly expressed we may have to 
keep the transterm files which are based on only a few genes.

> There is the issue of the number of files in the face of GUI's. Some GUI's 
> for EMBOSS generate a selector from which the user can choose a codon 
> usage table. If the complete CUTG has been extracted and installed, this 
> does not work well anymore. A selector with more than 10000 entries is not 
> convenient and furthermore, in a WWW interface the HTML page takes a 
> perceptibly long time to download.

Any cutgextract modification requests? I have added species selection.

> At the BEN site I solved this the following (not necessarily satisfactory) 
> way : I modified cutgextract so that it creates files with extension .cutg 
> rather than .cut. The interface wEMBOSS only shows the *.cut files in the 
> selector. If a user wants to use a CUTG rather than a standard 
> distribution file under wEMBOSS, he must first copy it to his project 
> using embossdata (at the command line there is no problem).

I will add an option to cutgextract for the output filename extension.

> As formats, it would of course be nice if EMBOSS programs could read and 
> write codon usage tables (and other data) in any format, just as they do 
> for sequences. Which formats should we support besides what EMBOSS uses 
> now ? Is there such a thing as "native" CUTG format (with one entry a 
> file) ?. I know about GCG format (not useful for us, but other people 
> certainly might want it). There is Staden format. Staden format supports 
> also files with 2 tables (codon usage in genes + trinucleotide frequency 
> in noncoding DNA) ; what to do with this ? only read the first ? There is 
> also the format used by CODEHOP 
> (http://blocks.fhcrc.org/blocks/codehop.html). Does 
> someone know other formats ?

CUTG has a format used on their web pages. It also has the spsum file which 
could be used.

regards,

Peter