[EMBOSS] Codon usage file improvements

Tue Apr 12 08:57:27 UTC 2005

Peter Rice wrote:
> A quick check before I make changes to the EMBOSS codon usage files.

Done.

The codon usage files now committed to CVS (so this will happen from the next 
release) have the following changes:

1. file naming is Exxxxx where xxxxx is the UniProt/SwissProt 5-letter name 
for the species. Some species in UniProt/SwissProt have more than one name 
(strains used for genome projects, for example AGRTU and AGRT5 for 
Agrobacterium tumefasciens - EMBOSS will use Eagrtu.cut for the codon usage 
table, but has genes from the genome sequence).

For example:

#Species: Agrobacterium tumefaciens str. C58
#Division: gbbct
#Release: CUTG146
#CdsCount: 10705

#Coding GC 59.76%
#1st letter GC 63.11%
#2nd letter GC 44.70%
#3rd letter GC 71.47%

#Codon AA Fraction Frequency Number
GCA    A     0.132    15.154  51011
GCC    A     0.440    50.470 169886
GCG    A     0.328    37.649 126730
GCT    A     0.101    11.550  38879
TGC    C     0.783     6.486  21834

2. The old filenames will stay until release 3.0.0 for those who are used to 
them. I will add comments to their headers. They came from the CODONUSAGE and 
TRANSTERM databases, and we copied their filenames!

The attached file cut.txt lists the old file names and their species. I used 
the notes when selecting species for the new codon usage files.

3. EMBOSS will be able to read other codon usage table formats, and will 
extract the species and other information where possible

4. Codon usage files are checked for inconsistencies - if they specify the 
number of genes, then files with too many stop codons will give a warning. 
Some formats do not include the genetic code, so for some species and formats 
the warning can be ignored. The EMBOSS and GCG formats are safe.

5. Some EMBOSS programs read a codon usage file - but only use it to read a 
genetic code. These programs will instead prompt for a genetic code in the 
next release. For example, showseq and prettyseq only need a genetic code for 
translation. Backtranseq does need a codon usage table - for back translation 
it needs to know the most used codon for each amino acid.

6. A new file Cut.index (in the data/CODONS directory) will list all the codon 
usage files and their species so that a menu of installed codon usage files 
can be used by interfaces.

A copy of Cut.index is attached as Cut_index.txt

Hope this helps

Peter

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: cut.txt
URL: <http://lists.open-bio.org/pipermail/emboss/attachments/20050412/d7935cf0/attachment-0002.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Cut_index.txt
URL: <http://lists.open-bio.org/pipermail/emboss/attachments/20050412/d7935cf0/attachment-0003.txt>