[EMBOSS] Codon usage file improvements
Peter Rice
pmr at ebi.ac.uk
Tue Apr 12 08:57:27 UTC 2005
Peter Rice wrote:
> A quick check before I make changes to the EMBOSS codon usage files.
Done.
The codon usage files now committed to CVS (so this will happen from the next
release) have the following changes:
1. file naming is Exxxxx where xxxxx is the UniProt/SwissProt 5-letter name
for the species. Some species in UniProt/SwissProt have more than one name
(strains used for genome projects, for example AGRTU and AGRT5 for
Agrobacterium tumefasciens - EMBOSS will use Eagrtu.cut for the codon usage
table, but has genes from the genome sequence).
For example:
#Species: Agrobacterium tumefaciens str. C58
#Division: gbbct
#Release: CUTG146
#CdsCount: 10705
#Coding GC 59.76%
#1st letter GC 63.11%
#2nd letter GC 44.70%
#3rd letter GC 71.47%
#Codon AA Fraction Frequency Number
GCA A 0.132 15.154 51011
GCC A 0.440 50.470 169886
GCG A 0.328 37.649 126730
GCT A 0.101 11.550 38879
TGC C 0.783 6.486 21834
2. The old filenames will stay until release 3.0.0 for those who are used to
them. I will add comments to their headers. They came from the CODONUSAGE and
TRANSTERM databases, and we copied their filenames!
The attached file cut.txt lists the old file names and their species. I used
the notes when selecting species for the new codon usage files.
3. EMBOSS will be able to read other codon usage table formats, and will
extract the species and other information where possible
4. Codon usage files are checked for inconsistencies - if they specify the
number of genes, then files with too many stop codons will give a warning.
Some formats do not include the genetic code, so for some species and formats
the warning can be ignored. The EMBOSS and GCG formats are safe.
5. Some EMBOSS programs read a codon usage file - but only use it to read a
genetic code. These programs will instead prompt for a genetic code in the
next release. For example, showseq and prettyseq only need a genetic code for
translation. Backtranseq does need a codon usage table - for back translation
it needs to know the most used codon for each amino acid.
6. A new file Cut.index (in the data/CODONS directory) will list all the codon
usage files and their species so that a menu of installed codon usage files
can be used by interfaces.
A copy of Cut.index is attached as Cut_index.txt
Hope this helps
Peter
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: cut.txt
URL: <http://lists.open-bio.org/pipermail/emboss/attachments/20050412/d7935cf0/attachment-0002.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Cut_index.txt
URL: <http://lists.open-bio.org/pipermail/emboss/attachments/20050412/d7935cf0/attachment-0003.txt>
More information about the EMBOSS
mailing list