[BioPython] How to check codon usage for specific amino acid positions in a given set of CDS sequences

Brad Chapman chapmanb at 50mail.com
Thu Jan 15 11:52:49 UTC 2009


Hi Animesh;

> I have been trying to write a python script to do the codon wise alignment
> of given nucleotide sequences. I have downloaded CDS sequences (by a script
> found on biopython mailing list) from genbank for a particular protein and
> now would like to check codon usage for few specific amino acid positions.

Biopython does not contain codon usage dictionaries; the possible
organisms and usage frequencies themselves are changing as
additional organisms are sequenced. Your best bet is to parse out the
values from the codon usage database (http://www.kazusa.or.jp/codon/)
for your organism of interest. An example is pasted below from E
coli; you did not mention which organism you were interested in. The
values are reported as usage per 1000 codons.

When you have defined this, here is some Biopython code to create a
dictionary (positional_usage) of usage at each codon position (using
python 0-based indexing for positions):

from Bio import SeqIO

handle = open("example.fasta", "rU")
positional_usage = {}
for record in SeqIO.parse(handle, "fasta"):
    assert len(record.seq) % 3 == 0 # make sure you are 3 based
    for cindex in range(len(record.seq) // 3):
        cur_codon = str(record.seq[cindex * 3:(cindex + 1) * 3])
        usage = usage_dict[cur_codon]
        positional_usage[cindex] = usage
handle.close()

The input to this is usage_dict, a dictionary defined as below. Hope
this helps,
Brad

Escherichia_coli = \
{'AAA': 35.601945036625438,
 'AAC': 21.202802271903,
 'AAG': 13.045009394539333,
 'AAT': 22.831396289856265,
 'ACA': 10.700618181965975,
 'ACC': 21.387130807992541,
 'ACG': 13.784236156652,
 'ACT': 11.016200111457801,
 'AGA': 4.4652452250900074,
 'AGC': 14.997074890221718,
 'AGG': 2.5626687138052029,
 'AGT': 10.73241545213447,
 'ATA': 8.2158886416564805,
 'ATC': 22.685559186075952,
 'ATG': 25.945855225833537,
 'ATT': 29.669004762179132,
 'CAA': 14.383602745467156,
 'CAC': 8.8157333849102599,
 'CAG': 28.118110840502265,
 'CAT': 12.473375763164368,
 'CCA': 8.6299703855048442,
 'CCC': 5.630985746455262,
 'CCG': 19.354496289402018,
 'CCT': 7.8991113260680947,
 'CGA': 4.0270166820911326,
 'CGC': 18.382647392898786,
 'CGG': 6.4933372765136035,
 'CGT': 18.916506823622456,
 'CTA': 4.4733738505466141,
 'CTC': 10.083559878921733,
 'CTG': 46.036709350716478,
 'CTT': 12.48556870134928,
 'GAA': 38.019254801088948,
 'GAC': 18.833307951301883,
 'GAG': 18.80390145332651,
 'GAT': 32.883397975828814,
 'GCA': 21.603495691469892,
 'GCC': 23.869708653328228,
 'GCG': 27.990682682608973,
 'GCT': 17.355093504295862,
 'GGA': 10.60618268033774,
 'GGC': 25.658245331001215,
 'GGG': 11.57779249962166,
 'GGT': 24.92882073488034,
 'GTA': 11.897916896280412,
 'GTC': 14.044830325702069,
 'GTG': 23.467102616006844,
 'GTT': 20.038018059414991,
 'TAA': 1.9881661557984951,
 'TAC': 12.005979799409431,
 'TAG': 0.28569727707782611,
 'TAT': 18.337939952887442,
 'TCA': 9.9362883118255478,
 'TCC': 9.2876718158321232,
 'TCG': 8.51664778355096,
 'TCT': 10.941368941813147,
 'TGA': 1.0356825140595336,
 'TGC': 5.9924705020549887,
 'TGG': 13.780171843923698,
 'TGT': 5.3450493921581241,
 'TTA': 14.983925643159559,
 'TTC': 15.622261818722567,
 'TTG': 12.856616545721486,
 'TTT': 22.459153059387496
}



More information about the Biopython mailing list