[BioPython] How to check codon usage for specific amino acid positions in a given set of CDS sequences

Thu Jan 15 14:26:07 UTC 2009

On Thu, Jan 15, 2009 at 2:21 PM, Animesh Agrawal
<animesh.agrawal at anu.edu.au> wrote:
>
> Hi Marco,
> My apologies. Probably in my last mail I didn't make myself very clear. I
> have a protein which is about 475 amino acid long and is highly conserved
> (over 95%) among diffrent organisms. I have downloaded its CDS(coding
> sequence) .

ok!
I used to work with transcript and alternative splicing

> I would like to calculate codon use frequenecy for important amino acid
> positions as you have put it very nicely in your reply:
> "for a particular aminoacid position (e.g. the first, or the third,or the
> last) the codon usage for those aminoacids that are coded by more than a
> possible codon (e.g. Ala) the frequency with which every codon is used?"
> For example in a set of four sequenecs
>                 1       2       3
>               Ala    Gly     Ile
> Seq1 GCT GCT ATT
> Seq2 GCC GCC ATC
> Seq3 GCA GCA ATA
> Seq4 GCG GCG ATT

Let's see how can you do this with biopython (ehi, Peter, please
correct me if I say something wrong!! :)).

If your set of sequences is not too big, you can just put the
sequences in a dictionary:
sequences = {seq1 : <SeqObject>, seq2 = <SeqObject>}

The alignment file (align.txt) should look like this (or any other
format supported by AlignIO):
>seq1
aaacccaaa
>seq2
aaacccaaa
>seq3
tttcccaaa
>seq4
tttgggaaa

If you want, you can use biopython to parse the alignment file:
>>> from Bio import AlignIO
>>> alignment = AlignIO(open('align.txt', 'r'))

Then, you will have an AlignIO object called 'alignment', which
contains all the sequences in your file:
>>> print alignment
SingleLetterAlphabet() alignment with 4 rows and 9 columns
aaacccaaa Seq1
aaacccaaa seq2
tttcccaaa seq3
tttgggaaa seq4

You will be able to access all the sequences in your alignment by the
_records property of AlignIO:
>>> sequences = alignment._records
>>> print sequences
[SeqRecord(seq=Seq('aaacccaaa', SingleLetterAlphabet()), id='seq1',
name='seq1', description='seq1', dbxrefs=[]),
 SeqRecord(seq=Seq('aaacccaaa', SingleLetterAlphabet()), id='seq2',
name='seq2', description='seq2', dbxrefs=[]),
 SeqRecord(seq=Seq('tttcccaaa', SingleLetterAlphabet()), id='seq3',
name='seq3', description='seq3', dbxrefs=[]),
 SeqRecord(seq=Seq('tttgggaaa', SingleLetterAlphabet()), id='seq4',
name='seq4', description='seq4', dbxrefs=[])]

If you prefer, you are not obliged to use AlignIO and you can your own
parser for your alignment. However, if you use biopython's code, you
won't have to demonstrate that your parser doesn't contain errors
(somebody could ask you this).

The alignment object in biopython doesn't have any method to count
codon usage the way you want to do. However, you can implement it
easily in many ways, for example:
>>> codon_count_by_position = {}
>>> for codon_start in (range(0, len(sequences[0]), 3)):
        codon_count_by_position[codon_start] = {}
        for sequence in sequences:
                current_codon = sequence.seq[codon_start:codon_start+3]

                # note: why do I have to do .tostring() here, and in
the previous statement no?

codon_count_by_position[codon_start].setdefault(current_codon.tostring(),
0)

codon_count_by_position[codon_start][current_codon.tostring()] += 1. /
len(sequences)

>>> print codon_count_by_position
{0: {'aaa': 0.5, 'ttt': 0.5}, 3: {'ccc': 0.75, 'ggg': 0.25}, 6: {'aaa': 1.0}}

There are many other ways you can do this and you should be careful in
handling gaps and alternative splicing, and you should have a look at
the tools that already do codon-based alignment, but I hope this can
help you.

>
> For first amino acid position i.e. Ala (which is coded by 4 codons) each
> codon is used once in 4 sequences that gives you frequency of 0.25 for each
> codon or for third  amino acid position i.e. Ile ( which is coded by 3
> codons) the  ATT will give you frequency of 0.5 while other two will give
> you frequency of 0.25.
>
>
> Cheers,
> Animesh
>
>
> ----- Original Message -----
> From: Giovanni Marco Dall'Olio <dalloliogm at fastwebnet.it>
> Date: Thursday, January 15, 2009 10:45 pm
> Subject: Re: [BioPython] How to check codon usage for specific amino acid
> positions in a given set of CDS sequences
> To: Animesh Agrawal <animesh.agrawal at anu.edu.au>
> Cc: biopython at lists.open-bio.org
>
>> On Thu, Jan 15, 2009 at 10:21 AM, Animesh Agrawal
>> <animesh.agrawal at anu.edu.au> wrote:
>> > Hi,
>> >
>> > I have been trying to write a python script to do the codon
>> wise alignment
>> > of given nucleotide sequences.
>>
>> Note that there are many tools that already do a 'codon wise'
>> alignment, if it is what I think you mean by it.
>> I think t-coffee does this. It is always better to use a tool that
>> already exists rather than develop a new one, if you can, because
>> otherwise your results will be different to compare with other
>> experiments.
>>
>>
>> > I have downloaded CDS sequences (by a script
>> > found on biopython mailing list) from genbank for a particular
>> protein and
>> > now would like to check codon usage for few specific amino
>> acid positions.
>>
>> Can you provide a better example of what do you want to obtain?
>> Do you want to know:
>> - for a particular aminoacid position (e.g. the first, or the third,
>> or the last) the codon usage in a set of sequences?
>> - for those aminoacids that are coded by more than a possible codon
>> (e.g. Ala) the frequency with which every codon is used?
>> - the frequency at which every possible codon is used, in general.
>>
>> If I can give you an advice, I would spend some time in
>> developing a
>> test case first. For example, create a fake sequence and
>> calculate the
>> output that you expect from your experiment.
>> It is a lot easier to describe your experiment to other people
>> if you
>> can provide the test cases you are using, it will be easier to
>> understand what you want to do.
>>
>>
>> > Could you please provide me few pointers on how to do that. I
>> also want to
>> > take this opportunity to thank you guys for excellent work on
>> biopython> documentation. I am new to python, but I am able to
>> use cookbook/tutorial
>> > example for my work with relative ease.
>> >
>> > Cheers,
>> >
>> > Animesh Agrawal
>> >
>> > PhD Scholar
>> >
>> > Proteomics & Therapy Design Group
>> >
>> > Division of Molecular Biosciences
>> >
>> > The John Curtin School of Medical Research
>> >
>> > The Australian National University
>> >
>> > P.O. Box 334
>> >
>> > Canberra ACT 2601
>> >
>> > AUSTRALIA
>> >
>> > T: +61 2 6125 8303
>> >
>> >
>> >
>> > _______________________________________________
>> > BioPython mailing list  -  BioPython at lists.open-bio.org
>> > http://lists.open-bio.org/mailman/listinfo/biopython
>> >
>>
>>
>>
>> --
>>
>> My blog on bioinformatics (now in English): http://bioinfoblog.it

-- 

My blog on bioinformatics (now in English): http://bioinfoblog.it