[Bioperl-l] PAML + Codeml problem..
Xianjun Dong
xianjun.dong at bccs.uib.no
Thu Aug 10 16:02:39 UTC 2006
Hi, Ryan
Thanks for your reply!
But here I still have two questions about the sample code:
1. the translate() function of Bio::Seq object use generic codon table,
but for Mitochondrial DNA (mtDNA), we should use different codon table.
So, if we take the human transcript ENST00000361390 as example,
>ENST00000361390 _cDNA
ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCTAATCGCAATGGCATTCCTAATGCTTACCGAACGAAAAATTCTAGGCTATATACAACTACGCAAAGGCCCCAACGTTGTAGGCCCCTACGGGCTACTACAACCCTTCGCTGACGCCATAAAACTCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATCACCGCCCCGACCTTAGCTCTCACCATCGCTCTTCTACTATGAACCCCCCTCCCCATACCCAACCCCCTGGTCAACCTCAACCTAGGCCTCCTATTTATTCTAGCCACCTCTAGCCTAGCCGTTTACTCAATCCTCTGATCAGGGTGAGCATCAAACTCAAACTACGCCCTGATCGGCGCACTGCGAGCAGTAGCCCAAACAATCTCATATGAAGTCACCCTAGCCATCATTCTACTATCAACATTACTAATAAGTGGCTCCTTTAACCTCTCCACCCTTATCACAACACAAGAACACCTCTGATTACTCCTGCCATCATGACCCTTGGCCATAATATGATTTATCTCCACACTAGCAGAGACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACTAGTCTCAGGCTTCAACATCGAATACGCCGCAGGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACAATCTTCCTAGGAACAACATATGACGCACTCTCCCCTGAACTCTACACAACATATTTTGTCACCAAGACCCTACTTCTAACCTCCCTGTTCTTATGAATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACACCTCCTATGAAAAAACTTCCTACCACTCACCCTAGCATTACTTATATGATATGTCTCCATACCCATTACAATCTCCAGCATTCCCCCTCAAACCTAA
After translating with above function, the amino acid sequence is like
this, which contain *(stop codon) within the sequence(also at the end of
the sequence). But actually, this is a mtDNA, if we use different codon
table, the * within the sequence will change to 'W'(Trp). (Because in
vertebrate mitochondria “AGA” and “AGG” are also stop codons, but not
“UGA”, which codes for tryptophan instead.)
>ENST00000361390 aa_beforefilter
IPMANLLLLIVPILIAMAFLMLTERKILGYIQLRKGPNVVGPYGLLQPFADAIKLFTKEPLKPATSTITLYITAPTLALTIALLL*TPLPIPNPLVNLNLGLLFILATSSLAVYSIL*SG*ASNSNYALIGALRAVAQTISYEVTLAIILLSTLLISGSFNLSTLITTQEHL*LLLPS*PLAII*FISTLAETNRTPFDLAEGESELVSGFNIEYAAGPFALFFIAEYTNIIIINTLTTTIFLGTTYDALSPELYTTYFVTKTLLLTSLFL*IRTAYPRFRYDQLIHLL*KNFLPLTLALLI*YVSIPITISSIPPQT*
2. My second question is:
If there are * both in the middle and end of the translated sequence
(with pattern AAAAAA*AAAAAAAAAAAAAAA*AAA*), like above case, after the
two checks for stop codon, all * will be filtered out. So, when
translate back from aa_aln to dna_aln, there should be no stop
codon included. But actually, when I track the program, it display that there are still stop codon included.
Here is the DNA alignment after recalling the aa_to_dna_aln function.
How to explain this?
>ENST00000361390 aa_to_dna_aln
ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCTAATCGCAATGGCATTCCTAATGCTTACCGAACGAAAAATTCTAGGCTATATACAACTACGCAAAGGCCCCAACGTTGTAGGCCCCTACGGGCTACTACAACCCTTCGCTGACGCCATAAAACTCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATCACCGCCCCGACCTTAGCTCTCACCATCGCTCTTCTACTATGAACCCCCCTCCCCATACCCAACCCCCTGGTCAACCTCAACCTAGGCCTCCTATTTATTCTAGCCACCTCTAGCCTAGCCGTTTACTCAATCCTCTGATCAGGGTGAGCATCAAACTCAAACTACGCCCTGATCGGCGCACTGCGAGCAGTAGCCCAAACAATCTCATATGAAGTCACCCTAGCCATCATTCTACTATCAACATTACTAATAAGTGGCTCCTTTAACCTCTCCACCCTTATCACAACACAAGAACACCTCTGATTACTCCTGCCATCATGACCCTTGGCCATAATATGATTTATCTCCACACTAGCAGAGACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACTAGTCTCAGGCTTCAACATCGAATACGCCGCAGGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACAATCTTCCTAGGAACAACATATGACGCACTCTCCCCTGAACTCTACACAACATATTTTGTCACCAAGACCCTACTT---CTAACCTCCCTGTTCTTATGAATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACACCTCCTATGAAAAAACTTCCTACCACTCACCCTAGCATTACTTATATGATATGTCTCCATACCCATT
I attached my script for two ortholog transcripts demo and the output
(including the error msg) here. Could you kindly check for me?
Thanks!
-Xianjun
/////////////////////////////////////////////////////////////////////
/////////////////////////////// output //////////////////////////////
/////////////////////////////////////////////////////////////////////
[xianjund at lauvtre kaks]$ perl calculator.pl
>ENST00000361390
ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCTAATCGCAATGGCATTCCTAATGCTTACCGAACGAAAAATTCTAGGCTATATACAACTACGCAAAGGCCCCAACGTTGTAGGCCCCTACGGGCTACTACAACCCTTCGCTGACGCCATAAAACTCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATCACCGCCCCGACCTTAGCTCTCACCATCGCTCTTCTACTATGAACCCCCCTCCCCATACCCAACCCCCTGGTCAACCTCAACCTAGGCCTCCTATTTATTCTAGCCACCTCTAGCCTAGCCGTTTACTCAATCCTCTGATCAGGGTGAGCATCAAACTCAAACTACGCCCTGATCGGCGCACTGCGAGCAGTAGCCCAAACAATCTCATATGAAGTCACCCTAGCCATCATTCTACTATCAACATTACTAATAAGTGGCTCCTTTAACCTCTCCACCCTTATCACAACACAAGAACACCTCTGATTACTCCTGCCATCATGACCCTTGGCCATAATATGATTTATCTCCACACTAGCAGAGACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACTAGTCTCAGGCTTCAACATCGAATACGCCGCAGGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACAATCTTCCTAGGAACAACATATGACGCACTCTCCCCTGAACTCTACACAACATATTTTGTCACCAAGACCCTACTTCTAACCTCCCTGTTCTTATGAATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACACCTCCTATGAAAAAACTTCCTACCACTCACCCTAGCATTACTTATATGATATGTCTCCATACCCATTACAATCTCCAGCATTCCCCCTCAAACCTAA
>ENSMUST00000082392
GTGTTCTTTATTAATATCCTAACACTCCTCGTCCCCATTCTAATCGCCATAGCCTTCCTAACATTAGTAGAACGCAAAATCTTAGGGTACATACAACTACGAAAAGGCCCTAACATTGTTGGTCCATACGGCATTTTACAACCATTTGCAGACGCCATAAAATTATTTATAAAAGAACCAATACGCCCTTTAACAACCTCTATATCCTTATTTATTATTGCACCTACCCTATCACTCACACTAGCATTAAGTCTATGAGTTCCCCTACCAATACCACACCCATTAATTAATTTAAACCTAGGGATTTTATTTATTTTAGCAACATCTAGCCTATCAGTTTACTCCATTCTATGATCAGGATGAGCCTCAAACTCCAAATACTCACTATTCGGAGCTTTACGAGCCGTAGCCCAAACAATTTCATATGAAGTAACCATAGCTATTATCCTTTTATCAGTTCTATTAATAAATGGATCCTACTCTCTACAAACACTTATTACAACCCAAGAACACATATGATTACTTCTGCCAGCCTGACCCATAGCCATAATATGATTTATCTCAACCCTAGCAGAAACAAACCGGGCCCCCTTCGACCTGACAGAAGGAGAATCAGAATTAGTATCAGGGTTTAACGTAGAATACGCAGCCGGCCCATTCGCGTTATTCTTTATAGCAGAGTACACTAACATTATTCTAATAAACGCCCTAACAACTATTATCTTCCTAGGACCCCTATACTATATCAATTTACCAGAACTCTACTCAACTAACTTCATAATAGAAGCTCTACTACTATCATCAACATTCCTATGGATCCGAGCATCTTATCCACGCTTCCGTTACGATCAACTTATACATCTTCTATGAAAAAACTTTCTACCCCTAACACTAGCATTATGTATGTGACATATTTCTTTACCAATTTTTACAGCGGGAGTACCACCATACATATAG
Calculate the Ka/Ks for ENSG00000198888 : ENSMUSG00000064341 ...
>ENSMUST00000082392 aa_beforefilter
VFFINILTLLVPILIAIAFLTLVERKILGYIQLRKGPNIVGPYGILQPFADAIKLFIKEPIRPLTTSISLFIIAPTLSLTLALSL*VPLPIPHPLINLNLGILFILATSSLSVYSIL*SG*ASNSKYSLFGALRAVAQTISYEVTIAIILLSVLLINGSYSLQTLITTQEHI*LLLPA*PIAII*FISTLAETNRAPFDLTEGESELVSGFNVEYAAGPFALFFIAEYTNIILINALTTIIFLGPLYYINLPELYSTNFIIEALLLSSTFLWIRASYPRFRYDQLIHLL*KNFLPLTLALCM*HISLPIFTAGVPPYI*
>ENSMUST00000082392 aa_afterfilter
VFFINILTLLVPILIAIAFLTLVERKILGYIQLRKGPNIVGPYGILQPFADAIKLFIKEPIRPLTTSISLFIIAPTLSLTLALSLVPLPIPHPLINLNLGILFILATSSLSVYSILSGASNSKYSLFGALRAVAQTISYEVTIAIILLSVLLINGSYSLQTLITTQEHILLLPAPIAIIFISTLAETNRAPFDLTEGESELVSGFNVEYAAGPFALFFIAEYTNIILINALTTIIFLGPLYYINLPELYSTNFIIEALLLSSTFLWIRASYPRFRYDQLIHLLKNFLPLTLALCMHISLPIFTAGVPPYI
>ENST00000361390 aa_beforefilter
IPMANLLLLIVPILIAMAFLMLTERKILGYIQLRKGPNVVGPYGLLQPFADAIKLFTKEPLKPATSTITLYITAPTLALTIALLL*TPLPIPNPLVNLNLGLLFILATSSLAVYSIL*SG*ASNSNYALIGALRAVAQTISYEVTLAIILLSTLLISGSFNLSTLITTQEHL*LLLPS*PLAII*FISTLAETNRTPFDLAEGESELVSGFNIEYAAGPFALFFIAEYTNIIIINTLTTTIFLGTTYDALSPELYTTYFVTKTLLLTSLFL*IRTAYPRFRYDQLIHLL*KNFLPLTLALLI*YVSIPITISSIPPQT*
>ENST00000361390 aa_afterfilter
IPMANLLLLIVPILIAMAFLMLTERKILGYIQLRKGPNVVGPYGLLQPFADAIKLFTKEPLKPATSTITLYITAPTLALTIALLLTPLPIPNPLVNLNLGLLFILATSSLAVYSILSGASNSNYALIGALRAVAQTISYEVTLAIILLSTLLISGSFNLSTLITTQEHLLLLPSPLAIIFISTLAETNRTPFDLAEGESELVSGFNIEYAAGPFALFFIAEYTNIIIINTLTTTIFLGTTYDALSPELYTTYFVTKTLLLTSLFLIRTAYPRFRYDQLIHLLKNFLPLTLALLIYVSIPITISSIPPQT
Print out the DNA sequences translated back from aa_to_dna function:
>ENSMUST00000082392 aa_to_dna_aln
GTGTTCTTTATTAATATCCTAACACTCCTCGTCCCCATTCTAATCGCCATAGCCTTCCTAACATTAGTAGAACGCAAAATCTTAGGGTACATACAACTACGAAAAGGCCCTAACATTGTTGGTCCATACGGCATTTTACAACCATTTGCAGACGCCATAAAATTATTTATAAAAGAACCAATACGCCCTTTAACAACCTCTATATCCTTATTTATTATTGCACCTACCCTATCACTCACACTAGCATTAAGTCTATGAGTTCCCCTACCAATACCACACCCATTAATTAATTTAAACCTAGGGATTTTATTTATTTTAGCAACATCTAGCCTATCAGTTTACTCCATTCTATGATCAGGATGAGCCTCAAACTCCAAATACTCACTATTCGGAGCTTTACGAGCCGTAGCCCAAACAATTTCATATGAAGTAACCATAGCTATTATCCTTTTATCAGTTCTATTAATAAATGGATCCTACTCTCTACAAACACTTATTACAACCCAAGAACACATATGATTACTTCTGCCAGCCTGACCCATAGCCATAATATGATTTATCTCAACCCTAGCAGAAACAAACCGGGCCCCCTTCGACCTGACAGAAGGAGAATCAGAATTAGTATCAGGGTTTAACGTAGAATACGCAGCCGGCCCATTCGCGTTATTCTTTATAGCAGAGTACACTAACATTATTCTAATAAACGCCCTAACAACTATTATCTTCCTAGGACCCCTATACTATATCAATTTACCAGAACTCTACTCAACTAACTTCATAATAGAAGCTCTACTACTATCATCAACATTCCTATGGATCCGAGCATCTTATCCACGCTTCCGTTACGATCAACTTATACATCTTCTATGAAAAAACTTTCTACCCCTAACACTAGCATTATGTATGTGACATATTTCTTTACCAATTTTT
>ENST00000361390 aa_to_dna_aln
ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCTAATCGCAATGGCATTCCTAATGCTTACCGAACGAAAAATTCTAGGCTATATACAACTACGCAAAGGCCCCAACGTTGTAGGCCCCTACGGGCTACTACAACCCTTCGCTGACGCCATAAAACTCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATCACCGCCCCGACCTTAGCTCTCACCATCGCTCTTCTACTATGAACCCCCCTCCCCATACCCAACCCCCTGGTCAACCTCAACCTAGGCCTCCTATTTATTCTAGCCACCTCTAGCCTAGCCGTTTACTCAATCCTCTGATCAGGGTGAGCATCAAACTCAAACTACGCCCTGATCGGCGCACTGCGAGCAGTAGCCCAAACAATCTCATATGAAGTCACCCTAGCCATCATTCTACTATCAACATTACTAATAAGTGGCTCCTTTAACCTCTCCACCCTTATCACAACACAAGAACACCTCTGATTACTCCTGCCATCATGACCCTTGGCCATAATATGATTTATCTCCACACTAGCAGAGACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACTAGTCTCAGGCTTCAACATCGAATACGCCGCAGGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACAATCTTCCTAGGAACAACATATGACGCACTCTCCCCTGAACTCTACACAACATATTTTGTCACCAAGACCCTACTT---CTAACCTCCCTGTTCTTATGAATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACACCTCCTATGAAAAAACTTCCTACCACTCACCCTAGCATTACTTATATGATATGTCTCCATACCCATT
-------------------- WARNING ---------------------
MSG: There was an error - see error_string for the program output
---------------------------------------------------
------------- EXCEPTION: Bio::Root::NotImplemented -------------
MSG: Unknown format of PAML output
STACK: Error::throw
STACK:
Bio::Root::Root::throw /usr/lib/perl5/site_perl/5.8.5/Bio/Root/Root.pm:328
STACK:
Bio::Tools::Phylo::PAML::_parse_summary /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:359
STACK:
Bio::Tools::Phylo::PAML::next_result /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:224
STACK: main::kaks_calculate calculator.pl:176
STACK: calculator.pl:116
/////////////////////////////////////////////////////////////////////
/////////////////////////////// script //////////////////////////////
/////////////////////////////////////////////////////////////////////
sub kaks_calculate
{
my %seqs=@_;
#my %seqs = %$seqs_ref;
my @prots;
my $aln_factory = Bio::Tools::Run::Alignment::Clustalw->new
('quiet'=>1);
# process each sequence
for my $seqid (keys %seqs)
{
my $seq = $seqs{$seqid};
my $protein =$seq->translate();
my $pseq = $protein->seq();
print ">$seqid aa_beforefilter \n$pseq\n";
if( $pseq =~ /\*/ && $pseq !~ /\*$/ ) {
warn("provided a CDS sequence with a stop codon, PAML will
choke!");
exit(0);
}
# Tcoffee can't handle '*' even if it is trailing
$pseq =~ s/\*//g;
print ">$seqid aa_afterfilter \n$pseq\n";
$protein->seq($pseq);
push @prots, $protein;
}
if( @prots < 2 ) {
warn("Need at least 2 CDS sequences to proceed");
exit(0);
}
# open(OUT, ">align_output.txt") || die("cannot open output
align_output for writing");
# Align the sequences with clustalw
my $aa_aln = $aln_factory->align(\@prots);
# project the protein alignment back to CDS coordinates
my $dna_aln = aa_to_dna_aln($aa_aln, \%seqs);
my @each = $dna_aln->each_seq();
print "\nPrint out the DNA sequences translated back from aa_to_dna
function:\n\n";
foreach my $s ( $dna_aln->each_seq() ) {
print ">".$s->display_id." aa_to_dna_aln\n".$s->seq()."\n";
}
my $kaks_factory = Bio::Tools::Run::Phylo::PAML::Codeml->new
( -params => { 'runmode' => -2,
'seqtype' => 1,
} );
# set the alignment object
$kaks_factory->alignment($dna_aln);
# run the KaKs analysis
my ($rc,$parser) = $kaks_factory->run();
my $result = $parser->next_result;
my $MLmatrix = $result->get_MLmatrix();
my @otus = $result->get_seqs();
# this gives us a mapping from the PAML order of sequences back to
# the input order (since names get truncated)
my @pos = map {
my $c= 1;
foreach my $s ( @each ) {
last if( $s->display_id eq $_->display_id );
$c++;
}
$c;
} @otus;
# print OUT join("\t", qw(SEQ1 SEQ2 Ka Ks Ka/Ks PROT_PERCENTID
CDNA_PERCENTID)),"\n";
print join("\t", qw(SEQ1 SEQ2 Ka Ks Ka/Ks PROT_PERCENTID
CDNA_PERCENTID)),"\n";
for( my $i = 0; $i < (scalar @otus -1) ; $i++) {
for( my $j = $i+1; $j < (scalar @otus); $j++ ) {
my $sub_aa_aln = $aa_aln->select_noncont($pos[$i],$pos[$j]);
my $sub_dna_aln = $dna_aln->select_noncont($pos[$i],$pos[$j]);
# print OUT join("\t", $otus[$i]->display_id,
print join("\t", $otus[$i]->display_id,
$otus[$j]->display_id,$MLmatrix->[$i]->[$j]-
>{'dN'},
$MLmatrix->[$i]->[$j]->{'dS'},
$MLmatrix->[$i]->[$j]->{'omega'},
sprintf("%.2f",$sub_aa_aln-
>percentage_identity),
sprintf("%.2f",$sub_dna_aln-
>percentage_identity),
), "\n";
}
}
}
-------------------- WARNING ---------------------
MSG: There was an error - see error_string for the program output
---------------------------------------------------
------------- EXCEPTION: Bio::Root::NotImplemented -------------
MSG: Unknown format of PAML output
STACK: Error::throw
STACK:
Bio::Root::Root::throw /usr/lib/perl5/site_perl/5.8.5/Bio/Root/Root.pm:328
STACK:
Bio::Tools::Phylo::PAML::_parse_summary /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:359
STACK:
Bio::Tools::Phylo::PAML::next_result /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:224
STACK: main::kaks_calculate calculator.pl:176
STACK: calculator.pl:116
----------------------------------------------------------------
On Mon, 2006-07-31 at 11:20 -0400, Ryan Golhar wrote:
> Hi Xianjun,
>
> I just did some work on this module including the example.
>
> >> it does not occur in the codon position
> >>(say, the third codon's position is not a times of 3).
> >>Why it effect the result?
>
> If I'm interpreting your question correctly, the stop codons in your
> sequence occur in-frame. This is why it is choking.
>
> >>So, when translate back from aa_aln to dna_aln, there should be no
> stop codon included. SO, why it can not pass?
>
> The Ka and Ks statistics are not calculated based on the protein
> sequence, they are calculated based on the DNA sequence. The protein
> sequence is used to provide a alignment for the codons of the DNA
> sequence. Checking the protein sequence for * is easier to identify
> in-frame stop codons than scanning the DNA sequence.
>
> The two checks for stop codons you mentioned are to check for stop
> codons within the sequence without worry for the last amino acid. The
> second part remove the * at the end of the sequence (not the middle).
>
> If you want to remove the in-frame stop codons, you can, but do so
> before translating it to protein sequences.
>
> Ryan
>
>
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org
> [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Xianjun Dong
> Sent: Monday, July 31, 2006 7:56 AM
> To: bioperl-l at lists.open-bio.org
> Subject: [Bioperl-l] PAML + Codeml problem..
>
>
> Hi,
>
> I have a problem during running the Codeml Wiki-HOWTO code:
>
> Here is the error message:
> ////////////////////////////////////////////////////////////////
> [xianjund at lauvtre kaks]$ perl paml.pl test.fa
>
> -------------------- WARNING ---------------------
> MSG: There was an error - see error_string for the program output STACK
> Bio::Tools::Run::Phylo::PAML::Codeml::run
> /Home/extern/xianjund/src/bioperl/bioperl-run/Bio/Tools/Run/Phylo/PAML/C
> odeml.pm:581
> STACK toplevel paml.pl:61
>
> ------------- EXCEPTION: Bio::Root::NotImplemented -------------
> MSG: Unknown format of PAML output
> STACK: Error::throw
> STACK:
> Bio::Root::Root::throw
> /usr/lib/perl5/site_perl/5.8.5/Bio/Root/Root.pm:328
> STACK:
> Bio::Tools::Phylo::PAML::_parse_summary
> /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:359
> STACK:
> Bio::Tools::Phylo::PAML::next_result
> /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:224
> STACK: paml.pl:62
> ----------------------------------------------------------------
> ////////////////////////////////////////////////////////////////
>
> My test sequence is:
> >ENST00000361390
> ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCTAATCGCAATGGCATTCCTAATGCTTACCGAA
> CGAAAAATTCTAGGCTATATACAACTACGCAAAGGCCCCAACGTTGTAGGCCCCTACGGGCTACTACAACCC
> TTCGCTGACGCCATAAAACTCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATC
> ACCGCCCCGACCTTAGCTCTCACCATCGCTCTTCTACTATGAACCCCCCTCCCCATACCCAACCCCCTGGTC
> AACCTCAACCTAGGCCTCCTATTTATTCTAGCCACCTCTAGCCTAGCCGTTTACTCAATCCTCTGATCAGGG
> TGAGCATCAAACTCAAACTACGCCCTGATCGGCGCACTGCGAGCAGTAGCCCAAACAATCTCATATGAAGTC
> ACCCTAGCCATCATTCTACTATCAACATTACTAATAAGTGGCTCCTTTAACCTCTCCACCCTTATCACAACA
> CAAGAACACCTCTGATTACTCCTGCCATCATGACCCTTGGCCATAATATGATTTATCTCCACACTAGCAGAG
> ACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACTAGTCTCAGGCTTCAACATCGAATACGCC
> GCAGGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACA
> ATCTTCCTAGGAACAACATATGACGCACTCTCCCCTGAACTCTACACAACATATTTTGTCACCAAGACCCTA
> CTTCTAACCTCCCTGTTCTTATGAATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACACCTC
> CTATGAAAAAACTTCCTACCACTCACCCTAGCATTACTTATATGATATGTCTCCATACCCATTACAATCTCC
> AGCATTCCCCCTCAAACCTAA
> >ENSMUST00000082392
> GTGTTCTTTATTAATATCCTAACACTCCTCGTCCCCATTCTAATCGCCATAGCCTTCCTAACATTAGTAGAA
> CGCAAAATCTTAGGGTACATACAACTACGAAAAGGCCCTAACATTGTTGGTCCATACGGCATTTTACAACCA
> TTTGCAGACGCCATAAAATTATTTATAAAAGAACCAATACGCCCTTTAACAACCTCTATATCCTTATTTATT
> ATTGCACCTACCCTATCACTCACACTAGCATTAAGTCTATGAGTTCCCCTACCAATACCACACCCATTAATT
> AATTTAAACCTAGGGATTTTATTTATTTTAGCAACATCTAGCCTATCAGTTTACTCCATTCTATGATCAGGA
> TGAGCCTCAAACTCCAAATACTCACTATTCGGAGCTTTACGAGCCGTAGCCCAAACAATTTCATATGAAGTA
> ACCATAGCTATTATCCTTTTATCAGTTCTATTAATAAATGGATCCTACTCTCTACAAACACTTATTACAACC
> CAAGAACACATATGATTACTTCTGCCAGCCTGACCCATAGCCATAATATGATTTATCTCAACCCTAGCAGAA
> ACAAACCGGGCCCCCTTCGACCTGACAGAAGGAGAATCAGAATTAGTATCAGGGTTTAACGTAGAATACGCA
> GCCGGCCCATTCGCGTTATTCTTTATAGCAGAGTACACTAACATTATTCTAATAAACGCCCTAACAACTATT
> ATCTTCCTAGGACCCCTATACTATATCAATTTACCAGAACTCTACTCAACTAACTTCATAATAGAAGCTCTA
> CTACTATCATCAACATTCCTATGGATCCGAGCATCTTATCCACGCTTCCGTTACGATCAACTTATACATCTT
> CTATGAAAAAACTTTCTACCCCTAACACTAGCATTATGTATGTGACATATTTCTTTACCAATTTTTACAGCG
> GGAGTACCACCATACATATAG
>
> Sure, I checked it. There is some stop codon in it. If I replace it with
> non-stop codon, it works.
>
> For example,
> >ENST00000361390
> ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCcaaTCGCAATGGCATTCCcaaTGCTTACCGAA
> CGAAAAATTCcaaGCTATATACAACTACGCAAAGGCCCCAACGTTGcaaGCCCCTACGGGCTACTACAACCC
> TTCGCcaaCGCCAcaaAACTCTTCACCAAAGAGCCCCcaaAACCCGCCACATCTACCATCACCCTCTACATC
> ACCGCCCCGACCTcaaCTCTCACCATCGCTCTTCTACTAcaaACCCCCCTCCCCATACCCAACCCCCTGGTC
> AACCTCAACCcaaGCCTCCTATTTATTCcaaCCACCTCcaaCCcaaCCGTTTACTCAATCCTCcaaTCAGGG
> caaGCATCAAACTCAAACTACGCCCcaaTCGGCGCACTGCGAGCAGcaaCCCAAACAATCTCATAcaaAGTC
> ACCCcaaCCATCATTCTACTATCAACATTACcaacaaGTGGCTCCTTcaaCCTCTCCACCCTTATCACAACA
> CAAGAACACCTCcaaTTACTCCTGCCATCAcaaCCCTTGGCCAcaaTAcaaTTTATCTCCACACcaaCAGAG
> ACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACcaaTCTCAGGCTTCAACATCGAATACGCC
> GCAGGCCCCTTCGCCCTATTCTTCAcaaCCGAATACACAAACATTATTAcaacaaACACCCTCACCACTACA
> ATCTTCCcaaGAACAACATAcaaCGCACTCTCCCCcaaACTCTACACAACATATTTTGTCACCAAGACCCTA
> CTTCcaaCCTCCCTGTTCTTAcaaATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACACCTC
> CTAcaaAAAAACTTCCTACCACTCACCCcaaCATTACTTATAcaaTATGTCTCCATACCCATTACAATCTCC
> AGCATTCCCCCTCAAACCcaa
> >ENSMUST00000082392
> GTGTTCTTTATcaaTATCCcaaCACTCCTCGTCCCCATTCcaaTCGCCAcaaCCTTCCcaaCATcaacaaAA
> CGCAAAATCTcaaGGTACATACAACTACGAAAAGGCCCcaaCATTGTTGGTCCATACGGCATTTTACAACCA
> TTTGCAGACGCCAcaaAATTATTTAcaaAAGAACCAATACGCCCTTcaaCAACCTCTATATCCTTATTTATT
> ATTGCACCTACCCTATCACTCACACcaaCATcaaGTCTAcaaGTTCCCCTACCAATACCACACCCATcaaTc
> aaTTcaaACCcaaGGATTTTATTTATTTcaaCAACATCcaaCCTATCAGTTTACTCCATTCTAcaaTCAGGA
> caaGCCTCAAACTCCAAATACTCACTATTCGGAGCTTTACGAGCCGcaaCCCAAACAATTTCATAcaaAGca
> aCCAcaaCTATTATCCTTTTATCAGTTCTATcaacaaATGGATCCTACTCTCTACAAACACTTATTACAACC
> CAAGAACACATAcaaTTACTTCTGCCAGCCcaaCCCAcaaCCAcaaTAcaaTTTATCTCAACCCcaaCAGAA
> ACAAACCGGGCCCCCTTCGACCcaaCAGAAGGAGAATCAGAATcaaTATCAGGGTTcaaCGcaaAATACGCA
> GCCGGCCCATTCGCGTTATTCTTTAcaaCAGAGTACACcaaCATTATTCcaacaaACGCCCcaaCAACTATT
> ATCTTCCcaaGACCCCTATACTATATCAATTTACCAGAACTCTACTCAACcaaCTTCAcaacaaAAGCTCTA
> CTACTATCATCAACATTCCTATGGATCCGAGCATCTTATCCACGCTTCCGTTACGATCAACTTATACATCTT
> CTAcaaAAAAACTTTCTACCCCcaaCACcaaCATTATGTATGcaaCATATTTCTTTACCAATTTTTACAGCG
> GGAGTACCACCATACATAcaa
>
> But my question is: it does not occur in the codon position (say, the
> third codon's position is not a times of 3). Why it effect the result?
>
> And also there is code to filter out the stop codon in the sample code
> (as the following shown) ///////////////////////////////
> if( $pseq =~ /\*/ &&
> $pseq !~ /\*$/ ) {
> warn("provided a CDS sequence with a stop codon, PAML will
> choke!");
> exit(0);
> }
> # Tcoffee can't handle '*' even if it is trailing
> $pseq =~ s/\*//g;
> /////////////////////////////
>
> So, when translate back from aa_aln to dna_aln, there should be no stop
> codon included. SO, why it can not pass?
>
> Thanks for answer!
>
> P.S: attach my code here:
> /////////////////////////////////////////////////////////
> #!/usr/bin/perl -w
> use strict;
> use Bio::Tools::Run::Phylo::PAML::Codeml;
> use Bio::Tools::Run::Alignment::Clustalw;
>
> # for projecting alignments from protein to R/DNA space
> use Bio::Align::Utilities qw(aa_to_dna_aln);
> # for input of the sequence data
> use Bio::SeqIO;
> use Bio::AlignIO;
>
> my $aln_factory = Bio::Tools::Run::Alignment::Clustalw->new('quiet'=>1);
> my $seqdata = shift || 'test.fa';
>
> my $seqio = new Bio::SeqIO(-file => $seqdata,
> -format => 'fasta');
> my %seqs;
> my @prots;
> # process each sequence
> while ( my $seq = $seqio->next_seq ) {
> $seqs{$seq->display_id} = $seq;
> # translate them into protein
> my $protein = $seq->translate();
> my $pseq = $protein->seq();
> if( $pseq =~ /\*/ &&
> $pseq !~ /\*$/ ) {
> warn("provided a CDS sequence with a stop codon, PAML will
> choke!");
> exit(0);
> }
> # Tcoffee can't handle '*' even if it is trailing
> $pseq =~ s/\*//g;
>
> $protein->seq($pseq);
> push @prots, $protein;
> }
>
> if( @prots < 2 ) {
> warn("Need at least 2 CDS sequences to proceed");
> exit(0);
> }
>
> # open(OUT, ">align_output.txt") || die("cannot open output
> align_output for writing"); # Align the sequences with clustalw my
> $aa_aln = $aln_factory->align(\@prots); # project the protein alignment
> back to CDS coordinates my $dna_aln = aa_to_dna_aln($aa_aln, \%seqs);
>
> my @each = $dna_aln->each_seq();
>
> my $kaks_factory = Bio::Tools::Run::Phylo::PAML::Codeml->new
> ( -params => { 'runmode' => -2,
> 'seqtype' => 1,
> },
> -save_tempfiles => 1,
> -verbose => 1);
>
> # set the alignment object
> $kaks_factory->alignment($dna_aln);
>
> # run the KaKs analysis
> my ($rc,$parser) = $kaks_factory->run();
> my $result = $parser->next_result;
> my $MLmatrix = $result->get_MLmatrix();
>
> my @otus = $result->get_seqs();
> # this gives us a mapping from the PAML order of sequences back to # the
> input order (since names get truncated) my @pos = map {
> my $c= 1;
> foreach my $s ( @each ) {
> last if( $s->display_id eq $_->display_id );
> $c++;
> }
> $c;
> } @otus;
>
> print join("\t", qw(SEQ1 SEQ2 Ka Ks Ka/Ks PROT_PERCENTID
> CDNA_PERCENTID)),"\n"; for( my $i = 0; $i < (scalar @otus -1) ; $i++) {
> for( my $j = $i+1; $j < (scalar @otus); $j++ ) {
> my $sub_aa_aln = $aa_aln->select_noncont($pos[$i],$pos[$j]);
> my $sub_dna_aln = $dna_aln->select_noncont($pos[$i],$pos[$j]);
> print join("\t", $otus[$i]->display_id,
> $otus[$j]->display_id,$MLmatrix->[$i]->[$j]-
> >{'dN'},
> $MLmatrix->[$i]->[$j]->{'dS'},
> $MLmatrix->[$i]->[$j]->{'omega'},
> sprintf("%.2f",$sub_aa_aln-
> >percentage_identity),
> sprintf("%.2f",$sub_dna_aln-
> >percentage_identity),
> ), "\n";
> }
> }
>
More information about the Bioperl-l
mailing list