[Bioperl-l] PAML::Codeml outputs unstable value, why?

Tue May 29 15:02:21 UTC 2007

HI, Darren

The sequences are from Human and zebrafish. I currently use two 
sequences. I just want to see what's the substitution pattern there is. 
But your comment remind me whether I should get the other species 
involved, like mouse, chicken.

BTW, what's you mean 'per codon, not per site'? Do you mean the Ds(Ks) 
of Codeml is for per codon, and Yn00 is for per site?
I think there should be a possible/reasonable way to calculate the 
synonymous substitution, even if the divergence is big enough. If the 
Codeml is not a good solution for that case, do you have better suggestion?

Thanks

Xianjun

Darren Obbard wrote:
> Out of interest, what are the species, and how much sequence are you 
> using?
>
> - Estimating Ds when it is >>1 is very hard anyway, since the 
> substitutions are saturated. i.e. Regardless of the method, there will 
> be some level of divergence for which Ds can no longer be estimated. A 
> Ds of ~14 (for PAML I think this is per codon, not per site) sounds 
> very high to me - higher than I would want to try to estimate Ds.
>
> Dong Xianjun wrote:
>> Thanks for information, Albert.
>>
>> But still in two questions:
>> Albert Vilella wrote:
>>> codeml in PAML can give different results in cases where the 
>>> optimization reaches different local maxima depending on the 
>>> different starting points of each run (seed values). So, at least 
>>> for some methods and options, this instability is inherent to the 
>>> underlying algorithm.
>> 1. How to set the initial value in order to get a reasonable 
>> estimation? Do you have some experience for that?
>>> Even more, for some methods and options, it is even recommended in 
>>> PAML documentation to run the same data more than once, to see if 
>>> the results are the same, which would be a good indication that the 
>>> model is robust given the data.
>> 2. Is there a recommend way to test the significance if the results 
>> are different? For example, in my case, dS could range from 10.1852 
>> to 14.9961 for the four runtime. If that means the model is not 
>> robust(how to check this?), should I change to use another model?
>>
>> How could YN00 reach stable result? (Is it because YN00 does not 
>> require initial value for optimization?) Why could YN00 produce so 
>> different result from Codeml? (for YN00, dS=2.1300 with SE=1.2272; 
>> for Codeml, dS=10.1852-14.9961)
>>> Maybe PAML's author can give a more specific answer for your data at:
>>> http://www.rannala.org/gsf/viewforum.php?f=1
>>
>> Actually I already post my question in the author's forum. Let's wait 
>> and see.
>>>
>>> Cheers,
>>>
>>>     Albert.
>>>
>>> On 5/29/07, *Dong Xianjun* <Xianjun.Dong at bccs.uib.no 
>>> <mailto:Xianjun.Dong at bccs.uib.no>> wrote:
>>>
>>>     HI, dear all, //sorry for duplicated msg for /Jason/ and /Neil/
>>>
>>>     I'm bothering by two problems when I use PAML module to calculate
>>>     Ka/Ks for my sequences. Could you help me?
>>>
>>>     1.  Codeml could produce different Ka/Ks value if I run it twice.
>>>     I check it both in command line and in Perl wrapper of
>>>     Bio::Tools::Run::Phylo::PAML::Codeml;
>>>
>>>     The input sequences are:
>>>     >seq1
>>>     
>>> TCTCTCTGGCCCAAAATCCGGGTTCCATTAAAAGTTGTGAGGACTGCTGAAAACAAGTTAAGTAACCGTTTCTTCCCTTATGATGAAATCGAGACAGAAGCTGTTCTGGCCATTGATGATGATATCATTATGCTGACCTCTGACGAGCTGCAATTTGGTTATGAG 
>>>
>>>     >seq2
>>>     
>>> TCACTGTGGCCCAAAGTCGCAGTGCCTCTTAAAGTGGTCCGCACCAAAGAAAACAAGCTCAGCAATCGATTCTTTCCGTTTGATGAGATCGAGACAGAAGCTGTCCTGGCCATTGACGATGACATCATCATGTTAACCTCAGATGAGCTACAGTTTGGATATGAG 
>>>
>>>
>>>     For command-line program, I used Codeml in PAML3.14, with
>>>     specifications in codeml.ctl (runmode = -2, seqtype = 1). I tried
>>>     to run the program four times.  The output are like below (from
>>>     the output file). We could see that they are different from each
>>>     other. they should be same or slightly different. Right? But they
>>>     are NOT.  Weird!
>>>     
>>> ---------------------------------------------------------------------------------------------------------------------------------- 
>>>
>>>     t=11.5447  S=    42.4  N=   122.6  dN/dS= 0.0035  dN= 0.0522     
>>> dS=14.8339
>>>     t= 9.4132  S=    41.8  N=   123.2  dN/dS= 0.0041  dN= 0.0507     
>>> dS=12.2349
>>>     t=11.6305  S=    42.2  N=   122.8  dN/dS= 0.0034  dN= 0.0510     
>>> dS=14.9961
>>>     t= 7.7879  S=    41.4  N=   123.6  dN/dS= 0.0050  dN= 0.0505     
>>> dS=10.1852
>>>     
>>> ---------------------------------------------------------------------------------------------------------------------------------- 
>>>
>>>     I found the same problem when I use the Perl Wrapper of
>>>     Bio::Tools::Run::Phylo::PAML::Codeml; (I attached my Perl script
>>>     here, similar to the one in BioPerl HOWTO).
>>>
>>>     2. Another strange thing is, if I switch to use program YN00 in
>>>     the package of PAML, the output are stable. However, it's much
>>>     different from Codeml. (see below)
>>>     
>>> ---------------------------------------------------------------------------------------------------------------------------------- 
>>>
>>>     seq. seq.     S       N        t   kappa    omega      dN +- SE  
>>>            dS +- SE
>>>        2    1    40.4   124.6   1.7452  1.3163  0.0378 0.0804 +-
>>>     0.0265  2.1300 +- 1.2272
>>>     
>>> ---------------------------------------------------------------------------------------------------------------------------------- 
>>>
>>>     Why like this? Which one I should believe?
>>>
>>>
>>>     Is there any guy who would kindly help me to run the perl script
>>>     (twice to check whether they are different)? or help to run the
>>>     codeml in command line?
>>>     I don't know whether there is anyone noticed this before, or
>>>     because of the wrong version of PAML.
>>>
>>>     Regards,
>>>
>>>     Xianjun
>>>
>>>
>>>
>>>     Himanshu Ardawatia wrote:
>>>>     #!/usr/bin/perl
>>>>
>>>>     use strict;
>>>>     use warnings;
>>>>
>>>>
>>>>     use Bio::Tools::Run::Phylo::PAML::Codeml;
>>>>     use Bio::Tools::Run::Alignment::Clustalw;
>>>>
>>>>     # for projecting alignments from protein to R/DNA space
>>>>     use Bio::Align::Utilities qw(aa_to_dna_aln);
>>>>
>>>>     # for input of the sequence data
>>>>     use Bio::SeqIO;
>>>>     use Bio::AlignIO;
>>>>
>>>>     my $aln_factory = new Bio::Tools::Run::Alignment::Clustalw();
>>>>
>>>>     #my $seqdata = 'chuck.fa';
>>>>     my $seqdata = 'xianjun.fa ';
>>>>
>>>>     my $seqIO = new Bio::SeqIO(-file   => $seqdata,
>>>>                                -format => 'fasta');
>>>>     my %seqs;
>>>>     my @prots;
>>>>
>>>>     my $output;
>>>>     # process each sequence
>>>>     while( my $seq = $seqIO->next_seq ) {
>>>>         $seqs{$seq->display_id} = $seq;
>>>>         # translate them into protein
>>>>         my $protein = $seq->translate();
>>>>         my $pseq = $protein->seq();
>>>>         if( $pseq =~ /\*/ &&
>>>>         $pseq !~ /\*$/ ) {
>>>>         warn("provided a cDNA sequence with a stop codon, PAML will
>>>>     choke!");
>>>>         exit(0);
>>>>         }
>>>>         # Tcoffee can't handle '*' even if it is trailing
>>>>         $pseq =~ s/\*//g;
>>>>         $protein->seq($pseq);
>>>>         push @prots, $protein;
>>>>     }
>>>>
>>>>     if( @prots < 2 ) {
>>>>         warn("Need at least 2 cDNA sequences to proceed");
>>>>         exit(0);
>>>>     }
>>>>
>>>>     open(OUT, ">align_output.txt") ||
>>>>           die("cannot open output $output for writing");
>>>>     # Align the sequences with clustalw
>>>>
>>>>     my $aa_aln = $aln_factory->align(\@prots);
>>>>
>>>>     # project the protein alignment back to cDNA coordinates
>>>>     my $dna_aln = &aa_to_dna_aln($aa_aln, \%seqs);
>>>>
>>>>     my @each = $dna_aln->each_seq();
>>>>
>>>>     my $kaks_factory = new Bio::Tools::Run::Phylo::PAML::Codeml
>>>>                       ( -params => { 'runmode' => -2,
>>>>                              'seqtype' => 1,
>>>>                      'model' => 1,
>>>>                     }
>>>>                   );
>>>>
>>>>     # set the alignment object
>>>>     $kaks_factory->alignment($dna_aln);
>>>>
>>>>     # run the KaKs analysis
>>>>     my ($rc,$parser) = $kaks_factory->run();
>>>>     my $result = $parser->next_result;
>>>>     my $MLmatrix = $result->get_MLmatrix();
>>>>
>>>>     my @otus = $result->get_seqs();
>>>>     # this gives us a mapping from the PAML order of sequences back to
>>>>     # the input order (since names get truncated)
>>>>     my @pos = map {
>>>>         my $c= 1;
>>>>         foreach my $s ( @each ) {
>>>>         last if( $s->display_id eq $_->display_id );
>>>>         $c++;
>>>>         }
>>>>         $c;
>>>>     } @otus;
>>>>
>>>>     print OUT join("\t", qw(SEQ1 SEQ2 Ka Ks Ka/Ks PROT_PERCENTID
>>>>     CDNA_PERCENTID)), "\n";
>>>>     for( my $i = 0; $i < (scalar @otus -1) ; $i++) {
>>>>         for( my $j = $i+1; $j < (scalar @otus); $j++ ) {
>>>>         my $sub_aa_aln  = $aa_aln->select_noncont($pos[$i],$pos[$j]);
>>>>         my $sub_dna_aln = $dna_aln->select_noncont($pos[$i],$pos[$j]);
>>>>         print OUT join("\t",                    $otus[$i]->display_id,
>>>>                    
>>>> $otus[$j]->display_id,$MLmatrix->[$i]->[$j]->{'dN'},
>>>>                    $MLmatrix->[$i]->[$j]->{'dS'},
>>>>                    $MLmatrix->[$i]->[$j]->{'omega'},
>>>>                    sprintf("%.2f",$sub_aa_aln->percentage_identity),
>>>>                    sprintf("%.2f",$sub_dna_aln->percentage_identity),
>>>>                    ), "\n";
>>>>         }
>>>>     }
>>>>
>>>>
>>>>     On 5/29/07, *Himanshu Ardawatia* <himanshu.ardawatia at bccs.uib.no
>>>>     <mailto:himanshu.ardawatia at bccs.uib.no>> wrote:
>>>>
>>>>         Hi Xianjun,
>>>>
>>>>         I recognize this script. But it was a bit cumbersom to use
>>>>         this as many things are done in the script (like multiple
>>>>         alignment, aa to dna alignment and ka/ks calculation) so one
>>>>         does not have real control on these different aspect.
>>>>         I do not remeber getting different Ka/Ks in different runs
>>>>         though. But I remeber that one I ran the script with
>>>>         different versions of clustalw and it REALLY gave different
>>>>         results !! So please make sure if the clustalw versions are
>>>>         the same in all your runs. Best is to use the latest version.
>>>>
>>>>         Finally I wrote my simple script which would generate a
>>>>         codeml.ctl file for each set of sequences and run the codeml
>>>>         based on that and then more on. Disadvantage of this can be
>>>>         that some files keep getting over-written (like the one
>>>>         which have their names hard-coded in codeml program) and if
>>>>         one needs those files as well then one needs to run the
>>>>         codeml cycles for each set of sequences in different
>>>>         directories.
>>>>
>>>>         One advantage of this kind of script is that you can use
>>>>         whichever alignment program you want to use and so on....But
>>>>         then its also extra steps of yourself doing multiple
>>>>         alignment and aa to dna alignment etc....
>>>>
>>>>         Does it make sense? If you still get different outputs with
>>>>         same version of clustalw then I can sit with you and look at
>>>>         things together. Or else try the script method which I
>>>>         mentioned.
>>>>
>>>>         Cheers  and Fu
>>>>         Himanshu
>>>>         \\
>>>>
>>>>         On 5/28/07, *Dong Xianjun* < Xianjun.Dong at bccs.uib.no
>>>>         <mailto:Xianjun.Dong at bccs.uib.no>> wrote:
>>>>
>>>>             HI, Himanshu
>>>>
>>>>             I am sure you did some work in Ka/Ks calculation. Here I
>>>>             have a question
>>>>             bothering me; the output for
>>>>             Bio::Tools::Run::Phylo::PAML::Codeml is not
>>>>             stable(different for each runtime), and also different
>>>>             from the output
>>>>             with modeul of Bio::Tools::Run::Phylo::PAML::Yn00.
>>>>
>>>>             Here I attached the script. Could you help to have a
>>>>             look and try to run
>>>>             the script? How is your way to calculate the Kaks ratio?
>>>>
>>>>             Thanks
>>>>
>>>>             --
>>>>             ---------------------------
>>>>             Sterding (Xianjun) Dong
>>>>             PhD student, Boris Lenhard's group
>>>>             Bergen Center of Computational Science
>>>>             Bergen University, Norway
>>>>             Mobile: 0047-47361688
>>>>             Telephone: 0047-55276381
>>>>             Skype: xianjun.dong
>>>>
>>>>
>>>>
>>>>
>>>
>>>     --     ---------------------------
>>>     Sterding (Xianjun) Dong
>>>     PhD student, Boris Lenhard's group
>>>     Bergen Center of Computational Science
>>>     Bergen University, Norway
>>>     Mobile: 0047-47361688
>>>     Telephone: 0047-55276381
>>>
>>>     Skype: xianjun.dong
>>>        
>>>
>>>     _______________________________________________
>>>     Bioperl-l mailing list
>>>     Bioperl-l at lists.open-bio.org <mailto:Bioperl-l at lists.open-bio.org>
>>>     http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>>
>>
>> -- 
>> ---------------------------
>> Sterding (Xianjun) Dong
>> PhD student, Boris Lenhard's group
>> Bergen Center of Computational Science
>> Bergen University, Norway
>> Mobile: 0047-47361688
>> Telephone: 0047-55276381
>> Skype: xianjun.dong
>>   
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>

-- 
---------------------------
Sterding (Xianjun) Dong
PhD student, Boris Lenhard's group
Bergen Center of Computational Science
Bergen University, Norway
Mobile: 0047-47361688
Telephone: 0047-55276381
Skype: xianjun.dong