[Bioperl-l] PAML/Codeml parsing

Wed Dec 5 19:56:47 UTC 2007

Here is a patch that seems to be working and does not break the existing
tests:

--- /home/kirovs/bioperl-live/Bio/Tools/Phylo/PAML.pm   2007-12-05
10:16:53.120720000 -0500
+++ /home/kirovs/bioperl/bioperl-live/Bio/Tools/Phylo/PAML.pm  
2007-12-05 14:46:31.436278000 -0500
@@ -419,7 +419,10 @@
     # CODONML (in paml 3.12 February 2002)  <<-- what we want to see!
 
     my $SEQTYPES = qr( (?: (?: CODON | AA | BASE | CODON2AA ) ML ) |
YN00 )x;
+    my $line;
+    $self->{'_already_parsed_seqs'}=$self->{'_already_parsed_seqs'}?1:0;
     while ($_ = $self->_readline) {
+           $line++;
        if ( m/^($SEQTYPES) \s+                      # seqtype: CODONML,
AAML, BASEML, CODON2AAML, YN00, etc
               (?: \(in \s+ ([^\)]+?) \s* \) \s* )?  # version: "paml
3.12 February 2002"; not present < 3.1 or YN00
               (\S+) \s*                             # tree filename
@@ -436,8 +439,11 @@
        } elsif (m/^Data set \d$/) {
            $self->{'_summary'} = {};
            $self->{'_summary'}->{'multidata'}++;
-       } elsif( m/^Before\s+deleting\s+alignment\s+gaps/ ) {
-           my ($phylip_header) = $self->_readline;
+       }
+       elsif( m/^Before\s+deleting\s+alignment\s+gaps/ ) {#Gap
+               my ($phylip_header) = $self->_readline;
+               $self->_parse_seqs;
+       } elsif (($line>3)&&($self->{'_already_parsed_seqs'}!=1)) {#No gap
            $self->_parse_seqs;
        }
     }
@@ -681,7 +687,6 @@
 }
 
 sub _parse_seqs {
-
     # this should in fact be packed into a Bio::SimpleAlign object
instead of
     # an array but we'll stay with this for now
     my ($self) = @_;


What this does is trigger sequence parsing if the /Before.../ pattern is
not seen until line 4. Since phylip_header seems to be doing nothing one
could completely eliminate the first seq parse elsif (even though
counting lines is not a good thing).
 Since I am not aware of all consequences of changing the sequence
parsing and I have no idea how extensive the tests are, I am not
committing anything, but feel free to use that if you wish.
Stefan

Stefan Kirov wrote:
> Jason,
> When there is a gapless alignment we have a differently formatted output
> from codeml:
> kirovs at horta:~/AESIG> head -n 10 feJRfxQl8D/mlc
>
> seed used = 492211105
>       3    141
>
> ENSRNOE00000058637               GCG AGC AAG TGT GAC AGC CAT GGC ACC CAC
> CTA GCA GGT GTG GTC AGC GGC CGG GAT GCT GGT GTG GCC AAG GGC ACC AGT CTG
> CAC AGT CTG CGT GTG CTC AAC TGT CAA GGG AAG GGC ACA GTC AGC GGC ACC CTC ATA
> ENSMUSE00000366347               GCG AGC AAG TGT GAC AGC CAC GGC ACC CAC
> CTG GCA GGT GTG GTC AGC GGC CGG GAT GCT GGT GTG GCC AAG GGC ACC AGC CTG
> CAC AGC CTG CGT GTG CTC AAC TGT CAA GGG AAG GGC ACA GTC AGC GGC ACC CTC ATA
> ENSE00001279150                  GCC AGC AAG TGT GAC AGT CAT GGC ACC CAC
> CTG GCA GGG GTG GTC AGC GGC CGG GAT GCC GGC GTG GCC AAG GGT GCC AGC ATG
> CGC AGC CTG CGC GTG CTC AAC TGC CAA GGG AAG GGC ACG GTT AGC GGC ACC CTC ATA
>
> And parsing this fails...
> The next one has gaps and works fine:
>
> kirovs at horta:~/AESIG> head -n 10 4z6ZX7s1B6/mlc
>
> seed used = 492252697
>
> Before deleting alignment gaps
>       2    162
>
> ENSMUSE00000460297               AAT ATC GAT ACA TTT TAC AAG GAG GCA GAA
> AAG AAG CTT ATA CAC GTG CTT GAG GGA GAC AGT CCC AAG TGG TCC ACA CCG AAC
> AAA GAC CCC ACC CGA GAG CCC CAT GCA GCC TCC ACT TGC TGT GCT TCA GAT CTC
> CTT GGT TCA GGA GGT CAG TTC CTG
> ENSE00000939192                  AAT ATT GAC ATA CTT TGC AAT GAA GCA GAA
> AAC AAG CTT ATG CAT ATA CTG CAT GCA AAT GAT CCC AAG TGG TCC ACC CCA ACT
> AAA GAC TGT ACT TCA GGG CCG TAC ACT GCT CAA ATC --- --- --- --- --- ATT
> CCT GGT ACA GGA AAC AAG CTT CTG
>
> I will send both whole files as an attachment with another mail (I do
> not know if these are going to pass through).
> My guess is that the whole _parse_summary method has to be re-worked as
> there is no tag to look for before the sequences start. Ugly.
> I am not sure what else could become broken if I try to fix it, so I
> will leave it to you.
> Stefan
>   
>> should be fixed.
>>
>> $ cvs log -r HEAD Bio/Tools/Phylo/PAML.pm
>> revision 1.56
>> date: 2007/11/01 14:52:56;  author: jason;  state: Exp;  lines: +21 -14
>> Parsing PAML4 and PAML3.15 should work now.  Dealing with variable
>> order for the sequences and summary results in
>> the top of the MLC files
>>
>> On Dec 4, 2007, at 11:25 AM, Stefan Kirov wrote:
>>
>>     
>>> Jason Stajich wrote:
>>>       
>>>> PAML4 breaks our PAML parser right now because the order of things in
>>>> the result file has changed.  Now sequences precede the information
>>>> about the version or the program run.  This means that $result-
>>>>         
>>>>> get_seqs() fails because we don't parse the sequences.
>>>>>           
>>>> We'll see what we can do, but as usual with supporting 3rd party
>>>> programs it is brittle when file formats change.  Th
>>>>
>>>> -jason
>>>>
>>>> -- 
>>>> Jason Stajich
>>>> jason at bioperl.org
>>>>
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>>>>
>>>>         
>>> Jason,
>>> I saw a commit after this post on codeml, but not on PAML.pm- I assume
>>> this is not fixed, am I correct?
>>> Thanks!
>>> Stefan
>>>       
>>     
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>