[Bioperl-l] PAML/Codeml parsing

Wed Dec 5 20:01:29 UTC 2007

sounds good - can you
- make it as a bug with the patch and sample files in bugzilla
- commit changes and I'll test as well

thanks,
-j

On Dec 5, 2007, at 11:56 AM, Stefan Kirov wrote:

> Here is a patch that seems to be working and does not break the  
> existing
> tests:
>
> --- /home/kirovs/bioperl-live/Bio/Tools/Phylo/PAML.pm   2007-12-05
> 10:16:53.120720000 -0500
> +++ /home/kirovs/bioperl/bioperl-live/Bio/Tools/Phylo/PAML.pm
> 2007-12-05 14:46:31.436278000 -0500
> @@ -419,7 +419,10 @@
>      # CODONML (in paml 3.12 February 2002)  <<-- what we want to see!
>
>      my $SEQTYPES = qr( (?: (?: CODON | AA | BASE | CODON2AA ) ML ) |
> YN00 )x;
> +    my $line;
> +    $self->{'_already_parsed_seqs'}=$self-> 
> {'_already_parsed_seqs'}?1:0;
>      while ($_ = $self->_readline) {
> +           $line++;
>         if ( m/^($SEQTYPES) \s+                      # seqtype:  
> CODONML,
> AAML, BASEML, CODON2AAML, YN00, etc
>                (?: \(in \s+ ([^\)]+?) \s* \) \s* )?  # version: "paml
> 3.12 February 2002"; not present < 3.1 or YN00
>                (\S+) \s*                             # tree filename
> @@ -436,8 +439,11 @@
>         } elsif (m/^Data set \d$/) {
>             $self->{'_summary'} = {};
>             $self->{'_summary'}->{'multidata'}++;
> -       } elsif( m/^Before\s+deleting\s+alignment\s+gaps/ ) {
> -           my ($phylip_header) = $self->_readline;
> +       }
> +       elsif( m/^Before\s+deleting\s+alignment\s+gaps/ ) {#Gap
> +               my ($phylip_header) = $self->_readline;
> +               $self->_parse_seqs;
> +       } elsif (($line>3)&&($self->{'_already_parsed_seqs'}!=1))  
> {#No gap
>             $self->_parse_seqs;
>         }
>      }
> @@ -681,7 +687,6 @@
>  }
>
>  sub _parse_seqs {
> -
>      # this should in fact be packed into a Bio::SimpleAlign object
> instead of
>      # an array but we'll stay with this for now
>      my ($self) = @_;
>
>
> What this does is trigger sequence parsing if the /Before.../  
> pattern is
> not seen until line 4. Since phylip_header seems to be doing  
> nothing one
> could completely eliminate the first seq parse elsif (even though
> counting lines is not a good thing).
>  Since I am not aware of all consequences of changing the sequence
> parsing and I have no idea how extensive the tests are, I am not
> committing anything, but feel free to use that if you wish.
> Stefan
>
> Stefan Kirov wrote:
>> Jason,
>> When there is a gapless alignment we have a differently formatted  
>> output
>> from codeml:
>> kirovs at horta:~/AESIG> head -n 10 feJRfxQl8D/mlc
>>
>> seed used = 492211105
>>       3    141
>>
>> ENSRNOE00000058637               GCG AGC AAG TGT GAC AGC CAT GGC  
>> ACC CAC
>> CTA GCA GGT GTG GTC AGC GGC CGG GAT GCT GGT GTG GCC AAG GGC ACC  
>> AGT CTG
>> CAC AGT CTG CGT GTG CTC AAC TGT CAA GGG AAG GGC ACA GTC AGC GGC  
>> ACC CTC ATA
>> ENSMUSE00000366347               GCG AGC AAG TGT GAC AGC CAC GGC  
>> ACC CAC
>> CTG GCA GGT GTG GTC AGC GGC CGG GAT GCT GGT GTG GCC AAG GGC ACC  
>> AGC CTG
>> CAC AGC CTG CGT GTG CTC AAC TGT CAA GGG AAG GGC ACA GTC AGC GGC  
>> ACC CTC ATA
>> ENSE00001279150                  GCC AGC AAG TGT GAC AGT CAT GGC  
>> ACC CAC
>> CTG GCA GGG GTG GTC AGC GGC CGG GAT GCC GGC GTG GCC AAG GGT GCC  
>> AGC ATG
>> CGC AGC CTG CGC GTG CTC AAC TGC CAA GGG AAG GGC ACG GTT AGC GGC  
>> ACC CTC ATA
>>
>> And parsing this fails...
>> The next one has gaps and works fine:
>>
>> kirovs at horta:~/AESIG> head -n 10 4z6ZX7s1B6/mlc
>>
>> seed used = 492252697
>>
>> Before deleting alignment gaps
>>       2    162
>>
>> ENSMUSE00000460297               AAT ATC GAT ACA TTT TAC AAG GAG  
>> GCA GAA
>> AAG AAG CTT ATA CAC GTG CTT GAG GGA GAC AGT CCC AAG TGG TCC ACA  
>> CCG AAC
>> AAA GAC CCC ACC CGA GAG CCC CAT GCA GCC TCC ACT TGC TGT GCT TCA  
>> GAT CTC
>> CTT GGT TCA GGA GGT CAG TTC CTG
>> ENSE00000939192                  AAT ATT GAC ATA CTT TGC AAT GAA  
>> GCA GAA
>> AAC AAG CTT ATG CAT ATA CTG CAT GCA AAT GAT CCC AAG TGG TCC ACC  
>> CCA ACT
>> AAA GAC TGT ACT TCA GGG CCG TAC ACT GCT CAA ATC --- --- --- ---  
>> --- ATT
>> CCT GGT ACA GGA AAC AAG CTT CTG
>>
>> I will send both whole files as an attachment with another mail (I do
>> not know if these are going to pass through).
>> My guess is that the whole _parse_summary method has to be re- 
>> worked as
>> there is no tag to look for before the sequences start. Ugly.
>> I am not sure what else could become broken if I try to fix it, so I
>> will leave it to you.
>> Stefan
>>
>>> should be fixed.
>>>
>>> $ cvs log -r HEAD Bio/Tools/Phylo/PAML.pm
>>> revision 1.56
>>> date: 2007/11/01 14:52:56;  author: jason;  state: Exp;  lines:  
>>> +21 -14
>>> Parsing PAML4 and PAML3.15 should work now.  Dealing with variable
>>> order for the sequences and summary results in
>>> the top of the MLC files
>>>
>>> On Dec 4, 2007, at 11:25 AM, Stefan Kirov wrote:
>>>
>>>
>>>> Jason Stajich wrote:
>>>>
>>>>> PAML4 breaks our PAML parser right now because the order of  
>>>>> things in
>>>>> the result file has changed.  Now sequences precede the  
>>>>> information
>>>>> about the version or the program run.  This means that $result-
>>>>>
>>>>>> get_seqs() fails because we don't parse the sequences.
>>>>>>
>>>>> We'll see what we can do, but as usual with supporting 3rd party
>>>>> programs it is brittle when file formats change.  Th
>>>>>
>>>>> -jason
>>>>>
>>>>> -- 
>>>>> Jason Stajich
>>>>> jason at bioperl.org
>>>>>
>>>>> _______________________________________________
>>>>> Bioperl-l mailing list
>>>>> Bioperl-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>>
>>>>>
>>>>>
>>>> Jason,
>>>> I saw a commit after this post on codeml, but not on PAML.pm- I  
>>>> assume
>>>> this is not fixed, am I correct?
>>>> Thanks!
>>>> Stefan
>>>>
>>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
>