[Bioperl-l] PAML/Codeml parsing
Stefan Kirov
stefan.kirov at bms.com
Wed Dec 5 20:33:47 UTC 2007
Done.
Jason Stajich wrote:
> sounds good - can you
> - make it as a bug with the patch and sample files in bugzilla
> - commit changes and I'll test as well
>
> thanks,
> -j
>
> On Dec 5, 2007, at 11:56 AM, Stefan Kirov wrote:
>
>
>> Here is a patch that seems to be working and does not break the
>> existing
>> tests:
>>
>> --- /home/kirovs/bioperl-live/Bio/Tools/Phylo/PAML.pm 2007-12-05
>> 10:16:53.120720000 -0500
>> +++ /home/kirovs/bioperl/bioperl-live/Bio/Tools/Phylo/PAML.pm
>> 2007-12-05 14:46:31.436278000 -0500
>> @@ -419,7 +419,10 @@
>> # CODONML (in paml 3.12 February 2002) <<-- what we want to see!
>>
>> my $SEQTYPES = qr( (?: (?: CODON | AA | BASE | CODON2AA ) ML ) |
>> YN00 )x;
>> + my $line;
>> + $self->{'_already_parsed_seqs'}=$self->
>> {'_already_parsed_seqs'}?1:0;
>> while ($_ = $self->_readline) {
>> + $line++;
>> if ( m/^($SEQTYPES) \s+ # seqtype:
>> CODONML,
>> AAML, BASEML, CODON2AAML, YN00, etc
>> (?: \(in \s+ ([^\)]+?) \s* \) \s* )? # version: "paml
>> 3.12 February 2002"; not present < 3.1 or YN00
>> (\S+) \s* # tree filename
>> @@ -436,8 +439,11 @@
>> } elsif (m/^Data set \d$/) {
>> $self->{'_summary'} = {};
>> $self->{'_summary'}->{'multidata'}++;
>> - } elsif( m/^Before\s+deleting\s+alignment\s+gaps/ ) {
>> - my ($phylip_header) = $self->_readline;
>> + }
>> + elsif( m/^Before\s+deleting\s+alignment\s+gaps/ ) {#Gap
>> + my ($phylip_header) = $self->_readline;
>> + $self->_parse_seqs;
>> + } elsif (($line>3)&&($self->{'_already_parsed_seqs'}!=1))
>> {#No gap
>> $self->_parse_seqs;
>> }
>> }
>> @@ -681,7 +687,6 @@
>> }
>>
>> sub _parse_seqs {
>> -
>> # this should in fact be packed into a Bio::SimpleAlign object
>> instead of
>> # an array but we'll stay with this for now
>> my ($self) = @_;
>>
>>
>> What this does is trigger sequence parsing if the /Before.../
>> pattern is
>> not seen until line 4. Since phylip_header seems to be doing
>> nothing one
>> could completely eliminate the first seq parse elsif (even though
>> counting lines is not a good thing).
>> Since I am not aware of all consequences of changing the sequence
>> parsing and I have no idea how extensive the tests are, I am not
>> committing anything, but feel free to use that if you wish.
>> Stefan
>>
>> Stefan Kirov wrote:
>>
>>> Jason,
>>> When there is a gapless alignment we have a differently formatted
>>> output
>>> from codeml:
>>> kirovs at horta:~/AESIG> head -n 10 feJRfxQl8D/mlc
>>>
>>> seed used = 492211105
>>> 3 141
>>>
>>> ENSRNOE00000058637 GCG AGC AAG TGT GAC AGC CAT GGC
>>> ACC CAC
>>> CTA GCA GGT GTG GTC AGC GGC CGG GAT GCT GGT GTG GCC AAG GGC ACC
>>> AGT CTG
>>> CAC AGT CTG CGT GTG CTC AAC TGT CAA GGG AAG GGC ACA GTC AGC GGC
>>> ACC CTC ATA
>>> ENSMUSE00000366347 GCG AGC AAG TGT GAC AGC CAC GGC
>>> ACC CAC
>>> CTG GCA GGT GTG GTC AGC GGC CGG GAT GCT GGT GTG GCC AAG GGC ACC
>>> AGC CTG
>>> CAC AGC CTG CGT GTG CTC AAC TGT CAA GGG AAG GGC ACA GTC AGC GGC
>>> ACC CTC ATA
>>> ENSE00001279150 GCC AGC AAG TGT GAC AGT CAT GGC
>>> ACC CAC
>>> CTG GCA GGG GTG GTC AGC GGC CGG GAT GCC GGC GTG GCC AAG GGT GCC
>>> AGC ATG
>>> CGC AGC CTG CGC GTG CTC AAC TGC CAA GGG AAG GGC ACG GTT AGC GGC
>>> ACC CTC ATA
>>>
>>> And parsing this fails...
>>> The next one has gaps and works fine:
>>>
>>> kirovs at horta:~/AESIG> head -n 10 4z6ZX7s1B6/mlc
>>>
>>> seed used = 492252697
>>>
>>> Before deleting alignment gaps
>>> 2 162
>>>
>>> ENSMUSE00000460297 AAT ATC GAT ACA TTT TAC AAG GAG
>>> GCA GAA
>>> AAG AAG CTT ATA CAC GTG CTT GAG GGA GAC AGT CCC AAG TGG TCC ACA
>>> CCG AAC
>>> AAA GAC CCC ACC CGA GAG CCC CAT GCA GCC TCC ACT TGC TGT GCT TCA
>>> GAT CTC
>>> CTT GGT TCA GGA GGT CAG TTC CTG
>>> ENSE00000939192 AAT ATT GAC ATA CTT TGC AAT GAA
>>> GCA GAA
>>> AAC AAG CTT ATG CAT ATA CTG CAT GCA AAT GAT CCC AAG TGG TCC ACC
>>> CCA ACT
>>> AAA GAC TGT ACT TCA GGG CCG TAC ACT GCT CAA ATC --- --- --- ---
>>> --- ATT
>>> CCT GGT ACA GGA AAC AAG CTT CTG
>>>
>>> I will send both whole files as an attachment with another mail (I do
>>> not know if these are going to pass through).
>>> My guess is that the whole _parse_summary method has to be re-
>>> worked as
>>> there is no tag to look for before the sequences start. Ugly.
>>> I am not sure what else could become broken if I try to fix it, so I
>>> will leave it to you.
>>> Stefan
>>>
>>>
>>>> should be fixed.
>>>>
>>>> $ cvs log -r HEAD Bio/Tools/Phylo/PAML.pm
>>>> revision 1.56
>>>> date: 2007/11/01 14:52:56; author: jason; state: Exp; lines:
>>>> +21 -14
>>>> Parsing PAML4 and PAML3.15 should work now. Dealing with variable
>>>> order for the sequences and summary results in
>>>> the top of the MLC files
>>>>
>>>> On Dec 4, 2007, at 11:25 AM, Stefan Kirov wrote:
>>>>
>>>>
>>>>
>>>>> Jason Stajich wrote:
>>>>>
>>>>>
>>>>>> PAML4 breaks our PAML parser right now because the order of
>>>>>> things in
>>>>>> the result file has changed. Now sequences precede the
>>>>>> information
>>>>>> about the version or the program run. This means that $result-
>>>>>>
>>>>>>
>>>>>>> get_seqs() fails because we don't parse the sequences.
>>>>>>>
>>>>>>>
>>>>>> We'll see what we can do, but as usual with supporting 3rd party
>>>>>> programs it is brittle when file formats change. Th
>>>>>>
>>>>>> -jason
>>>>>>
>>>>>> --
>>>>>> Jason Stajich
>>>>>> jason at bioperl.org
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioperl-l mailing list
>>>>>> Bioperl-l at lists.open-bio.org
>>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> Jason,
>>>>> I saw a commit after this post on codeml, but not on PAML.pm- I
>>>>> assume
>>>>> this is not fixed, am I correct?
>>>>> Thanks!
>>>>> Stefan
>>>>>
>>>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>>
>>>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
More information about the Bioperl-l
mailing list