[Bioperl-l] PAML/Codeml parsing
Jason Stajich
jason at bioperl.org
Wed Dec 5 20:01:29 UTC 2007
sounds good - can you
- make it as a bug with the patch and sample files in bugzilla
- commit changes and I'll test as well
thanks,
-j
On Dec 5, 2007, at 11:56 AM, Stefan Kirov wrote:
> Here is a patch that seems to be working and does not break the
> existing
> tests:
>
> --- /home/kirovs/bioperl-live/Bio/Tools/Phylo/PAML.pm 2007-12-05
> 10:16:53.120720000 -0500
> +++ /home/kirovs/bioperl/bioperl-live/Bio/Tools/Phylo/PAML.pm
> 2007-12-05 14:46:31.436278000 -0500
> @@ -419,7 +419,10 @@
> # CODONML (in paml 3.12 February 2002) <<-- what we want to see!
>
> my $SEQTYPES = qr( (?: (?: CODON | AA | BASE | CODON2AA ) ML ) |
> YN00 )x;
> + my $line;
> + $self->{'_already_parsed_seqs'}=$self->
> {'_already_parsed_seqs'}?1:0;
> while ($_ = $self->_readline) {
> + $line++;
> if ( m/^($SEQTYPES) \s+ # seqtype:
> CODONML,
> AAML, BASEML, CODON2AAML, YN00, etc
> (?: \(in \s+ ([^\)]+?) \s* \) \s* )? # version: "paml
> 3.12 February 2002"; not present < 3.1 or YN00
> (\S+) \s* # tree filename
> @@ -436,8 +439,11 @@
> } elsif (m/^Data set \d$/) {
> $self->{'_summary'} = {};
> $self->{'_summary'}->{'multidata'}++;
> - } elsif( m/^Before\s+deleting\s+alignment\s+gaps/ ) {
> - my ($phylip_header) = $self->_readline;
> + }
> + elsif( m/^Before\s+deleting\s+alignment\s+gaps/ ) {#Gap
> + my ($phylip_header) = $self->_readline;
> + $self->_parse_seqs;
> + } elsif (($line>3)&&($self->{'_already_parsed_seqs'}!=1))
> {#No gap
> $self->_parse_seqs;
> }
> }
> @@ -681,7 +687,6 @@
> }
>
> sub _parse_seqs {
> -
> # this should in fact be packed into a Bio::SimpleAlign object
> instead of
> # an array but we'll stay with this for now
> my ($self) = @_;
>
>
> What this does is trigger sequence parsing if the /Before.../
> pattern is
> not seen until line 4. Since phylip_header seems to be doing
> nothing one
> could completely eliminate the first seq parse elsif (even though
> counting lines is not a good thing).
> Since I am not aware of all consequences of changing the sequence
> parsing and I have no idea how extensive the tests are, I am not
> committing anything, but feel free to use that if you wish.
> Stefan
>
> Stefan Kirov wrote:
>> Jason,
>> When there is a gapless alignment we have a differently formatted
>> output
>> from codeml:
>> kirovs at horta:~/AESIG> head -n 10 feJRfxQl8D/mlc
>>
>> seed used = 492211105
>> 3 141
>>
>> ENSRNOE00000058637 GCG AGC AAG TGT GAC AGC CAT GGC
>> ACC CAC
>> CTA GCA GGT GTG GTC AGC GGC CGG GAT GCT GGT GTG GCC AAG GGC ACC
>> AGT CTG
>> CAC AGT CTG CGT GTG CTC AAC TGT CAA GGG AAG GGC ACA GTC AGC GGC
>> ACC CTC ATA
>> ENSMUSE00000366347 GCG AGC AAG TGT GAC AGC CAC GGC
>> ACC CAC
>> CTG GCA GGT GTG GTC AGC GGC CGG GAT GCT GGT GTG GCC AAG GGC ACC
>> AGC CTG
>> CAC AGC CTG CGT GTG CTC AAC TGT CAA GGG AAG GGC ACA GTC AGC GGC
>> ACC CTC ATA
>> ENSE00001279150 GCC AGC AAG TGT GAC AGT CAT GGC
>> ACC CAC
>> CTG GCA GGG GTG GTC AGC GGC CGG GAT GCC GGC GTG GCC AAG GGT GCC
>> AGC ATG
>> CGC AGC CTG CGC GTG CTC AAC TGC CAA GGG AAG GGC ACG GTT AGC GGC
>> ACC CTC ATA
>>
>> And parsing this fails...
>> The next one has gaps and works fine:
>>
>> kirovs at horta:~/AESIG> head -n 10 4z6ZX7s1B6/mlc
>>
>> seed used = 492252697
>>
>> Before deleting alignment gaps
>> 2 162
>>
>> ENSMUSE00000460297 AAT ATC GAT ACA TTT TAC AAG GAG
>> GCA GAA
>> AAG AAG CTT ATA CAC GTG CTT GAG GGA GAC AGT CCC AAG TGG TCC ACA
>> CCG AAC
>> AAA GAC CCC ACC CGA GAG CCC CAT GCA GCC TCC ACT TGC TGT GCT TCA
>> GAT CTC
>> CTT GGT TCA GGA GGT CAG TTC CTG
>> ENSE00000939192 AAT ATT GAC ATA CTT TGC AAT GAA
>> GCA GAA
>> AAC AAG CTT ATG CAT ATA CTG CAT GCA AAT GAT CCC AAG TGG TCC ACC
>> CCA ACT
>> AAA GAC TGT ACT TCA GGG CCG TAC ACT GCT CAA ATC --- --- --- ---
>> --- ATT
>> CCT GGT ACA GGA AAC AAG CTT CTG
>>
>> I will send both whole files as an attachment with another mail (I do
>> not know if these are going to pass through).
>> My guess is that the whole _parse_summary method has to be re-
>> worked as
>> there is no tag to look for before the sequences start. Ugly.
>> I am not sure what else could become broken if I try to fix it, so I
>> will leave it to you.
>> Stefan
>>
>>> should be fixed.
>>>
>>> $ cvs log -r HEAD Bio/Tools/Phylo/PAML.pm
>>> revision 1.56
>>> date: 2007/11/01 14:52:56; author: jason; state: Exp; lines:
>>> +21 -14
>>> Parsing PAML4 and PAML3.15 should work now. Dealing with variable
>>> order for the sequences and summary results in
>>> the top of the MLC files
>>>
>>> On Dec 4, 2007, at 11:25 AM, Stefan Kirov wrote:
>>>
>>>
>>>> Jason Stajich wrote:
>>>>
>>>>> PAML4 breaks our PAML parser right now because the order of
>>>>> things in
>>>>> the result file has changed. Now sequences precede the
>>>>> information
>>>>> about the version or the program run. This means that $result-
>>>>>
>>>>>> get_seqs() fails because we don't parse the sequences.
>>>>>>
>>>>> We'll see what we can do, but as usual with supporting 3rd party
>>>>> programs it is brittle when file formats change. Th
>>>>>
>>>>> -jason
>>>>>
>>>>> --
>>>>> Jason Stajich
>>>>> jason at bioperl.org
>>>>>
>>>>> _______________________________________________
>>>>> Bioperl-l mailing list
>>>>> Bioperl-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>>
>>>>>
>>>>>
>>>> Jason,
>>>> I saw a commit after this post on codeml, but not on PAML.pm- I
>>>> assume
>>>> this is not fixed, am I correct?
>>>> Thanks!
>>>> Stefan
>>>>
>>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>>
>
More information about the Bioperl-l
mailing list