[Bioperl-l] PAML/Codeml parsing
Stefan Kirov
stefan.kirov at bms.com
Wed Dec 5 19:56:47 UTC 2007
Here is a patch that seems to be working and does not break the existing
tests:
--- /home/kirovs/bioperl-live/Bio/Tools/Phylo/PAML.pm 2007-12-05
10:16:53.120720000 -0500
+++ /home/kirovs/bioperl/bioperl-live/Bio/Tools/Phylo/PAML.pm
2007-12-05 14:46:31.436278000 -0500
@@ -419,7 +419,10 @@
# CODONML (in paml 3.12 February 2002) <<-- what we want to see!
my $SEQTYPES = qr( (?: (?: CODON | AA | BASE | CODON2AA ) ML ) |
YN00 )x;
+ my $line;
+ $self->{'_already_parsed_seqs'}=$self->{'_already_parsed_seqs'}?1:0;
while ($_ = $self->_readline) {
+ $line++;
if ( m/^($SEQTYPES) \s+ # seqtype: CODONML,
AAML, BASEML, CODON2AAML, YN00, etc
(?: \(in \s+ ([^\)]+?) \s* \) \s* )? # version: "paml
3.12 February 2002"; not present < 3.1 or YN00
(\S+) \s* # tree filename
@@ -436,8 +439,11 @@
} elsif (m/^Data set \d$/) {
$self->{'_summary'} = {};
$self->{'_summary'}->{'multidata'}++;
- } elsif( m/^Before\s+deleting\s+alignment\s+gaps/ ) {
- my ($phylip_header) = $self->_readline;
+ }
+ elsif( m/^Before\s+deleting\s+alignment\s+gaps/ ) {#Gap
+ my ($phylip_header) = $self->_readline;
+ $self->_parse_seqs;
+ } elsif (($line>3)&&($self->{'_already_parsed_seqs'}!=1)) {#No gap
$self->_parse_seqs;
}
}
@@ -681,7 +687,6 @@
}
sub _parse_seqs {
-
# this should in fact be packed into a Bio::SimpleAlign object
instead of
# an array but we'll stay with this for now
my ($self) = @_;
What this does is trigger sequence parsing if the /Before.../ pattern is
not seen until line 4. Since phylip_header seems to be doing nothing one
could completely eliminate the first seq parse elsif (even though
counting lines is not a good thing).
Since I am not aware of all consequences of changing the sequence
parsing and I have no idea how extensive the tests are, I am not
committing anything, but feel free to use that if you wish.
Stefan
Stefan Kirov wrote:
> Jason,
> When there is a gapless alignment we have a differently formatted output
> from codeml:
> kirovs at horta:~/AESIG> head -n 10 feJRfxQl8D/mlc
>
> seed used = 492211105
> 3 141
>
> ENSRNOE00000058637 GCG AGC AAG TGT GAC AGC CAT GGC ACC CAC
> CTA GCA GGT GTG GTC AGC GGC CGG GAT GCT GGT GTG GCC AAG GGC ACC AGT CTG
> CAC AGT CTG CGT GTG CTC AAC TGT CAA GGG AAG GGC ACA GTC AGC GGC ACC CTC ATA
> ENSMUSE00000366347 GCG AGC AAG TGT GAC AGC CAC GGC ACC CAC
> CTG GCA GGT GTG GTC AGC GGC CGG GAT GCT GGT GTG GCC AAG GGC ACC AGC CTG
> CAC AGC CTG CGT GTG CTC AAC TGT CAA GGG AAG GGC ACA GTC AGC GGC ACC CTC ATA
> ENSE00001279150 GCC AGC AAG TGT GAC AGT CAT GGC ACC CAC
> CTG GCA GGG GTG GTC AGC GGC CGG GAT GCC GGC GTG GCC AAG GGT GCC AGC ATG
> CGC AGC CTG CGC GTG CTC AAC TGC CAA GGG AAG GGC ACG GTT AGC GGC ACC CTC ATA
>
> And parsing this fails...
> The next one has gaps and works fine:
>
> kirovs at horta:~/AESIG> head -n 10 4z6ZX7s1B6/mlc
>
> seed used = 492252697
>
> Before deleting alignment gaps
> 2 162
>
> ENSMUSE00000460297 AAT ATC GAT ACA TTT TAC AAG GAG GCA GAA
> AAG AAG CTT ATA CAC GTG CTT GAG GGA GAC AGT CCC AAG TGG TCC ACA CCG AAC
> AAA GAC CCC ACC CGA GAG CCC CAT GCA GCC TCC ACT TGC TGT GCT TCA GAT CTC
> CTT GGT TCA GGA GGT CAG TTC CTG
> ENSE00000939192 AAT ATT GAC ATA CTT TGC AAT GAA GCA GAA
> AAC AAG CTT ATG CAT ATA CTG CAT GCA AAT GAT CCC AAG TGG TCC ACC CCA ACT
> AAA GAC TGT ACT TCA GGG CCG TAC ACT GCT CAA ATC --- --- --- --- --- ATT
> CCT GGT ACA GGA AAC AAG CTT CTG
>
> I will send both whole files as an attachment with another mail (I do
> not know if these are going to pass through).
> My guess is that the whole _parse_summary method has to be re-worked as
> there is no tag to look for before the sequences start. Ugly.
> I am not sure what else could become broken if I try to fix it, so I
> will leave it to you.
> Stefan
>
>> should be fixed.
>>
>> $ cvs log -r HEAD Bio/Tools/Phylo/PAML.pm
>> revision 1.56
>> date: 2007/11/01 14:52:56; author: jason; state: Exp; lines: +21 -14
>> Parsing PAML4 and PAML3.15 should work now. Dealing with variable
>> order for the sequences and summary results in
>> the top of the MLC files
>>
>> On Dec 4, 2007, at 11:25 AM, Stefan Kirov wrote:
>>
>>
>>> Jason Stajich wrote:
>>>
>>>> PAML4 breaks our PAML parser right now because the order of things in
>>>> the result file has changed. Now sequences precede the information
>>>> about the version or the program run. This means that $result-
>>>>
>>>>> get_seqs() fails because we don't parse the sequences.
>>>>>
>>>> We'll see what we can do, but as usual with supporting 3rd party
>>>> programs it is brittle when file formats change. Th
>>>>
>>>> -jason
>>>>
>>>> --
>>>> Jason Stajich
>>>> jason at bioperl.org
>>>>
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>>>>
>>>>
>>> Jason,
>>> I saw a commit after this post on codeml, but not on PAML.pm- I assume
>>> this is not fixed, am I correct?
>>> Thanks!
>>> Stefan
>>>
>>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
More information about the Bioperl-l
mailing list