[Bioperl-l] Fwd: BLAST parsing broken

Chris Fields cjfields at illinois.edu
Sun May 9 20:43:29 UTC 2010


If the patch is against main trunk it isn't a problem, otherwise the diff should be vs. that code.

chris

On May 9, 2010, at 2:23 PM, Razi Khaja wrote:

> Attached (blast.pm.diff) is a patch that fixes Heikki's problem.
> Can someone advise an appropriate way to have this patch applied, given that
> it is an amendment to a previous patch?
> Thanks
> Razi
> 
> 
> ---------- Forwarded message ----------
> From: Heikki Lehvaslaiho <heikki.lehvaslaiho at gmail.com>
> Date: Wed, May 5, 2010 at 2:11 AM
> Subject: Re: [Bioperl-l] BLAST parsing broken
> To: Razi Khaja <razi.khaja at gmail.com>
> 
> 
> Hi Raja,
> 
> Thanks for trying to fix this.
> 
> I am attaching an example output file to this message. I just tested again
> that master from git repository fails to get query ID, but the previous
> version works.
> 
> bala ~/src/bioperl-live> git checkout master
> Previous HEAD position was 5e278f5... Robson's patch for buggy blastpgp
> output
> Switched to branch 'master'
> 
> When I started using the latest mpiBLAST code a few months ago I did compare
> the 0 output from it to standard NCBI blast and they were identical.
> 
> 
> 
> 
> Also, I've noticed a discrepancy between within  bioperl blast parsing that
> I have not had time to work on. Would you be interested in having a look?
> 
> I am creating output from mpiBLAST in 0 format and then converting it into
> tab-delimited 8 format. I am  unable to get 100% similarity for all cases
> when I compare the conversion to the output straight from mpiBLAST in format
> 8. Sometimes the  mismatch and gap values are off by one.
> 
> I am attaching a script that does the conversion. It is the same one I was
> using when I noticed the problem above. I was going to put the code into
> bioperl but that got delayed when I noticed the discrepancies.
> 
> 
> Cheers,
> 
> 
>   -Heikki
> 
> Heikki Lehvaslaiho - skype:heikki_lehvaslaiho
> cell: +966 545 595 849  office: +966 2 808 2429
> 
> Computational Bioscience Research Centre (CBRC), Building #2, Office #4216
> 4700 King Abdullah University of Science and Technology (KAUST)
> Thuwal 23955-6900, Kingdom of Saudi Arabia
> 
> 
> 
> On 4 May 2010 20:55, Razi Khaja <razi.khaja at gmail.com> wrote:
> 
>> That is odd.  Heikki, do you have a blast output file that produces this
>> error?
>> Could you attach the file and either send to the list or myself (if the
>> list
>> does not accept attachments).
>> Thanks,
>> Razi
>> 
>> 
>> On Mon, May 3, 2010 at 8:08 AM, Chris Fields <cjfields at illinois.edu>
>> wrote:
>> 
>>> Odd, I ran tests on that prior to commit.  I'll work on fixing that (in
>> svn
>>> of course, until the migration is complete).
>>> 
>>> chris
>>> 
>>> On May 3, 2010, at 6:45 AM, Heikki Lehvaslaiho wrote:
>>> 
>>>> Chris,
>>>> 
>>>> latest additions to Bio::SearchIO::blast.pm broke the parsing of
>> normal
>>>> blast output.  $result->query_name returns now undef.
>>>> 
>>>> (Using the anonymous git now). This change still works:
>>>> 
>>>> commit 5e278f5dbb9afc4dc0359cd3fdc8fb0d0f4cad74
>>>> Author: cjfields <cjfields at eb9725d8-4842-0410-9bbb-c0b52e2da49b>
>>>> Date:   Sun Dec 20 04:39:58 2009 +0000
>>>> 
>>>>  Robson's patch for buggy blastpgp output
>>>> 
>>>> But this does not:
>>>> 
>>>> commit 9a89c3434597104dd50553e3562983d78d14a544
>>>> Author: cjfields <cjfields at eb9725d8-4842-0410-9bbb-c0b52e2da49b>
>>>> Date:   Thu Apr 15 04:21:17 2010 +0000
>>>> 
>>>>  [bug 3031]
>>>> 
>>>>  patches for catching algorithm ref, courtesy Razi Khaja.
>>>> 
>>>> That makes it easy to find the diffs:
>>>> 
>>>> $git diff 5e278f5dbb9afc4dc0359cd3fdc8fb0d0f4cad74
>>>> 9a89c3434597104dd50553e3562983d78d14a544   Bio/SearchIO/blast.pm
>>>> diff --git a/Bio/SearchIO/blast.pm b/Bio/SearchIO/blast.pm
>>>> index 378023a..6f7eeeb 100644
>>>> --- a/Bio/SearchIO/blast.pm
>>>> +++ b/Bio/SearchIO/blast.pm
>>>> @@ -209,6 +209,7 @@ BEGIN {
>>>> 
>>>>       'BlastOutput_program'             => 'RESULT-algorithm_name',
>>>>       'BlastOutput_version'             =>
>> 'RESULT-algorithm_version',
>>>> +        'BlastOutput_algorithm-reference' =>
>>> 'RESULT-algorithm_reference',
>>>>       'BlastOutput_query-def'           => 'RESULT-query_name',
>>>>       'BlastOutput_query-len'           => 'RESULT-query_length',
>>>>       'BlastOutput_query-acc'           => 'RESULT-query_accession',
>>>> @@ -504,6 +505,26 @@ sub next_result {
>>>>               }
>>>>           );
>>>>       }
>>>> +        # parse the BLAST algorithm reference
>>>> +        elsif(/^Reference:\s+(.*)$/) {
>>>> +            # want to preserve newlines for the BLAST algorithm
>>> reference
>>>> +            my $algorithm_reference = "$1\n";
>>>> +            $_ = $self->_readline;
>>>> +            # while the current line, does not match an empty line, a
>>> RID:,
>>>> or a Database:, we are still looking at the
>>>> +            # algorithm_reference, append it to what we parsed so far
>>>> +            while($_ !~ /^$/ && $_ !~ /^RID:/ && $_ !~ /^Database:/) {
>>>> +                $algorithm_reference .= "$_";
>>>> +                $_ = $self->_readline;
>>>> +            }
>>>> +            # if we exited the while loop, we saw an empty line, a
>> RID:,
>>> or
>>>> a Database:, so push it back
>>>> +            $self->_pushback($_);
>>>> +            $self->element(
>>>> +                {
>>>> +                    'Name' => 'BlastOutput_algorithm-reference',
>>>> +                    'Data' => $algorithm_reference
>>>> +                }
>>>> +            );
>>>> +        }
>>>>       # added Windows workaround for bug 1985
>>>>       elsif (/^(Searching|Results from round)/) {
>>>>           next unless $1 =~ /Results from round/;
>>>> 
>>>> 
>>>> I am not sure why reference parsing messes things up. Maybe it eats too
>>> many
>>>> lines from the result file.
>>>> 
>>>> Yours,
>>>> 
>>>>  -Heikki
>>>> 
>>>> Heikki Lehvaslaiho - skype:heikki_lehvaslaiho
>>>> cell: +966 545 595 849  office: +966 2 808 2429
>>>> 
>>>> Computational Bioscience Research Centre (CBRC), Building #2, Office
>>> #4216
>>>> 4700 King Abdullah University of Science and Technology (KAUST)
>>>> Thuwal 23955-6900, Kingdom of Saudi Arabia
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>> 
>>> 
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>> 
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> 
> <mpiblast.out><blastparser028.pl><blast.pm.diff>_______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l





More information about the Bioperl-l mailing list