[Bioperl-l] Error parsing BLAST report

Freimuth, Robert freimuth at pathology.wustl.edu
Wed Aug 16 19:56:37 UTC 2006


Hello,

I'm trying to parse a BLAST report using the following code:

use warnings;
use strict;

use Bio::SearchIO;

my $file = 'NP_006065_blast.out';

my $searchio = new Bio::SearchIO( -format => 'blast',
                                  -file   => $file );

while( my $result = $searchio->next_result() )
{
    while( my $hit = $result->next_hit )
    {
            my $hit_acc_num = $hit->accession();

            # get the total length of the aligned region for query or
sbjct seq
            # (includes all HSPs, calculated after tiling)

            my $align_len = $hit->length_aln( 'query' );

            print "Alignment length for $hit_acc_num is $align_len\n";
    }
}

There are 104 one-line descriptions in the report, and alignments for
each one of them (the blast report was created using
b_num_alignments_shown => 500 and v_num_descriptions_shown => 500).
However, when I run the above code I get 14 errors like the following:

-------------------- WARNING ---------------------
MSG: There is no HSP data for hit 'ENSP00000327738'.
You have called a method (Bio::Search::Hit::GenericHit::length_aln)
that requires HSP data and there was no HSP data for this hit,
most likely because it was absent from the BLAST report.
Note that by default, BLAST lists alignments for the first 250 hits,
but it lists descriptions for 500 hits. If this is the case,
and you care about these hits, you should re-run BLAST using the
-b option (or equivalent if not using blastall) to increase the number
of alignments.
 
---------------------------------------------------

There is an alignment for this (and the other 13 sequences) in the
report.  In fact, if I edit the report and delete all but the
description and the alignment for ENSP00000327738, it parses fine (no
error).

I continued editing the report and produced the following minimal test
case that reproduces the error.  Note that the description for
ENSP00000350182 appears twice, BUT THE ERROR IS FOR ENSP00000327738.

*********** BLAST REPORT FOR TEST CASE ***********

BLASTP 2.2.11 [Jun-05-2005]


Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A.
Schaffer, 
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), 
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs",  Nucleic Acids Res. 25:3389-3402.

Query= NP_006065
         (442 letters)

Database: Homo_sapiens.NCBI36.apr.pep.fa 
           48,851 sequences; 23,910,368 total letters

Searching..................................................done

                                                                 Score
E
Sequences producing significant alignments:                      (bits)
Value

ENSP00000350182 pep:novel clone::BX322644.8:4905:15090:-1 gene:E...
120   3e-27
ENSP00000350182 pep:novel clone::BX322644.8:4905:15090:-1 gene:E...
120   3e-27
ENSP00000327738 pep:known-ccds chromosome:NCBI36:4:189297592:189...
115   8e-26

>ENSP00000350182 pep:novel clone::BX322644.8:4905:15090:-1
gene:ENSG00000137397
           transcript:ENST00000357569 
          Length = 425

 Score =  120 bits (301), Expect = 3e-27
 Identities = 76/261 (29%), Positives = 140/261 (53%), Gaps = 21/261
(8%)

Query: 9   IEKEVTCPICLELLTEPLSLDCGHSFCQACITAKIKESVIISRGESSCPVCQTRFQPGNL
68
           +++EV CPICL++L +P+++DCGH+FC  CIT +I E+   S G   CP+C+T  +   +
Sbjct: 10  LQEEVICPICLDILQKPVTIDCGHNFCLKCIT-QIGET---SCGFFKCPLCKTSVRKNAI
65

Query: 69  RPNRHLANIVERVKEVKMSP-QEGQKRDVCEHHGKKLQIFCKEDGKVICWVCELSQEHQG
127
           R N  L N+VE+++ ++ S  Q  +K   C  H +    FC++DGK +C+VC  S++H+ 
Sbjct: 66  RFNSLLRNLVEKIQALQASEVQSKRKEATCPRHQEMFHYFCEDDGKFLCFVCCESKDHKS
125

Query: 128 HQTFRINEVVKECQEKLQVALQRLIKEDQEAEKLED------DIRQERTAWKIERQKILK
181
           H    I E  +  Q ++Q  +Q L ++++E  +++       D+  ++   + E+Q+IL 
Sbjct: 126 HNVSLIEEAAQNYQGQIQEQIQVLQQKEKETVQVKAQGVHRVDVFTDQV--EHEKQRILT
183

Query: 182 GFNEMRVILDNEEQRELQKL----EEGEVNVLDNLAAATDQLVQQRQDASTLISDLQRRL
237
            F  +  +L+ E+   L ++     EG       +A+   QL     D   L+  L+ + 
Sbjct: 184 EFELLHQVLEEEKNFLLSRIYWLGHEGTEAGKHYVASTEPQL----NDLKKLVDSLKTKQ
239

Query: 238 TGSSVEMLQDVIDVMKRSESW 258
                ++L+     + RSE +
Sbjct: 240 NMPPRQLLEVTQPHLPRSEEF 260


>ENSP00000327738 pep:known-ccds
chromosome:NCBI36:4:189297592:189305643:1
           gene:ENSG00000184108 transcript:ENST00000332517
           CCDS3851.1
          Length = 468

 Score =  115 bits (289), Expect = 8e-26
 Identities = 101/410 (24%), Positives = 180/410 (43%), Gaps = 39/410
(9%)

Query: 8   DIEKEVTCPICLELLTEPLSLDCGHSFCQACITAKIKESVIISRGESSCPVCQTRFQPGN
67
           ++ +E+TC ICL+  + P++ +CGHSFC  C+    +E         SCP C    +  +
Sbjct: 9   NLREELTCFICLDYFSSPVTTECGHSFCLVCLLRSWEE----HNTPLSCPECWRTLEGPH
64

Query: 68  LRPNRHLANIVERVKEVKMSPQEGQKRDVCEHHGK-----KLQIFCKEDGKVICWVCELS
122
            + N  L  +    ++++   Q  Q  D    +G+     K     ++ G        ++
Sbjct: 65  FQSNERLGRLASIARQLR--SQVLQSEDEQGSYGRMPTTAKALSDDEQGGSAF-----VA
117

Query: 123 QEHQGHQTFRINEVVKECQEKLQVALQRLIKEDQEA------EKLEDDIRQERTAWKIER
176
           Q H  ++    +E  +  +EKLQ  L  L    +EA      EK    + QE T  K  +
Sbjct: 118 QSHGANRVHLSSEAEEHHREKLQEILNLLRVRRKEAQAVLTHEKERVKLCQEET--KTCK
175

Query: 177 QKILKGFNEMRVILDNEEQRELQKLEEGEVNVLDNLAAATDQLVQQRQDASTLISDLQRR
236
           Q ++  + +M   L  EEQ +LQ LE+ E   +  L     +L QQ +  S +I+ ++  
Sbjct: 176 QVVVSEYMKMHQFLKEEEQLQLQLLEQEEKENMRKLRNNEIKLTQQIRSLSKMIAQIESS
235

Query: 237 LTGSSVEMLQDVIDVMKRSESWTXXXXXXXXXXXXXXFRVPDLSGMLQVLKELTDVQYYW
296
              S+ E L++V   ++RSE                   +  ++GM ++L++ +      
Sbjct: 236 SQSSAFESLEEVRGALERSE----PLLLQCPEATTTELSLCRITGMKEMLRKFS------
285

Query: 297 VDVMLNPGSATSNVAISVDQRQVKTVRTCTFKNSNPCDF-SAFGVFGCQYFSSGKYYWEV
355
            ++ L+P +A + + +S D + VK   +      NP  F  +  V G Q F+SG++YWEV
Sbjct: 286 TEITLDPATANAYLVLSEDLKSVKYGGSRQQLPDNPERFDQSATVLGTQIFTSGRHYWEV
345

Query: 356 DVSGKIAWILGVHSKISSLNKRKSSGFAFDPSVNYSKVYSRYRPQYGYWV 405
           +V  K  W +G+     S    +       P   +S +  +    Y  WV
Sbjct: 346 EVGNKTEWEVGICKDSVS----RKGNLPKPPGDLFSLIGLKIGDDYSLWV 391


  Database: Homo_sapiens.NCBI36.apr.pep.fa
    Posted date:  Jun 15, 2006  8:56 PM
  Number of letters in database: 23,910,368
  Number of sequences in database:  48,851
  
Lambda     K      H
   0.319    0.133    0.398 

Gapped
Lambda     K      H
   0.267   0.0410    0.140 


Matrix: BLOSUM62
Gap Penalties: Existence: 11, Extension: 1
Number of Hits to DB: 20,900,506
Number of Sequences: 48851
Number of extensions: 899179
Number of successful extensions: 6075
Number of sequences better than 1.0e-25: 105
Number of HSP's better than  0.0 without gapping: 18
Number of HSP's successfully gapped in prelim test: 87
Number of HSP's that attempted gapping in prelim test: 5632
Number of HSP's gapped (non-prelim): 157
length of query: 442
length of database: 23,910,368
effective HSP length: 107
effective length of query: 335
effective length of database: 18,683,311
effective search space: 6258909185
effective search space used: 6258909185
T: 11
A: 40
X1: 16 ( 7.4 bits)
X2: 38 (14.6 bits)
X3: 64 (24.7 bits)
S1: 41 (21.8 bits)
S2: 289 (115.9 bits)

*********** END BLAST REPORT FOR TEST CASE ***********

Any ideas?

Thanks,

Bob





More information about the Bioperl-l mailing list