[Bioperl-l] Remote Blast Failing

paul.boutros at utoronto.ca paul.boutros at utoronto.ca
Fri Sep 9 00:46:50 EDT 2005


Hello,

NCBI has changed their format for RemoteBlasts, and in some cases this is 
causing SearchIO to fail.  I think this is related to Jason's email from a few 
weeks back:
http://bioperl.org/pipermail/bioperl-l/2005-August/019634.html

All nucleotide queries I tried fail on perl 5.8.7 on both AIX and WinXP using 
Bioperl 1.4 (last stable release).  The reason appears to be a change in the HSP 
alignment format, removing a comma.  A work-around for BioPerl 1.4 is to change 
line 1145 of Bio\SearchIO\blast.pm this way:
-1145:        if( /^((Query|Sbjct):\s+(\-?\d+)\s*)(\S+)\s+(\-?\d+)/ ) {
+1145:        if( /^((Query|Sbjct):{0,1}\s+(\-?\d+)\s*)(\S+)\s+(\-?\d+)/ ) {

I downloaded the CVS tarball, and this change is already in bioperl-live.  
However, one class of queries that *doesn't* work from bioperl-live are genomic 
BLASTs.  Here, NCBI has added several extra lines have been added to the output.

Here's an example of the new format:
#################################################
gi|63489990|ref|NT_039206.4|Mm2_39246_34  Mus musculus chromosome  28.2       25
gi|63482841|ref|NT_078297.3|Mm1_78362_34  Mus musculus chromosome  28.2       25

ALIGNMENTS
>gi|63543231|ref|NT_039343.4|Mm6_39383_34 Mus musculus chromosome 6 genomic 
contig, strain C57BL/6J
          Length=21478308

 Features flanking this part of subject sequence:
   60669 bp at 5' side: hypothetical protein LOC101197
   386242 bp at 3' side: RIKEN cDNA A930040G15

 Score = 38.2 bits (19),  Expect = 0.026
 Identities = 19/19 (100%), Gaps = 0/19 (0%)
 Strand=Plus/Plus

Query  1       AGGCCGTTCACCAGTATGA  19
               |||||||||||||||||||
Sbjct  246489  AGGCCGTTCACCAGTATGA  246507
#################################################

And parsing a report containing this gives the error message:
#################################################
------------- EXCEPTION  -------------
MSG: no data for midline  Features flanking this part of subject sequence:
STACK Bio::SearchIO::blast::next_result C:/Perl/site/lib/Bio\SearchIO\blast.pm:
1173
STACK toplevel test_blast.pl:9
--------------------------------------
#################################################

I can submit a patch, but I wanted to get input on the best way to handle this:  
should the feature-data be stored somewhere, or just skipped?

Here are the parameters used for this query in case somebody wants to recreate 
it.  I can also forward the blast report file if you're interested.
Sequence: aggccgttcaccagtatgac
Database: mouse_contig/ref_contig
Entrez Query: Mus musculus [ORGN]

A short-term fix if anybody else is having this problem is to BLAST against the 
database 'chromosome' instead of 'mouse_contig/ref_contif' and so forth for 
other species.

Sorry for the long message!
Paul


More information about the Bioperl-l mailing list