[Bioperl-l] BLAST Parsing Bug?

Paul Boutros pcboutro@engmail.uwaterloo.ca
Tue, 20 Aug 2002 15:40:37 -0400 (EDT)


Excellent, I got it to work now.

Here's a quick fix in case anybody reads through the archives looking for
a solution.  Just pre-parse the BLAST file with the script below before
parsing.

use strict;

open(IN,  "<$ARGV[0]");
open(OUT, ">$ARGV[1]");

my $flag = 0;

while (my $line = <IN>) {

	if ($flag == 1) {

		if ($line =~ /Number of letters searched: /) {
			}

		if ($line =~ /Number of sequences in database: /) {
			$flag = 0;
			}

		}

	else {
		if ($line =~ /Subset of the database/) {
			$flag = 1;
			}

		else {
			print OUT $line;
			}

		}

	}

Thanks for the help getting this sorted out.
Paul

On Tue, 20 Aug 2002, Jason Stajich wrote:

> yep - never tried it on output with the subset lines, will have to add an
> exit condition for the HSP loop on either lines starting with
> "Database"
> or
> "Subset"
> 
> currently it only looked for
> "Database:"
> lines.
> 
> 
>  Database: c:\docume~1\paul\blast\data\est_others.03
> >     Posted date:  Aug 15, 2002 12:45 PM
> >   Number of letters in database: 333,332,998
> >   Number of sequences in database:  651,575
> 
> On Tue, 20 Aug 2002, Paul Boutros wrote:
> 
> > Okay, same code, new Blast record, new error.  The new blast record was
> > run with parameters:
> > -l (restricting it to a subset of GIs)
> > -v 10 (restrict to 10 hits)
> > -e 0.3 (expectation value)
> >
> > The error seems to be suggesting that there is an empty line somewhere
> > where it isn't expected (i.e. midline = "\n").
> >
> > Any comments?  I've followed the suggestion of testing this both with
> > ActiveState and by downloading the libraries and installing them directly.
> > I get the same results either way.
> >
> > I also verified all the dependencies are present and up-to-date.
> >
> > Paul
> >
> >
> I've never seen the '
> > error:
> > ------------- EXCEPTION  -------------
> > MSG: no data for midline   Subset of the database(s) listed below
> > STACK Bio::SearchIO::blast::next_result
> > C:/Perl/site/lib/Bio/SearchIO/blast.pm:5
> > 66
> > STACK toplevel blastp~1.pl:9
> > --------------------------------------
> >
> > new blast file:
> > BLASTN 2.2.3 [Apr-24-2002]
> >
> >
> > Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
> > Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
> > "Gapped BLAST and PSI-BLAST: a new generation of protein database search
> > programs",  Nucleic Acids Res. 25:3389-3402.
> >
> > Query= H3001A01-3(C0001A09-3)
> >          (362 letters)
> >
> > Database: est_others
> >            5,032,538 sequences; 2,449,699,975 total letters
> >
> >
> >
> >                                                                  Score
> > E
> > Sequences producing significant alignments:                      (bits)
> > Value
> >
> > gb|BI292210.1|BI292210 UI-R-DN0-civ-m-09-0-UI.s1 UI-R-DN0 Rattus...   274
> > 1e-072
> > gb|BF290726.1|BF290726 EST455317 Rat Gene Index, normalized rat,...   266
> > 3e-070
> > gb|BI301460.1|BI301460 UI-R-DN0-cit-e-07-0-UI.s1 UI-R-DN0 Rattus...   260
> > 2e-068
> >
> > >gb|BI292210.1|BI292210 UI-R-DN0-civ-m-09-0-UI.s1 UI-R-DN0 Rattus
> > norvegicus cDNA clone
> >            UI-R-DN0-civ-m-09-0-UI 3'
> >           Length = 468
> >
> >  Score =  274 bits (138), Expect = 1e-072
> >  Identities = 168/178 (94%)
> >  Strand = Plus / Plus
> >
> >
> > Query: 1   aagatttatttatttattccatgtataggaatacactgtagctgtcttcagacacaccag 60
> >            ||||||||||||||||||  ||| ||| |||||||||||||||||||||||| |||||||
> > Sbjct: 20  aagatttatttatttattttatgcatatgaatacactgtagctgtcttcagatacaccag 79
> >
> >
> > Query: 61  aagagggcatcagatctcattgcagatggctgtgagccaccatgtggttgctgggatttg
> > 120
> >            ||||||||||||||||| ||| ||||||| ||||||||||||||||||||||||||||||
> > Sbjct: 80  aagagggcatcagatcttattacagatggttgtgagccaccatgtggttgctgggatttg
> > 139
> >
> >
> > Query: 121 aactcaggacctctggaagagcagtcggtgctcttaaccgctgagccatctctccagc 178
> >            |||||||||||||||||||||||||| |||||||||||| ||||||||||||||||||
> > Sbjct: 140 aactcaggacctctggaagagcagtcagtgctcttaaccactgagccatctctccagc 197
> >
> >
> > >gb|BF290726.1|BF290726 EST455317 Rat Gene Index, normalized rat, Rattus
> > norvegicus cDNA
> >            Rattus norvegicus cDNA clone RGIIB68 3' sequence
> >           Length = 223
> >
> >  Score =  266 bits (134), Expect = 3e-070
> >  Identities = 167/178 (93%)
> >  Strand = Plus / Plus
> >
> >
> > Query: 1   aagatttatttatttattccatgtataggaatacactgtagctgtcttcagacacaccag 60
> >            |||||||||||||||||| |||||||  || |||||||||||||||||||||||||||||
> > Sbjct: 1   aagatttatttatttatttcatgtatgtgagtacactgtagctgtcttcagacacaccag 60
> >
> >
> > Query: 61  aagagggcatcagatctcattgcagatggctgtgagccaccatgtggttgctgggatttg
> > 120
> >            |||||||||||||||| |||  | ||||| |||||| ||||||||||||||||||| |||
> > Sbjct: 61  aagagggcatcagatcccatcacggatggttgtgaggcaccatgtggttgctgggaattg
> > 120
> >
> >
> > Query: 121 aactcaggacctctggaagagcagtcggtgctcttaaccgctgagccatctctccagc 178
> >            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> > Sbjct: 121 aactcaggacctctggaagagcagtcggtgctcttaaccgctgagccatctctccagc 178
> >
> >
> > >gb|BI301460.1|BI301460 UI-R-DN0-cit-e-07-0-UI.s1 UI-R-DN0 Rattus
> > norvegicus cDNA clone
> >            UI-R-DN0-cit-e-07-0-UI 3'
> >           Length = 524
> >
> >  Score =  260 bits (131), Expect = 2e-068
> >  Identities = 164/175 (93%)
> >  Strand = Plus / Plus
> >
> >
> > Query: 4   atttatttatttattccatgtataggaatacactgtagctgtcttcagacacaccagaag 63
> >            |||||||||||||||   | |||| || |||| |||||||||||||||||||||||||||
> > Sbjct: 210 atttatttatttatttattatatatgagtacattgtagctgtcttcagacacaccagaag
> > 269
> >
> >
> > Query: 64  agggcatcagatctcattgcagatggctgtgagccaccatgtggttgctgggatttgaac
> > 123
> >            |||||||||||||||||| ||||||| |||||||||||||||||||||||||||||||||
> > Sbjct: 270 agggcatcagatctcattacagatggttgtgagccaccatgtggttgctgggatttgaac
> > 329
> >
> >
> > Query: 124 tcaggacctctggaagagcagtcggtgctcttaaccgctgagccatctctccagc 178
> >            ||||||||||||||||||||||| ||||||||||||||||||||| |||||||||
> > Sbjct: 330 tcaggacctctggaagagcagtcagtgctcttaaccgctgagccacctctccagc 384
> >
> >
> >
> >
> >   Subset of the database(s) listed below
> >      Number of letters searched: 123,827,604
> >      Number of sequences searched:  285,629
> >
> >   Database: est_others
> >     Posted date:  Aug 15, 2002 12:08 PM
> >   Number of letters in database: 333,332,922
> >   Number of sequences in database:  0
> >
> >   Database: c:\docume~1\paul\blast\data\est_others.01
> >     Posted date:  Aug 15, 2002 12:21 PM
> >   Number of letters in database: 333,333,126
> >   Number of sequences in database:  734,123
> >
> >   Database: c:\docume~1\paul\blast\data\est_others.02
> >     Posted date:  Aug 15, 2002 12:33 PM
> >   Number of letters in database: 333,332,951
> >   Number of sequences in database:  710,185
> >
> >   Database: c:\docume~1\paul\blast\data\est_others.03
> >     Posted date:  Aug 15, 2002 12:45 PM
> >   Number of letters in database: 333,332,998
> >   Number of sequences in database:  651,575
> >
> >   Database: c:\docume~1\paul\blast\data\est_others.04
> >     Posted date:  Aug 15, 2002 12:56 PM
> >   Number of letters in database: 333,332,826
> >   Number of sequences in database:  637,159
> >
> >   Database: c:\docume~1\paul\blast\data\est_others.05
> >     Posted date:  Aug 15, 2002  1:07 PM
> >   Number of letters in database: 333,333,104
> >   Number of sequences in database:  630,795
> >
> >   Database: c:\docume~1\paul\blast\data\est_others.06
> >     Posted date:  Aug 15, 2002  1:19 PM
> >   Number of letters in database: 333,332,943
> >   Number of sequences in database:  650,535
> >
> >   Database: c:\docume~1\paul\blast\data\est_others.07
> >     Posted date:  Aug 15, 2002  1:28 PM
> >   Number of letters in database: 116,369,105
> >   Number of sequences in database:  227,351
> >
> > Lambda     K      H
> >     1.37    0.711     1.31
> >
> > Gapped
> > Lambda     K      H
> >     1.37    0.711     1.31
> >
> >
> > Matrix: blastn matrix:1 -3
> > Gap Penalties: Existence: 5, Extension: 2
> > Number of Hits to DB: 49,566
> > Number of Sequences: 4241723
> > Number of extensions: 49566
> > Number of successful extensions: 21421
> > Number of sequences better than  0.3: 2335
> > length of query: 362
> > length of database: 123,827,604
> > effective HSP length: 18
> > effective length of query: 344
> > effective length of database: 118,686,282
> > effective search space: 40828081008
> > effective search space used: 40828081008
> > T: 0
> > A: 40
> > X1: 6 (11.9 bits)
> > X2: 15 (29.7 bits)
> > S1: 12 (24.3 bits)
> > S2: 19 (38.2 bits)
> > BLASTN 2.2.3 [Apr-24-2002]
> >
> >
> >
> >
> > On Tue, 20 Aug 2002, Jason Stajich wrote:
> >
> > > Because the parser expects to be parsing a full blast report - you are
> > > only providing it with a report which has hits but no hsps.
> > >
> > > At some point we can adapt the module to parse these types of reports, but
> > > for now it is only going to work with reports that have the full
> > > alignments included.
> > >
> > > -jason
> > >
> > > On Tue, 20 Aug 2002, Paul Boutros wrote:
> > >
> > > > Hello,
> > > >
> > > > I am just starting with Bioperl, trying to evaluate how useful it will be
> > > > for our group.  I'm struggling with getting it to work on my first few
> > > > steps here, though.  I would like to use the SearchIO system to parse a
> > > > blast-results file and I can strange results.
> > > >
> > > > System: Win2k Pro (sp3)
> > > > Perl: 5.6.1 ActiveState build 631 (all packages are updated)
> > > > BioPerl: 1.00.2
> > > >
> > > > The basic problem is that the parser isn't finding any of the hits.  At
> > > > all.  So the code below comes back with $count=0 for every record in the
> > > > BLAST output file.  Any ideas what I'm doing wrong?
> > > >
> > > > Paul
> > > >
> > > >
> > > > Code:
> > > > use strict;
> > > > use Bio::SearchIO;
> > > >
> > > > my $searchio = new Bio::SearchIO(
> > > > 			'-format'	=> 'blast',
> > > > 			'-file'		=> '15k5prime.out',
> > > > 			);
> > > >
> > > > while (my $result = $searchio->next_result()) {
> > > >
> > > > 	my $count = 0;
> > > >
> > > > 	print "Name: ", $result->query_name(), "\n";
> > > >
> > > > 	while (my $hit = $result->next_hit()) {
> > > > 		$count++;
> > > > 		}
> > > >
> > > > 	print "Count: $count\n";
> > > >
> > > > 	}
> > > >
> > > > Blast File Fragment:
> > > >
> > > > BLASTN 2.2.3 [Apr-24-2002]
> > > >
> > > >
> > > > Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
> > > > Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
> > > > "Gapped BLAST and PSI-BLAST: a new generation of protein database search
> > > > programs",  Nucleic Acids Res. 25:3389-3402.
> > > >
> > > > Query= H3001A01-5
> > > >          (589 letters)
> > > >
> > > > Database: est_others
> > > >            5,032,538 sequences; 2,449,699,975 total letters
> > > >
> > > >
> > > >
> > > >                                                                  Score
> > > > E
> > > > Sequences producing significant alignments:                      (bits)
> > > > Value
> > > >
> > > > gb|BQ206993.1|BQ206993 UI-R-DZ1-cnm-h-16-0-UI.s1 UI-R-DZ1 Rattus...   200
> > > > 3e-050
> > > > gb|BM386877.1|BM386877 UI-R-CN1-cjh-d-20-0-UI.s1 UI-R-CN1 Rattus...   198
> > > > 1e-049
> > > > gb|BI301905.1|BI301905 UI-R-DL0-cio-k-03-0-UI.s1 UI-R-DL0 Rattus...   198
> > > > 1e-049
> > > > gb|BI301460.1|BI301460 UI-R-DN0-cit-e-07-0-UI.s1 UI-R-DN0 Rattus...   198
> > > > 1e-049
> > > > gb|BG371847.1|BG371847 UI-R-CV0-brj-a-09-0-UI.s1 UI-R-CV0 Rattus...   198
> > > > 1e-049
> > > > gb|BE115424.1|BE115424 UI-R-BS1-axu-f-02-0-UI.s1 UI-R-BS1 Rattus...   198
> > > > 1e-049
> > > > gb|AA819696.1|AA819696 UI-R-A0-bh-d-10-0-UI.s1 UI-R-A0 Rattus no...   192
> > > > 6e-048
> > > > gb|BM383271.1|BM383271 UI-R-DS0-cje-i-16-0-UI.s1 UI-R-DS0 Rattus...   190
> > > > 2e-047
> > > > gb|BI292210.1|BI292210 UI-R-DN0-civ-m-09-0-UI.s1 UI-R-DN0 Rattus...   190
> > > > 2e-047
> > > > gb|BI284655.1|BI284655 UI-R-DE0-cac-f-05-0-UI.s1 UI-R-DE0 Rattus...   190
> > > > 2e-047
> > > >
> > > >   Subset of the database(s) listed below
> > > >      Number of letters searched: 123,827,604
> > > >      Number of sequences searched:  285,629
> > > >
> > > >   Database: est_others
> > > >     Posted date:  Aug 15, 2002 12:08 PM
> > > >   Number of letters in database: 333,332,922
> > > >   Number of sequences in database:  0
> > > >
> > > >   Database: c:\docume~1\paul\blast\data\est_others.01
> > > >     Posted date:  Aug 15, 2002 12:21 PM
> > > >   Number of letters in database: 333,333,126
> > > >   Number of sequences in database:  734,123
> > > >
> > > >   Database: c:\docume~1\paul\blast\data\est_others.02
> > > >     Posted date:  Aug 15, 2002 12:33 PM
> > > >   Number of letters in database: 333,332,951
> > > >   Number of sequences in database:  710,185
> > > >
> > > >   Database: c:\docume~1\paul\blast\data\est_others.03
> > > >     Posted date:  Aug 15, 2002 12:45 PM
> > > >   Number of letters in database: 333,332,998
> > > >   Number of sequences in database:  651,575
> > > >
> > > >   Database: c:\docume~1\paul\blast\data\est_others.04
> > > >     Posted date:  Aug 15, 2002 12:56 PM
> > > >   Number of letters in database: 333,332,826
> > > >   Number of sequences in database:  637,159
> > > >
> > > >   Database: c:\docume~1\paul\blast\data\est_others.05
> > > >     Posted date:  Aug 15, 2002  1:07 PM
> > > >   Number of letters in database: 333,333,104
> > > >   Number of sequences in database:  630,795
> > > >
> > > >   Database: c:\docume~1\paul\blast\data\est_others.06
> > > >     Posted date:  Aug 15, 2002  1:19 PM
> > > >   Number of letters in database: 333,332,943
> > > >   Number of sequences in database:  650,535
> > > >
> > > >   Database: c:\docume~1\paul\blast\data\est_others.07
> > > >     Posted date:  Aug 15, 2002  1:28 PM
> > > >   Number of letters in database: 116,369,105
> > > >   Number of sequences in database:  227,351
> > > >
> > > > Lambda     K      H
> > > >     1.37    0.711     1.31
> > > >
> > > > Gapped
> > > > Lambda     K      H
> > > >     1.37    0.711     1.31
> > > >
> > > >
> > > > Matrix: blastn matrix:1 -3
> > > > Gap Penalties: Existence: 5, Extension: 2
> > > > Number of Hits to DB: 69,708
> > > > Number of Sequences: 4241723
> > > > Number of extensions: 69708
> > > > Number of successful extensions: 25280
> > > > Number of sequences better than  0.3: 2163
> > > > length of query: 589
> > > > length of database: 123,827,604
> > > > effective HSP length: 18
> > > > effective length of query: 571
> > > > effective length of database: 118,686,282
> > > > effective search space: 67769867022
> > > > effective search space used: 67769867022
> > > > T: 0
> > > > A: 40
> > > > X1: 6 (11.9 bits)
> > > > X2: 15 (29.7 bits)
> > > > S1: 12 (24.3 bits)
> > > > S2: 19 (38.2 bits)
> > > > BLASTN 2.2.3 [Apr-24-2002]
> > > >
> > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > Bioperl-l mailing list
> > > > Bioperl-l@bioperl.org
> > > > http://bioperl.org/mailman/listinfo/bioperl-l
> > > >
> > >
> > > --
> > > Jason Stajich
> > > Duke University
> > > jason at cgt.mc.duke.edu
> > >
> >
> 
> -- 
> Jason Stajich
> Duke University
> jason at cgt.mc.duke.edu
>