[Bioperl-l] BLAST Parsing Bug?

Jason Stajich jason@cgt.mc.duke.edu
Tue, 20 Aug 2002 14:29:28 -0400 (EDT)


yep - never tried it on output with the subset lines, will have to add an
exit condition for the HSP loop on either lines starting with
"Database"
or
"Subset"

currently it only looked for
"Database:"
lines.


 Database: c:\docume~1\paul\blast\data\est_others.03
>     Posted date:  Aug 15, 2002 12:45 PM
>   Number of letters in database: 333,332,998
>   Number of sequences in database:  651,575

On Tue, 20 Aug 2002, Paul Boutros wrote:

> Okay, same code, new Blast record, new error.  The new blast record was
> run with parameters:
> -l (restricting it to a subset of GIs)
> -v 10 (restrict to 10 hits)
> -e 0.3 (expectation value)
>
> The error seems to be suggesting that there is an empty line somewhere
> where it isn't expected (i.e. midline = "\n").
>
> Any comments?  I've followed the suggestion of testing this both with
> ActiveState and by downloading the libraries and installing them directly.
> I get the same results either way.
>
> I also verified all the dependencies are present and up-to-date.
>
> Paul
>
>
I've never seen the '
> error:
> ------------- EXCEPTION  -------------
> MSG: no data for midline   Subset of the database(s) listed below
> STACK Bio::SearchIO::blast::next_result
> C:/Perl/site/lib/Bio/SearchIO/blast.pm:5
> 66
> STACK toplevel blastp~1.pl:9
> --------------------------------------
>
> new blast file:
> BLASTN 2.2.3 [Apr-24-2002]
>
>
> Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
> Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
> "Gapped BLAST and PSI-BLAST: a new generation of protein database search
> programs",  Nucleic Acids Res. 25:3389-3402.
>
> Query= H3001A01-3(C0001A09-3)
>          (362 letters)
>
> Database: est_others
>            5,032,538 sequences; 2,449,699,975 total letters
>
>
>
>                                                                  Score
> E
> Sequences producing significant alignments:                      (bits)
> Value
>
> gb|BI292210.1|BI292210 UI-R-DN0-civ-m-09-0-UI.s1 UI-R-DN0 Rattus...   274
> 1e-072
> gb|BF290726.1|BF290726 EST455317 Rat Gene Index, normalized rat,...   266
> 3e-070
> gb|BI301460.1|BI301460 UI-R-DN0-cit-e-07-0-UI.s1 UI-R-DN0 Rattus...   260
> 2e-068
>
> >gb|BI292210.1|BI292210 UI-R-DN0-civ-m-09-0-UI.s1 UI-R-DN0 Rattus
> norvegicus cDNA clone
>            UI-R-DN0-civ-m-09-0-UI 3'
>           Length = 468
>
>  Score =  274 bits (138), Expect = 1e-072
>  Identities = 168/178 (94%)
>  Strand = Plus / Plus
>
>
> Query: 1   aagatttatttatttattccatgtataggaatacactgtagctgtcttcagacacaccag 60
>            ||||||||||||||||||  ||| ||| |||||||||||||||||||||||| |||||||
> Sbjct: 20  aagatttatttatttattttatgcatatgaatacactgtagctgtcttcagatacaccag 79
>
>
> Query: 61  aagagggcatcagatctcattgcagatggctgtgagccaccatgtggttgctgggatttg
> 120
>            ||||||||||||||||| ||| ||||||| ||||||||||||||||||||||||||||||
> Sbjct: 80  aagagggcatcagatcttattacagatggttgtgagccaccatgtggttgctgggatttg
> 139
>
>
> Query: 121 aactcaggacctctggaagagcagtcggtgctcttaaccgctgagccatctctccagc 178
>            |||||||||||||||||||||||||| |||||||||||| ||||||||||||||||||
> Sbjct: 140 aactcaggacctctggaagagcagtcagtgctcttaaccactgagccatctctccagc 197
>
>
> >gb|BF290726.1|BF290726 EST455317 Rat Gene Index, normalized rat, Rattus
> norvegicus cDNA
>            Rattus norvegicus cDNA clone RGIIB68 3' sequence
>           Length = 223
>
>  Score =  266 bits (134), Expect = 3e-070
>  Identities = 167/178 (93%)
>  Strand = Plus / Plus
>
>
> Query: 1   aagatttatttatttattccatgtataggaatacactgtagctgtcttcagacacaccag 60
>            |||||||||||||||||| |||||||  || |||||||||||||||||||||||||||||
> Sbjct: 1   aagatttatttatttatttcatgtatgtgagtacactgtagctgtcttcagacacaccag 60
>
>
> Query: 61  aagagggcatcagatctcattgcagatggctgtgagccaccatgtggttgctgggatttg
> 120
>            |||||||||||||||| |||  | ||||| |||||| ||||||||||||||||||| |||
> Sbjct: 61  aagagggcatcagatcccatcacggatggttgtgaggcaccatgtggttgctgggaattg
> 120
>
>
> Query: 121 aactcaggacctctggaagagcagtcggtgctcttaaccgctgagccatctctccagc 178
>            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> Sbjct: 121 aactcaggacctctggaagagcagtcggtgctcttaaccgctgagccatctctccagc 178
>
>
> >gb|BI301460.1|BI301460 UI-R-DN0-cit-e-07-0-UI.s1 UI-R-DN0 Rattus
> norvegicus cDNA clone
>            UI-R-DN0-cit-e-07-0-UI 3'
>           Length = 524
>
>  Score =  260 bits (131), Expect = 2e-068
>  Identities = 164/175 (93%)
>  Strand = Plus / Plus
>
>
> Query: 4   atttatttatttattccatgtataggaatacactgtagctgtcttcagacacaccagaag 63
>            |||||||||||||||   | |||| || |||| |||||||||||||||||||||||||||
> Sbjct: 210 atttatttatttatttattatatatgagtacattgtagctgtcttcagacacaccagaag
> 269
>
>
> Query: 64  agggcatcagatctcattgcagatggctgtgagccaccatgtggttgctgggatttgaac
> 123
>            |||||||||||||||||| ||||||| |||||||||||||||||||||||||||||||||
> Sbjct: 270 agggcatcagatctcattacagatggttgtgagccaccatgtggttgctgggatttgaac
> 329
>
>
> Query: 124 tcaggacctctggaagagcagtcggtgctcttaaccgctgagccatctctccagc 178
>            ||||||||||||||||||||||| ||||||||||||||||||||| |||||||||
> Sbjct: 330 tcaggacctctggaagagcagtcagtgctcttaaccgctgagccacctctccagc 384
>
>
>
>
>   Subset of the database(s) listed below
>      Number of letters searched: 123,827,604
>      Number of sequences searched:  285,629
>
>   Database: est_others
>     Posted date:  Aug 15, 2002 12:08 PM
>   Number of letters in database: 333,332,922
>   Number of sequences in database:  0
>
>   Database: c:\docume~1\paul\blast\data\est_others.01
>     Posted date:  Aug 15, 2002 12:21 PM
>   Number of letters in database: 333,333,126
>   Number of sequences in database:  734,123
>
>   Database: c:\docume~1\paul\blast\data\est_others.02
>     Posted date:  Aug 15, 2002 12:33 PM
>   Number of letters in database: 333,332,951
>   Number of sequences in database:  710,185
>
>   Database: c:\docume~1\paul\blast\data\est_others.03
>     Posted date:  Aug 15, 2002 12:45 PM
>   Number of letters in database: 333,332,998
>   Number of sequences in database:  651,575
>
>   Database: c:\docume~1\paul\blast\data\est_others.04
>     Posted date:  Aug 15, 2002 12:56 PM
>   Number of letters in database: 333,332,826
>   Number of sequences in database:  637,159
>
>   Database: c:\docume~1\paul\blast\data\est_others.05
>     Posted date:  Aug 15, 2002  1:07 PM
>   Number of letters in database: 333,333,104
>   Number of sequences in database:  630,795
>
>   Database: c:\docume~1\paul\blast\data\est_others.06
>     Posted date:  Aug 15, 2002  1:19 PM
>   Number of letters in database: 333,332,943
>   Number of sequences in database:  650,535
>
>   Database: c:\docume~1\paul\blast\data\est_others.07
>     Posted date:  Aug 15, 2002  1:28 PM
>   Number of letters in database: 116,369,105
>   Number of sequences in database:  227,351
>
> Lambda     K      H
>     1.37    0.711     1.31
>
> Gapped
> Lambda     K      H
>     1.37    0.711     1.31
>
>
> Matrix: blastn matrix:1 -3
> Gap Penalties: Existence: 5, Extension: 2
> Number of Hits to DB: 49,566
> Number of Sequences: 4241723
> Number of extensions: 49566
> Number of successful extensions: 21421
> Number of sequences better than  0.3: 2335
> length of query: 362
> length of database: 123,827,604
> effective HSP length: 18
> effective length of query: 344
> effective length of database: 118,686,282
> effective search space: 40828081008
> effective search space used: 40828081008
> T: 0
> A: 40
> X1: 6 (11.9 bits)
> X2: 15 (29.7 bits)
> S1: 12 (24.3 bits)
> S2: 19 (38.2 bits)
> BLASTN 2.2.3 [Apr-24-2002]
>
>
>
>
> On Tue, 20 Aug 2002, Jason Stajich wrote:
>
> > Because the parser expects to be parsing a full blast report - you are
> > only providing it with a report which has hits but no hsps.
> >
> > At some point we can adapt the module to parse these types of reports, but
> > for now it is only going to work with reports that have the full
> > alignments included.
> >
> > -jason
> >
> > On Tue, 20 Aug 2002, Paul Boutros wrote:
> >
> > > Hello,
> > >
> > > I am just starting with Bioperl, trying to evaluate how useful it will be
> > > for our group.  I'm struggling with getting it to work on my first few
> > > steps here, though.  I would like to use the SearchIO system to parse a
> > > blast-results file and I can strange results.
> > >
> > > System: Win2k Pro (sp3)
> > > Perl: 5.6.1 ActiveState build 631 (all packages are updated)
> > > BioPerl: 1.00.2
> > >
> > > The basic problem is that the parser isn't finding any of the hits.  At
> > > all.  So the code below comes back with $count=0 for every record in the
> > > BLAST output file.  Any ideas what I'm doing wrong?
> > >
> > > Paul
> > >
> > >
> > > Code:
> > > use strict;
> > > use Bio::SearchIO;
> > >
> > > my $searchio = new Bio::SearchIO(
> > > 			'-format'	=> 'blast',
> > > 			'-file'		=> '15k5prime.out',
> > > 			);
> > >
> > > while (my $result = $searchio->next_result()) {
> > >
> > > 	my $count = 0;
> > >
> > > 	print "Name: ", $result->query_name(), "\n";
> > >
> > > 	while (my $hit = $result->next_hit()) {
> > > 		$count++;
> > > 		}
> > >
> > > 	print "Count: $count\n";
> > >
> > > 	}
> > >
> > > Blast File Fragment:
> > >
> > > BLASTN 2.2.3 [Apr-24-2002]
> > >
> > >
> > > Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
> > > Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
> > > "Gapped BLAST and PSI-BLAST: a new generation of protein database search
> > > programs",  Nucleic Acids Res. 25:3389-3402.
> > >
> > > Query= H3001A01-5
> > >          (589 letters)
> > >
> > > Database: est_others
> > >            5,032,538 sequences; 2,449,699,975 total letters
> > >
> > >
> > >
> > >                                                                  Score
> > > E
> > > Sequences producing significant alignments:                      (bits)
> > > Value
> > >
> > > gb|BQ206993.1|BQ206993 UI-R-DZ1-cnm-h-16-0-UI.s1 UI-R-DZ1 Rattus...   200
> > > 3e-050
> > > gb|BM386877.1|BM386877 UI-R-CN1-cjh-d-20-0-UI.s1 UI-R-CN1 Rattus...   198
> > > 1e-049
> > > gb|BI301905.1|BI301905 UI-R-DL0-cio-k-03-0-UI.s1 UI-R-DL0 Rattus...   198
> > > 1e-049
> > > gb|BI301460.1|BI301460 UI-R-DN0-cit-e-07-0-UI.s1 UI-R-DN0 Rattus...   198
> > > 1e-049
> > > gb|BG371847.1|BG371847 UI-R-CV0-brj-a-09-0-UI.s1 UI-R-CV0 Rattus...   198
> > > 1e-049
> > > gb|BE115424.1|BE115424 UI-R-BS1-axu-f-02-0-UI.s1 UI-R-BS1 Rattus...   198
> > > 1e-049
> > > gb|AA819696.1|AA819696 UI-R-A0-bh-d-10-0-UI.s1 UI-R-A0 Rattus no...   192
> > > 6e-048
> > > gb|BM383271.1|BM383271 UI-R-DS0-cje-i-16-0-UI.s1 UI-R-DS0 Rattus...   190
> > > 2e-047
> > > gb|BI292210.1|BI292210 UI-R-DN0-civ-m-09-0-UI.s1 UI-R-DN0 Rattus...   190
> > > 2e-047
> > > gb|BI284655.1|BI284655 UI-R-DE0-cac-f-05-0-UI.s1 UI-R-DE0 Rattus...   190
> > > 2e-047
> > >
> > >   Subset of the database(s) listed below
> > >      Number of letters searched: 123,827,604
> > >      Number of sequences searched:  285,629
> > >
> > >   Database: est_others
> > >     Posted date:  Aug 15, 2002 12:08 PM
> > >   Number of letters in database: 333,332,922
> > >   Number of sequences in database:  0
> > >
> > >   Database: c:\docume~1\paul\blast\data\est_others.01
> > >     Posted date:  Aug 15, 2002 12:21 PM
> > >   Number of letters in database: 333,333,126
> > >   Number of sequences in database:  734,123
> > >
> > >   Database: c:\docume~1\paul\blast\data\est_others.02
> > >     Posted date:  Aug 15, 2002 12:33 PM
> > >   Number of letters in database: 333,332,951
> > >   Number of sequences in database:  710,185
> > >
> > >   Database: c:\docume~1\paul\blast\data\est_others.03
> > >     Posted date:  Aug 15, 2002 12:45 PM
> > >   Number of letters in database: 333,332,998
> > >   Number of sequences in database:  651,575
> > >
> > >   Database: c:\docume~1\paul\blast\data\est_others.04
> > >     Posted date:  Aug 15, 2002 12:56 PM
> > >   Number of letters in database: 333,332,826
> > >   Number of sequences in database:  637,159
> > >
> > >   Database: c:\docume~1\paul\blast\data\est_others.05
> > >     Posted date:  Aug 15, 2002  1:07 PM
> > >   Number of letters in database: 333,333,104
> > >   Number of sequences in database:  630,795
> > >
> > >   Database: c:\docume~1\paul\blast\data\est_others.06
> > >     Posted date:  Aug 15, 2002  1:19 PM
> > >   Number of letters in database: 333,332,943
> > >   Number of sequences in database:  650,535
> > >
> > >   Database: c:\docume~1\paul\blast\data\est_others.07
> > >     Posted date:  Aug 15, 2002  1:28 PM
> > >   Number of letters in database: 116,369,105
> > >   Number of sequences in database:  227,351
> > >
> > > Lambda     K      H
> > >     1.37    0.711     1.31
> > >
> > > Gapped
> > > Lambda     K      H
> > >     1.37    0.711     1.31
> > >
> > >
> > > Matrix: blastn matrix:1 -3
> > > Gap Penalties: Existence: 5, Extension: 2
> > > Number of Hits to DB: 69,708
> > > Number of Sequences: 4241723
> > > Number of extensions: 69708
> > > Number of successful extensions: 25280
> > > Number of sequences better than  0.3: 2163
> > > length of query: 589
> > > length of database: 123,827,604
> > > effective HSP length: 18
> > > effective length of query: 571
> > > effective length of database: 118,686,282
> > > effective search space: 67769867022
> > > effective search space used: 67769867022
> > > T: 0
> > > A: 40
> > > X1: 6 (11.9 bits)
> > > X2: 15 (29.7 bits)
> > > S1: 12 (24.3 bits)
> > > S2: 19 (38.2 bits)
> > > BLASTN 2.2.3 [Apr-24-2002]
> > >
> > >
> > >
> > >
> > > _______________________________________________
> > > Bioperl-l mailing list
> > > Bioperl-l@bioperl.org
> > > http://bioperl.org/mailman/listinfo/bioperl-l
> > >
> >
> > --
> > Jason Stajich
> > Duke University
> > jason at cgt.mc.duke.edu
> >
>

-- 
Jason Stajich
Duke University
jason at cgt.mc.duke.edu