[Biojava-l] blast parsing and empty hits

Doug Rusch drusch@tcag.org
Thu, 3 Oct 2002 11:39:19 -0400


NCBI Blast 2.2.4 and I am just using the demo app BlastLike2XML. The version of blast should have no effect, indeed parsing a normal blast output works fine. However when the blast hit has no hits (see the sample blast output below) then the parser fails. I should note that there is a comment in the BlastSAXParser.java file that states that the parser will not work on blast output that does not contain a summary field. However, even if I put an artifical summary field into the Blast output, it still does not work because it has no way to handle empty blast reports (the "No hits found" line).

That is why I am asking what other people do? The blast parser is almost 2 years old and has not been modified since. Does everyone just write their own parser? Do people just not use BioJava to parse blast output?

Thanks,
Doug Rusch
TCAG.org

Sample XML output :

<?xml version="1.0"?>

<biojava:BlastLikeDataSetCollection xmlns=""
                                    xmlns:biojava="http://www.biojava.org">
  <biojava:BlastLikeDataSet program="ncbi-blastn"
                            version="2.2.4">
    <biojava:Header></biojava:Header>
  </biojava:BlastLikeDataSet>


Example blast result :

BLASTN 2.2.4 [Aug-26-2002]


Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, 
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), 
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs",  Nucleic Acids Res. 25:3389-3402.

Query= chimp|31266849857412 12 3149 bases
         (3147 letters)

Database: ../HumanGenome/ChrY.fasta 
           25 sequences; 22,743,943 total letters

Searching.done

 ***** No hits found ******

  Database: ../HumanGenome/ChrY.fasta
    Posted date:  Sep 20, 2002  4:59 PM
  Number of letters in database: 22,743,943
  Number of sequences in database:  25
  
Lambda     K      H
    1.39    0.750     1.39 

Gapped
Lambda     K      H
    1.39    0.750     1.39 


Matrix: blastn matrix:1 -8
Gap Penalties: Existence: 5, Extension: 2
Number of Hits to DB: 43,559
Number of Sequences: 25
Number of extensions: 43559
Number of successful extensions: 2
Number of sequences better than 1.0e-100: 0
length of query: 3147
length of database: 22,743,943
effective HSP length: 17
effective length of query: 3130
effective length of database: 22,743,518
effective search space: 71187211340
effective search space used: 71187211340
T: 0
A: 0
X1: 6 (12.0 bits)
X2: 15 (30.0 bits)
S1: 12 (24.4 bits)
S2: 184 (368.4 bits)


-----Original Message-----
From:	Simon Brocklehurst [mailto:simon.brocklehurst@CambridgeAntibody.com]
Sent:	Thu 10/3/2002 11:41 AM
To:	Doug Rusch
Cc:	biojava-l@biojava.org
Subject:	Re: [Biojava-l] blast parsing and empty hits


Doug Rusch wrote:
> 
> Hi all,
> 
> I am relatively new to the biojava community and I had a question about the blast parser in biojava. While looking at the code and judging from its behavior, the blast parser has no way to deal with empty blast reports (where no hit was found). My questions are : Am I missing something obvious? If not, are people really using this blast parser? If so, how do you handle this behavior? If not, what do people actually use to parse blast output?
> 

Doug,

Can you be a bit more specific e.g. say what you're trying to do, what
behaviour you're observing by using which classes, which version of
Blast (NCBI Blast, Wu Blast, program version numbers) you're using.

The parser is not designed to do anything interesting with empty blast
reports - what would you *like* to be able to do with them?

Simon
--
Simon M. Brocklehurst, Ph.D.
Director of Informatics & Robotics
Cambridge Antibody Technology
The Science Park, Melbourn, Cambridgeshire, UK
http://www.CambridgeAntibody.com/
mailto:simon.brocklehurst@CambridgeAntibody.com