[Biojava-l] SAXException with BLAST errors

W. Eric Trull wetrull at yahoo.com
Mon Dec 12 19:22:48 EST 2005


Hello all,

Some of you may remember that I've been creating a Java application to front
a BLAST web service.  Everything is working great except some user found the
random sequence that causes problems (gotta love those users).  I'm using the
BlastXMLParserFacade to parse NCBI BLAST (2.2.12) XML output.  I think I have
two problems; one is a NCBI BLAST problem and the other is with BioJava's
BlastXMLParserFacade.  Any help/advice would be appreciated, especially if I
have to explain the problem to NCBI - biology is not my strong suit.

Here is the relevant BioJava stack trace:

org.xml.sax.SAXException: <Hsp> is non-compliant.
	at
org.biojava.bio.program.sax.blastxml.HspHandler.endElementHandler(HspHandler.java:362)
	at
org.biojava.bio.program.sax.blastxml.StAXFeatureHandler.endElement(StAXFeatureHandler.java:235)
	at
org.biojava.utils.stax.SAX2StAXAdaptor.endElement(SAX2StAXAdaptor.java:153)
	at org.apache.xerces.parsers.SAXParser.endElement(SAXParser.java:1403)
	at
org.apache.xerces.validators.common.XMLValidator.callEndElement(XMLValidator.java:1456)
	at
org.apache.xerces.framework.XMLDocumentScanner$ContentDispatcher.dispatch(XMLDocumentScanner.java:1260)
	at
org.apache.xerces.framework.XMLDocumentScanner.parseSome(XMLDocumentScanner.java:381)
	at org.apache.xerces.framework.XMLParser.parse(XMLParser.java:1081)
	at
org.biojava.bio.program.sax.blastxml.BlastXMLParserFacade.parse(BlastXMLParserFacade.java:180)

Here is STDERR from NCBI BLAST on Sun Solaris:

[blastall] ERROR: ncbiapi [000.000]  : SeqPortNew: pdb|1ML5|E start(263) >=
len(256)
[blastall] ERROR: ncbiapi [000.000]  : SeqPortNew: pdb|1ML5|E start(263) >=
len(256)
[blastall] ERROR:  [065.106]  : /var/tmp/blast39961.tmpOutput
BlastOutput.iterations.E.hits.E.hsps.E.<hseq>
Invalid value(s) [-3] in VisibleString
[ýýýýýýýýýýýýýýýýý----------ýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýý ...]

Here is what I get from NCBI BLAST on Windows XP:

[NULL_Caption] ERROR: ncbiapi [000.000]  : SeqPortNew: pdb|1ML5|E start(263)
>=
len(256)
[NULL_Caption] ERROR: ncbiapi [000.000]  : SeqPortNew: pdb|1ML5|E start(263)
>=
len(256)
[NULL_Caption] ERROR: ncbiapi [000.000]  : SeqPortNew: pdb|1ML5|E start(280)
>=
len(256)
[NULL_Caption] ERROR: ncbiapi [000.000]  : SeqPortNew: pdb|1ML5|E start(313)
>=
len(256)

Here is how I started BLAST:

/home/etrull/developer/blast-sparc64-solaris-2.2.12/bin/blastall -p blastp -d
/home/etrull/developer/blast/current/pdb -i /var/tmp/fasta39960.tmp -m 7 -o
/var/tmp/blast39961.tmp -b 0

Here is my input sequence:

MLPRETDEEP EEPGRRGSFV EMVDNLRGKS GQGYYVEMTV GSPPQTLNIL VDTGSSNFAV GAAPHPFLHR
YYQRQLSSTY RDLRKGVYVP YTQGAWAGEL GTDLVSIPHG PNVTVRANIA AITESDKFFI NGSNWEGILG
LAYAEIARPD DSLEPFFDSL VKQTHVPNLF SLQLCGAGFP LNQSEVLASV GGSMIIGGID HSLYTGSLWY
TPIRREWYYE VIIVRVEING QDLKMDCKEY NYDKSIVDSG TTNLRLPKKV FEAAVKSIKA ASSTEKFPDG
FWLGEQLVCW QAGTTPWNIF PVISLYLMGE VTNQSFRITI LPQQYLRPVE DVATSQDDCY KFAISQSSTG
TVMGAVIMEG FYVVFDRARK RIGFAVSACH VHDEFRTAAV EGPFVTLDME
DCGYN

Here is the regular BLAST output for pdb|1ML5|E.  It seems odd to me that the
identities and positives are both zero - why is this even showing up as a
similar sequence?

>pdb|1ML5|E 30S Ribosomal Protein S2
          Length = 256

 Score = 28.1 bits (61), Expect = 5.8
 Identities = 0/71 (0%), Positives = 0/71 (0%), Gaps = 10/71 (14%)

Query: 99  ELGTDLVSIPHGPNVTVRANIAAITESDKFFINGSNWEGILGLAYAEIARPDDSLEPFFD 158

Sbjct: 264 ----------                                                   313

Query: 159 SLVKQTHVPNL 169

Sbjct: 314             324


Here is the XML BLAST output for pdb|1ML5|E.  Notice the second <Hsp_hseq>
has a bunch of "#" signs.  Is this valid in BioJava?

        <Hit>
          <Hit_num>146</Hit_num>
          <Hit_id>pdb|1ML5|E</Hit_id>
          <Hit_def>30S Ribosomal Protein S2</Hit_def>
          <Hit_accession>1ML5_E</Hit_accession>
          <Hit_len>256</Hit_len>
          <Hit_hsps>
            <Hsp>
              <Hsp_num>1</Hsp_num>
              <Hsp_bit-score>28.1054</Hsp_bit-score>
              <Hsp_score>61</Hsp_score>
              <Hsp_evalue>5.76848</Hsp_evalue>
              <Hsp_query-from>99</Hsp_query-from>
              <Hsp_query-to>169</Hsp_query-to>
              <Hsp_hit-from>264</Hsp_hit-from>
              <Hsp_hit-to>324</Hsp_hit-to>
              <Hsp_query-frame>1</Hsp_query-frame>
              <Hsp_hit-frame>1</Hsp_hit-frame>
              <Hsp_gaps>10</Hsp_gaps>
              <Hsp_align-len>71</Hsp_align-len>
             
<Hsp_qseq>ELGTDLVSIPHGPNVTVRANIAAITESDKFFINGSNWEGILGLAYAEIARPDDSLEPFFDSLVKQTHVPNL</Hsp_qseq>
             
<Hsp_hseq>#################----------############################################</Hsp_hseq>
              <Hsp_midline>                                                  
                    </Hsp_midline>
            </Hsp>
          </Hit_hsps>
        </Hit>

Thanks.

-Eric Trull


More information about the Biojava-l mailing list