[Biojava-l] SAXException with BLAST errors

mark.schreiber at novartis.com mark.schreiber at novartis.com
Mon Dec 12 20:37:59 EST 2005


Not exactly sure what the problem is here but it looks like your input is 
not in FASTA format so that might be causing a problem??





"W. Eric Trull" <wetrull at yahoo.com>
Sent by: biojava-l-bounces at portal.open-bio.org
12/13/2005 08:22 AM

 
        To:     biojava-l at biojava.org
        cc:     (bcc: Mark Schreiber/GP/Novartis)
        Subject:        [Biojava-l] SAXException with BLAST errors


Hello all,

Some of you may remember that I've been creating a Java application to 
front
a BLAST web service.  Everything is working great except some user found 
the
random sequence that causes problems (gotta love those users).  I'm using 
the
BlastXMLParserFacade to parse NCBI BLAST (2.2.12) XML output.  I think I 
have
two problems; one is a NCBI BLAST problem and the other is with BioJava's
BlastXMLParserFacade.  Any help/advice would be appreciated, especially if 
I
have to explain the problem to NCBI - biology is not my strong suit.

Here is the relevant BioJava stack trace:

org.xml.sax.SAXException: <Hsp> is non-compliant.
                 at
org.biojava.bio.program.sax.blastxml.HspHandler.endElementHandler(HspHandler.java:362)
                 at
org.biojava.bio.program.sax.blastxml.StAXFeatureHandler.endElement(StAXFeatureHandler.java:235)
                 at
org.biojava.utils.stax.SAX2StAXAdaptor.endElement(SAX2StAXAdaptor.java:153)
                 at 
org.apache.xerces.parsers.SAXParser.endElement(SAXParser.java:1403)
                 at
org.apache.xerces.validators.common.XMLValidator.callEndElement(XMLValidator.java:1456)
                 at
org.apache.xerces.framework.XMLDocumentScanner$ContentDispatcher.dispatch(XMLDocumentScanner.java:1260)
                 at
org.apache.xerces.framework.XMLDocumentScanner.parseSome(XMLDocumentScanner.java:381)
                 at 
org.apache.xerces.framework.XMLParser.parse(XMLParser.java:1081)
                 at
org.biojava.bio.program.sax.blastxml.BlastXMLParserFacade.parse(BlastXMLParserFacade.java:180)

Here is STDERR from NCBI BLAST on Sun Solaris:

[blastall] ERROR: ncbiapi [000.000]  : SeqPortNew: pdb|1ML5|E start(263) 
>=
len(256)
[blastall] ERROR: ncbiapi [000.000]  : SeqPortNew: pdb|1ML5|E start(263) 
>=
len(256)
[blastall] ERROR:  [065.106]  : /var/tmp/blast39961.tmpOutput
BlastOutput.iterations.E.hits.E.hsps.E.<hseq>
Invalid value(s) [-3] in VisibleString
[ýýýýýýýýýýýýýýýýý----------ýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýý 
...]

Here is what I get from NCBI BLAST on Windows XP:

[NULL_Caption] ERROR: ncbiapi [000.000]  : SeqPortNew: pdb|1ML5|E 
start(263)
>=
len(256)
[NULL_Caption] ERROR: ncbiapi [000.000]  : SeqPortNew: pdb|1ML5|E 
start(263)
>=
len(256)
[NULL_Caption] ERROR: ncbiapi [000.000]  : SeqPortNew: pdb|1ML5|E 
start(280)
>=
len(256)
[NULL_Caption] ERROR: ncbiapi [000.000]  : SeqPortNew: pdb|1ML5|E 
start(313)
>=
len(256)

Here is how I started BLAST:

/home/etrull/developer/blast-sparc64-solaris-2.2.12/bin/blastall -p blastp 
-d
/home/etrull/developer/blast/current/pdb -i /var/tmp/fasta39960.tmp -m 7 
-o
/var/tmp/blast39961.tmp -b 0

Here is my input sequence:

MLPRETDEEP EEPGRRGSFV EMVDNLRGKS GQGYYVEMTV GSPPQTLNIL VDTGSSNFAV 
GAAPHPFLHR
YYQRQLSSTY RDLRKGVYVP YTQGAWAGEL GTDLVSIPHG PNVTVRANIA AITESDKFFI 
NGSNWEGILG
LAYAEIARPD DSLEPFFDSL VKQTHVPNLF SLQLCGAGFP LNQSEVLASV GGSMIIGGID 
HSLYTGSLWY
TPIRREWYYE VIIVRVEING QDLKMDCKEY NYDKSIVDSG TTNLRLPKKV FEAAVKSIKA 
ASSTEKFPDG
FWLGEQLVCW QAGTTPWNIF PVISLYLMGE VTNQSFRITI LPQQYLRPVE DVATSQDDCY 
KFAISQSSTG
TVMGAVIMEG FYVVFDRARK RIGFAVSACH VHDEFRTAAV EGPFVTLDME
DCGYN

Here is the regular BLAST output for pdb|1ML5|E.  It seems odd to me that 
the
identities and positives are both zero - why is this even showing up as a
similar sequence?

>pdb|1ML5|E 30S Ribosomal Protein S2
          Length = 256

 Score = 28.1 bits (61), Expect = 5.8
 Identities = 0/71 (0%), Positives = 0/71 (0%), Gaps = 10/71 (14%)

Query: 99  ELGTDLVSIPHGPNVTVRANIAAITESDKFFINGSNWEGILGLAYAEIARPDDSLEPFFD 
158

Sbjct: 264 ---------- 313

Query: 159 SLVKQTHVPNL 169

Sbjct: 314             324


Here is the XML BLAST output for pdb|1ML5|E.  Notice the second <Hsp_hseq>
has a bunch of "#" signs.  Is this valid in BioJava?

        <Hit>
          <Hit_num>146</Hit_num>
          <Hit_id>pdb|1ML5|E</Hit_id>
          <Hit_def>30S Ribosomal Protein S2</Hit_def>
          <Hit_accession>1ML5_E</Hit_accession>
          <Hit_len>256</Hit_len>
          <Hit_hsps>
            <Hsp>
              <Hsp_num>1</Hsp_num>
              <Hsp_bit-score>28.1054</Hsp_bit-score>
              <Hsp_score>61</Hsp_score>
              <Hsp_evalue>5.76848</Hsp_evalue>
              <Hsp_query-from>99</Hsp_query-from>
              <Hsp_query-to>169</Hsp_query-to>
              <Hsp_hit-from>264</Hsp_hit-from>
              <Hsp_hit-to>324</Hsp_hit-to>
              <Hsp_query-frame>1</Hsp_query-frame>
              <Hsp_hit-frame>1</Hsp_hit-frame>
              <Hsp_gaps>10</Hsp_gaps>
              <Hsp_align-len>71</Hsp_align-len>
 
<Hsp_qseq>ELGTDLVSIPHGPNVTVRANIAAITESDKFFINGSNWEGILGLAYAEIARPDDSLEPFFDSLVKQTHVPNL</Hsp_qseq>
 
<Hsp_hseq>#################----------############################################</Hsp_hseq>
              <Hsp_midline>  
                    </Hsp_midline>
            </Hsp>
          </Hit_hsps>
        </Hit>

Thanks.

-Eric Trull
_______________________________________________
Biojava-l mailing list  -  Biojava-l at biojava.org
http://biojava.org/mailman/listinfo/biojava-l






More information about the Biojava-l mailing list