[Biojava-l] SAXException with BLAST errors

W. Eric Trull wetrull at yahoo.com
Wed Dec 14 11:58:28 EST 2005


Thanks for the suggestion Mark.  I emailed NCBI and the jist of the reply
was:

These SeqPortNew errors usually indicate a problem in the formatting process;
the #'s are certainly not normal. Is this the only database entry that
generates errors?

So I dug a little deeper on 1ML5 to discover that it has a chain 'e' and a
chain 'E'.  When I created my FASTA file to feed to formatdb I made the
deflines of the form pdb|<id>|<chain>, but in uppercase.  So I had two
entries with the same defline but different sequences.  I think this is my
problem and am working on fixing it now.

Thanks.

-Eric Trull

--- mark.schreiber at novartis.com wrote:

> I would send NCBI your test sequence, the blast output and the version of 
> BLAST and ask them if this is "normal". I have found them to be very 
> responsive in the past. If it is normal then we need to fix biojava to 
> cope.
> 
> - Mark
> 
> 
> 
> 
> 
> "W. Eric Trull" <wetrull at yahoo.com>
> 12/13/2005 09:42 AM
> 
>  
>         To:     Mark Schreiber/GP/Novartis at PH
>         cc:     biojava-l at biojava.org,
> biojava-l-bounces at portal.open-bio.org
>         Subject:        Re: [Biojava-l] SAXException with BLAST errors
> 
> 
> No, I use BioJava to write the user's query sequence as a fasta file 
> before
> feeding it to BLAST.  I just copied a differently formatted sequence into 
> my
> post.
> 
> Thanks.
> 
> -Eric Trull
> 
> --- mark.schreiber at novartis.com wrote:
> 
> > Not exactly sure what the problem is here but it looks like your input 
> is 
> > not in FASTA format so that might be causing a problem??
> > 
> > 
> > 
> > 
> > 
> > "W. Eric Trull" <wetrull at yahoo.com>
> > Sent by: biojava-l-bounces at portal.open-bio.org
> > 12/13/2005 08:22 AM
> > 
> > 
> >         To:     biojava-l at biojava.org
> >         cc:     (bcc: Mark Schreiber/GP/Novartis)
> >         Subject:        [Biojava-l] SAXException with BLAST errors
> > 
> > 
> > Hello all,
> > 
> > Some of you may remember that I've been creating a Java application to 
> > front
> > a BLAST web service.  Everything is working great except some user found 
> 
> > the
> > random sequence that causes problems (gotta love those users).  I'm 
> using 
> > the
> > BlastXMLParserFacade to parse NCBI BLAST (2.2.12) XML output.  I think I 
> 
> > have
> > two problems; one is a NCBI BLAST problem and the other is with 
> BioJava's
> > BlastXMLParserFacade.  Any help/advice would be appreciated, especially 
> if 
> > I
> > have to explain the problem to NCBI - biology is not my strong suit.
> > 
> > Here is the relevant BioJava stack trace:
> > 
> > org.xml.sax.SAXException: <Hsp> is non-compliant.
> >                  at
> >
>
org.biojava.bio.program.sax.blastxml.HspHandler.endElementHandler(HspHandler.java:362)
> >                  at
> >
>
org.biojava.bio.program.sax.blastxml.StAXFeatureHandler.endElement(StAXFeatureHandler.java:235)
> >                  at
> > 
> org.biojava.utils.stax.SAX2StAXAdaptor.endElement(SAX2StAXAdaptor.java:153)
> >                  at 
> > org.apache.xerces.parsers.SAXParser.endElement(SAXParser.java:1403)
> >                  at
> >
>
org.apache.xerces.validators.common.XMLValidator.callEndElement(XMLValidator.java:1456)
> >                  at
> >
>
org.apache.xerces.framework.XMLDocumentScanner$ContentDispatcher.dispatch(XMLDocumentScanner.java:1260)
> >                  at
> >
>
org.apache.xerces.framework.XMLDocumentScanner.parseSome(XMLDocumentScanner.java:381)
> >                  at 
> > org.apache.xerces.framework.XMLParser.parse(XMLParser.java:1081)
> >                  at
> >
>
org.biojava.bio.program.sax.blastxml.BlastXMLParserFacade.parse(BlastXMLParserFacade.java:180)
> > 
> > Here is STDERR from NCBI BLAST on Sun Solaris:
> > 
> > [blastall] ERROR: ncbiapi [000.000]  : SeqPortNew: pdb|1ML5|E start(263) 
> 
> > >=
> > len(256)
> > [blastall] ERROR: ncbiapi [000.000]  : SeqPortNew: pdb|1ML5|E start(263) 
> 
> > >=
> > len(256)
> > [blastall] ERROR:  [065.106]  : /var/tmp/blast39961.tmpOutput
> > BlastOutput.iterations.E.hits.E.hsps.E.<hseq>
> > Invalid value(s) [-3] in VisibleString
> > [ýýýýýýýýýýýýýýýýý----------ýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýý 
> 
> > ...]
> > 
> > Here is what I get from NCBI BLAST on Windows XP:
> > 
> > [NULL_Caption] ERROR: ncbiapi [000.000]  : SeqPortNew: pdb|1ML5|E 
> > start(263)
> > >=
> > len(256)
> > [NULL_Caption] ERROR: ncbiapi [000.000]  : SeqPortNew: pdb|1ML5|E 
> > start(263)
> > >=
> > len(256)
> > [NULL_Caption] ERROR: ncbiapi [000.000]  : SeqPortNew: pdb|1ML5|E 
> > start(280)
> > >=
> > len(256)
> > [NULL_Caption] ERROR: ncbiapi [000.000]  : SeqPortNew: pdb|1ML5|E 
> > start(313)
> > >=
> > len(256)
> > 
> > Here is how I started BLAST:
> > 
> > /home/etrull/developer/blast-sparc64-solaris-2.2.12/bin/blastall -p 
> blastp 
> > -d
> > /home/etrull/developer/blast/current/pdb -i /var/tmp/fasta39960.tmp -m 7 
> 
> > -o
> > /var/tmp/blast39961.tmp -b 0
> > 
> > Here is my input sequence:
> > 
> > MLPRETDEEP EEPGRRGSFV EMVDNLRGKS GQGYYVEMTV GSPPQTLNIL VDTGSSNFAV 
> > GAAPHPFLHR
> > YYQRQLSSTY RDLRKGVYVP YTQGAWAGEL GTDLVSIPHG PNVTVRANIA AITESDKFFI 
> > NGSNWEGILG
> > LAYAEIARPD DSLEPFFDSL VKQTHVPNLF SLQLCGAGFP LNQSEVLASV GGSMIIGGID 
> > HSLYTGSLWY
> > TPIRREWYYE VIIVRVEING QDLKMDCKEY NYDKSIVDSG TTNLRLPKKV FEAAVKSIKA 
> > ASSTEKFPDG
> > FWLGEQLVCW QAGTTPWNIF PVISLYLMGE VTNQSFRITI LPQQYLRPVE DVATSQDDCY 
> > KFAISQSSTG
> > TVMGAVIMEG FYVVFDRARK RIGFAVSACH VHDEFRTAAV EGPFVTLDME
> > DCGYN
> > 
> > Here is the regular BLAST output for pdb|1ML5|E.  It seems odd to me 
> that 
> > the
> > identities and positives are both zero - why is this even showing up as 
> a
> > similar sequence?
> > 
> > >pdb|1ML5|E 30S Ribosomal Protein S2
> >           Length = 256
> > 
> >  Score = 28.1 bits (61), Expect = 5.8
> >  Identities = 0/71 (0%), Positives = 0/71 (0%), Gaps = 10/71 (14%)
> > 
> > Query: 99  ELGTDLVSIPHGPNVTVRANIAAITESDKFFINGSNWEGILGLAYAEIARPDDSLEPFFD 
> > 158
> > 
> > Sbjct: 264 ---------- 313
> > 
> > Query: 159 SLVKQTHVPNL 169
> > 
> > Sbjct: 314             324
> > 
> > 
> > Here is the XML BLAST output for pdb|1ML5|E.  Notice the second 
> <Hsp_hseq>
> > has a bunch of "#" signs.  Is this valid in BioJava?
> > 
> >         <Hit>
> >           <Hit_num>146</Hit_num>
> >           <Hit_id>pdb|1ML5|E</Hit_id>
> >           <Hit_def>30S Ribosomal Protein S2</Hit_def>
> >           <Hit_accession>1ML5_E</Hit_accession>
> >           <Hit_len>256</Hit_len>
> >           <Hit_hsps>
> >             <Hsp>
> >               <Hsp_num>1</Hsp_num>
> >               <Hsp_bit-score>28.1054</Hsp_bit-score>
> >               <Hsp_score>61</Hsp_score>
> >               <Hsp_evalue>5.76848</Hsp_evalue>
> >               <Hsp_query-from>99</Hsp_query-from>
> 
=== message truncated ===



More information about the Biojava-l mailing list