[Biopython-dev] [Biopython - Bug #3354] (New) Legacy blast XML parser returns prematurely StopIteration

redmine at redmine.open-bio.org redmine at redmine.open-bio.org
Wed May 16 10:19:14 UTC 2012


Issue #3354 has been reported by Martin Mokrejš.

----------------------------------------
Bug #3354: Legacy blast XML parser returns prematurely StopIteration
https://redmine.open-bio.org/issues/3354

Author: Martin Mokrejš
Status: New
Priority: Normal
Assignee: 
Category: 
Target version: 
URL: 


Hi,
  I am parsing some blast 2.2.24 XML output and the last record I get is the one from
iteration 124. I see that entry is followed by a new <Iteration_iter-num> section which
is probably the culprit. I will try newer legacy blast but still, biopython could maybe
overcome this bug in XML input?



<pre>
blastall -p blastn -A 4 -i SRR068315.fasta -d my_targets.fasta -F 0 -S 1 -r 2 -e 10e-30 -m 7
</pre>
<pre>

<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd">
<BlastOutput>
  <BlastOutput_program>blastn</BlastOutput_program>
  <BlastOutput_version>blastn 2.2.24 [Aug-08-2010]</BlastOutput_version>
  <BlastOutput_reference>~Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, ~Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), ~&quot;Gapped BLAST and PSI-BLAST: a new generation of protein database search~programs&quot;,  Nucleic Acids Res. 25:3389-3402.</BlastOutput_reference>
  <BlastOutput_db>my_targets.fasta</BlastOutput_db>
  <BlastOutput_query-ID>lcl|1_0</BlastOutput_query-ID>
  <BlastOutput_query-def>FYUQ5C204IQCOE length=283 xy=3463_2076 region=4 run=R_2009_07_08_19_30_38_</BlastOutput_query-def>
  <BlastOutput_query-len>318</BlastOutput_query-len>
  <BlastOutput_param>
    <Parameters>
      <Parameters_expect>1e-29</Parameters_expect>
      <Parameters_sc-match>2</Parameters_sc-match>
      <Parameters_sc-mismatch>-3</Parameters_sc-mismatch>
      <Parameters_gap-open>5</Parameters_gap-open>
      <Parameters_gap-extend>2</Parameters_gap-extend>
      <Parameters_filter>F</Parameters_filter>
    </Parameters>
  </BlastOutput_param>
  <BlastOutput_iterations>
[cut]
    <Iteration>
      <Iteration_iter-num>124</Iteration_iter-num>
      <Iteration_query-ID>lcl|124_0</Iteration_query-ID>
      <Iteration_query-def>FYUQ5C204JXGMI length=44 xy=3954_2264 region=4 run=R_2009_07_08_19_30_38_</Iteration_query-def>
      <Iteration_query-len>350</Iteration_query-len>
      <Iteration_stat>
        <Statistics>
          <Statistics_db-num>22</Statistics_db-num>
          <Statistics_db-len>9262</Statistics_db-len>
          <Statistics_hsp-len>0</Statistics_hsp-len>
          <Statistics_eff-space>0</Statistics_eff-space>
          <Statistics_kappa>0.41</Statistics_kappa>
          <Statistics_lambda>0.625</Statistics_lambda>
          <Statistics_entropy>0.78</Statistics_entropy>
        </Statistics>
      </Iteration_stat>
      <Iteration_message>No hits found</Iteration_message>
    </Iteration>
    <Iteration>
      <Iteration_iter-num>1</Iteration_iter-num>
      <Iteration_stat>
        <Statistics>
          <Statistics_db-num>22</Statistics_db-num>
          <Statistics_db-len>9262</Statistics_db-len>
          <Statistics_hsp-len>0</Statistics_hsp-len>
          <Statistics_eff-space>0</Statistics_eff-space>
          <Statistics_kappa>0.41</Statistics_kappa>
          <Statistics_lambda>0.625</Statistics_lambda>
          <Statistics_entropy>0.78</Statistics_entropy>
        </Statistics>
      </Iteration_stat>
    </Iteration>
    <Iteration>
      <Iteration_iter-num>125</Iteration_iter-num>
      <Iteration_query-ID>lcl|125_0</Iteration_query-ID>
      <Iteration_query-def>FYUQ5C204JFG82 length=173 xy=3749_2948 region=4 run=R_2009_07_08_19_30_38_</Iteration_query-def>
      <Iteration_query-len>208</Iteration_query-len>
      <Iteration_stat>
        <Statistics>
          <Statistics_db-num>22</Statistics_db-num>
          <Statistics_db-len>9262</Statistics_db-len>
          <Statistics_hsp-len>0</Statistics_hsp-len>
          <Statistics_eff-space>0</Statistics_eff-space>
          <Statistics_kappa>0.41</Statistics_kappa>
          <Statistics_lambda>0.625</Statistics_lambda>
          <Statistics_entropy>0.78</Statistics_entropy>
        </Statistics>
      </Iteration_stat>
      <Iteration_message>No hits found</Iteration_message>
    </Iteration>
    <Iteration>
      <Iteration_iter-num>126</Iteration_iter-num>
      <Iteration_query-ID>lcl|126_0</Iteration_query-ID>
      <Iteration_query-def>FYUQ5C204I2D3A length=146 xy=3600_2628 region=4 run=R_2009_07_08_19_30_38_</Iteration_query-def>
      <Iteration_query-len>205</Iteration_query-len>
      <Iteration_stat>
        <Statistics>
          <Statistics_db-num>22</Statistics_db-num>
          <Statistics_db-len>9262</Statistics_db-len>
          <Statistics_hsp-len>0</Statistics_hsp-len>
          <Statistics_eff-space>0</Statistics_eff-space>
          <Statistics_kappa>0.41</Statistics_kappa>
          <Statistics_lambda>0.625</Statistics_lambda>
          <Statistics_entropy>0.78</Statistics_entropy>
        </Statistics>
      </Iteration_stat>
      <Iteration_message>No hits found</Iteration_message>
    </Iteration>


</pre>

Grep-ping for the iteration numbers I foresee few more cases like that ahead in the XML file:
<pre>

      <Iteration_iter-num>234</Iteration_iter-num>
      <Iteration_iter-num>1</Iteration_iter-num>
      <Iteration_iter-num>235</Iteration_iter-num>
      <Iteration_iter-num>236</Iteration_iter-num>

      <Iteration_iter-num>345</Iteration_iter-num>
      <Iteration_iter-num>1</Iteration_iter-num>
      <Iteration_iter-num>346</Iteration_iter-num>
      <Iteration_iter-num>347</Iteration_iter-num>

      <Iteration_iter-num>450</Iteration_iter-num>
      <Iteration_iter-num>1</Iteration_iter-num>
      <Iteration_iter-num>451</Iteration_iter-num>
      <Iteration_iter-num>452</Iteration_iter-num>

      <Iteration_iter-num>555</Iteration_iter-num>
      <Iteration_iter-num>1</Iteration_iter-num>
      <Iteration_iter-num>556</Iteration_iter-num>
      <Iteration_iter-num>557</Iteration_iter-num>

      <Iteration_iter-num>655</Iteration_iter-num>
      <Iteration_iter-num>1</Iteration_iter-num>
      <Iteration_iter-num>656</Iteration_iter-num>
      <Iteration_iter-num>657</Iteration_iter-num>

      <Iteration_iter-num>759</Iteration_iter-num>
      <Iteration_iter-num>1</Iteration_iter-num>
      <Iteration_iter-num>760</Iteration_iter-num>
      <Iteration_iter-num>761</Iteration_iter-num>

      <Iteration_iter-num>859</Iteration_iter-num>
      <Iteration_iter-num>1</Iteration_iter-num>
      <Iteration_iter-num>860</Iteration_iter-num>
      <Iteration_iter-num>861</Iteration_iter-num>

      <Iteration_iter-num>956</Iteration_iter-num>
      <Iteration_iter-num>1</Iteration_iter-num>
      <Iteration_iter-num>957</Iteration_iter-num>
      <Iteration_iter-num>958</Iteration_iter-num>

      <Iteration_iter-num>1050</Iteration_iter-num>
      <Iteration_iter-num>1</Iteration_iter-num>
      <Iteration_iter-num>1051</Iteration_iter-num>
      <Iteration_iter-num>1052</Iteration_iter-num>

      <Iteration_iter-num>1145</Iteration_iter-num>
      <Iteration_iter-num>1</Iteration_iter-num>
      <Iteration_iter-num>1146</Iteration_iter-num>
      <Iteration_iter-num>1147</Iteration_iter-num>

      <Iteration_iter-num>1239</Iteration_iter-num>
      <Iteration_iter-num>1</Iteration_iter-num>
      <Iteration_iter-num>1240</Iteration_iter-num>
      <Iteration_iter-num>1241</Iteration_iter-num>

      <Iteration_iter-num>1333</Iteration_iter-num>
      <Iteration_iter-num>1</Iteration_iter-num>
      <Iteration_iter-num>1334</Iteration_iter-num>
      <Iteration_iter-num>1335</Iteration_iter-num>

      <Iteration_iter-num>1430</Iteration_iter-num>
      <Iteration_iter-num>1</Iteration_iter-num>
      <Iteration_iter-num>1431</Iteration_iter-num>
      <Iteration_iter-num>1432</Iteration_iter-num>

      <Iteration_iter-num>1523</Iteration_iter-num>
      <Iteration_iter-num>1</Iteration_iter-num>
      <Iteration_iter-num>1524</Iteration_iter-num>
      <Iteration_iter-num>1525</Iteration_iter-num>

      <Iteration_iter-num>1610</Iteration_iter-num>
      <Iteration_iter-num>1</Iteration_iter-num>
      <Iteration_iter-num>1611</Iteration_iter-num>
      <Iteration_iter-num>1612</Iteration_iter-num>

      <Iteration_iter-num>1703</Iteration_iter-num>
      <Iteration_iter-num>1</Iteration_iter-num>
      <Iteration_iter-num>1704</Iteration_iter-num>
      <Iteration_iter-num>1705</Iteration_iter-num>

      <Iteration_iter-num>1792</Iteration_iter-num>
      <Iteration_iter-num>1</Iteration_iter-num>
      <Iteration_iter-num>1793</Iteration_iter-num>
      <Iteration_iter-num>1794</Iteration_iter-num>

      <Iteration_iter-num>1881</Iteration_iter-num>
      <Iteration_iter-num>1</Iteration_iter-num>
      <Iteration_iter-num>1882</Iteration_iter-num>
      <Iteration_iter-num>1883</Iteration_iter-num>


</pre>
Then, no this problem anymore until end of the XML file at:
<pre>
     <Iteration_iter-num>25698</Iteration_iter-num>
</pre>


I am attaching the XML file with entries removed since about the last problematic place, with the two "closing" XML lines added so the file should be valid XML again.


----------------------------------------
You have received this notification because this email was added to the New Issue Alert plugin


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org




More information about the Biopython-dev mailing list