[Bioperl-l] problems parsing XML results from BLAST+ version of psiblast running in batch mode

Scott Markel Scott.Markel at accelrys.com
Tue Jun 18 15:54:24 UTC 2013


Short version -

How do I use Bio::Search::* modules to parse the XML results from the BLAST+ version of psiblast running in batch mode?  Only one set of iteration numbers is used, so I can't tell which iteration goes with which query sequence.

Long version -

I'm running NCBI BLAST+ psiblast (version 2.2.27+) in batch mode with XML output.  Unlike the BLAST version, which creates a <BlastOutput>...</BlastOutput> tag pair for each query sequence, the BLAST+ version creates a single <BlastOutput>...</BlastOutput> tag pair containing all iterations for all query sequences.  The iteration numbers run across the query sequences, i.e., the iteration numbers don't restart for a new query sequence.

So, how to know which iteration goes with which query sequence?

There are <BlastOutput_query-ID>...</BlastOutput_query-ID> and <BlastOutput_query-def>...</BlastOutput_query-def> tag pairs that could be used to inspect the iterations, but there are no subroutines in Bio::Search::Iteration::GenericIteration providing access to these values.

An XML output file fragment showing the tag pairs is pasted below.

Any suggestions on workarounds or a pointer to something obvious that I'm missing would be greatly appreciated.

Scott

#########################

<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd">
<BlastOutput>
  <BlastOutput_program>psiblast</BlastOutput_program>
...
  <BlastOutput_query-ID>Query_1</BlastOutput_query-ID>
  <BlastOutput_query-def>lcl|1 no description available</BlastOutput_query-def>
  <BlastOutput_query-len>100</BlastOutput_query-len>
  <BlastOutput_param>
...
  </BlastOutput_param>
  <BlastOutput_iterations>
    <Iteration>
      <Iteration_iter-num>1</Iteration_iter-num>
      <Iteration_query-ID>Query_1</Iteration_query-ID>
      <Iteration_query-def>lcl|1 no description available</Iteration_query-def>
      <Iteration_query-len>100</Iteration_query-len>
      <Iteration_hits>
...
      </Iteration_hits>
      <Iteration_stat>
...
      </Iteration_stat>
    </Iteration>
    <Iteration>
      <Iteration_iter-num>2</Iteration_iter-num>
      <Iteration_query-ID>Query_2</Iteration_query-ID>
      <Iteration_query-def>lcl|2 no description available</Iteration_query-def>
      <Iteration_query-len>100</Iteration_query-len>
      <Iteration_hits>
...
      </Iteration_hits>
      <Iteration_stat>
...
      </Iteration_stat>
    </Iteration>
  </BlastOutput_iterations>
</BlastOutput>


Scott Markel, Ph.D.
Principal Bioinformatics Architect  email:  smarkel at accelrys.com
Accelrys (Pipeline Pilot R&D)       mobile: +1 858 205 3653
10188 Telesis Court, Suite 100      voice:  +1 858 799 5603
San Diego, CA 92121                 fax:    +1 858 799 5222
USA                                 web:    http://www.accelrys.com

http://www.linkedin.com/in/smarkel
Secretary, Board of Directors:
    International Society for Computational Biology
Chair: ISCB Publications and Communications Committee
Associate Editor: PLOS Computational Biology
Editorial Board: Briefings in Bioinformatics







More information about the Bioperl-l mailing list