[Bioperl-l] problems parsing XML results from BLAST+ version of psiblast running in batch mode
Scott Markel
Scott.Markel at accelrys.com
Tue Jun 18 15:54:24 UTC 2013
Short version -
How do I use Bio::Search::* modules to parse the XML results from the BLAST+ version of psiblast running in batch mode? Only one set of iteration numbers is used, so I can't tell which iteration goes with which query sequence.
Long version -
I'm running NCBI BLAST+ psiblast (version 2.2.27+) in batch mode with XML output. Unlike the BLAST version, which creates a <BlastOutput>...</BlastOutput> tag pair for each query sequence, the BLAST+ version creates a single <BlastOutput>...</BlastOutput> tag pair containing all iterations for all query sequences. The iteration numbers run across the query sequences, i.e., the iteration numbers don't restart for a new query sequence.
So, how to know which iteration goes with which query sequence?
There are <BlastOutput_query-ID>...</BlastOutput_query-ID> and <BlastOutput_query-def>...</BlastOutput_query-def> tag pairs that could be used to inspect the iterations, but there are no subroutines in Bio::Search::Iteration::GenericIteration providing access to these values.
An XML output file fragment showing the tag pairs is pasted below.
Any suggestions on workarounds or a pointer to something obvious that I'm missing would be greatly appreciated.
Scott
#########################
<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd">
<BlastOutput>
<BlastOutput_program>psiblast</BlastOutput_program>
...
<BlastOutput_query-ID>Query_1</BlastOutput_query-ID>
<BlastOutput_query-def>lcl|1 no description available</BlastOutput_query-def>
<BlastOutput_query-len>100</BlastOutput_query-len>
<BlastOutput_param>
...
</BlastOutput_param>
<BlastOutput_iterations>
<Iteration>
<Iteration_iter-num>1</Iteration_iter-num>
<Iteration_query-ID>Query_1</Iteration_query-ID>
<Iteration_query-def>lcl|1 no description available</Iteration_query-def>
<Iteration_query-len>100</Iteration_query-len>
<Iteration_hits>
...
</Iteration_hits>
<Iteration_stat>
...
</Iteration_stat>
</Iteration>
<Iteration>
<Iteration_iter-num>2</Iteration_iter-num>
<Iteration_query-ID>Query_2</Iteration_query-ID>
<Iteration_query-def>lcl|2 no description available</Iteration_query-def>
<Iteration_query-len>100</Iteration_query-len>
<Iteration_hits>
...
</Iteration_hits>
<Iteration_stat>
...
</Iteration_stat>
</Iteration>
</BlastOutput_iterations>
</BlastOutput>
Scott Markel, Ph.D.
Principal Bioinformatics Architect email: smarkel at accelrys.com
Accelrys (Pipeline Pilot R&D) mobile: +1 858 205 3653
10188 Telesis Court, Suite 100 voice: +1 858 799 5603
San Diego, CA 92121 fax: +1 858 799 5222
USA web: http://www.accelrys.com
http://www.linkedin.com/in/smarkel
Secretary, Board of Directors:
International Society for Computational Biology
Chair: ISCB Publications and Communications Committee
Associate Editor: PLOS Computational Biology
Editorial Board: Briefings in Bioinformatics
More information about the Bioperl-l
mailing list