[Biopython-dev] Blast records

Michiel de Hoon mjldehoon at yahoo.com
Wed Sep 23 13:51:04 UTC 2009


--- On Tue, 9/22/09, Peter <biopython at maubp.freeserve.co.uk> wrote:
> As I recall (backed up by what I wrote in the tutorial),
> when I last checked, the plain text PSI-BLAST output
> (i.e. from the command line tool blastpgp) included a
> lot of information missing in the XML output. Perhaps
> this has improved? If it hasn't, I am inclined to leave
> things as they are. If the current PSI-BLAST outputs
> more details in the XML we may be able to do a better job.

As far as I can tell, the XML contains the same information as the plain-text psiblast output, but the XML parser doesn't parse it correctly, since it assumes it is dealing with regular blast rather than psi-blast.

> The next bit is my recollection of some of the background
> to this:
> Classic BLAST (and also RPS-BLAST) allow multiple queries
> and use the "iterator" block in the XML file for each query.
> This was an odd choice of naming, but I think the XML tag was
> originally only intended for the PSI-BLAST outout where each 
> "iteration" block in the XML corresponds to each step of the 
> algorithm. You may recall early versions of BLAST would output 
> "concatenated" XML files for multiple queries - which were not
> true XML files.

That is correct. To make things more complex, if you run psi-blast with multiple queries you get concatenated XML files again, with the iteration blocks corresponding to the psi-blast iterations for each query.

> I guess they fixed this by reusing the existing "iteration"
> structure for multiple queries (rather than adding new XML
> tags). With this in mind the current parsing of the XML from
> PSI-BLAST makes sense.

I don't know if it really makes sense. For a single psi-blast query, we're getting multiple Blast records. For multiple psi-blast queries, we're iterating over the iteration blocks while ignoring the fact that they can come from different queries.

Ideally, we should be able to see from the XML whether it was regular blast with multiple queries, or psi-blast with a single query. Right now that is possible by looking a the query-def lines, but I wonder if NCBI is considering a better solution for this. I'll write an email to them to find out.

--Michiel


      


More information about the Biopython-dev mailing list