[Biopython-dev] Using BLAST et al with multiple queries

Mon Dec 13 13:56:02 EST 2004

In short: does anyone else use FASTA files containing multiple sequences 
as input queries to standalone Blast?  How do you parse the output?

-----------------------------------------------------------------------

I have recently been doing a lot of repeated blast searches on the same
database, using a succession of different protein queries.

I have been using the standalone blast programs, and suspect that the 
database is reloaded from the disc each time blast is run.  This would 
explain the rather heavy disc usage.

 From the NCBI's FAQ, 
http://www.ncbi.nlm.nih.gov/BLAST/blast_FAQs.shtml#Batch

> Q: How can I search a batch of sequences with BLAST?
> 
> There are three options for "Batch" BLAST searches:
> 
> 1) Web MegaBLAST EST analysis tool...

I want to use protein sequences, so this is N/A to me.

> 3) BLAST Network Client 'blastcl3'

I want to use local databases, so this is N/A to me.

> 2) Standalone BLAST executables:The Standalone BLAST executables are 
> command line programs which run BLAST searches against local 
> downloaded copies of the NCBI BLAST databases. The programs will 
> handle either a single large file with multiple FASTA query 
> sequences,

This is what I would like to do, but the current version of
Bio/Blast/NCBIStandalone only appears to look at the output for the
first query.

[In fact, I would argue that silently ignoring the results for the
subsequent queries is a bug...]

In the case of blastp (and I would therefore expect for blastall in 
general), the second and subsequent queries are almost the same format, 
but with the database information omitted from their headers.

i.e. First record header has:
* Version string
* Reference
* Query= ... (NNN letters)
* Database

Subsequent records have:
* Version string
* Reference
* Query= ... (NNN letters)

For blastn and tblastn (blastall) these is also the -B option which may
complicate things.

The rpslast output is similar.  First record has header of:
* Version string
* Query= ... (NNN letters)

Subsequent records have just header of:
* Query= ... (NNN letters)

Incidentally, I have emailed the NCBI about the missing reference and 
database information, but got an automated reply that the recipient was 
on holiday until late December.

See also Bug 1715 where I have got rpsblast support working in BioPython:

http://bugzilla.open-bio.org/show_bug.cgi?id=1715

> or you can create a script to send multiple files one at a time...

This is effectively what I have been doing with BioPython at the moment. 
  It works fine, but the speed is rather poor due to all the disc access.

Peter
-- 
PhD Student
MOAC Doctoral Training Centre
University of Warwick