[Bioperl-l] getting database hit sequences - from BLAST databases

Mon, 21 Oct 2002 09:35:20 +0100 (BST)

> I would like to manually align (Needleman-Wunsch) my query sequences with 
> the database hits that Blast found, so somehow I need to get the sequence 
> of the database hits (Subjects). I want the entire sequence that is in the 
> database for this subject, not only the portion that formed a HSP with 
> some part of my query sequence.
> 
> After excessive reading of the documentation I concluded that there is no 
> direct possibility to do so.

Modern versions of blast allow you to index databases and retrieve
sequences by accession number. For NCBI-BLAST, inculde the -o option with
formatdb and retrieve with the fastacmd program. For WU-BLAST, include the
-I flag with xdformat or re-index later with -X, and use the xdget
program. Note that if you're using your own identifiers like ">FOO", when
you fetch these from an NCBI-BLAST database, they will come out of the
database like ">lcl|FOO" because there are internal namespaces. This does
not happen with xdget as the internal namespace is nameless. One downside
to xdget is that it does not currently support virtual databases, but with
WU-BLAST you ought not to need virtual databases because large files are
supported well.

I'm not an active Bioperl developer/user, but in my opinion it's probably
a good idea for Bioperl to provide transparent access to database
sequences (using fastacmd, xdget, or other means) through the object
interface. This would be a relatively easy modification to the system
requiring (1) determining which flavor of BLAST was used (2) figuring out
from the program name if the database is aa or nt (3) binding
$sbjct->sequence() to make a call to fastacmd or xdget (or some other
mechanism). This could be done directly from the report, or explicitly set
at the level of the report (or multi-report).

There are potential errors to deal with such as: (a) there is no local
database (b) the database was not indexed (c) the BLAST databases are not
accessible from the filesystem (d) the report is out of sync with the
database, (e) virtual databases were used, and (f) there are redundant
identifiers (so what you get isn't what you hope for). The issues of 'd'
and 'f' are particularly difficult because they don't show up as errors at
run time though it would be possible to trap 'd' in WU-BLAST using
xdformat -i, though that would be a real pain for each call. And it is
possible to get duplicates with fastacmd with -a.

But for your purposes, in the absense of such a slick interface, you ought
to be able to manage a workaround pretty easily using one of these
commands:

	open(GET, "xdget -n $db \"$id\" |");
	open(GET, "xdget -p $db \"$id\" |");
	open(GET, "fastacmd -d \"$db\" -s \"$id\" |");
	open(GET, "fastacmd -d \"$db\" -s \"$id\" |");

Note that I included quotes around the identifier because if you're using
NCBI-style deflines they contain pipe symbols, and I also put quotes
around the NCBI database as this is how one uses virtual databases.

Lastly, if you're getting unexpected behavior from your N-W alignment,
consider using a version that doesn't penalize end-gaps.

-Ian