[Open-bio-l] [BioRuby] Interesting BLAST 2.2.25+ XML behaviour
Peter Cock
p.j.a.cock at googlemail.com
Wed May 4 10:36:57 UTC 2011
On Wed, May 4, 2011 at 10:59 AM, Michal <mictadlo at gmail.com> wrote:
> Hi Peter,
> Do you have the script which read
>
> https://bitbucket.org/galaxy/galaxy-central/src/8eaf07a46623/test-data/blastp_four_human_vs_rhodopsin.xml
>
>
> and what would be the correct output?
>
> Thank you in advance.
>
> Cheers,
> Michal
Hi Michal,
I'm not quite sure what you're asking, but I'll try. First, the three
data files:
$ wget https://bitbucket.org/galaxy/galaxy-central/src/8eaf07a46623/test-data/blastp_four_human_vs_rhodopsin.xml
$ wget https://bitbucket.org/galaxy/galaxy-central/src/8eaf07a46623/test-data/four_human_proteins.fasta
$ wget https://bitbucket.org/galaxy/galaxy-central/src/8eaf07a46623/rhodopsin_proteins.fasta
The query file has four sequences,
$ grep -c "^>" four_human_proteins.fasta
4
$ grep "^>" four_human_proteins.fasta
>sp|Q9BS26|ERP44_HUMAN Endoplasmic reticulum resident protein 44 OS=Homo sapiens GN=ERP44 PE=1 SV=1
>sp|Q9NSY1|BMP2K_HUMAN BMP-2-inducible protein kinase OS=Homo sapiens GN=BMP2K PE=1 SV=2
>sp|P06213|INSR_HUMAN Insulin receptor OS=Homo sapiens GN=INSR PE=1 SV=4
>sp|P08100|OPSD_HUMAN Rhodopsin OS=Homo sapiens GN=RHO PE=1 SV=1
Based on past experience, I would expect 4 iteration blocks in the
XML, but in this case I have 24:
$ grep "<Iteration>" -c blastp_four_human_vs_rhodopsin.xml
24
Notice we get 6 iterations for each query (4 times 6 is 24):
$ grep "<Iteration_query-ID>" blastp_four_human_vs_rhodopsin.xml
<Iteration_query-ID>sp|Q9BS26|ERP44_HUMAN</Iteration_query-ID>
<Iteration_query-ID>sp|Q9BS26|ERP44_HUMAN</Iteration_query-ID>
<Iteration_query-ID>sp|Q9BS26|ERP44_HUMAN</Iteration_query-ID>
<Iteration_query-ID>sp|Q9BS26|ERP44_HUMAN</Iteration_query-ID>
<Iteration_query-ID>sp|Q9BS26|ERP44_HUMAN</Iteration_query-ID>
<Iteration_query-ID>sp|Q9BS26|ERP44_HUMAN</Iteration_query-ID>
<Iteration_query-ID>sp|Q9NSY1|BMP2K_HUMAN</Iteration_query-ID>
<Iteration_query-ID>sp|Q9NSY1|BMP2K_HUMAN</Iteration_query-ID>
<Iteration_query-ID>sp|Q9NSY1|BMP2K_HUMAN</Iteration_query-ID>
<Iteration_query-ID>sp|Q9NSY1|BMP2K_HUMAN</Iteration_query-ID>
<Iteration_query-ID>sp|Q9NSY1|BMP2K_HUMAN</Iteration_query-ID>
<Iteration_query-ID>sp|Q9NSY1|BMP2K_HUMAN</Iteration_query-ID>
<Iteration_query-ID>sp|P06213|INSR_HUMAN</Iteration_query-ID>
<Iteration_query-ID>sp|P06213|INSR_HUMAN</Iteration_query-ID>
<Iteration_query-ID>sp|P06213|INSR_HUMAN</Iteration_query-ID>
<Iteration_query-ID>sp|P06213|INSR_HUMAN</Iteration_query-ID>
<Iteration_query-ID>sp|P06213|INSR_HUMAN</Iteration_query-ID>
<Iteration_query-ID>sp|P06213|INSR_HUMAN</Iteration_query-ID>
<Iteration_query-ID>sp|P08100|OPSD_HUMAN</Iteration_query-ID>
<Iteration_query-ID>sp|P08100|OPSD_HUMAN</Iteration_query-ID>
<Iteration_query-ID>sp|P08100|OPSD_HUMAN</Iteration_query-ID>
<Iteration_query-ID>sp|P08100|OPSD_HUMAN</Iteration_query-ID>
<Iteration_query-ID>sp|P08100|OPSD_HUMAN</Iteration_query-ID>
<Iteration_query-ID>sp|P08100|OPSD_HUMAN</Iteration_query-ID>
Now, using the two FASTA files directly and re-running blastp, what do I get?
$ ~/Downloads/ncbi-blast-2.2.25+/bin/blastp -query
four_human_proteins.fasta -subject rhodopsin_proteins.fasta -outfmt 5
| grep "<Iteration>" -c
24
Or again with -parse_deflines, which changes how the hit ID/def is presented:
$ ~/Downloads/ncbi-blast-2.2.25+/bin/blastp -query
four_human_proteins.fasta -subject rhodopsin_proteins.fasta -outfmt 5
-parse_deflines | grep "<Iteration>" -c
24
How about older versions?
$ ~/Downloads/ncbi-blast-2.2.24+/bin/blastp -query
four_human_proteins.fasta -subject rhodopsin_proteins.fasta -outfmt 5
BLAST engine error: XML formatting is only supported for a database search
I'll have to make a blast database first...
$ ~/Downloads/ncbi-blast-2.2.24+/bin/makeblastdb -in
rhodopsin_proteins.fasta -dbtype prot
Building a new DB, current time: 05/04/2011 11:22:57
New DB name: rhodopsin_proteins.fasta
New DB title: rhodopsin_proteins.fasta
Sequence type: Protein
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1073741824B
Adding sequences from FASTA; added 6 sequences in 0.105655 seconds.
$ ~/Downloads/ncbi-blast-2.2.25+/bin/blastp -query
four_human_proteins.fasta -db rhodopsin_proteins.fasta -outfmt 5 |
grep "<Iteration>" -c
4
Look - just four identifiers as I expect! This also works if the database
is built with the -parse_seqids switch.
The same happens with older versions of BLAST+, one <Iteration>
block per query, so four iteration blocks for this example. I tried all
of 2.2.21+, 2.2.22+, 2.2.23+ and 2.2.24+ (running makeblastdb to
give a fresh database, then blastp).
That seems to demonstrate that bug is specific to the XML output
from FASTA vs FASTA (not FASTA vs DB), which is a new feature
in NCBI BLAST 2.2.25+
I will raise this with the NCBI, and report back.
However, even if the NCBI fix it in the next release, we (Bio*) may
want to update our parsers to cope with this quirk, or at least put a
warning in our BLAST XML parser documentation, as there will be
lots of installations of NCBI BLAST 2.2.25+ in the wild.
Peter
More information about the Open-Bio-l
mailing list