[Open-bio-l] [BioRuby] Interesting BLAST 2.2.25+ XML behaviour
Chris Fields
cjfields at illinois.edu
Tue Aug 9 20:09:05 UTC 2011
I'm reviving this thread to see what the current status is (if anything has changed). The bioperl parser has the same problem; at the moment we're bascially stuck until NCBI gives some indication as to whether this is a bug or not. Any word back from them yet?
(and agreed, it would be nice to have an external bug tracker from NCBI).
chris
On May 4, 2011, at 5:36 AM, Peter Cock wrote:
> On Wed, May 4, 2011 at 10:59 AM, Michal <mictadlo at gmail.com> wrote:
>> Hi Peter,
>> Do you have the script which read
>>
>> https://bitbucket.org/galaxy/galaxy-central/src/8eaf07a46623/test-data/blastp_four_human_vs_rhodopsin.xml
>>
>>
>> and what would be the correct output?
>>
>> Thank you in advance.
>>
>> Cheers,
>> Michal
>
> Hi Michal,
>
> I'm not quite sure what you're asking, but I'll try. First, the three
> data files:
>
> $ wget https://bitbucket.org/galaxy/galaxy-central/src/8eaf07a46623/test-data/blastp_four_human_vs_rhodopsin.xml
> $ wget https://bitbucket.org/galaxy/galaxy-central/src/8eaf07a46623/test-data/four_human_proteins.fasta
> $ wget https://bitbucket.org/galaxy/galaxy-central/src/8eaf07a46623/rhodopsin_proteins.fasta
>
> The query file has four sequences,
>
> $ grep -c "^>" four_human_proteins.fasta
> 4
>
> $ grep "^>" four_human_proteins.fasta
>> sp|Q9BS26|ERP44_HUMAN Endoplasmic reticulum resident protein 44 OS=Homo sapiens GN=ERP44 PE=1 SV=1
>> sp|Q9NSY1|BMP2K_HUMAN BMP-2-inducible protein kinase OS=Homo sapiens GN=BMP2K PE=1 SV=2
>> sp|P06213|INSR_HUMAN Insulin receptor OS=Homo sapiens GN=INSR PE=1 SV=4
>> sp|P08100|OPSD_HUMAN Rhodopsin OS=Homo sapiens GN=RHO PE=1 SV=1
>
> Based on past experience, I would expect 4 iteration blocks in the
> XML, but in this case I have 24:
>
> $ grep "<Iteration>" -c blastp_four_human_vs_rhodopsin.xml
> 24
>
> Notice we get 6 iterations for each query (4 times 6 is 24):
>
> $ grep "<Iteration_query-ID>" blastp_four_human_vs_rhodopsin.xml
> <Iteration_query-ID>sp|Q9BS26|ERP44_HUMAN</Iteration_query-ID>
> <Iteration_query-ID>sp|Q9BS26|ERP44_HUMAN</Iteration_query-ID>
> <Iteration_query-ID>sp|Q9BS26|ERP44_HUMAN</Iteration_query-ID>
> <Iteration_query-ID>sp|Q9BS26|ERP44_HUMAN</Iteration_query-ID>
> <Iteration_query-ID>sp|Q9BS26|ERP44_HUMAN</Iteration_query-ID>
> <Iteration_query-ID>sp|Q9BS26|ERP44_HUMAN</Iteration_query-ID>
> <Iteration_query-ID>sp|Q9NSY1|BMP2K_HUMAN</Iteration_query-ID>
> <Iteration_query-ID>sp|Q9NSY1|BMP2K_HUMAN</Iteration_query-ID>
> <Iteration_query-ID>sp|Q9NSY1|BMP2K_HUMAN</Iteration_query-ID>
> <Iteration_query-ID>sp|Q9NSY1|BMP2K_HUMAN</Iteration_query-ID>
> <Iteration_query-ID>sp|Q9NSY1|BMP2K_HUMAN</Iteration_query-ID>
> <Iteration_query-ID>sp|Q9NSY1|BMP2K_HUMAN</Iteration_query-ID>
> <Iteration_query-ID>sp|P06213|INSR_HUMAN</Iteration_query-ID>
> <Iteration_query-ID>sp|P06213|INSR_HUMAN</Iteration_query-ID>
> <Iteration_query-ID>sp|P06213|INSR_HUMAN</Iteration_query-ID>
> <Iteration_query-ID>sp|P06213|INSR_HUMAN</Iteration_query-ID>
> <Iteration_query-ID>sp|P06213|INSR_HUMAN</Iteration_query-ID>
> <Iteration_query-ID>sp|P06213|INSR_HUMAN</Iteration_query-ID>
> <Iteration_query-ID>sp|P08100|OPSD_HUMAN</Iteration_query-ID>
> <Iteration_query-ID>sp|P08100|OPSD_HUMAN</Iteration_query-ID>
> <Iteration_query-ID>sp|P08100|OPSD_HUMAN</Iteration_query-ID>
> <Iteration_query-ID>sp|P08100|OPSD_HUMAN</Iteration_query-ID>
> <Iteration_query-ID>sp|P08100|OPSD_HUMAN</Iteration_query-ID>
> <Iteration_query-ID>sp|P08100|OPSD_HUMAN</Iteration_query-ID>
>
> Now, using the two FASTA files directly and re-running blastp, what do I get?
>
> $ ~/Downloads/ncbi-blast-2.2.25+/bin/blastp -query
> four_human_proteins.fasta -subject rhodopsin_proteins.fasta -outfmt 5
> | grep "<Iteration>" -c
> 24
>
> Or again with -parse_deflines, which changes how the hit ID/def is presented:
>
> $ ~/Downloads/ncbi-blast-2.2.25+/bin/blastp -query
> four_human_proteins.fasta -subject rhodopsin_proteins.fasta -outfmt 5
> -parse_deflines | grep "<Iteration>" -c
> 24
>
> How about older versions?
>
> $ ~/Downloads/ncbi-blast-2.2.24+/bin/blastp -query
> four_human_proteins.fasta -subject rhodopsin_proteins.fasta -outfmt 5
> BLAST engine error: XML formatting is only supported for a database search
>
> I'll have to make a blast database first...
>
> $ ~/Downloads/ncbi-blast-2.2.24+/bin/makeblastdb -in
> rhodopsin_proteins.fasta -dbtype prot
>
> Building a new DB, current time: 05/04/2011 11:22:57
> New DB name: rhodopsin_proteins.fasta
> New DB title: rhodopsin_proteins.fasta
> Sequence type: Protein
> Keep Linkouts: T
> Keep MBits: T
> Maximum file size: 1073741824B
> Adding sequences from FASTA; added 6 sequences in 0.105655 seconds.
>
> $ ~/Downloads/ncbi-blast-2.2.25+/bin/blastp -query
> four_human_proteins.fasta -db rhodopsin_proteins.fasta -outfmt 5 |
> grep "<Iteration>" -c
> 4
>
> Look - just four identifiers as I expect! This also works if the database
> is built with the -parse_seqids switch.
>
> The same happens with older versions of BLAST+, one <Iteration>
> block per query, so four iteration blocks for this example. I tried all
> of 2.2.21+, 2.2.22+, 2.2.23+ and 2.2.24+ (running makeblastdb to
> give a fresh database, then blastp).
>
> That seems to demonstrate that bug is specific to the XML output
> from FASTA vs FASTA (not FASTA vs DB), which is a new feature
> in NCBI BLAST 2.2.25+
>
> I will raise this with the NCBI, and report back.
>
> However, even if the NCBI fix it in the next release, we (Bio*) may
> want to update our parsers to cope with this quirk, or at least put a
> warning in our BLAST XML parser documentation, as there will be
> lots of installations of NCBI BLAST 2.2.25+ in the wild.
>
> Peter
More information about the Open-Bio-l
mailing list