[Bioperl-l] blast output -> blast -m8 output
Amir Karger
akarger at CGR.Harvard.edu
Fri Dec 23 11:24:17 EST 2005
I'm writing a script that will take regular blast output and translate it to
blast -m8 tabular form. (The reverse transform won't work without re-doing
the alignments.)
I've attached the blast output for running 3 sequences against month.aa.
Below are the script, the script output, and the blast -m8 output. (Output
is the same for bioperl-1.4 and 1.5-RC1.)
So there are two questions:
1) To get the number of openings, I count /-+/g in $hsp->query_string .
$hsp->hit_string. Is this really the best way to do it? (A $hsp->ngaps
method would be nicer :)
2) Why does blast -m8 use longer IDs than regular blast? (See second column)
I guess this is an NCBI question, really. Other than the difference in IDs
and a couple space characters, the two outputs are the same.
3) Do you think this will work on all blast outputs, or will something
break?
Thanks,
-Amir Karger
---------
Here's the script:
#!/usr/local/bin/perl -w
use strict;
use Bio::SearchIO;
my $in = new Bio::SearchIO(-format => 'blast', -file => $ARGV[0]);
while (my $result = $in->next_result ) {
while (my $hit = $result->next_hit ) {
while (my $hsp = $hit->next_hsp ) {
my $pcid = $hsp->percent_identity;
my $len = $hsp->length('total');
my $mismatches = $len - $hsp->num_identical -
$hsp->gaps('total');
# Find runs of '-' in query, hit strings
my @gap_list = (($hsp->query_string . $hsp->hit_string) =~
/-+/g);
my $ngaps = @gap_list;
my @fields = (
$result->query_name, $hit->name, sprintf("%.2f",$pcid),
$len, $mismatches, $ngaps,
$hsp->start('query'), $hsp->end('query'),
$hsp->start('hit'), $hsp->end('hit'),
$hsp->evalue, $hsp->bits
);
print join("\t", @fields), "\n";
}
}
}
----------
Here's the output from running perl change_blast_to_tab.pl seqs.blp
Bacteriophage_1[M19348] ref|NP_037061.1| 40.32 62 33 1
28 89 1050 1107 6e-05 46.6
Bacteriophage_1[M19348] ref|XP_193814.5| 48.89 45 17 1
57 95 320 364 0.001 42.7
Bacteriophage_1[M19348] ref|XP_912463.1| 48.89 45 17 1
57 95 866 910 0.001 42.7
Bacteriophage_1[M19348] ref|XP_619329.2| 48.89 45 17 1
57 95 676 720 0.001 42.7
C.elegans_1_[Z49071] ref|XP_917828.1| 29.61 412 242 9
40 410 52 456 6e-43 173
C.elegans_1_[Z49071] gb|AAI10184.1| 31.99 347 213 11 40
373 53 389 6e-42 169
-----------
Here's the -m8 output.
Bacteriophage_1[M19348] gi|6978677|ref|NP_037061.1| 40.32 62 33
1 28 89 1050 1107 6e-05 46.6
Bacteriophage_1[M19348] gi|82958039|ref|XP_193814.5| 48.89 45 17
1 57 95 320 364 0.001 42.7
Bacteriophage_1[M19348] gi|82958037|ref|XP_912463.1| 48.89 45 17
1 57 95 866 910 0.001 42.7
Bacteriophage_1[M19348] gi|82957449|ref|XP_619329.2| 48.89 45 17
1 57 95 676 720 0.001 42.7
C.elegans_1_[Z49071] gi|82802536|ref|XP_917828.1| 29.61 412 242
9 40 410 52 456 6e-43 173
C.elegans_1_[Z49071] gi|82571607|gb|AAI10184.1| 31.99 347 213
11 40 373 53 389 6e-42 169
-------------- next part --------------
A non-text attachment was scrubbed...
Name: seqs.blp
Type: application/octet-stream
Size: 10400 bytes
Desc: not available
Url : http://portal.open-bio.org/pipermail/bioperl-l/attachments/20051223/a95c381f/seqs-0001.obj
More information about the Bioperl-l
mailing list