[Bioperl-l] Bug in Bio::Tools::BPlite.pm ?
Jason Eric Stajich
jason@cgt.mc.duke.edu
Tue, 2 Oct 2001 16:45:25 -0400 (EDT)
Leonardo -
I think I got it. 2 things happen.
1) Our decision was that sequence similarity features (in BPlite) are
always created s.t. start < end and strand is set to -1 if start > end
on initialization in Bio::Tools::BPlite::HSP. So you are getting
things correctly in terms of start and end, but you need to evaluate
strand. However, see next point...
2) We were not allowing strand to be set to '-1' in the HSP initialization
b/c we were not also checking for 'BLASTN' as a report type. This was
part of the recent changes to get strand/frame stuff in correctly that
hadn't been stress tested - I added BLASTN and things seem to work (in
the sense that the output from your program is identical, but now the
strand is correct for the subject - you will need to change your
program and decide how you want to output the information either
rearranging based on strand or printing strand out as well).
My changes basically look like this
< if ($blasttype eq 'TBLASTN' || $blasttype eq 'TBLASTX') {
---
> if ($blasttype eq 'TBLASTN' || $blasttype eq 'TBLASTX' ||
> $blasttype eq 'BLASTN' ) {
All tests pass which means either this is the RIGHT thing to do, or we
don't have enough tests for all scenarios when parsing blast reports
(This is really something that someone could do if they wanted to
volunteer to help on something!!!)
I'm checking this in, but I think we might need some more evaluation at
some point and better tests to be sure we've covered all the bases.
-jason
On Tue, 25 Sep 2001, Leonardo Marino-Ramirez wrote:
> Hello,
>
> I am trying to get subject start end end coordinates from blast reports
> using the Bio::Tools::BPlite module.
>
> I am using a standard script to parse blast reports:
>
> #!/usr/bin/perl
> # parse_blast
> # Reads and parse blast reports
> # usage parse_blast -d <directory> > <Name_of_parse_output>
> # Output format
> #
>
> use strict;
> use Bio::Tools::BPlite;
> use Getopt::Std;
>
> use vars qw($opt_d
> $file
> @blast_reports
> $fname
> $report
> @tmp
> $qn
> $gi
> $sc
> $ev
> $pid
> $qfr
> $qstr
> $qsta
> $qend
> $ssta
> $send
> $sbjct
> $type
> $db
> $anal1
> $anal2
> $anal3
> $anal4
> $anal
> );
>
> getopt('d');
>
> my $dirname = "$opt_d";
>
> ## Get a working directory containing blast reports
> if (opendir (DIR, $dirname)) {
> while ($file = readdir(DIR)) {
> ## blast reports must contain the .br suffix
> push (@blast_reports, "$file") if ($file =~ /\.br$/);
> }
> }
> closedir(DIR);
>
> foreach $fname (@blast_reports) {
>
> parse_blast ( $report );
>
> }
>
>
> ## parse_blast subroutine
> sub parse_blast {
>
> my $report = new Bio::Tools::BPlite(-file=>"$fname");
>
> my $query = $report->query;
>
> ## Get only the clone id from all the query name
> @tmp = split ' ', $query; $qn = $tmp[0];
>
> $db = $report->database;
>
> ## get analysis code for database
> $anal1 = $anal2 = $anal3 = $anal4 = ();
>
> if ($db eq "ecoli.aa") {
> $anal1 = "1";
> } elsif ($db eq "pdbaa") {
> $anal2 = "2";
> } elsif ($db eq "ecoli.na") {
> $anal3 = "3";
> } else {
> $anal1 = $anal2 = $anal3 = "0";
> }
>
> while(my $sbjct = $report->nextSbjct) {
> my $type = $sbjct->report_type(); #print "$type\n";
>
> ## get analysis code for blast type
> if ($type eq "BLASTX") {
> $anal4 = "1";
> } elsif ($type eq "BLASTN") {
> $anal4 = "2";
> } else {
> $anal4 = "0";
> }
>
> my $blast_hit = $sbjct->name;
> ## get gi's from blast report
> @tmp = split /\|/, $blast_hit; $gi = $tmp[1];
>
> while(my $hsp = $sbjct->nextHSP) {
> $sc = $hsp->bits; #print "score is $sc\n";
> $ev = $hsp->P; #print "e-value is $ev\n";
> $pid = $hsp->percent; #print "% id is $pid\n";
> $qfr = $hsp->query->frame; #print "query frame is $qfr\n";
> $qstr = $hsp->query->strand; #print "query strand is $qstr\n";
> $qsta = $hsp->query->start;
> $qend = $hsp->query->end;
> $ssta = $hsp->subject->start;
> $send = $hsp->subject->end;
>
> $anal = "$anal4$anal1$anal2$anal3";
>
> print
> "$qsta\t$qend\t$sc\t$qstr\t$qfr\t$anal\t$qn\t$ssta\t$send\t$blast_hit\t$gi\t$ev\t$pid\n";
>
> }
> }
>
> }
>
>
> The problem is that when I am reading a blast report (see attachment) the
> start and end coordinates are inverted!
>
> My output of the script above looks like this:
>
> 102 613 944 1 23 EB10001G04.Seq 3503126
> 3503640gi|6626251|gb|U00096.1|U00096 Escherichia coli K-12 MG1655 complete
> genome 6626251 0.0 98.6
>
> What is the problem? Note that there is also a tab missing between the
> subject end and the query name.
>
> uname -a
> Linux tofu.tamu.edu 2.2.12-20smp #1 SMP Mon Sep 27 10:34:45 EDT 1999 i686
> unknow
> n
>
> perl 5.005_03 built for i386-linux
>
> $Id: BPlite.pm,v 1.25 2001/09/05 11:38:57 heikki Exp
>
> Thanks, Leonardo
>
--
Jason Stajich
Duke University
jason@cgt.mc.duke.edu