[Bioperl-l] parsing multiple blast reports

Jason Stajich jason@cgt.mc.duke.edu
Fri, 26 Apr 2002 08:18:24 -0400 (EDT)


On Fri, 26 Apr 2002, Gert Thijs wrote:

> Hi,
>
> I have a question concerning the parsing of multiple blast reports in one
> large file. I have already scanned the list and the doc files but I did not
> find an answer.
> I use the program 'blastcl3' to send blast requests to the server at NCBI.
> Typically I submit a file with several sequence instead of submitting many
> requests with one sequence. The result is a large file wih many concatenated
> blast reports.

This should be the same report as generated by a local blastall.... ah but
as I discover later on it isn't exactly the same.

> To me it seems that Bio::SearchIO is the right candidate to step
> through this file and do select the right HSPs for each query
> sequence. Below is the code I use to test this approach. This script
> works fine for the first report but it stops at the end of the first
> report. I switched on verbose to see if there was a problem but no
> warning/error message was printed.
>

That's how the objects are designed - I'm not sure why it isn't working
for you - I am able to run fine on a number of multi-report files.

Oh wait I checked it out - This is just AWESOME work by NCBI here, I mean
why shouldn't the same program which runs locally and remotely have
slightly different formats?

Normal local blast (blastall) code produces reports that look like this

BLASTP (12234) .....

Query= BLAH

Report... here

BLASTP (12234) ....

So we can detect when we hit the next report (and detect the algorithm)
by seeing each [T]?BLAST[PXN] line.... but in blastcl3 code (just want to
note that all these apps are part of the same codebase folks, of course
blastcl3 reports probably have to funnelled into ASN.1 over the wire and
then back out into a report).

But blastcl3 reports look like this
BLASTP (12234) .....

Query= BLAH

Report ... here

Query= BLAH

Report ... here.

etc...


I'll have to see about how we can detect this without breaking the code
that assumes a /BLAST/ line signifies the start of a new report.

I guess we can probably add a flag which indicates this is blastcl3
format not 'standard' blast ...

You might consider outputting the data with -m 7 for XML output which
should solve this problem, if you still want to eyeball the reports you
can generate people readable HTML reports from this with the SearchWriter
code and I think you'll have a better time of the parsing with XML.

-j

> ----
> my $searchObj = new Bio::SearchIO( -format => 'blast',
>                                     -file => "<$blastReport",
>                                     -verbose => 1
>                                    );
> while ( my $result = $searchObj->next_result ){
>      my $query = $result->query_name;
>      print $query,"\n";
>      while ( my $hit = $result->next_hit ){
>          print "$hit: ",$hit->name,"|",$hit->description,"\n";
>      }
> }
> $searchObj->close;
> ----
>
>
> Has anyone any idea what I am missing or doing wrong?
>
> Many thanks,
> Gert
>
>
>
>

-- 
Jason Stajich
Duke University
jason@cgt.mc.duke.edu