[Bioperl-l] SearchIO and legacy parser minute overview

Jason Stajich jason@cgt.mc.duke.edu
Mon, 13 May 2002 20:00:58 -0400 (EDT)


OO may seem hard at first but I promise it is worth it.

For what I refer to below, the online documentation at
http://docs.bioperl.org are a great place to look as Raphael's Pdoc
generated code actually provides links to whole inheritance tree so you
can walk up it and see what methods each of these classes implement.

The OO schema that is linked at:
http://www.bioperl.org/Core/Latest/modules.html
should also shed a little light on the object structures.  I'll keep
updating as things change and I find magic blocks of free time to do this
stuff.

====
In the new SearchIO system we have separated blast parsing into 2
sets of components - the parsers and the objects they create.

The historical objects Bio::Tools::Blast (steve's Old system),
Bio::Tools::BPlite (a relatively lightweight implementation for blast
parsing module) also separated the parser from the objects but they both
created their own set of Hit and HSP objects which were not necessarily
compatible.  I wanted to create a single parser system that would create
the same types of objects for FASTA, BLAST, HMMer, and other searching
program output.

Thus was born the Bio::SearchIO:: an extensible framework for parsing
FASTA, BLAST (WU and NCBI in xml and text formats), and (eventually) hmmer
style reports.

Bio::SearchIO::blast - for reading text ncbi and wu- blast reports.
			[t]blast[xpn]
Bio::SearchIO::blastxml - for reading text blast reports.
Bio::SearchIO::psiblast - SteveC's alternate implementation of blast
                          parsing.
Bio::SearchIO::fasta - for reading text TFASTAXY reports.

my $in = new Bio::SearchIO(-file => $filename, -format => 'blast');
while( $result = $report->next_result ) {
 # to see what type of object result is you can use introspection in perl
 print ref($result), "\n";
 while( my $hit = $result->next_hit ) {
  while( my $hsp = $hit->next_hsp ) {
  }
 }
}

The SearchIO code creates objects that are compliant with the
Bio::Search::Result::ResultI interface.  You can look at this interface to
see a summary of its specific methods.  You can look at the inheritance
(the @ISA) array to see what interfaces this object inherits from. This is
also linked nicely in the header of the pdoc documentation.

the next_hit() method of the ResultI interface returns
Bio::Search::Hit::HitI compliant objects.  Similarly look at the docs for
this object and you will see what methods it requires and what interfaces
it inherits from.

the next_hsp() methods for the HitI object similarly produces
Bio::Search::HSP::HSPI objects.  The fun with OO is that anyone can
provide their own pluggable object into this system if they implement the
interface.  So Steve implement BlastHSP and I wrote GenericHSP.  I hope
that the code and the interface for GenericHSP are reasonably
understandable through the online doc but let me walk you through it.

HSPs obviously have a lot of information.  A stretch of similarity can be
coded as a Feature (Bio::SeqFeatureI) - in the case of an HSP there are 2
features for the query and hit.  So we encapsulate this in a
Bio::SeqFeature::FeaturePair.  Since they are describing similarity each
of the features in the feature pair is a Bio::SeqFeature::Similarity so
actually the HSP is a Bio::SeqFeature::SimilarityPair.  So you can look at
the methods for all of those objects to see what HSP supports.  That is
where the query()  and hit() methods are coming from.  query() and hit()
return Bio::SeqFeatureI compliant objects (in this case they are going to
be Bio::SeqFeature::Similarity objects) so you can fine the similarity,
bits, frac_identical, etc methods.

Additionally you can see that a Bio::SeqFeatureI has methods like
start(),end(),strand(), frame(), score().


On Mon, 13 May 2002, Andy Nunberg wrote:

> Jason, what would be the simplest way for me to determine what methods
> are available to a Bio::SearchIO Blast object for hits and hsps? My
> biggest confusion is inheritance and I need to read more on OOP since
> I have done none in the past.
>
> Is your module comparable to Psiblast parsed object in terms of
> gaps,strands, frames etc??
>
Yep they both should implement the same interface so all the functionality
you are searching for is present in both classes.  I personally think mine
is pretty easy to use and and the Generic(Hit/HSP/Result) objects that it
creates are easy to read through while SteveC's delayed parsing is a bit
scary to walk through.   They should be essentially the same but Steve's
objects deviate from the the interface a bit so they're not completely
interchangeable.

I'm not that familiar with what is in the tables in the old Tools::Blast
but we built some simplified objects to write out basic table information
in the Bio::SearchIO::Writer classes.  For example if you wanted to output
a simple table from the HSPs just use the following.  Unfortunately Steve
and I did not connect very well on this implementation so this code uses
stuff that is specific to his BlastHit/BlastHSP objects created with the
'psiblast' format using methods that are not in the interface.  So you
can't use my implementation at this point.  I'll try and get this
corrected later on.


    use Bio::SearchIO;
    use Bio::SearchIO::Writer::HSPTableWriter;

    my $in = Bio::SearchIO->new(-file => $file, -format => 'psiblast');

    my $writer = Bio::SearchIO::Writer::HSPTableWriter->new(
                                  -columns => [qw(
                                                  query_name
                                                  query_length
                                                  hit_name
                                                  hit_length
                                                  rank
                                                  frac_identical_query
                                                  expect
                                                  )]  );

    my $out = Bio::SearchIO->new( -writer => $writer,
				  -file   => ">searchio.out" );
    my $first = 1;
    while ( my $result = $in->next_result() ) {
        $out->write_result($result, $first);
	$first = 0;
    }


-jason

 > Andy
> *******************************************************************
> Andy Nunberg, Ph.D
> Computational Biologist
> Orion Genomics, LLC
> (314) 615-6989
> http://www.oriongenomics.com
>

-- 
Jason Stajich
Duke University
jason at cgt.mc.duke.edu