[Biojava-l] getting % identity from blast search results

Keith James kdj@sanger.ac.uk
10 Jun 2002 10:07:07 +0100


>>>>> "Simon" == Simon Brocklehurst <simon.brocklehurst@CambridgeAntibody.com> writes:

    Simon> Susan Glass wrote:
    >>  Hello there, Last year I wrote a class that parsed a BLAST
    >> search file and returned a List of SequenceDBSearchResults,
    >> using BlastLikeSAXParser and BlastLikeSearchBuilder.  I
    >> modelled it on demo code that David Waring had written
    >> (thanks!).  Once the results were returned, I picked the hits
    >> that were below a given evalue cutoff.  My problem now is that
    >> the client has requested that instead of an evalue cutoff, the
    >> program should pick all hits that match, say, with 90% identity
    >> over 90% of the query sequence length.  I'm not sure how to get

[...]

    Simon> So, your changing needs are exactly the reason why it's a
    Simon> good idea to use a SAX approach to parsing i.e. one object
    Simon> does not fit all needs.  Your kind of requirement is not
    Simon> unusual at all. So, I would say don't be afraid to make
    Simon> your own objects, and populate them from SAX events passed
    Simon> to your own ContentHandler.  These objects don't have to be
    Simon> reusable across loads of use cases, they just have to meet
    Simon> your needs.

Absolutely. The original interfaces for search result/hit/subhit were
even more sparse than the existing ones, which are still not expected
to cover all situations. I'm in a similar situation this week - I want
%id, and %coverage of query and hit from Fasta search results, but
writing to a database rather than Java objects. So it's a case of
writing a new ContentHandler.

I can recommend David Huen's XML material from the bootcamp at
http://biojava.org/bootcamp/exercises/xml/index.html which describes
in some detail how to write modular SAX ContentHandlers quickly.

Aside:

With this in mind I've reworked the ContentHandlers in the ssbind
package to use StAX because it makes them more modular and quite easy
to plug together to get new behaviour. The end-user classes are
unchanged and all the existing tests pass unmodified, so this is a
transparent change. (It's not checked in yet.)

Keith

-- 

-= Keith James - kdj@sanger.ac.uk - http://www.sanger.ac.uk/Users/kdj =-
Pathogen Sequencing Unit, Wellcome Trust Sanger Institute, Cambridge, UK