[Bioperl-l] New hmmpfam parser

Sun Aug 20 21:56:37 UTC 2006

I've added a new hmmpfam parser to bioperl-live.

You access it with Bio::SearchIO::new(-format => "hmmer_pull"). It uses
the new Bio::PullParserI discussed in thread 'SearchIO speedup'.

The major differences between it and the existing SearchIO plugin for
hmmpfam reports (hmmer.pm) are speed, memory usage and how it deals with
hits and hsps. hmmer.pm breaks Bio::Search::HitI API by having hit
(model) name()s that are not unique within a ResultI. It also only ever
has one domain per model. hmmer_pull.pm has unique model names and as
many domains per model as there are in the file being parsed.
hmmer_pull.pm also gives back more correct answers when you try to use
the full variety of HitI, GenericHit, HSPI and GenericHSP methods.

Speed tested on one example hmmpfam report of 441kb comparing hmmer.pm
and hmmer_pull.pm:
(memory usage was always ~1.8x less)

# for the result for query sequence 'test5' (5th result of 10 in my
# test dataset), just get the most significant domain of the most
# significant model:
# while ($result = $searchio->next_result) {
#   if ($result->query_name eq 'test5') {
#     $result->sort_hits(sub{#sort by significance});
#     $hit = $result->next_hit;
#     $hsp = $hit->hsp('best');
#     last;
#   }
# }
23.5x faster

# while ($result = $searchio->next_result) { # do nothing }
38x faster

# while ($result = $searchio->next_result) {
#   while ($hit = $result->next_hit) {
#     while ($hsp = $hit->next_hsp) { # do nothing }
5.3x faster

# while ($result = $searchio->next_result) {
#   while ($hit = $result->next_hit) {
#     while ($hsp = $hit->next_hsp) {
#       $fi = $hsp->frac_identical('query');
#     }
(note that hmmer.pm returns the wrong answer for $fi: 0)
2.2x faster