[Bioperl-l] Bio::SearchIO::hmmer hsp behaviour

Thu Jun 29 02:53:25 UTC 2006

I don't have any time to really debate this sadly - I definitely went  
back and forth on how to solve this and not many people ever spoke up  
about what the WANTED.  So glad to hear there are opinions out there  
now.

I think the bug fix you refer to had to do with not returning things  
ordered by E-value -- the creation machinery only only builds Hit  
objects when there are HSP objects being built.  Basically the  
parsing is linear in terms of the file, we read "Model" (Hit) data  
first and store them in a hash keyed by the name of the domain, but  
we only >>build<< the "Hits" when seen HSPs, hence the problem when  
the -A option limits alignments but reports Hits that don't have  
individual alignments.  This has to do with the order of things not  
syncing up and/or dealing with the -A option when there is leftover  
Hit data but no HSPs to populate them.  We also had this problem in  
BLAST reports and had to work around that, but I never bothered  
solving it in HMMER I guess.  Glad there are other people who are  
going to fix the problems!

The one "alignment" (HSP) per hit was a workaround to the problem  
that Hits were being returned in the order the HSPs came in (Sequence  
order) -- because that is the order they were being built in -- not  
in the sorted order of the Hits as seen in the report.

Feel free to propose an alternative implement for parser as you see  
fit as long as the API is preserved.  you can contibute a new  
SearchIO plugin and HMMERSearchResultListener to deal with it - or I  
guess do what I also do and just run hmmer2table and deal with things  
in a tab-delimited format.

Personally my interests lie in the actual domains so the Hit objects  
are superfluous in my own work so it never bothered me to have one  
per Hit and it flows more naturally to things like GFF, etc.  You can  
aggregate them however you like after the fact pretty simply so I  
don't find this too hard to deal with, but if this a major deterrent  
for people I guess have at it ( I think the speed of object creation  
is a larger problem that I hope that someone will work on soon).

I'd appreciate you including the salient points of how the report is  
interpreted on the wiki at some point (with 8X10 glossy pictures and  
circles and arrows on the back...http://en.wikipedia.org/wiki/Alice% 
27s_Restaurant) so the debate can be archived too.

-jason

On Jun 28, 2006, at 7:00 PM, Sendu Bala wrote:

> Chris Fields wrote:
>>> Sendu Bala wrote:
> [snip]
>>> In any case, this is extremely counter-intuitive, especially given
>>> that next_domain is a synonym of next_hsp. I think either the
>>> synonym relationship remains and hits have multiple hsps (and there
>>> is only one hit per model)
> [snip]
>
>> The model (hit-like) table scores are retained and can be retrieved
>> via $model->significance and the individual domain (hsp-like) evalues
>> via $model->evalue.
>
> I know, see my earlier post.
>
>> The reason you don't get all the individual domain evalues is that
>> only five alignments are returned by default.  You might try changing
>> the 'A' parameter to see if you can get more alignments; that may
>> work around the problem of missing domains for now.
>
> [I'm using my own data, not the OP's]
> No, I have all the alignments: 'A' isn't a problem. And I can get all
> the domains. The problem is I have to check multiple different hits to
> find them all.
>
>
>> You'll note that the Model/Domain results returned are not based on
>> top score but what looks like the position of the domain in the
>> sequence (seq-t in the last table); that's what is stated in the
>> hmmpfam docs.
> [...]
>> Well, that and SearchIO is set up as a SAX-like parser, so I believe
>> it processes the model-domain alignments as the file is parsed.
>
> Yes, this is the problem. The parser does the obvious thing, but in my
> view it does not do the correct thing.
>
>
>> Model/domain pairs really aren't Hits/HSPs by definition, like the
>> CVS commit from Jason states.  The way Pfam is set up, you actually
>> have your query(ies) scanned using a database of Pfam domains (HMM's,
>> built from protein alignments for various protein families), hence
>> the alignment in the report is not a HSP since HSPs come from
>> pairwise sequence alignments.  An HSP is a pair of sequences which,
>> when aligned, meet or exceed a maximal cutoff.  The hmmpfam report
>> has alignments of the sequence and the consensus for the alignment
>> the HMM is based on (not another sequence, so not an HSP).
>
> But this is just semantics. It doesn't /matter/ that its not really
> truly a sequence that's being aligned. The parser needs to present to
> the user the information in the file. As we see in the OP's  
> example, it
> simply fails to do this because the parser isn't model-centric  
> while the
> file it is parsing /is/.
>
> And in any case, your argument doesn't hold because even the current
> parser /does/ store domains in hsp objects! It just only stores one  
> hsp
> per hit, repeatedly, which is nonsensical.
>
> [to avoid confusion, in the following the use of 'model' is in the
> programming sense, whilst 'Model' refers to the things generated by  
> hmmer]
>
> The correct model to describe the file being parsed is one that is  
> able
> provide to the user all the available results for all Models that  
> hit a
> query sequence, even when there are no alignments in the file. To make
> this fit the SearchIO scheme, we must have one hit per Model. The hit
> has hsps which are the domains. This perfectly matches the information
> in the file. It matches something like a Blast, where you have one hit
> per database sequence/query sequence combo.
>
> A hit could end up with no hsps (no domains), but we may not even  
> care.
> Sometimes you really do just want to know if a particular model hit at
> all, and with what evalue/score. The current parsing model isn't
> guaranteed to tell you this even when you can read it yourself in the
> file being parsed.
>
> You can guess at the intent of the original authors, I think, just by
> looking at those method synonyms. next_hit == next_model. next_hsp ==
> next_domain. This makes perfect sense. This is the way to correctly
> model the information in the file. The problem is that next_model
> doesn't give you the next Model (because each Model has multiple  
> hits),
> and next_domain doesn't give you the next domain (because each hit  
> only
> has one domain).
>
>
>> I think the reasoning for keeping single model-domain pairs is that
>> you should consider each domain's location in the sequence as well as
>> the number of times they appear, regardless of whether they belong to
>> the same model or not.  One protein could have three ATP-binding
>> domains and another two, and they could be located in different
>> positions on the sequence.  But where they are on the sequence in
>> relation to other domains and to each other (i.e. positional
>> information) is just as important, maybe more so, than how many times
>> that domain appears.
>
> Well, that's for the user to decide. But the way the results are
> presented needs to make sense. If blast results came back with all  
> hsps
> listed out in sequence position order, would you have multiple hits  
> per
> database sequence each with one hsp? No, because the meaning is
> completely wrong. The 'hit' is the collection of alignments of a
> particular database sequence hitting a query sequence. The alignments
> are stored in a bunch of hsps. It is absurd to have more than one hit
> object for a database+query sequence combo, because then we have
> multiple hit objects duplicating the exact same information, and 'hit'
> no longer has any meaning - it is a collection of /some/ of the
> alignments? Yet this is exactly what we have with hmmpfam result  
> parsing.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

--
Jason Stajich
Duke University
http://www.duke.edu/~jes12