[Bioperl-l] hmmer3/hmmscan parser

Wed May 26 15:25:24 UTC 2010

Thanks for the feedback, Dave.

>> So this brings up an interesting point. At some point, we'll have  
>> to build out a few additional SearchIO methods to incorporate some  
>> of the additional information encoded in the HMMER v3 reports.
>
> Would the new methods need to be added to SearchIO if they're  
> specific to H3? (as opposed to just being in the H3 sub-class)

Sorry for being unclear - the methods in question would be, at least  
in my mind, specific to the H3 sub-class.

>
>> Sean talks a bit in the user manual about the importance of looking  
>> at both the full sequence and the best domain (see page 18 in the  
>> manual linked to on this page http://hmmer.janelia.org/#documentation) 
>> .  For example, he mentions that one should consider the e-value of  
>> both the full sequence and best domain to ascertain if the query is  
>> homologous to a profile being considered via hmmsearch.
>>
>> He also mentions that looking at the full sequence report values  
>> without consideration of the best domain report values can be  
>> misleading. I'm not saying that your approach regarding Hit- 
>> >raw_score is wrong - proper interpretation of the results is up to  
>> the end user and there are benefits to looking at the full sequence  
>> (again, communicated on page 18) - but we might consider how to  
>> best encode the SearchIO methods to mitigate end user confusion and  
>> mistakes.
>
> I think this is a great idea.
>
> Of course it's always best for end-users to RTFM and understand the  
> tools they're using, but it's clearly beneficial to make it easier  
> to do the right thing.
>
> Having not considered it too much, I'm not sure how to accomplish  
> this without breaking the SearchIO idiom. But presumably a way could  
> be found.
>

I'll see if I can't hit the drawing board and come up with a naming  
scheme for additional H3 methods that retrieve some of the extra data  
encoded in the new reports. It *probably* makes most sense, at least  
from the standpoint of the user's perspective, to adopt the full- 
length report values as the standard hit->significance and hit- 
 >raw_score while having something like hit->best_significance and hit- 
 >best_score as H3 methods that return the best-domain report values.   
Again, this could use some thought/discussion.

>
>>> Some of the folks on IRC suggested that we might want to integrate  
>>> the
>>> hmmer.pm parser as well, modularizing this a bit and loading the  
>>> correct
>>> parser depending on the requested format.
>
>> This might make sense, given that HMMER v3 is now live and seems to  
>> be adopted by researchers at an increasing rate. Since I used  
>> hmmer.pm as a template for hmmer3.pm, it shouldn't be too difficult  
>> to do, either.  I think a thorough conversation on this point is  
>> warranted as others I've talked to have preferred the modules to be  
>> separate.
>>
>> I'd be interested to hear what other have to say on this point.
>
> I did not follow the IRC discussion, so I confess I'm not totally  
> clear on what "integrate the hmmer.pm parser" means. I'm taking it  
> to mean combining the code that parses HMMER2 with the code that  
> parses HMMER3.=

> But then "modularizing this a bit and loading the correct parser  
> depending on the requested format" seems to contradict that  
> assumption.
>
> Perhaps you (or someone) could clarify a bit what the HMMER2 -  
> HMMER3 integration would look like (and the goal of doing so) ?
>

I was not a part of that conversation either and I'm also operating  
under a similar assumption about what "integrating the hmmer.pm  
parser" means.  I too am confused about the statement regarding  
modularization; I assume Kai meant that next_result would leverage the  
HMMER version number (which it already grabs) to guide the appropriate  
parsing of the datafile.  Not thinking about this too carefully, it  
might be a simple as:

next_result{
	version = get_hmmer_version
	if version == 2
		parse V2 report file
	if version == 3
		parse V3 report file
}

to make the code a bit more manageable, the various version parsers  
could be appropriated to independent subroutines.

Kai, is this along the lines of what you were thinking?

If this is correct (that is, merging the H2 and H3 parsers into a  
single hmmer.pm module), I see one primary benefit - the end user need  
not specify which HMMER module they want to implement, just use  
Bio::SearchIO::hmmer - and one secondary benefit - there's enough  
similarity between H2 and H3 reports that some from the H2 parser  
redundantly appears in the H3 parser.  There are certainly other  
benefits that I'm overlooking.

The only real downside I see at the moment is that the hmmer.pm parser  
becomes a bit more complicated and bloated. But I suspect this can be  
remedied with careful partitioning of the code into appropriate  
subroutines and thorough documentation. I am a bit concerned about how  
the aforementioned H3 specific methods are incorporated, but that  
should be manageable.

I wonder if anyone involved in the IRC discussion cares to weigh in?

Regardless, I'd advocate getting the H3 version fully flushed out to  
deal with the issues brought up in the first half of this message  
prior to an attempt to merge the two modules, as the merging process  
may be affected by the structure of the H3 parser.

Best,
Tom