[Bioperl-l] Bio::Tools::BPlite::HSP - percentage IDs

Chervitz, Steve Steve_Chervitz@affymetrix.com
Fri, 20 Jul 2001 22:17:34 -0700


Speaking of arguments (bad pun intended), in Bio::Tools::Blast::HSP, the
frac_identical() method takes an argument of 'sbjct' or 'query'. It seems
natural to parameterize percent identity of the whole sequence the same way
(which BTW, Bio::Tools::Blast::HSP doesn't do). However, read on.

An interesting point here is that the fraction returned by
Bio::Tools::Blast::HSP::frac_idenctical() only considers the length of the
aligned region, not the length of the whole sequence. I would argue that
it's important to interpret the percent identity within the aligned region
in conjunction with the fraction of the sequence that is aligned. For
example, I may have a HSP with 83% identity in the alignment and the
alignment itself covers 55% of the length of the query and 98% of the hit. I
like to know all of these data, rather than just knowing the percent
identity over the whole length of the query or the hit.

Bio::Tools::Blast::Sbjct has frac_aligned_query() and frac_aligned_sbjct()
methods that compute the fraction of the query or hit sequence that is
aligned. But these are summary statistics, created by tiling all HSPs
together. I noticed that there aren't such methods on Bio::Tools::Blast::HSP
and there probably should be. 

Steve


> -----Original Message-----
> From: Jason Stajich [mailto:jason@chg.mc.duke.edu]
> Sent: Wednesday, July 11, 2001 10:45 AM
> To: simon potter
> Cc: bioperl-l@bioperl.org
> Subject: Re: [Bioperl-l] Bio::Tools::BPlite::HSP - percentage IDs
> 
> 
> Hmm, I really hadn't ever looked at that code - What are the 
> arguments for
> or against.  
> 
> We can have a flag be twiddled to determine whether or not to use
> subject or query, but are both technically correct? 
> 
> -Jason
> 
> On Wed, 11 Jul 2001, simon potter wrote:
> 
> > Hello.
> > 
> > I've a question about parsing Blast output and how to get percentage
> > sequence identity.
> > 
> > In HSP.pm it is calculated by dividing the number of 
> matches by the query
> > seq length, rather than the subject seq length (i.e. a 
> re-calculation of
> > the % given in the blast output).
> > 
> > Is there a reason for calculating it in this way? I've 
> talked to people
> > around here and the general feeling is that it's better to 
> calculate wrt
> > the subject seq.
> > 
> > Question is - what to do about this? Is this something we 
> should change -
> > maybe a solution is to provide a choice for how we %id?  
> What do people
> > think?
> > 
> > Thanks,
> > 
> > Simon Potter,
> > EnsEMBL team, Sanger
> > 
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l@bioperl.org
> > http://bioperl.org/mailman/listinfo/bioperl-l
> > 
> 
> Jason Stajich
> jason@chg.mc.duke.edu
> Center for Human Genetics
> Duke University Medical Center 
> http://www.chg.duke.edu/ 
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>