[Bioperl-l] Bio::Tools::BPlite::HSP - percentage IDs
Chervitz, Steve
Steve_Chervitz@affymetrix.com
Fri, 20 Jul 2001 22:17:34 -0700
Speaking of arguments (bad pun intended), in Bio::Tools::Blast::HSP, the
frac_identical() method takes an argument of 'sbjct' or 'query'. It seems
natural to parameterize percent identity of the whole sequence the same way
(which BTW, Bio::Tools::Blast::HSP doesn't do). However, read on.
An interesting point here is that the fraction returned by
Bio::Tools::Blast::HSP::frac_idenctical() only considers the length of the
aligned region, not the length of the whole sequence. I would argue that
it's important to interpret the percent identity within the aligned region
in conjunction with the fraction of the sequence that is aligned. For
example, I may have a HSP with 83% identity in the alignment and the
alignment itself covers 55% of the length of the query and 98% of the hit. I
like to know all of these data, rather than just knowing the percent
identity over the whole length of the query or the hit.
Bio::Tools::Blast::Sbjct has frac_aligned_query() and frac_aligned_sbjct()
methods that compute the fraction of the query or hit sequence that is
aligned. But these are summary statistics, created by tiling all HSPs
together. I noticed that there aren't such methods on Bio::Tools::Blast::HSP
and there probably should be.
Steve
> -----Original Message-----
> From: Jason Stajich [mailto:jason@chg.mc.duke.edu]
> Sent: Wednesday, July 11, 2001 10:45 AM
> To: simon potter
> Cc: bioperl-l@bioperl.org
> Subject: Re: [Bioperl-l] Bio::Tools::BPlite::HSP - percentage IDs
>
>
> Hmm, I really hadn't ever looked at that code - What are the
> arguments for
> or against.
>
> We can have a flag be twiddled to determine whether or not to use
> subject or query, but are both technically correct?
>
> -Jason
>
> On Wed, 11 Jul 2001, simon potter wrote:
>
> > Hello.
> >
> > I've a question about parsing Blast output and how to get percentage
> > sequence identity.
> >
> > In HSP.pm it is calculated by dividing the number of
> matches by the query
> > seq length, rather than the subject seq length (i.e. a
> re-calculation of
> > the % given in the blast output).
> >
> > Is there a reason for calculating it in this way? I've
> talked to people
> > around here and the general feeling is that it's better to
> calculate wrt
> > the subject seq.
> >
> > Question is - what to do about this? Is this something we
> should change -
> > maybe a solution is to provide a choice for how we %id?
> What do people
> > think?
> >
> > Thanks,
> >
> > Simon Potter,
> > EnsEMBL team, Sanger
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l@bioperl.org
> > http://bioperl.org/mailman/listinfo/bioperl-l
> >
>
> Jason Stajich
> jason@chg.mc.duke.edu
> Center for Human Genetics
> Duke University Medical Center
> http://www.chg.duke.edu/
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>