[Bioperl-l] BPlite percent_id bug

Jerm jerm@fugu-sg.org
Wed, 16 Jan 2002 11:51:50 +0800


I've noticed a bug in BPlite (the branch I'm using is Bioperl-072), when
the blast output file is parsed.

The percentage_ids are calculated by dividing the number of identical
matches with the query sequence length.

So for example, 
-------------------------------------------------------------------------
---------#  Plus Strand HSPs:
# 
# Score = 247 (92.0 bits), Expect = 8.6e-88, Sum P(10) = 8.6e-88
# Identities = 48/64 (75%), Positives = 57/64 (89%), Frame = +3 / +1
# 
#Query: 37125 LQTVICSYVFFQGFLNLKWSRFARVVLTRSIAIIPTLLVAVFQDVEHLTGMNDFLNVLQS
37304#             L+ ++C     QGFLNL+WSRFARV+LTRS+AI
PTLLVA+FQD++HLTGMNDFLNVLQS#Sbjct:  3520
LKVLVC----LQGFLNLRWSRFARVLLTRSLAITPTLLVAIFQDIQHLTGMNDFLNVLQS 3687# 
#Query: 37305 LQVR 37316
#             LQVR
#Sbjct:  3688 LQVR 3699
-------------------------------------------------------------------------
----------

The $match (48) is parsed out from the file, and is divided by the
$qlength (37316 - 37126 +1 = 191), and the perc_id for this HSP is then
48/191 = 25%

But this blast output is from a tblastx, that is to say, the qlength is in
NT, but the number of matches is in AA. The perc_id is obviously
incorrect.

Is there a reason why th perc_id is not parsed out from the file directly
(75%) instead?


Jer-Ming Chia
Fugu Informatics
Singapore