[Bioperl-l] Trying to get a conservation index out of a MSA
Téletchéa Stéphane
stephane.teletchea at inserm.fr
Mon Feb 3 19:26:02 UTC 2014
Dear all,
I want to produce an index of conserved residues for one protein, here
is what I do:
a) blast for it on UNIREF90
b) from the blast results, transform it into a FASTA file, and re-align
it using muscle/clustalo
c) read the output MSA, and parse it like this:
my $prot="P35613";
my $offset = $aln->column_from_residue_number('lcl|UniRef90_'.$prot, 1);
for ( my $i=1 ; $i < ( $aln->length()-$offset ); $i++)
{
my $pos =
$aln->column_from_residue_number('lcl|UniRef90_'.$prot, $i);
my $mini_aln = $aln->slice($i,$i,1);
my $mini_id = $mini_aln->percentage_identity;
my $id = floor($mini_id / 10);
# If we have a 100% conservation, this means we have not
sampled enough
# the sequence space, so be less affirmative...
$id = 9 if $id = 10;
printf "%s -> %d\n",substr($seq,$i-$offset,1), $id;
}
The problem is that all my conservation scores are 100%, although from
the real
alignement output by muscle or clustalo, this is not true, as verified
in Jalview :-)
I think the culprit lies in the name detection (I'm using bioperl all
the way, though),
since when I want to parse the MSA from muscle/clustalo, I get a lot of :
--------------------- WARNING ---------------------
MSG: Replacing one sequence [lcl|UniRef90_L5L884/1-0]
---------------------------------------------------
--------------------- WARNING ---------------------
MSG: Replacing one sequence [lcl|UniRef90_UPI0003448243/1-0]
---------------------------------------------------
--------------------- WARNING ---------------------
MSG: Replacing one sequence [lcl|UniRef90_UPI0002C45112/1-0]
---------------------------------------------------
The identifiers are typical of Uniref records, and as the "Uniref" term
means,
they are unique :-)
I think there is a problem in the heuristics of Uniref parsing, since
the only
messages I saw concerning this issue was about names being equal:
- http://bioperl.org/pipermail/bioperl-l/2005-July/019313.html
-
http://www.bioperl.org/wiki/Adding_duplicate_sequences_to_an_alignment_object
A) Is there a better way of doing this? (Getting the %id per position of
the reference sequence ?)
B) If not, of for the glory, is there an improvement possible to avoid
the message? I can provide
a more complete example if needed.
Thanks a lot in advance for you help,
Stéphane
--
Equipe DSIMB - Dynamique des Structures et
des Interactions des Macromolécules Biologiques
INTS, INSERM-Paris-Diderot UMR_S 1134
6 rue Alexandre Cabanel - 75739 Paris cedex 15 - France
Tél : +33 144 493 057
Fax : +33 143 065 019
http://www.dsimb.inserm.fr / http://www.steletch.org
More information about the Bioperl-l
mailing list