[Bioperl-l] Trying to get a conservation index out of a MSA

Mon Feb 3 19:26:02 UTC 2014

Dear all,

I want to produce an index of conserved residues for one protein, here 
is what I do:
a) blast for it on UNIREF90
b) from the blast results, transform it into a FASTA file, and re-align 
it using muscle/clustalo
c) read the output MSA, and parse it like this:

  my $prot="P35613";
  my $offset = $aln->column_from_residue_number('lcl|UniRef90_'.$prot, 1);

     for ( my $i=1 ; $i < ( $aln->length()-$offset ); $i++)
     {
         my $pos = 
$aln->column_from_residue_number('lcl|UniRef90_'.$prot, $i);
         my $mini_aln = $aln->slice($i,$i,1);
         my $mini_id = $mini_aln->percentage_identity;
         my $id = floor($mini_id / 10);
         # If we have a 100% conservation, this means we have not 
sampled enough
         # the sequence space, so be less affirmative...
         $id = 9 if $id = 10;
         printf "%s -> %d\n",substr($seq,$i-$offset,1), $id;
     }

The problem is that all my conservation scores are 100%, although from 
the real
alignement output by muscle or clustalo, this is not true, as verified 
in Jalview :-)

I think the culprit lies in the name detection (I'm using bioperl all 
the way, though),
since when I want to parse the MSA from muscle/clustalo, I get a lot of :

--------------------- WARNING ---------------------
MSG: Replacing one sequence [lcl|UniRef90_L5L884/1-0]

---------------------------------------------------

--------------------- WARNING ---------------------
MSG: Replacing one sequence [lcl|UniRef90_UPI0003448243/1-0]

---------------------------------------------------

--------------------- WARNING ---------------------
MSG: Replacing one sequence [lcl|UniRef90_UPI0002C45112/1-0]

---------------------------------------------------

The identifiers are typical of Uniref records, and as the "Uniref" term 
means,
they are unique :-)

I think there is a problem in the heuristics of Uniref parsing, since 
the only
messages I saw concerning this issue was about names being equal:
- http://bioperl.org/pipermail/bioperl-l/2005-July/019313.html
- 
http://www.bioperl.org/wiki/Adding_duplicate_sequences_to_an_alignment_object

A) Is there a better way of doing this? (Getting the %id per position of 
the reference sequence ?)
B) If not, of for the glory, is there an improvement possible to avoid 
the message? I can provide
a more complete example if needed.

Thanks a lot in advance for you help,

Stéphane

-- 
Equipe DSIMB - Dynamique des Structures et
des Interactions des Macromolécules Biologiques
INTS, INSERM-Paris-Diderot UMR_S 1134
6 rue Alexandre Cabanel - 75739 Paris cedex 15 - France
Tél : +33 144 493 057
Fax : +33 143 065 019
http://www.dsimb.inserm.fr / http://www.steletch.org