[Bioperl-l] Trying to get a conservation index out of a MSA

Mon Feb 3 21:39:32 UTC 2014

Stephane,

Do you see the same issue when "-displayname_flat => 1”?

I’m talking about when you create $aln.

Brian O.

On Feb 3, 2014, at 2:26 PM, Téletchéa Stéphane <stephane.teletchea at inserm.fr> wrote:

> Dear all,
> 
> I want to produce an index of conserved residues for one protein, here is what I do:
> a) blast for it on UNIREF90
> b) from the blast results, transform it into a FASTA file, and re-align it using muscle/clustalo
> c) read the output MSA, and parse it like this:
> 
> my $prot="P35613";
> my $offset = $aln->column_from_residue_number('lcl|UniRef90_'.$prot, 1);
> 
>    for ( my $i=1 ; $i < ( $aln->length()-$offset ); $i++)
>    {
>        my $pos = $aln->column_from_residue_number('lcl|UniRef90_'.$prot, $i);
>        my $mini_aln = $aln->slice($i,$i,1);
>        my $mini_id = $mini_aln->percentage_identity;
>        my $id = floor($mini_id / 10);
>        # If we have a 100% conservation, this means we have not sampled enough
>        # the sequence space, so be less affirmative...
>        $id = 9 if $id = 10;
>        printf "%s -> %d\n",substr($seq,$i-$offset,1), $id;
>    }
> 
> The problem is that all my conservation scores are 100%, although from the real
> alignement output by muscle or clustalo, this is not true, as verified in Jalview :-)
> 
> I think the culprit lies in the name detection (I'm using bioperl all the way, though),
> since when I want to parse the MSA from muscle/clustalo, I get a lot of :
> 
> --------------------- WARNING ---------------------
> MSG: Replacing one sequence [lcl|UniRef90_L5L884/1-0]
> 
> ---------------------------------------------------
> 
> --------------------- WARNING ---------------------
> MSG: Replacing one sequence [lcl|UniRef90_UPI0003448243/1-0]
> 
> ---------------------------------------------------
> 
> --------------------- WARNING ---------------------
> MSG: Replacing one sequence [lcl|UniRef90_UPI0002C45112/1-0]
> 
> ---------------------------------------------------
> 
> The identifiers are typical of Uniref records, and as the "Uniref" term means,
> they are unique :-)
> 
> I think there is a problem in the heuristics of Uniref parsing, since the only
> messages I saw concerning this issue was about names being equal:
> - http://bioperl.org/pipermail/bioperl-l/2005-July/019313.html
> - http://www.bioperl.org/wiki/Adding_duplicate_sequences_to_an_alignment_object
> 
> A) Is there a better way of doing this? (Getting the %id per position of the reference sequence ?)
> B) If not, of for the glory, is there an improvement possible to avoid the message? I can provide
> a more complete example if needed.
> 
> Thanks a lot in advance for you help,
> 
> Stéphane
> 
> -- 
> Equipe DSIMB - Dynamique des Structures et
> des Interactions des Macromolécules Biologiques
> INTS, INSERM-Paris-Diderot UMR_S 1134
> 6 rue Alexandre Cabanel - 75739 Paris cedex 15 - France
> Tél : +33 144 493 057
> Fax : +33 143 065 019
> http://www.dsimb.inserm.fr / http://www.steletch.org
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l