[Bioperl-l] Trying to get a conservation index out of a MSA
Brian Osborne
bosborne11 at verizon.net
Mon Feb 3 21:39:32 UTC 2014
Stephane,
Do you see the same issue when "-displayname_flat => 1”?
I’m talking about when you create $aln.
Brian O.
On Feb 3, 2014, at 2:26 PM, Téletchéa Stéphane <stephane.teletchea at inserm.fr> wrote:
> Dear all,
>
> I want to produce an index of conserved residues for one protein, here is what I do:
> a) blast for it on UNIREF90
> b) from the blast results, transform it into a FASTA file, and re-align it using muscle/clustalo
> c) read the output MSA, and parse it like this:
>
> my $prot="P35613";
> my $offset = $aln->column_from_residue_number('lcl|UniRef90_'.$prot, 1);
>
> for ( my $i=1 ; $i < ( $aln->length()-$offset ); $i++)
> {
> my $pos = $aln->column_from_residue_number('lcl|UniRef90_'.$prot, $i);
> my $mini_aln = $aln->slice($i,$i,1);
> my $mini_id = $mini_aln->percentage_identity;
> my $id = floor($mini_id / 10);
> # If we have a 100% conservation, this means we have not sampled enough
> # the sequence space, so be less affirmative...
> $id = 9 if $id = 10;
> printf "%s -> %d\n",substr($seq,$i-$offset,1), $id;
> }
>
> The problem is that all my conservation scores are 100%, although from the real
> alignement output by muscle or clustalo, this is not true, as verified in Jalview :-)
>
> I think the culprit lies in the name detection (I'm using bioperl all the way, though),
> since when I want to parse the MSA from muscle/clustalo, I get a lot of :
>
> --------------------- WARNING ---------------------
> MSG: Replacing one sequence [lcl|UniRef90_L5L884/1-0]
>
> ---------------------------------------------------
>
> --------------------- WARNING ---------------------
> MSG: Replacing one sequence [lcl|UniRef90_UPI0003448243/1-0]
>
> ---------------------------------------------------
>
> --------------------- WARNING ---------------------
> MSG: Replacing one sequence [lcl|UniRef90_UPI0002C45112/1-0]
>
> ---------------------------------------------------
>
> The identifiers are typical of Uniref records, and as the "Uniref" term means,
> they are unique :-)
>
> I think there is a problem in the heuristics of Uniref parsing, since the only
> messages I saw concerning this issue was about names being equal:
> - http://bioperl.org/pipermail/bioperl-l/2005-July/019313.html
> - http://www.bioperl.org/wiki/Adding_duplicate_sequences_to_an_alignment_object
>
> A) Is there a better way of doing this? (Getting the %id per position of the reference sequence ?)
> B) If not, of for the glory, is there an improvement possible to avoid the message? I can provide
> a more complete example if needed.
>
> Thanks a lot in advance for you help,
>
> Stéphane
>
> --
> Equipe DSIMB - Dynamique des Structures et
> des Interactions des Macromolécules Biologiques
> INTS, INSERM-Paris-Diderot UMR_S 1134
> 6 rue Alexandre Cabanel - 75739 Paris cedex 15 - France
> Tél : +33 144 493 057
> Fax : +33 143 065 019
> http://www.dsimb.inserm.fr / http://www.steletch.org
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list