[Bioperl-l] Generating a consensus sequence from a Clustal alignment

Tue Oct 26 19:15:09 UTC 2010

I've got my script running.

In order to support the identification of degenerate primers I had to create
a new method that would alter the IUPAC consensus. We're interested in
15-30bp sections of the consensus with high, but not 100%, conservation. We
would like to have a couple of "slices" with 2-3 mismatches to allow for the
ordering of a variety of primers.

To be completed:

   - I would *love* to hide the Clustal output that is being printed to the
   screen
   - Primer candidate identification (min_primerlength to max_primerlength)
   based on our scoring matrix
   - Primer melting temperature calculation
   - 3' Clamping identification
   - Self complementary primers

This has been a nice change from Java.

Bill

*Code*

assume: I have already read fasta input, performed Clustal align and
obtained the IUPAC consensus

sub getDegenerateConsensusIupac {
    my ($aln, $maxGapPercent, $iupacConsensus, $numSequences) = @_;

  # convert consensus into array
  my @cons = split(//, $iupacConsensus);

  # loop over the alignment "columns"
  foreach my $position ( 1 .. $aln->length() ) {
      my $consCode = $cons[$position-1];
      my %slice = getSliceHash($aln, $position);

      my $out = _applyGapPercentToConsensus ($maxGapPercent, $consCode,
$numSequences, %slice);

      # Convert consensus value new value
      if ($out ne $consCode) {
          $cons[$position-1] = $out;
      }
  }

  #rebuild consensus
  my $degenerateConsensus = join '', @cons;

  return $degenerateConsensus;
}

sub _applyGapPercentToConsensus {
    my ($maxGapPercent, $consCode, $numSequences, %slice) = @_;

        my $gapCount = $slice{'.'};

        # no gaps!
        if ($gapCount !~ /\d/) {
            return $consCode;
        }

        my $pctGap = ($gapCount / $numSequences) * 100;

        if ($pctGap > $maxGapPercent) {
            if ($consCode =~ /[A-Z]/) {
                return lc $consCode;
            }
            if ($consCode =~ /[a-z]/) {
                return '.';
            }
        }

    return $consCode;
}

sub getSliceHash{
    my ($aln, $position) = @_;

    my %sliceHash;
    foreach my $seq ($aln->each_seq) {
        my $res = $seq->subseq($position, $position);
        $sliceHash{$res}++;
    }

    return %sliceHash;
}

On Tue, Oct 19, 2010 at 11:47 AM, Chris Fields <cjfields at illinois.edu>wrote:

> Bio::SimpleAlign is the class that contains the alignment data; it does not
> generate the alignment for you.  You can use modules from BioPerl-Run that
> run ClustalW, MUSCLE, T-Coffee, etc to get a Bio::SimpleAlign instance, or
> parse the already-generated alignment output via Bio::AlignIO.
>
> From the Bio::Tools::Run::Alignment::ClustalW docs:
>
> =================================================================
>
>  #  Build a clustalw alignment factory
>  @params = ('ktuple' => 2, 'matrix' => 'BLOSUM');
>  $factory = Bio::Tools::Run::Alignment::Clustalw->new(@params);
>
>  #  Pass the factory a list of sequences to be aligned.
>  $inputfilename = 't/data/cysprot.fa';
>  $aln = $factory->align($inputfilename); # $aln is a SimpleAlign object.
>
>  # ...or
>  $seq_array_ref = \@seq_array;
>
>  # where @seq_array is an array of Bio::Seq objects
>  $aln = $factory->align($seq_array_ref);
>
> =================================================================
>
> $aln is a Bio::SimpleAlign derived from ClustalW output.
>
> From Bio::SimpleAlign (note the use of Bio::AlignIO):
>
> =================================================================
>
>  # Use Bio::AlignIO to read in the alignment
>  $str = Bio::AlignIO->new(-file => 't/data/testaln.pfam');
>  $aln = $str->next_aln();
>
>  # Describe
>  print $aln->length;
>  print $aln->num_residues;
>  print $aln->is_flush;
>  print $aln->num_sequences;
>  print $aln->score;
>  print $aln->percentage_identity;
>  print $aln->consensus_string(50);
>
> =================================================================
>
> Note the consensus_string() method:
>
>  Title     : consensus_string
>  Usage     : $str = $ali->consensus_string($threshold_percent)
>  Function  : Makes a strict consensus
>  Returns   : Consensus string
>  Argument  : Optional treshold ranging from 0 to 100.
>             The consensus residue has to appear at least threshold %
>             of the sequences at a given location, otherwise a '?'
>             character will be placed at that location.
>             (Default value = 0%)
>
> chris
>
> On Oct 18, 2010, at 7:02 PM, Bill Stephens wrote:
>
> > So, I've got the SimpleAlign running. It looks like it's running the
> > alignment based upon the input sequence location only (first residue from
> > each sequence).  This is not what I need.
> >
> > I'm back to to clustal, tcoffee or dalign.
> >
> > Bill
> >
> > On Mon, Oct 18, 2010 at 12:09 PM, Jun Yin <jun.yin at ucd.ie> wrote:
> >
> >> Hi, Bill,
> >>
> >> You may consider to use consensus_iupac or consensus_string methods in
> >> Bio::SimpleAlign to generate consensus sequence.
> >>
> >> Cheers,
> >> Jun Yin
> >> Ph.D. student in U.C.D.
> >>
> >> Bioinformatics Laboratory
> >> Conway Institute
> >> University College Dublin
> >>
> >> -----Original Message-----
> >> From: bioperl-l-bounces at lists.open-bio.org
> >> [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Chris Fields
> >> Sent: Monday, October 18, 2010 3:55 PM
> >> To: Bill Stephens
> >> Cc: bioperl-l at bioperl.org
> >> Subject: Re: [Bioperl-l] Generating a consensus sequence from a Clustal
> >> alignment
> >>
> >> Bill,
> >>
> >> Actually, the page you reached at Pasteur is not Pise, but Mobyle (their
> >> replacement for the older Pise tools).  The Pise modules were in
> >> BioPerl-Run, but they were deprecated a few years ago and removed from
> the
> >> latest BioPerl-Run releases b/c the remote service is no longer active;
> >> there is no Perl-based replacement for Mobyle interaction.
> >>
> >> Have you thought about just using the functionality within the
> >> Bio::SimpleAlign class to generate the consensus?  I'm pretty sure there
> >> are
> >> methods in place to do that.
> >>
> >> chris
> >>
> >> On Oct 18, 2010, at 8:58 AM, Bill Stephens wrote:
> >>
> >>> All,
> >>>
> >>> I'm in my first week with bioperl for a class project (although I've
> used
> >>> Perl for years). I've successfully run a clustal alignment of several
> DNA
> >>> sequences to produce the aln and dnd files. Now I would like to
> generate
> >> a
> >>> consensus sequence from the alignment.  I see that Pise Cons does this
> >>> satisfactorily on my example data (
> >>> http://mobyle.pasteur.fr/cgi-bin/portal.py?form=consensus) . However,
> >> I'm
> >>> not finding Bio::Tools::Run::PiseApplication::cons in the 1.6.1
> >> distribution
> >>> that I installed.
> >>>
> >>> Is this another module that I need to install separately?
> >>>
> >>> "cpan[2]> m /Pise/
> >>> Module    Bio::Tools::Run::AnalysisFactory::Pise
> >>> (BIRNEY/bioperl-run-1.4.tar.gz)
> >>> Module    Bio::Tools::Run::PiseApplication
> >> (BIRNEY/bioperl-run-1.4.tar.gz)"
> >>>
> >>> Bill S.
> >>> _______________________________________________
> >>> Bioperl-l mailing list
> >>> Bioperl-l at lists.open-bio.org
> >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>
> >>
> >> _______________________________________________
> >> Bioperl-l mailing list
> >> Bioperl-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>
> >> __________ Information from ESET Smart Security, version of virus
> signature
> >> database 5377 (20100818) __________
> >>
> >> The message was checked by ESET Smart Security.
> >>
> >> http://www.eset.com
> >>
> >>
> >>
> >>
> >> __________ Information from ESET Smart Security, version of virus
> signature
> >> database 5377 (20100818) __________
> >>
> >> The message was checked by ESET Smart Security.
> >>
> >> http://www.eset.com
> >>
> >>
> >>
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>