[Bioperl-l] Writing a Bio::Assembly::Contig using Bio::AlignIO

Mon Jul 5 18:15:00 EDT 2004

Hello all,

I have a couple of questions that arose during a brief discussion with
Jason Stajich a while back (Jason, thanks very much for you input, I
just got a chance to come back to this), as well as a diff of the
changes I've made so far.

What did I want to do:  
Some time back I wanted to display the Assembly in an ace file produced
by phrap.  Looking at the documentation I saw that Bio::Assembly::Contig
ISA Bio::Align::AlignI, so I thought I should be able to write it to a
file using Bio::AlignIO.  

Problems I encountered:  
There turned out to be two problems with this: 1) many of the
Bio::Align::AlignI methods were not implemented in Bio::Assembly::Contig
and 2) alignments in Bio::Assembly::Contig are not flush.

Possible solution:  
I went ahead and implemented the missing methods in
Bio::Assembly::Contig and added a pad method to makes the alignment
flush and an unpad method to undo the effects of pad (similar to how
match/unmatch work).  The diff output is at the end of this email.

Usage example:
my $in = Bio::Assembly::IO->new  (-file => "contigs.ace",
                                  -format => "ace");

my $out = Bio::AlignIO->new (-file => ">test.out",
                             -format => "clustalw");

while (my $aln = $in->next_assembly()) {
  my @contigs = $aln->all_contigs();
  if (scalar @contigs == 0) { last; }
  foreach my $contig (@contigs) {
    $contig->pad();
    $out->write_aln($contig);
    $contig->unpad();
  }
}

This seems to work for all the AlignIO modules which implemented
write_aln.

Additional questions:  
Jason and I talked about whether or not pad/unpad should be contig
specific, but after looking through the other methods present in AlignI
I went ahead and added pad/unpad there because it seemed correct to
me...  this could be removed if others disagree.

> I think leaving the stuff as contig.pm specific for now but perhaps
> re-posting the proposal to the bioperl list will elicit more response
> (hopefully).

The other question that arose during my exchange with Jason was whether
or not modules such as Bio::AlignIO::clustalw should test for flush.
Here is Jason's suggestion, however I haven't implemented it because I'm
not familiar enough with the alignment specs to know if it is a good
idea (and Jason didn't seem 100% certain)...  What do others think?

> As for the testing for flush in clustalw etc, I'm not sure what the
> alignment specs say - perhaps we can set it up so that there is a
> "strictness" flag for AlignIO which is set to on by default and will
> test for flushness.  I imagine it is hard to do things like fasta MSA
> alignments as unflush so it would be critical to have it test there...

Thank you...

Awaiting your comments,
Chris

Diff output:

Index: Bio/Align/AlignI.pm
===================================================================
RCS file: /home/repository/bioperl/bioperl-live/Bio/Align/AlignI.pm,v
retrieving revision 1.9
diff -r1.9 AlignI.pm
426a427,462
> =head2 pad
>
>  Title     : pad()
>  Usage     : $ali->pad()
>  Function  :
>
>              Goes through all sequences and pads them so that
>              the alignment is flush.
>
>  Returns   : 1
>  Argument  : a pad character, optional, defaults to '-'
>
> =cut
>
> sub pad {
>     my ($self) = @_;
>     $self->throw_not_implemented();
> }
>
> =head2 unpad
>
>  Title     : unpad()
>  Usage     : $ali->unpad()
>  Function  :
>
>              Undoes the effect of method pad.
>
>  Returns   : 1
>  Argument  :
>
> =cut
>
> sub unpad {
>     my ($self) = @_;
>     $self->throw_not_implemented();
> }
Index: Bio/Assembly/Contig.pm
===================================================================
RCS file: /home/repository/bioperl/bioperl-live/Bio/Assembly/Contig.pm,v
retrieving revision 1.4
diff -r1.4 Contig.pm
210c210,244
< use vars qw(@ISA);
---
> use vars qw(@ISA %CONSERVATION_GROUPS);
> use Dumpvalue;
>
> BEGIN {
>     # This data should probably be in a more centralized module...
>     # it is taken from Clustalw documentation
>     # These are all the positively scoring groups that occur in the
>     # Gonnet Pam250 matrix. The strong and weak groups are
>     # defined as strong score >0.5 and weak score =<0.5 respectively.
>
>     %CONSERVATION_GROUPS = ( 'strong' => [ qw(STA
>                    NEQK
>                    NHQK
>                    NDEQ
>                    QHRK
>                    MILV
>                    MILF
>                    HY
>                    FYW)
>                      ],
>             'weak' => [ qw(CSA
>                       ATV
>                       SAG
>                       STNK
>                       STPA
>                       SGND
>                       SNDEQK
>                       NDEQHK
>                       NEQHRK
>                       FVLIM
>                       HFY) ],
>             );
>
> }
>
1304c1338
<     my $seqID = $self->{'_order'}->{--$pos};
---
>     my $seqID = $self->{'_order'}->{$pos};
1404,1405c1438,1451
<     my ($self) = @_;
<     $self->throw_not_implemented();
---
>     my $self = shift;
>     my $from = shift;
>     my $to   = shift;
>     my ($seq,$temp);
>
>     $self->throw("Need exactly two arguments")
>       unless defined $from and defined $to;
>
>     foreach $seq ( $self->each_seq() ) {
>       $temp = $seq->seq();
>       $temp =~ s/$from/$to/g;
>       $seq->seq($temp);
>     }
>     return 1;
1407a1454
>
1436,1437c1483,1590
<     my ($self) = @_;
<     $self->throw_not_implemented();
---
>     my ($self,$matchlinechar, $strong, $weak) = @_;
>
>     my %matchchars = ( 'match'  => $matchlinechar || '*',
>              'weak'   => $weak          || '.',
>              'strong' => $strong        || ':',
>              'mismatch'=> ' ',
>           );
>
>     if (!$self->is_flush()) {
>       $self->throw ("Producing a matchline for an alignment which is
not flush
 doesn't make sense IMO - Chris.");
>       return;
>     }
>
>     my @seqchars;
>     my $seqcount = 0;
>     my $alphabet;
>     foreach my $seq ( $self->each_seq ) {
>       push @seqchars, [ split(//, uc ($seq->seq)) ];
>       $alphabet = $seq->alphabet unless defined $alphabet;
>     }
>     my $refseq = shift @seqchars;
>     # let's just march down the columns
>     my $matchline;
>     POS: foreach my $pos ( 0..$self->length ) {
>    my $refchar = $refseq->[$pos];
>    next unless $refchar; # skip ''
>    my %col = ($refchar => 1);
>    my $dash = ($refchar eq '-' || $refchar eq '.' || $refchar eq ' ');
>    foreach my $seq ( @seqchars ) {
>        $dash = 1 if( $seq->[$pos] eq '-' || $seq->[$pos] eq '.' ||
>            $seq->[$pos] eq ' ' );
>        $col{$seq->[$pos]}++;
>    }
>    my @colresidues = sort keys %col;
>    my $char = $matchchars{'mismatch'};
>    # if all the values are the same
>    if( $dash ) { $char =  $matchchars{'mismatch'} }
>    elsif( @colresidues == 1 ) { $char = $matchchars{'match'} }
>    elsif( $alphabet eq 'protein' ) { # only try to do weak/strong
>                                      # matches for protein seqs
>        TYPE: foreach my $type ( qw(strong weak) ) {
>                 # iterate through categories
>       my %groups;
>       # iterate through each of the aa in the col
>       # look to see which groups it is in
>       foreach my $c ( @colresidues ) {
>           foreach my $f ( grep /\Q$c/, @{$CONSERVATION_GROUPS{$type}}
) {
>          push @{$groups{$f}},$c;
>           }
>       }
>       GRP: foreach my $cols ( values %groups ) {
>           @$cols = sort @$cols;
>           # now we are just testing to see if two arrays
>           # are identical w/o changing either one
>
>           # have to be same len
>           next if( scalar @$cols != scalar @colresidues );
>           # walk down the length and check each slot
>           for($_=0;$_ < (scalar @$cols);$_++ ) {
>          next GRP if( $cols->[$_] ne $colresidues[$_] );
>           }
>           $char = $matchchars{$type};
>           last TYPE;
>       }
>        }
>      }
>    $matchline .= $char;
>     }
>
>     return $matchline;
>
> }
>
> =head2 pad
>
>  Title     : pad
>  Usage     : if( !$contig->is_flush() ) {
>            :   $contig->pad();
>            :
>  Function  : Pad the sequences in the alignment to make it flush
>            : Does nothing if the alignment has already been padded
>            :
>  Returns   : 1
>  Argument  : $pad_char, optional, default '-'
>
> =cut
>
> sub pad {
>   my ($self,$pad_char) = @_;
>
>   $pad_char ||= '-';
>
>   if (!$self->{'pad_seq'}) {
>     $self->{'pad_seq'} = 1;
>     foreach my $seq ( $self->each_seq ) {
>       my $coord = $self->get_seq_coord($seq);
>       my $tmp_seq = $seq->seq();
>       for (my $i = 1; $i < $coord->start(); $i++) {
>          $tmp_seq = $pad_char.$tmp_seq;
>       }
>       for (my $i = $coord->end(); $i < $self->length(); $i++) {
>          $tmp_seq .= $pad_char;
>       }
>       $seq->seq($tmp_seq);
>     }
>   }
>
>   return 1;
1439a1593,1623
> =head2 unpad
>
>  Title     : unpad
>  Usage     : $contig->unpad();
>            :
>  Function  : undo the effects of the pad method.
>            : does nothing if the alignment has not been padded
>            :
>  Returns   : 1
>  Argument  :
>
> =cut
>
> sub unpad {
>   my ($self) = @_;
>
>   if ($self->{'pad_seq'}) {
>     $self->{'pad_seq'} = 0;
>     my $dumper = new Dumpvalue();
>     foreach my $seq ( $self->each_seq ) {
>       my $coord = $self->get_seq_coord($seq);
>       my @chars = split //, $seq->seq();
>       @chars = splice @chars, $coord->start()-1,
>                               $coord->end() - $coord->start() + 1;
>       $seq->seq(join '', @chars);
>     }
>   }
>   return 1;
> }
>
>
1440a1625
>     $self->throw_not_implemented();
1461,1462c1646,1676
<     my ($self) = @_;
<     $self->throw_not_implemented();
---
>     my ($self, $match) = @_;
>
>     if (!$self->is_flush()) {
>       $self->throw ("Running match on an alignment which is not flush
doesn't
make sense IMO - Chris");
>       return;
>     }
>
>     $match ||= '.';
>     my ($matching_char) = $match;
>     $matching_char = "\\$match" if $match =~ /[\^.$|()\[\]]/ ;  #';
>     $self->map_chars($matching_char, '-');
>
>     my @seqs = $self->each_seq();
>     return 1 unless scalar @seqs > 1;
>
>
>     my $refseq = shift @seqs ;
>     my @refseq = split //, $refseq->seq;
>     my $gapchar = $self->gap_char;
>
>     foreach my $seq ( @seqs ) {
>       my @varseq = split //, $seq->seq();
>       for ( my $i=0; $i < scalar @varseq; $i++) {
>           $varseq[$i] = $match if defined $refseq[$i] &&
>               ( $refseq[$i] =~ /[a-zA-Z\*]/ ||
>                 $refseq[$i] =~ /$gapchar/ )
>                     && $refseq[$i] eq $varseq[$i];
>       }
>       $seq->seq(join '', @varseq);
>     }
>     $self->match_char($match);
>     return 1;
1525,1526c1739,1746
<     my ($self) = @_;
<     $self->throw_not_implemented();
---
>     my ($self, $char) = @_;
>
>     if (defined $char ) {
>       $self->throw("Single missing character, not [$char]!") if
CORE::length($
char) > 1;
>       $self->{'_missing_char'} = $char;
>     }
>
>     return $self->{'_missing_char'};
1540,1541c1760,1767
<     my ($self) = @_;
<     $self->throw_not_implemented();
---
>     my ($self, $char) = @_;
>
>     if (defined $char ) {
>       $self->throw("Single match character, not [$char]!") if
CORE::length($ch
ar) > 1;
>       $self->{'_match_char'} = $char;
>     }
>
>     return $self->{'_match_char'};
1555,1556c1781,1788
<     my ($self) = @_;
<     $self->throw_not_implemented();
---
>     my ($self, $char) = @_;
>
>     if (defined $char || ! defined $self->{'_gap_char'} ) {
>       $char= '-' unless defined $char;
>       $self->throw("Single gap character, not [$char]!") if
CORE::length($char
) > 1;
>       $self->{'_gap_char'} = $char;
>     }
>     return $self->{'_gap_char'};
1570,1571c1802,1816
<     my ($self) = @_;
<     $self->throw_not_implemented();
---
>    my ($self,$includeextra) = @_;
>
>    unless ($self->{'_symbols'}) {
>        foreach my $seq ($self->each_seq) {
>            map { $self->{'_symbols'}->{$_} = 1; } split(//,$seq->seq);
>        }
>    }
>    my %copy = %{$self->{'_symbols'}};
>    if( ! $includeextra ) {
>        foreach my $char ( $self->gap_char, $self->match_char,
>                         $self->missing_char) {
>          delete $copy{$char} if( defined $char );
>        }
>    }
>    return keys %copy;
1641,1642c1886,1908
<     my ($self) = @_;
<     $self->throw_not_implemented();
---
>     my ($self,$report) = @_;
>     my $seq;
>     my $length = (-1);
>     my $temp;
>
>     foreach $seq ( $self->each_seq() ) {
>    if( $length == (-1) ) {
>        $length = CORE::length($seq->seq());
>        next;
>    }
>
>    $temp = CORE::length($seq->seq());
>    if( $temp != $length ) {
>        $self->warn("expecting $length not $temp from ".
>          $seq->display_id) if( $report );
>        $self->debug("expecting $length not $temp from ".
>           $seq->display_id);
>        $self->debug($seq->seq(). "\n");
>        return 0;
>    }
>     }
>     return 1;
>
1658,1659c1924
<
<     $self->throw_not_implemented();
---
>     return $self->get_consensus_length();
1676c1941
< sub maxname_length {
---
> sub maxdisplayname_length {
1678c1943,1955
<     $self->throw_not_implemented();
---
>
>     my $maxname = (-1);
>     my ($seq,$len);
>
>     foreach $seq ( $self->each_seq() ) {
>    $len = CORE::length $self->displayname($seq->get_nse());
>
>    if( $len > $maxname ) {
>        $maxname = $len;
>    }
>     }
>
>     return $maxname;
1680a1958
>
1834c2112,2128
< sub displayname { # Do nothing
---
> sub displayname {
>      my ($self, $name, $disname) = @_;
>
>     $name =~ s/\/\d+-\d+//;
>
>     $self->throw("No sequence with name [$name]")
>         unless defined $self->{'_elem'}{$name};
>
>     if(  $disname and  $name) {
>       $self->{'_dis_name'}->{$name} = $disname;
>       return $disname;
>     }
>     elsif( defined $self->{'_dis_name'}->{$name} ) {
>       return  $self->{'_dis_name'}->{$name};
>     } else {
>       return $name;
>     }
1867c2161,2169
< sub set_displayname_flat { # Do nothing!
---
> sub set_displayname_flat {
>     my $self = shift;
>     my ($nse,$seq);
>
>     foreach $seq ( $self->each_seq() ) {
>       $nse = $seq->get_nse();
>       $self->displayname($nse,$seq->id());
>     }
>     return 1;

----
Christopher T. Lewis
Agriculture and Agri-Food Canada/Agriculture et Agroalimentaire Canada
LewisCT at agr.gc.ca