[Bioperl-l] Writing a Bio::Assembly::Contig using Bio::AlignIO
Lewis, Christopher
LewisCT at AGR.GC.CA
Mon Jul 5 18:15:00 EDT 2004
Hello all,
I have a couple of questions that arose during a brief discussion with
Jason Stajich a while back (Jason, thanks very much for you input, I
just got a chance to come back to this), as well as a diff of the
changes I've made so far.
What did I want to do:
Some time back I wanted to display the Assembly in an ace file produced
by phrap. Looking at the documentation I saw that Bio::Assembly::Contig
ISA Bio::Align::AlignI, so I thought I should be able to write it to a
file using Bio::AlignIO.
Problems I encountered:
There turned out to be two problems with this: 1) many of the
Bio::Align::AlignI methods were not implemented in Bio::Assembly::Contig
and 2) alignments in Bio::Assembly::Contig are not flush.
Possible solution:
I went ahead and implemented the missing methods in
Bio::Assembly::Contig and added a pad method to makes the alignment
flush and an unpad method to undo the effects of pad (similar to how
match/unmatch work). The diff output is at the end of this email.
Usage example:
my $in = Bio::Assembly::IO->new (-file => "contigs.ace",
-format => "ace");
my $out = Bio::AlignIO->new (-file => ">test.out",
-format => "clustalw");
while (my $aln = $in->next_assembly()) {
my @contigs = $aln->all_contigs();
if (scalar @contigs == 0) { last; }
foreach my $contig (@contigs) {
$contig->pad();
$out->write_aln($contig);
$contig->unpad();
}
}
This seems to work for all the AlignIO modules which implemented
write_aln.
Additional questions:
Jason and I talked about whether or not pad/unpad should be contig
specific, but after looking through the other methods present in AlignI
I went ahead and added pad/unpad there because it seemed correct to
me... this could be removed if others disagree.
> I think leaving the stuff as contig.pm specific for now but perhaps
> re-posting the proposal to the bioperl list will elicit more response
> (hopefully).
The other question that arose during my exchange with Jason was whether
or not modules such as Bio::AlignIO::clustalw should test for flush.
Here is Jason's suggestion, however I haven't implemented it because I'm
not familiar enough with the alignment specs to know if it is a good
idea (and Jason didn't seem 100% certain)... What do others think?
> As for the testing for flush in clustalw etc, I'm not sure what the
> alignment specs say - perhaps we can set it up so that there is a
> "strictness" flag for AlignIO which is set to on by default and will
> test for flushness. I imagine it is hard to do things like fasta MSA
> alignments as unflush so it would be critical to have it test there...
Thank you...
Awaiting your comments,
Chris
Diff output:
Index: Bio/Align/AlignI.pm
===================================================================
RCS file: /home/repository/bioperl/bioperl-live/Bio/Align/AlignI.pm,v
retrieving revision 1.9
diff -r1.9 AlignI.pm
426a427,462
> =head2 pad
>
> Title : pad()
> Usage : $ali->pad()
> Function :
>
> Goes through all sequences and pads them so that
> the alignment is flush.
>
> Returns : 1
> Argument : a pad character, optional, defaults to '-'
>
> =cut
>
> sub pad {
> my ($self) = @_;
> $self->throw_not_implemented();
> }
>
> =head2 unpad
>
> Title : unpad()
> Usage : $ali->unpad()
> Function :
>
> Undoes the effect of method pad.
>
> Returns : 1
> Argument :
>
> =cut
>
> sub unpad {
> my ($self) = @_;
> $self->throw_not_implemented();
> }
Index: Bio/Assembly/Contig.pm
===================================================================
RCS file: /home/repository/bioperl/bioperl-live/Bio/Assembly/Contig.pm,v
retrieving revision 1.4
diff -r1.4 Contig.pm
210c210,244
< use vars qw(@ISA);
---
> use vars qw(@ISA %CONSERVATION_GROUPS);
> use Dumpvalue;
>
> BEGIN {
> # This data should probably be in a more centralized module...
> # it is taken from Clustalw documentation
> # These are all the positively scoring groups that occur in the
> # Gonnet Pam250 matrix. The strong and weak groups are
> # defined as strong score >0.5 and weak score =<0.5 respectively.
>
> %CONSERVATION_GROUPS = ( 'strong' => [ qw(STA
> NEQK
> NHQK
> NDEQ
> QHRK
> MILV
> MILF
> HY
> FYW)
> ],
> 'weak' => [ qw(CSA
> ATV
> SAG
> STNK
> STPA
> SGND
> SNDEQK
> NDEQHK
> NEQHRK
> FVLIM
> HFY) ],
> );
>
> }
>
1304c1338
< my $seqID = $self->{'_order'}->{--$pos};
---
> my $seqID = $self->{'_order'}->{$pos};
1404,1405c1438,1451
< my ($self) = @_;
< $self->throw_not_implemented();
---
> my $self = shift;
> my $from = shift;
> my $to = shift;
> my ($seq,$temp);
>
> $self->throw("Need exactly two arguments")
> unless defined $from and defined $to;
>
> foreach $seq ( $self->each_seq() ) {
> $temp = $seq->seq();
> $temp =~ s/$from/$to/g;
> $seq->seq($temp);
> }
> return 1;
1407a1454
>
1436,1437c1483,1590
< my ($self) = @_;
< $self->throw_not_implemented();
---
> my ($self,$matchlinechar, $strong, $weak) = @_;
>
> my %matchchars = ( 'match' => $matchlinechar || '*',
> 'weak' => $weak || '.',
> 'strong' => $strong || ':',
> 'mismatch'=> ' ',
> );
>
> if (!$self->is_flush()) {
> $self->throw ("Producing a matchline for an alignment which is
not flush
doesn't make sense IMO - Chris.");
> return;
> }
>
> my @seqchars;
> my $seqcount = 0;
> my $alphabet;
> foreach my $seq ( $self->each_seq ) {
> push @seqchars, [ split(//, uc ($seq->seq)) ];
> $alphabet = $seq->alphabet unless defined $alphabet;
> }
> my $refseq = shift @seqchars;
> # let's just march down the columns
> my $matchline;
> POS: foreach my $pos ( 0..$self->length ) {
> my $refchar = $refseq->[$pos];
> next unless $refchar; # skip ''
> my %col = ($refchar => 1);
> my $dash = ($refchar eq '-' || $refchar eq '.' || $refchar eq ' ');
> foreach my $seq ( @seqchars ) {
> $dash = 1 if( $seq->[$pos] eq '-' || $seq->[$pos] eq '.' ||
> $seq->[$pos] eq ' ' );
> $col{$seq->[$pos]}++;
> }
> my @colresidues = sort keys %col;
> my $char = $matchchars{'mismatch'};
> # if all the values are the same
> if( $dash ) { $char = $matchchars{'mismatch'} }
> elsif( @colresidues == 1 ) { $char = $matchchars{'match'} }
> elsif( $alphabet eq 'protein' ) { # only try to do weak/strong
> # matches for protein seqs
> TYPE: foreach my $type ( qw(strong weak) ) {
> # iterate through categories
> my %groups;
> # iterate through each of the aa in the col
> # look to see which groups it is in
> foreach my $c ( @colresidues ) {
> foreach my $f ( grep /\Q$c/, @{$CONSERVATION_GROUPS{$type}}
) {
> push @{$groups{$f}},$c;
> }
> }
> GRP: foreach my $cols ( values %groups ) {
> @$cols = sort @$cols;
> # now we are just testing to see if two arrays
> # are identical w/o changing either one
>
> # have to be same len
> next if( scalar @$cols != scalar @colresidues );
> # walk down the length and check each slot
> for($_=0;$_ < (scalar @$cols);$_++ ) {
> next GRP if( $cols->[$_] ne $colresidues[$_] );
> }
> $char = $matchchars{$type};
> last TYPE;
> }
> }
> }
> $matchline .= $char;
> }
>
> return $matchline;
>
> }
>
> =head2 pad
>
> Title : pad
> Usage : if( !$contig->is_flush() ) {
> : $contig->pad();
> :
> Function : Pad the sequences in the alignment to make it flush
> : Does nothing if the alignment has already been padded
> :
> Returns : 1
> Argument : $pad_char, optional, default '-'
>
> =cut
>
> sub pad {
> my ($self,$pad_char) = @_;
>
> $pad_char ||= '-';
>
> if (!$self->{'pad_seq'}) {
> $self->{'pad_seq'} = 1;
> foreach my $seq ( $self->each_seq ) {
> my $coord = $self->get_seq_coord($seq);
> my $tmp_seq = $seq->seq();
> for (my $i = 1; $i < $coord->start(); $i++) {
> $tmp_seq = $pad_char.$tmp_seq;
> }
> for (my $i = $coord->end(); $i < $self->length(); $i++) {
> $tmp_seq .= $pad_char;
> }
> $seq->seq($tmp_seq);
> }
> }
>
> return 1;
1439a1593,1623
> =head2 unpad
>
> Title : unpad
> Usage : $contig->unpad();
> :
> Function : undo the effects of the pad method.
> : does nothing if the alignment has not been padded
> :
> Returns : 1
> Argument :
>
> =cut
>
> sub unpad {
> my ($self) = @_;
>
> if ($self->{'pad_seq'}) {
> $self->{'pad_seq'} = 0;
> my $dumper = new Dumpvalue();
> foreach my $seq ( $self->each_seq ) {
> my $coord = $self->get_seq_coord($seq);
> my @chars = split //, $seq->seq();
> @chars = splice @chars, $coord->start()-1,
> $coord->end() - $coord->start() + 1;
> $seq->seq(join '', @chars);
> }
> }
> return 1;
> }
>
>
1440a1625
> $self->throw_not_implemented();
1461,1462c1646,1676
< my ($self) = @_;
< $self->throw_not_implemented();
---
> my ($self, $match) = @_;
>
> if (!$self->is_flush()) {
> $self->throw ("Running match on an alignment which is not flush
doesn't
make sense IMO - Chris");
> return;
> }
>
> $match ||= '.';
> my ($matching_char) = $match;
> $matching_char = "\\$match" if $match =~ /[\^.$|()\[\]]/ ; #';
> $self->map_chars($matching_char, '-');
>
> my @seqs = $self->each_seq();
> return 1 unless scalar @seqs > 1;
>
>
> my $refseq = shift @seqs ;
> my @refseq = split //, $refseq->seq;
> my $gapchar = $self->gap_char;
>
> foreach my $seq ( @seqs ) {
> my @varseq = split //, $seq->seq();
> for ( my $i=0; $i < scalar @varseq; $i++) {
> $varseq[$i] = $match if defined $refseq[$i] &&
> ( $refseq[$i] =~ /[a-zA-Z\*]/ ||
> $refseq[$i] =~ /$gapchar/ )
> && $refseq[$i] eq $varseq[$i];
> }
> $seq->seq(join '', @varseq);
> }
> $self->match_char($match);
> return 1;
1525,1526c1739,1746
< my ($self) = @_;
< $self->throw_not_implemented();
---
> my ($self, $char) = @_;
>
> if (defined $char ) {
> $self->throw("Single missing character, not [$char]!") if
CORE::length($
char) > 1;
> $self->{'_missing_char'} = $char;
> }
>
> return $self->{'_missing_char'};
1540,1541c1760,1767
< my ($self) = @_;
< $self->throw_not_implemented();
---
> my ($self, $char) = @_;
>
> if (defined $char ) {
> $self->throw("Single match character, not [$char]!") if
CORE::length($ch
ar) > 1;
> $self->{'_match_char'} = $char;
> }
>
> return $self->{'_match_char'};
1555,1556c1781,1788
< my ($self) = @_;
< $self->throw_not_implemented();
---
> my ($self, $char) = @_;
>
> if (defined $char || ! defined $self->{'_gap_char'} ) {
> $char= '-' unless defined $char;
> $self->throw("Single gap character, not [$char]!") if
CORE::length($char
) > 1;
> $self->{'_gap_char'} = $char;
> }
> return $self->{'_gap_char'};
1570,1571c1802,1816
< my ($self) = @_;
< $self->throw_not_implemented();
---
> my ($self,$includeextra) = @_;
>
> unless ($self->{'_symbols'}) {
> foreach my $seq ($self->each_seq) {
> map { $self->{'_symbols'}->{$_} = 1; } split(//,$seq->seq);
> }
> }
> my %copy = %{$self->{'_symbols'}};
> if( ! $includeextra ) {
> foreach my $char ( $self->gap_char, $self->match_char,
> $self->missing_char) {
> delete $copy{$char} if( defined $char );
> }
> }
> return keys %copy;
1641,1642c1886,1908
< my ($self) = @_;
< $self->throw_not_implemented();
---
> my ($self,$report) = @_;
> my $seq;
> my $length = (-1);
> my $temp;
>
> foreach $seq ( $self->each_seq() ) {
> if( $length == (-1) ) {
> $length = CORE::length($seq->seq());
> next;
> }
>
> $temp = CORE::length($seq->seq());
> if( $temp != $length ) {
> $self->warn("expecting $length not $temp from ".
> $seq->display_id) if( $report );
> $self->debug("expecting $length not $temp from ".
> $seq->display_id);
> $self->debug($seq->seq(). "\n");
> return 0;
> }
> }
> return 1;
>
1658,1659c1924
<
< $self->throw_not_implemented();
---
> return $self->get_consensus_length();
1676c1941
< sub maxname_length {
---
> sub maxdisplayname_length {
1678c1943,1955
< $self->throw_not_implemented();
---
>
> my $maxname = (-1);
> my ($seq,$len);
>
> foreach $seq ( $self->each_seq() ) {
> $len = CORE::length $self->displayname($seq->get_nse());
>
> if( $len > $maxname ) {
> $maxname = $len;
> }
> }
>
> return $maxname;
1680a1958
>
1834c2112,2128
< sub displayname { # Do nothing
---
> sub displayname {
> my ($self, $name, $disname) = @_;
>
> $name =~ s/\/\d+-\d+//;
>
> $self->throw("No sequence with name [$name]")
> unless defined $self->{'_elem'}{$name};
>
> if( $disname and $name) {
> $self->{'_dis_name'}->{$name} = $disname;
> return $disname;
> }
> elsif( defined $self->{'_dis_name'}->{$name} ) {
> return $self->{'_dis_name'}->{$name};
> } else {
> return $name;
> }
1867c2161,2169
< sub set_displayname_flat { # Do nothing!
---
> sub set_displayname_flat {
> my $self = shift;
> my ($nse,$seq);
>
> foreach $seq ( $self->each_seq() ) {
> $nse = $seq->get_nse();
> $self->displayname($nse,$seq->id());
> }
> return 1;
----
Christopher T. Lewis
Agriculture and Agri-Food Canada/Agriculture et Agroalimentaire Canada
LewisCT at agr.gc.ca
More information about the Bioperl-l
mailing list