[Bioperl-l] Should coords be adjusted after removing alignment columns?

Chris Fields cjfields at uiuc.edu
Tue Aug 14 16:33:31 UTC 2007


Could you attach the scripts and patches to a bug report for tracking  
so anyone interested can double-check?  Having them in an email is  
problematic as the text in some clients wraps.

 From what I'm seeing I think we're in general agreement, though I'll  
reason through it to see if I'm following correctly.  The data in the  
SimpleAlign example you give is this:

a/5-20            atcgatcgatcgatcg
b/30-43           -tcgatc-atcgatcg
c/50-63           atcgatcgatc-atc-
                    ****** *** ***

Removing the gaps gives:

a/5-20            tcgatcatcatc
b/30-43           tcgatcatcatc
c/50-63           tcgatcatcatc
                   ************

The start/end is wrong, as you state.  Adjusting to map simple start/ 
ends to the original sequence won't work as we're removing gaps and  
residues in the LocatableSeqs along with it (ends and internal  
residues).  I guess if we want to map back to the original sequence  
accurately we would have to use split locations (not currently  
implemented with LocatableSeq) or maybe a cigar-like syntax against  
consensus (ugh), otherwise we wouldn't know where to map the relevant  
internal gaps (now missing from the alignment) w/o running a local  
alignment against the original sequence:

a/6-11;12-19      tcgatcatcatc
b/30-38;40-42     tcgatcatcatc
c/51-56;58-63     tcgatcatcatc
                   ************

That could get really hairy for long alignments.  We could also  
return multiple SimpleAligns which map correctly (ugh), but what we  
really want (and the API specifies) is a new single SimpleAlign.

It may come down to simply stating it 'voids the warranty' (so-to- 
speak) when modifications are made to alignments which remove/insert  
residues from LocatableSeqs via remove_gaps/remove_columns or  
similar, and either leave as is with relevant warnings or readjust  
start/end appropriately when LocatableSeq residues change.

gapless_a/1-12    tcgatcatcatc
gapless_b/1-12    tcgatcatcatc
gapless_c/1-12    tcgatcatcatc
                   ************

Not sure which is the best approach but anything would be better than  
giving an unexpectedly incorrect answer.

chris

On Aug 14, 2007, at 9:57 AM, Sendu Bala wrote:

> I'm looking at what looks like a pretty major bug in Bio::SimpleAlign,
> but before I commit the fix I wanted to check my sanity/understanding.
>
> My understanding is that an alignment may be built from just sub-parts
> of a number of sequences. So you give each sequence in the alignment a
> start and stop so you can later map back the aligned region to the
> original sequence. So, for example, the following should all pass:
>
> diff -r1.56 SimpleAlign.t
> 459a460,540
>>
>>
>> # is _remove_col really working correctly?
>> my $a = Bio::LocatableSeq->new(-id => 'a', -seq =>
> 'atcgatcgatcgatcg', -start => 5, -end => 20);
>> my $b = Bio::LocatableSeq->new(-id => 'b', -seq =>
> '-tcgatc-atcgatcg', -start => 30, -end => 43);
>> my $c = Bio::LocatableSeq->new(-id => 'c', -seq =>
> 'atcgatcgatc-atc-', -start => 50, -end => 63);
>> my $d = Bio::LocatableSeq->new(-id => 'd', -seq =>
> '--cgatcgatcgat--', -start => 80, -end => 91);
>> my $e = Bio::LocatableSeq->new(-id => 'e', -seq =>
> '-t-gatcgatcga-c-', -start => 100, -end => 111);
>> $aln = Bio::SimpleAlign->new();
>> $aln->add_seq($a);
>> $aln->add_seq($b);
>> $aln->add_seq($c);
>>
>> my $gapless = $aln->remove_gaps();
>> foreach my $seq ($gapless->each_seq) {
>>       if ($seq->id eq 'a') {
>>               is $seq->start, 6;
>>               is $seq->end, 19;
>>               is $seq->seq, 'tcgatcatcatc';
>>       }
>>       elsif ($seq->id eq 'b') {
>>               is $seq->start, 30;
>>               is $seq->end, 42;
>>               is $seq->seq, 'tcgatcatcatc';
>>       }
>>       elsif ($seq->id eq 'c') {
>>               is $seq->start, 51;
>>               is $seq->end, 63;
>>               is $seq->seq, 'tcgatcatcatc';
>>       }
>> }
>>
>> $aln->add_seq($d);
>> $aln->add_seq($e);
>> $gapless = $aln->remove_gaps();
>> foreach my $seq ($gapless->each_seq) {
>>       if ($seq->id eq 'a') {
>>               is $seq->start, 8;
>>               is $seq->end, 17;
>>               is $seq->seq, 'gatcatca';
>>       }
>>       elsif ($seq->id eq 'b') {
>>               is $seq->start, 32;
>>               is $seq->end, 40;
>>               is $seq->seq, 'gatcatca';
>>       }
>>       elsif ($seq->id eq 'c') {
>>               is $seq->start, 53;
>>               is $seq->end, 61;
>>               is $seq->seq, 'gatcatca';
>>       }
>>       elsif ($seq->id eq 'd') {
>>               is $seq->start, 81;
>>               is $seq->end, 90;
>>               is $seq->seq, 'gatcatca';
>>       }
>>       elsif ($seq->id eq 'e') {
>>               is $seq->start, 101;
>>               is $seq->end, 110;
>>               is $seq->seq, 'gatcatca';
>>       }
>> }
>>
>> my $f = Bio::LocatableSeq->new(-id => 'f', -seq =>
> 'a-cgatcgatcgat-g', -start => 30, -end => 43);
>> $aln = Bio::SimpleAlign->new();
>> $aln->add_seq($a);
>> $aln->add_seq($f);
>>
>> $gapless = $aln->remove_gaps();
>> foreach my $seq ($gapless->each_seq) {
>>       if ($seq->id eq 'a') {
>>               is $seq->start, 5;
>>               is $seq->end, 20;
>>               is $seq->seq, 'acgatcgatcgatg';
>>       }
>>       elsif ($seq->id eq 'f') {
>>               is $seq->start, 30;
>>               is $seq->end, 43;
>>               is $seq->seq, 'acgatcgatcgatg';
>>       }
>> }
>
>
> But they don't. Once you remove certain columns the start and stop of
> the sequences in the alignment are no longer correct coordinates  
> for the
> sub-sequence in the original sequence.
>
> I propose the following patch to resolve this issue:
>
> diff -r1.136 SimpleAlign.pm
> 1116c1116,1118
> <
> ---
>>
>>     my $gap = $self->gap_char;
>>
> 1129,1137c1131,1147
> <             my $spliced;
> <             $spliced .= $start > 0 ? substr($sequence,0,$start) :  
> '';
> <             $spliced .= substr($sequence,$end+1,$seq->length-$end 
> +1);
> <             $sequence = $spliced;
> <             if ($start == 1) {
> <               $new_seq->start($end);
> <             }
> <             else {
> <               $new_seq->start( $seq->start);
> ---
>>             my $orig = $sequence;
>>             my $head =  $start > 0 ? substr($sequence, 0,  
>> $start) : '';
>>             my $tail = ($end + 1) >= length($sequence) ? '' :
> substr($sequence, $end + 1);
>>             $sequence = $head.$tail;
>>             # start
>>             unless (defined $new_seq->start) {
>>                 if ($start == 0) {
>>                     my $start_adjust = () = substr($orig, 0, $end +
> 1) =~ /$gap/g;
>>                     $new_seq->start($seq->start + $end + 1 -
> $start_adjust);
>>                 }
>>                 else {
>>                     my $start_adjust = $orig =~ /$gap+/;
>>                     if ($start_adjust) {
>>                         $start_adjust = $+[0] - 1 < $start;
>>                     }
>>                     $new_seq->start($seq->start + $start_adjust);
>>                 }
> 1140,1141c1150,1152
> <             if($end >= $seq->end){
> <              $new_seq->end( $start);
> ---
>>             if (($end + 1) >= length($orig)) {
>>                 my $end_adjust = () = substr($orig, $start) =~ / 
>> $gap/g;
>>                 $new_seq->end($seq->end - (length($orig) - $start) +
> $end_adjust);
> 1144c1155
> <              $new_seq->end($seq->end);
> ---
>>                 $new_seq->end($seq->end);
> 1148c1159
> <                 push @new, $new_seq;
> ---
>>               push @new, $new_seq;
> 1207,1209c1218,1234
> <       # sort the positions to remove columns at the end 1st
> <       @$positions = sort { $b->[0] <=> $a->[0] } @$positions;
> <       $aln = $self->_remove_col($aln,$positions);
> ---
>>       # sort the positions
>>       @$positions = sort { $a->[0] <=> $b->[0] } @$positions;
>>
>>     my @remove;
>>     my $length = 0;
>>     foreach my $pos (@{$positions}) {
>>         my ($start, $end) = @{$pos};
>>
>>         #have to offset the start and end for subsequent removes
>>         $start-=$length;
>>         $end  -=$length;
>>         $length += ($end-$start+1);
>>         push @remove, [$start,$end];
>>     }
>>
>>     #remove the segments
>>     $aln = $#remove >= 0 ? $self->_remove_col($aln,\@remove) : $self;
>
>
> This breaks 2 tests in SimpleAlign.t, but as far as I can tell, those
> tests expect the wrong answer. Changed to expect the correct answer,
> SimpleAlign.t and all other tests in the test suite pass.
>
> diff -r1.56 SimpleAlign.t
> 214,215c214,215
> <       "P84139/1-33              NEGEHQIKLDELFEKLLRARLIFKNKDVLRRC\n".
> <       "P814153/1-33             NEGMHQIKLDVLFEKLLRARLIFKNKDVLRRC\n".
> ---
>>       "P84139/2-33              NEGEHQIKLDELFEKLLRARLIFKNKDVLRRC\n".
>>       "P814153/2-33             NEGMHQIKLDVLFEKLLRARLIFKNKDVLRRC\n".
> 229c229
> <       "gb|443893|124775/1-32    -RFRIKVPPAVEGARPALLIFKSRPELGC\n",
> ---
>>       "gb|443893|124775/2-32    -RFRIKVPPAVEGARPALLIFKSRPELGC\n",
>
>
> Can someone triple-check my thinking and report back please?
>
> Cheers,
> Sendu.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign






More information about the Bioperl-l mailing list