[Bioperl-l] Should coords be adjusted after removing alignment columns?
Chris Fields
cjfields at uiuc.edu
Tue Aug 14 16:33:31 UTC 2007
Could you attach the scripts and patches to a bug report for tracking
so anyone interested can double-check? Having them in an email is
problematic as the text in some clients wraps.
From what I'm seeing I think we're in general agreement, though I'll
reason through it to see if I'm following correctly. The data in the
SimpleAlign example you give is this:
a/5-20 atcgatcgatcgatcg
b/30-43 -tcgatc-atcgatcg
c/50-63 atcgatcgatc-atc-
****** *** ***
Removing the gaps gives:
a/5-20 tcgatcatcatc
b/30-43 tcgatcatcatc
c/50-63 tcgatcatcatc
************
The start/end is wrong, as you state. Adjusting to map simple start/
ends to the original sequence won't work as we're removing gaps and
residues in the LocatableSeqs along with it (ends and internal
residues). I guess if we want to map back to the original sequence
accurately we would have to use split locations (not currently
implemented with LocatableSeq) or maybe a cigar-like syntax against
consensus (ugh), otherwise we wouldn't know where to map the relevant
internal gaps (now missing from the alignment) w/o running a local
alignment against the original sequence:
a/6-11;12-19 tcgatcatcatc
b/30-38;40-42 tcgatcatcatc
c/51-56;58-63 tcgatcatcatc
************
That could get really hairy for long alignments. We could also
return multiple SimpleAligns which map correctly (ugh), but what we
really want (and the API specifies) is a new single SimpleAlign.
It may come down to simply stating it 'voids the warranty' (so-to-
speak) when modifications are made to alignments which remove/insert
residues from LocatableSeqs via remove_gaps/remove_columns or
similar, and either leave as is with relevant warnings or readjust
start/end appropriately when LocatableSeq residues change.
gapless_a/1-12 tcgatcatcatc
gapless_b/1-12 tcgatcatcatc
gapless_c/1-12 tcgatcatcatc
************
Not sure which is the best approach but anything would be better than
giving an unexpectedly incorrect answer.
chris
On Aug 14, 2007, at 9:57 AM, Sendu Bala wrote:
> I'm looking at what looks like a pretty major bug in Bio::SimpleAlign,
> but before I commit the fix I wanted to check my sanity/understanding.
>
> My understanding is that an alignment may be built from just sub-parts
> of a number of sequences. So you give each sequence in the alignment a
> start and stop so you can later map back the aligned region to the
> original sequence. So, for example, the following should all pass:
>
> diff -r1.56 SimpleAlign.t
> 459a460,540
>>
>>
>> # is _remove_col really working correctly?
>> my $a = Bio::LocatableSeq->new(-id => 'a', -seq =>
> 'atcgatcgatcgatcg', -start => 5, -end => 20);
>> my $b = Bio::LocatableSeq->new(-id => 'b', -seq =>
> '-tcgatc-atcgatcg', -start => 30, -end => 43);
>> my $c = Bio::LocatableSeq->new(-id => 'c', -seq =>
> 'atcgatcgatc-atc-', -start => 50, -end => 63);
>> my $d = Bio::LocatableSeq->new(-id => 'd', -seq =>
> '--cgatcgatcgat--', -start => 80, -end => 91);
>> my $e = Bio::LocatableSeq->new(-id => 'e', -seq =>
> '-t-gatcgatcga-c-', -start => 100, -end => 111);
>> $aln = Bio::SimpleAlign->new();
>> $aln->add_seq($a);
>> $aln->add_seq($b);
>> $aln->add_seq($c);
>>
>> my $gapless = $aln->remove_gaps();
>> foreach my $seq ($gapless->each_seq) {
>> if ($seq->id eq 'a') {
>> is $seq->start, 6;
>> is $seq->end, 19;
>> is $seq->seq, 'tcgatcatcatc';
>> }
>> elsif ($seq->id eq 'b') {
>> is $seq->start, 30;
>> is $seq->end, 42;
>> is $seq->seq, 'tcgatcatcatc';
>> }
>> elsif ($seq->id eq 'c') {
>> is $seq->start, 51;
>> is $seq->end, 63;
>> is $seq->seq, 'tcgatcatcatc';
>> }
>> }
>>
>> $aln->add_seq($d);
>> $aln->add_seq($e);
>> $gapless = $aln->remove_gaps();
>> foreach my $seq ($gapless->each_seq) {
>> if ($seq->id eq 'a') {
>> is $seq->start, 8;
>> is $seq->end, 17;
>> is $seq->seq, 'gatcatca';
>> }
>> elsif ($seq->id eq 'b') {
>> is $seq->start, 32;
>> is $seq->end, 40;
>> is $seq->seq, 'gatcatca';
>> }
>> elsif ($seq->id eq 'c') {
>> is $seq->start, 53;
>> is $seq->end, 61;
>> is $seq->seq, 'gatcatca';
>> }
>> elsif ($seq->id eq 'd') {
>> is $seq->start, 81;
>> is $seq->end, 90;
>> is $seq->seq, 'gatcatca';
>> }
>> elsif ($seq->id eq 'e') {
>> is $seq->start, 101;
>> is $seq->end, 110;
>> is $seq->seq, 'gatcatca';
>> }
>> }
>>
>> my $f = Bio::LocatableSeq->new(-id => 'f', -seq =>
> 'a-cgatcgatcgat-g', -start => 30, -end => 43);
>> $aln = Bio::SimpleAlign->new();
>> $aln->add_seq($a);
>> $aln->add_seq($f);
>>
>> $gapless = $aln->remove_gaps();
>> foreach my $seq ($gapless->each_seq) {
>> if ($seq->id eq 'a') {
>> is $seq->start, 5;
>> is $seq->end, 20;
>> is $seq->seq, 'acgatcgatcgatg';
>> }
>> elsif ($seq->id eq 'f') {
>> is $seq->start, 30;
>> is $seq->end, 43;
>> is $seq->seq, 'acgatcgatcgatg';
>> }
>> }
>
>
> But they don't. Once you remove certain columns the start and stop of
> the sequences in the alignment are no longer correct coordinates
> for the
> sub-sequence in the original sequence.
>
> I propose the following patch to resolve this issue:
>
> diff -r1.136 SimpleAlign.pm
> 1116c1116,1118
> <
> ---
>>
>> my $gap = $self->gap_char;
>>
> 1129,1137c1131,1147
> < my $spliced;
> < $spliced .= $start > 0 ? substr($sequence,0,$start) :
> '';
> < $spliced .= substr($sequence,$end+1,$seq->length-$end
> +1);
> < $sequence = $spliced;
> < if ($start == 1) {
> < $new_seq->start($end);
> < }
> < else {
> < $new_seq->start( $seq->start);
> ---
>> my $orig = $sequence;
>> my $head = $start > 0 ? substr($sequence, 0,
>> $start) : '';
>> my $tail = ($end + 1) >= length($sequence) ? '' :
> substr($sequence, $end + 1);
>> $sequence = $head.$tail;
>> # start
>> unless (defined $new_seq->start) {
>> if ($start == 0) {
>> my $start_adjust = () = substr($orig, 0, $end +
> 1) =~ /$gap/g;
>> $new_seq->start($seq->start + $end + 1 -
> $start_adjust);
>> }
>> else {
>> my $start_adjust = $orig =~ /$gap+/;
>> if ($start_adjust) {
>> $start_adjust = $+[0] - 1 < $start;
>> }
>> $new_seq->start($seq->start + $start_adjust);
>> }
> 1140,1141c1150,1152
> < if($end >= $seq->end){
> < $new_seq->end( $start);
> ---
>> if (($end + 1) >= length($orig)) {
>> my $end_adjust = () = substr($orig, $start) =~ /
>> $gap/g;
>> $new_seq->end($seq->end - (length($orig) - $start) +
> $end_adjust);
> 1144c1155
> < $new_seq->end($seq->end);
> ---
>> $new_seq->end($seq->end);
> 1148c1159
> < push @new, $new_seq;
> ---
>> push @new, $new_seq;
> 1207,1209c1218,1234
> < # sort the positions to remove columns at the end 1st
> < @$positions = sort { $b->[0] <=> $a->[0] } @$positions;
> < $aln = $self->_remove_col($aln,$positions);
> ---
>> # sort the positions
>> @$positions = sort { $a->[0] <=> $b->[0] } @$positions;
>>
>> my @remove;
>> my $length = 0;
>> foreach my $pos (@{$positions}) {
>> my ($start, $end) = @{$pos};
>>
>> #have to offset the start and end for subsequent removes
>> $start-=$length;
>> $end -=$length;
>> $length += ($end-$start+1);
>> push @remove, [$start,$end];
>> }
>>
>> #remove the segments
>> $aln = $#remove >= 0 ? $self->_remove_col($aln,\@remove) : $self;
>
>
> This breaks 2 tests in SimpleAlign.t, but as far as I can tell, those
> tests expect the wrong answer. Changed to expect the correct answer,
> SimpleAlign.t and all other tests in the test suite pass.
>
> diff -r1.56 SimpleAlign.t
> 214,215c214,215
> < "P84139/1-33 NEGEHQIKLDELFEKLLRARLIFKNKDVLRRC\n".
> < "P814153/1-33 NEGMHQIKLDVLFEKLLRARLIFKNKDVLRRC\n".
> ---
>> "P84139/2-33 NEGEHQIKLDELFEKLLRARLIFKNKDVLRRC\n".
>> "P814153/2-33 NEGMHQIKLDVLFEKLLRARLIFKNKDVLRRC\n".
> 229c229
> < "gb|443893|124775/1-32 -RFRIKVPPAVEGARPALLIFKSRPELGC\n",
> ---
>> "gb|443893|124775/2-32 -RFRIKVPPAVEGARPALLIFKSRPELGC\n",
>
>
> Can someone triple-check my thinking and report back please?
>
> Cheers,
> Sendu.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
More information about the Bioperl-l
mailing list