[Bioperl-l] Bio::LocatableSeq end checking inconsistency
Jun Yin
jun.yin at ucd.ie
Fri Aug 20 16:17:33 UTC 2010
Hi, Bernd,
Thx for your input.
Yes, this is one of the old bugs in Bio::SimpleAlign. $aln->slice just
simply $slice_seq =~ s/\W//g to calculate the ungapped length.
But in $seq->_ungapped_len, this method use $string =~
s{[$GAP_SYMBOLS$FRAMESHIFT_SYMBOLS]+}{}g;
Which is '\-\.=~\\\/ ' to calculate the ungapped length.
To solve this problem, first, now I use
$nonres = join("",$self->gap_char, $self->match_char,$self->missing_char);
Which is '-\.&' to remove the non-residue chars in the alignment sequence
(though if you use '=','~','\','/' will also cause problems).
Secondly, I have merged slice, remove_columns and remove_gaps, using the
same internal function. Thus it is easier to debug.
These changes will be merged into main BioPerl branch after next version.
But anyway, the confict is still there, because the non residue chars are
defined as:
In Bio::SimpleAlign, $aln->gap_char, $aln->missing_char, $aln->match_char
In Bio::LocatableSeq
$GAP_SYMBOLS = '\-\.=~';
$FRAMESHIFT_SYMBOLS = '\\\/';
so try to use '-' or '.' for your gap char at the moment, otherwise you may
encounter end warnings in calculation.
And, if you want to keep gap only sequences, you can call the method as:
$aln2 = $aln->slice(20,30,1)
The last parameter is to keep gap only sequence.
Cheers,
Jun Yin
Ph.D. student in U.C.D.
Bioinformatics Laboratory
Conway Institute
University College Dublin
-----Original Message-----
From: Bernd Web [mailto:bernd.web at gmail.com]
Sent: Friday, August 20, 2010 4:17 PM
To: Jun Yin
Cc: bioperl-l at lists.open-bio.org
Subject: Re: [Bioperl-l] Bio::LocatableSeq end checking inconsistency
Hi Yin,
I am not quite sure if the following is also related to your gapped
length issue but I found I had to adapt the calculation of
ungapped_len in Bio::LocatableSeq. If my slices did not contain any
letters or a new gap char I used, SimpleAlign could not find the
sequences when outputting the alignment. This was due to a difference
in length calculation:
SimpleAlign: uses \W: $slice_seq =~ s/\W//g;
Bio::LocatableSeq::ungapped_len uses "$string =~ s/[\.\-]+//g;"
I had to include '~' (for my local sequences) in the ungapped_len;
otherwise i would run into the end issues with SimpleAlign.
Kind regards,
Bernd
On Fri, Aug 13, 2010 at 3:36 PM, Jun Yin <jun.yin at ucd.ie> wrote:
> Hi, all,
>
>
>
> I am the google summer of code student working on Bio::Align subsystem
> refactoring. The code (Bio::SimpleAlign) I re-implemented now has passed
> nearly all the test, except a few tests on seq/start-end testing. But here
> comes a problem. This may be an old issue, that the Bio::LocatableSeq end
> assignment and checking are inconsistent.
>
>
>
> The current end checking method is based on:
>
> $end=$seq->_ungapped_len+$seq->start-1
>
> However, this checking may not fit the real world case.
>
>
>
> The inconsistency usually happens when a few columns of the sequence are
> removed.
>
>
>
> For example:
>
> my $a = Bio::LocatableSeq->new(
>
> -id => 'a',
>
> -strand => 1,
>
> -seq => '-tcgatc-atcgatcg',
>
> -start => 30,
>
> -end => 43
>
> );
>
>
>
> If we remove the 1st, 8th and the last columns
>
>
>
> $a->seq() will be 'tcgatcatcgatc'
>
> $a->_ungapped_len==12
>
>
>
> Actually, in the real world, the first residue will still be 30 (the old
> $seq->start), and the last residue is the residue before the 43 (the old
> $seq->end), thus 42.
>
>
>
> But if you call a validation, the calculation is
> $a->_ungapped_len+$a->start-1=12+30-1=41
>
> So the reassignment of the $seq->end will not pass the validation.
>
>
>
> So unless you save the information to a new sequence object, the original
> position information will be lost anyway. But in some cases, we have to
> change the sequence in its original sequence object ..
>
>
>
> What is your suggestion on this issue?
>
> A. pass the test and lose the information #convenient in coding but
the
> start-end annotation is not right any more
>
> B. keep the information and forget the test #the object will still
> remember where the last residue was in the original sequence. But is it
> really meaningful at all? Because all the other residues may come from
> nowhere
>
> C. Neither of above #any other suggestions?
>
>
>
> Cheers,
>
> Jun Yin
>
> Ph.D. student in U.C.D.
>
>
>
> Bioinformatics Laboratory
>
> Conway Institute
>
> University College Dublin
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
__________ Information from ESET Smart Security, version of virus signature
database 5377 (20100818) __________
The message was checked by ESET Smart Security.
http://www.eset.com
__________ Information from ESET Smart Security, version of virus signature
database 5377 (20100818) __________
The message was checked by ESET Smart Security.
http://www.eset.com
More information about the Bioperl-l
mailing list