[Bioperl-l] code review on LocatableSeq performance fix.

Mark A. Jensen maj at fortinbras.us
Wed Aug 26 11:39:40 UTC 2009


I think it's great. column_from_residue_number doesn't have any
secret side effects, and the patch preserves nice integer in, nice integer
out, and input and output both are 1-origin indices as far as I can tell.
I say go for it-
MAJ
----- Original Message ----- 
From: "George Hartzell" <hartzell at alerce.com>
To: "bioperl-l List" <bioperl-l at bioperl.org>
Sent: Tuesday, August 25, 2009 4:22 PM
Subject: [Bioperl-l] code review on LocatableSeq performance fix.


> 
> [For better or worse] I use pairs of locatable seq's to represent
> alignments between cDNAs (spliced mRNA) and genomic sequence.
> 
> I end up using column_from_residue_number a lot to map features back
> and forth between the coordinate system.
> 
> My sequences tend to be fairly long, and the current implementation of
> column_from_residue_number (which splits the sequences into arrays of
> individual characters) performs very badly on them.
> 
> I've included below a small variation on a patch that I've been using
> for a while (when I pulled it up to the current bioperl-live I changed
> a couple of regexps to use $GAP_SYMBOLS and $RESIDUE_SYMBOLS).  It
> passes the t/Seq/LocatableSeq.t tests and Works For Me (tm).
> 
> Instead of creating whopping big arrays and then looping over them it
> breaks the sequence down into runs of residues/gaps and strides across
> them.  It also unwinds the strandedness test and avoids the cute trick
> of using an anonymous sub (which saves a couple of lines in the source
> file but adds *signficant* overhead every time around the loop).
> 
> All hail Devel::NYTProf.
> 
> Chris et al.'s comments about the mysteries and vagaries of
> Bio::LocatableSeq makes me leary of just committing it.
> 
> Anyone want to comment on it?
> 
> g.
> 
> Index: Bio/LocatableSeq.pm
> ===================================================================
> --- Bio/LocatableSeq.pm (revision 16001)
> +++ Bio/LocatableSeq.pm (working copy)
> @@ -423,27 +423,47 @@
>     unless $resnumber =~ /^\d+$/ and $resnumber > 0;
> 
>     if ($resnumber >= $self->start() and $resnumber <= $self->end()) {
> -    my @residues = split //, $self->seq;
> -    my $count = $self->start();
> -    my $i;
> -    my ($start,$end,$inc,$test);
> -        my $strand = $self->strand || 0;
> -    # the following bit of "magic" allows the main loop logic to be the
> -    # same regardless of the strand of the sequence
> -    ($start,$end,$inc,$test)= ($strand == -1)?
> -            (scalar(@residues-1),0,-1,sub{$i >= $end}) :
> -                (0,scalar(@residues-1),1,sub{$i <= $end});
> + my @chunks;
> + my $column_incr;
> + my $current_column;
> + my $current_residue = $self->start - 1;
> + my $seq = $self->seq;
> + my $strand = $self->strand || 0;
> 
> -    for ($i=$start; $test->(); $i+= $inc) {
> -        if ($residues[$i] ne '.' and $residues[$i] ne '-') {
> -        $count == $resnumber and last;
> -        $count++;
> -        }
> -    }
> -    # $i now holds the index of the column.
> -        # The actual column number is this index + 1
> + if ($strand == -1) {
> +#     @chunks = reverse $seq =~ m/[^\.\-]+|[\.\-]+/go;
> +     @chunks = reverse $seq =~ m/[$RESIDUE_SYMBOLS]+|[$GAP_SYMBOLS]+/go;
> +     $column_incr = -1;
> +     $current_column = (CORE::length $seq) + 1;
> + }
> + else {
> +#     @chunks = $seq =~ m/[^\.\-]+|[\.\-]+/go;
> +     @chunks = $seq =~ m/[$RESIDUE_SYMBOLS]+|[$GAP_SYMBOLS]+/go;
> +     $column_incr = 1;
> +     $current_column = 0;
> + }
> 
> -    return $i+1;
> + while (my $chunk = shift @chunks) {
> +#     if ($chunk =~ m|^[\.\-]|o) {
> +     if ($chunk =~ m|^[$GAP_SYMBOLS]|o) {
> + $current_column += $column_incr * CORE::length($chunk);
> +     }
> +     else {
> + if ($current_residue + CORE::length($chunk) < $resnumber) {
> +     $current_column += $column_incr * CORE::length($chunk);
> +     $current_residue += CORE::length($chunk);
> + }
> + else {
> +     if ($strand == -1) {
> + $current_column -= $resnumber - $current_residue;
> +     }
> +     else {
> + $current_column += $resnumber - $current_residue;
> +     }
> +     return $current_column;
> + }
> +     }
> + }
>     }
> 
>     $self->throw("Could not find residue number $resnumber");
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
>



More information about the Bioperl-l mailing list