[Bioperl-l] code review on LocatableSeq performance fix.

Tue Aug 25 20:22:20 UTC 2009

[For better or worse] I use pairs of locatable seq's to represent
alignments between cDNAs (spliced mRNA) and genomic sequence.

I end up using column_from_residue_number a lot to map features back
and forth between the coordinate system.

My sequences tend to be fairly long, and the current implementation of
column_from_residue_number (which splits the sequences into arrays of
individual characters) performs very badly on them.

I've included below a small variation on a patch that I've been using
for a while (when I pulled it up to the current bioperl-live I changed
a couple of regexps to use $GAP_SYMBOLS and $RESIDUE_SYMBOLS).  It
passes the t/Seq/LocatableSeq.t tests and Works For Me (tm).

Instead of creating whopping big arrays and then looping over them it
breaks the sequence down into runs of residues/gaps and strides across
them.  It also unwinds the strandedness test and avoids the cute trick
of using an anonymous sub (which saves a couple of lines in the source
file but adds *signficant* overhead every time around the loop).

All hail Devel::NYTProf.

Chris et al.'s comments about the mysteries and vagaries of
Bio::LocatableSeq makes me leary of just committing it.

Anyone want to comment on it?

g.

Index: Bio/LocatableSeq.pm
===================================================================

--- Bio/LocatableSeq.pm	(revision 16001)
+++ Bio/LocatableSeq.pm	(working copy)
@@ -423,27 +423,47 @@
     unless $resnumber =~ /^\d+$/ and $resnumber > 0;
 
     if ($resnumber >= $self->start() and $resnumber <= $self->end()) {
-    my @residues = split //, $self->seq;
-    my $count = $self->start();
-    my $i;
-    my ($start,$end,$inc,$test);
-        my $strand = $self->strand || 0;
-    # the following bit of "magic" allows the main loop logic to be the
-    # same regardless of the strand of the sequence
-    ($start,$end,$inc,$test)= ($strand == -1)?
-            (scalar(@residues-1),0,-1,sub{$i >= $end}) :
-                (0,scalar(@residues-1),1,sub{$i <= $end});
+	my @chunks;
+	my $column_incr;
+	my $current_column;
+	my $current_residue = $self->start - 1;
+	my $seq = $self->seq;
+	my $strand = $self->strand || 0;
 
-    for ($i=$start; $test->(); $i+= $inc) {
-        if ($residues[$i] ne '.' and $residues[$i] ne '-') {
-        $count == $resnumber and last;
-        $count++;
-        }
-    }
-    # $i now holds the index of the column.
-        # The actual column number is this index + 1
+	if ($strand == -1) {
+#	    @chunks = reverse $seq =~ m/[^\.\-]+|[\.\-]+/go;
+	    @chunks = reverse $seq =~ m/[$RESIDUE_SYMBOLS]+|[$GAP_SYMBOLS]+/go;
+	    $column_incr = -1;
+	    $current_column = (CORE::length $seq) + 1;
+	}
+	else {
+#	    @chunks = $seq =~ m/[^\.\-]+|[\.\-]+/go;
+	    @chunks = $seq =~ m/[$RESIDUE_SYMBOLS]+|[$GAP_SYMBOLS]+/go;
+	    $column_incr = 1;
+	    $current_column = 0;
+	}
 
-    return $i+1;
+	while (my $chunk = shift @chunks) {
+#	    if ($chunk =~ m|^[\.\-]|o) {
+	    if ($chunk =~ m|^[$GAP_SYMBOLS]|o) {
+		$current_column += $column_incr * CORE::length($chunk);
+	    }
+	    else {
+		if ($current_residue + CORE::length($chunk) < $resnumber) {
+		    $current_column += $column_incr * CORE::length($chunk);
+		    $current_residue += CORE::length($chunk);
+		}
+		else {
+		    if ($strand == -1) {
+			$current_column -= $resnumber - $current_residue;
+		    }
+		    else {
+			$current_column += $resnumber - $current_residue;
+		    }
+		    return $current_column;
+		}
+	    }
+	}
     }
 
     $self->throw("Could not find residue number $resnumber");