[Bioperl-l] Sequence matching problem!

Fri Feb 23 13:55:24 UTC 2007

James Smith wrote:
> On Fri, 23 Feb 2007, Albert Vilella wrote:
> 
>> now that we are at this pattern matching thread, I was wondering if
>> any perl guru could enlighten me on the issue of matching exact
>> sequence patterns on a gapped target sequence. E.g.:
>>
>> my $seq = "CGATCAACGAATCGTACGTACTC";
>> my $gapped_seq =
>> "GGGGGGCG-------A---TC---AACGA-----ATC---GTA---CGTACTCTACTCGGGGG";
>>
>> and one would like to get as a result:
>>
>> "CG-------A---TC---AACGA-----ATC---GTA---CGTACTCTACTC"
>>
>> which is the match of $seq but in $gapped_seq.
> 
> Try...
> 
>  my $seq = "CGATCAACGAATCGTACGTACTC";
>  my $gapped_seq =
>    "GGGGGGCG-------A---TC---AACGA-----ATC---GTA---CGTACTCTACTCGGGGG";
> 
>  my $regexp = '('.join('-*?',split//,$seq).')';
> 
>  if( $gapped_seq =~ /$regexp/ ) {
>    print "Match is $1\n";
>  } else {
>    print "No match\n";
>  }

That's great stuff. If you were matching thousands of different $seq 
against the same very large $gapped_seq, and only needed the first match 
of $seq in $gapped_seq, the alternative to the above approach (remove 
the gaps from $gapped_seq and do index() matching) will be faster.

Here's one (overly long-winded) way of implementing it, that I found to 
take ~2s vs ~22s for the above regex approach when doing the job on 
999999 copies of $seq:

#!/usr/bin/perl -w
use strict;
use warnings;

my $gapped_seq = 
"GGGGGGCG-------A---TC---AACGA-----ATC---GTA---CGTACTCTACTCGGGGG";

# note the total gap-length at position in gapless 0-based coords
my @gap_lengths;
my $gap_length = 0;
while ($gapped_seq =~ /(-+)/g) {
   my $match = $1;
   my $prev_length = $gap_length;
   $gap_length += length($match);
   my $end = pos($gapped_seq) - $gap_length - 1;
   push(@gap_lengths, $prev_length) for (1..$end-$#gap_lengths);
}
push(@gap_lengths, $gap_length) for (1..(length($gapped_seq) - 
@gap_lengths - $gap_length));

# remove the gaps
my $gapless_seq = $gapped_seq;
$gapless_seq =~ s/-//g;

# now for each of thousands of seqs...
my $seq = 'CGATCAACGAATCGTACGTACTC';
my @seqs;
for (1..999999) {
   push(@seqs, $seq);
}
foreach my $seq (@seqs) {
   my $start = index($gapless_seq, $seq);
   if ($start == -1) {
     print "No match found for seq '$seq'\n";
     next;
   }
   my $end = $start + length($seq) - 1;

   # calculate the coords in $gapped_seq
   $start = $start + $gap_lengths[$start];
   $end = $end + $gap_lengths[$end];

   my $result = substr($gapped_seq, $start, ($end - $start + 1));
   #print $result, "\n";
}

exit;