[Bioperl-l] Finding possible primers regex

Chris Fields cjfields at uiuc.edu
Mon Aug 4 04:08:51 UTC 2008


On Aug 2, 2008, at 3:05 PM, Benbo wrote:

>
> Hi there,
> I'm trying to write a perl script to scan an aligned multiple entry  
> fasta
> file and find possible primers. So far I've produced a string which  
> contains
> bases which match all sequences and * where they don't match e.g.
> 1) TTAGCCTAA
> 2) TTAGCAGAA
> 3) TTACCCTAA
>
> would give TTA*C**AA.
>
> I want to parse this string and pull out all sequences which are  
> 18-21 bp in
> length and have no more than 4 * in them.
>
> So far, I've got this:
>
> while($fragment_match =~ /([GTAC*]{18,21})/g){
> print "$1\n";
> }
>
> hoping to match all fragments 18-21 characters in length. However  
> even that
> doesn't work as it has essentially chunked it into 21 char blocks,  
> rather
> than what I hoped for of
> 0-18
> 0-19
> 0-20
> 0-21
> 1-19
> 1-20
> 1-21
> 1-22
>
> etc.
>
> Can anyone let me know if this is already possible in BioPerl, or  
> how one
> would go about it with regex. Sadly I'm fairly new to perl and  
> getting to
> grips with BioPerl, so please treat me gently :).
>
> Many thanks,
>
> Ben

There is a trick to this which is discussed more extensively in  
'Mastering Regular Expressions'.  Essentially you have to embed code  
into the regex and trick the parser into backtracking using a negative  
lookahead.  The match itself fails (i.e. no match is returned), but  
the embedded code is executed for each match attempt,

The following script is a slight modification of one I used which  
checks the consensus string from the input alignment (in aligned FASTA  
format here), extracts the alignment slice using that match, then spit  
the alignment out to STDOUT in clustalw format.  This should work for  
perl 5.8 and up, but it's only been tested on perl 5.10.  You should  
be able to use this to fit what you want.

my $in = Bio::AlignIO->new(-file => $file,
                            -format => 'fasta');
my $out = Bio::AlignIO->new(-fh => \*STDOUT,
                            -format => 'clustalw');

while (my $aln = $in->next_aln) {
     my $c = $aln->consensus_string(100);
     my @matches;
     $c =~ m/
         ([GTAC?]{18,21})
         (?{my $match = check_match($1);
            push @matches, [$match,
                            pos(),
                            length($match)]
               if defined $match;})
         (?!)
         /xig;
     for my $match (@matches) {
         my ($hit, $st, $end) = ($match->[0],
                                 $match->[1] - $match->[2] + 1,
                                 $match->[1]);
         my $newaln = $aln->slice($st, $end);
         $out->write_aln($newaln);
     }
}

sub check_match {
     my $match = shift;
     return unless $match;
     my $ct = $match =~ tr/?/?/;
     return $match if $ct <= 4;
}


chris




More information about the Bioperl-l mailing list