[Bioperl-l] Finding possible primers regex

Mon Aug 11 19:50:38 UTC 2008

When I can I could try generating a method which accepts a regex/Bio::Tools::SeqPattern and returns an AlignIO stream or array of SimpleAlign instances (the former could be attached to a temp file for iteration).  Any preference?

chris

---- Original message ----
>Date: Sat, 9 Aug 2008 12:07:30 -0400
>From: Hilmar Lapp <hlapp at gmx.net>  
>Subject: Re: [Bioperl-l] Finding possible primers regex  
>To: Chris Fields <cjfields at uiuc.edu>
>Cc: Benbo <btemperton at googlemail.com>, Bioperl-l at lists.open-bio.org
>
>This looks like a neat trick. Do you think it's worth including as a  
>SimpleAlign method (obviously w/o the printing to STDOUT)? I can  
>imagine that a lot of people might appreciate it.
>
>	-hilmar
>
>On Aug 4, 2008, at 12:08 AM, Chris Fields wrote:
>
>> On Aug 2, 2008, at 3:05 PM, Benbo wrote:
>>
>>>
>>> Hi there,
>>> I'm trying to write a perl script to scan an aligned multiple entry  
>>> fasta
>>> file and find possible primers. So far I've produced a string which  
>>> contains
>>> bases which match all sequences and * where they don't match e.g.
>>> 1) TTAGCCTAA
>>> 2) TTAGCAGAA
>>> 3) TTACCCTAA
>>>
>>> would give TTA*C**AA.
>>>
>>> I want to parse this string and pull out all sequences which are  
>>> 18-21 bp in
>>> length and have no more than 4 * in them.
>>>
>>> So far, I've got this:
>>>
>>> while($fragment_match =~ /([GTAC*]{18,21})/g){
>>> print "$1\n";
>>> }
>>>
>>> hoping to match all fragments 18-21 characters in length. However  
>>> even that
>>> doesn't work as it has essentially chunked it into 21 char blocks,  
>>> rather
>>> than what I hoped for of
>>> 0-18
>>> 0-19
>>> 0-20
>>> 0-21
>>> 1-19
>>> 1-20
>>> 1-21
>>> 1-22
>>>
>>> etc.
>>>
>>> Can anyone let me know if this is already possible in BioPerl, or  
>>> how one
>>> would go about it with regex. Sadly I'm fairly new to perl and  
>>> getting to
>>> grips with BioPerl, so please treat me gently :).
>>>
>>> Many thanks,
>>>
>>> Ben
>>
>> There is a trick to this which is discussed more extensively in  
>> 'Mastering Regular Expressions'.  Essentially you have to embed code  
>> into the regex and trick the parser into backtracking using a  
>> negative lookahead.  The match itself fails (i.e. no match is  
>> returned), but the embedded code is executed for each match attempt,
>>
>> The following script is a slight modification of one I used which  
>> checks the consensus string from the input alignment (in aligned  
>> FASTA format here), extracts the alignment slice using that match,  
>> then spit the alignment out to STDOUT in clustalw format.  This  
>> should work for perl 5.8 and up, but it's only been tested on perl  
>> 5.10.  You should be able to use this to fit what you want.
>>
>> my $in = Bio::AlignIO->new(-file => $file,
>>                           -format => 'fasta');
>> my $out = Bio::AlignIO->new(-fh => \*STDOUT,
>>                           -format => 'clustalw');
>>
>> while (my $aln = $in->next_aln) {
>>    my $c = $aln->consensus_string(100);
>>    my @matches;
>>    $c =~ m/
>>        ([GTAC?]{18,21})
>>        (?{my $match = check_match($1);
>>           push @matches, [$match,
>>                           pos(),
>>                           length($match)]
>>              if defined $match;})
>>        (?!)
>>        /xig;
>>    for my $match (@matches) {
>>        my ($hit, $st, $end) = ($match->[0],
>>                                $match->[1] - $match->[2] + 1,
>>                                $match->[1]);
>>        my $newaln = $aln->slice($st, $end);
>>        $out->write_aln($newaln);
>>    }
>> }
>>
>> sub check_match {
>>    my $match = shift;
>>    return unless $match;
>>    my $ct = $match =~ tr/?/?/;
>>    return $match if $ct <= 4;
>> }
>>
>>
>> chris
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>-- 
>===========================================================
>: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
>===========================================================
>
>
>