[Bioperl-l] Allowing One error in Sequence matching

Smithies, Russell Russell.Smithies at agresearch.co.nz
Thu Sep 17 01:46:54 UTC 2009


I misread your question, my example will match NGCT, ANCT, AGNT, or ACGN with 1 miss-match (or NGNT, NGCN, ANNT, ANCT etc with 2 miss-matches)
The eval is just doing a regex on the match string created by the loop - "[AN][GN][CN][TN]"
If your word size is short and you're not using too many mismatches, brute-forcing it with a compiled regex would probably work.


> -----Original Message-----
> From: Abhishek Pratap [mailto:abhishek.vit at gmail.com]
> Sent: Thursday, 17 September 2009 1:39 p.m.
> To: Smithies, Russell
> Cc: bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] Allowing One error in Sequence matching
> 
> Hi Russell
> 
> Thanks for a quick reply. However I am not following the code clearly
> and the reason behind it.
> 
> Will this work for  matching AGCT  to ACCT | ANCT | AACT. It dint give
> me the expected output when I ran it. I am more interested in
> understanding the logic.
> 
> It would be great if you could expand a bit more.
> 
> 
> Also if I do it the brute force way as suggested to me by a frnd , how
> will that work in terms of scalability.
> 
> @dna1=split(//,$a);
> @dna2=split(//,$b);
> $x=0;
> for($i=0;$i<@dna1;$i++){
>         if ($dna1[$i] ne $dna2[$i]){
>                         $x++;
>         }
> }
> 
> if($x<=1){
>         print "RESULT: your sequence is true\n";
> }
> 
> else { print " RESULT: your sequence is false\n";}
> 
> Thanks,
> -Abhi
> 
> 
> On Wed, Sep 16, 2009 at 7:06 PM, Smithies, Russell
> <Russell.Smithies at agresearch.co.nz> wrote:
> > How about chunk it into overlapping words, skip if >2 N, then regex?
> >
> > $seq =
> "CGATCGNATGNCGTCTAGCTGACANGTTGACTCTAGCTGATCGATCGATCGTACGTANNCGTAGTCGTACNTACGAT
> CTNACGCACGNATGCTACGTACG";
> >
> > $motif = "ACGT";
> > foreach (split //, $motif) {$w .= "[${_}N]"}
> >
> > foreach ($seq =~ /(?=(\w{4}))/g){
> >  next if tr/N/N/ >= 2;
> >  print "$_\n" if  eval "/$w/" ;
> > }
> >
> >
> >
> >> -----Original Message-----
> >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> >> bounces at lists.open-bio.org] On Behalf Of Abhishek Pratap
> >> Sent: Thursday, 17 September 2009 9:42 a.m.
> >> To: bioperl-l at lists.open-bio.org
> >> Subject: [Bioperl-l] Allowing One error in Sequence matching
> >>
> >> Hi All
> >>
> >> I am not able to think of smart way to do sequence matching allowing
> >> userdefined number of mismatches.
> >>
> >> For eg:
> >>
> >> Given Sequence : AGCT will be considered a match to reference if any
> >> one base pair position #(1,2,3,4)  has a mismatch that is  [ACGTN] so
> >> the possible matches could be
> >>
> >> This is for position 1.
> >> AGCT
> >> GGCT
> >> CGCT
> >> TGCT
> >> NGCT
> >> and likewise for each position.
> >>
> >> any nice regular expression. One way that I could think was to
> >> generate all the possible tags for a given sequence and then do the
> >> matching. It will be a computationally expensive for long dataset .
> >> Any neat method ?
> >>
> >> Thanks,
> >> -Abhi
> >> _______________________________________________
> >> Bioperl-l mailing list
> >> Bioperl-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> > =======================================================================
> > Attention: The information contained in this message and/or attachments
> > from AgResearch Limited is intended only for the persons or entities
> > to which it is addressed and may contain confidential and/or privileged
> > material. Any review, retransmission, dissemination or other use of, or
> > taking of any action in reliance upon, this information by persons or
> > entities other than the intended recipients is prohibited by AgResearch
> > Limited. If you have received this message in error, please notify the
> > sender immediately.
> > =======================================================================
> >




More information about the Bioperl-l mailing list