[Bioperl-l] Pattern search with gap

Andrew Walsh walsh at cenix-bioscience.com
Wed Jun 22 10:04:20 EDT 2005


Hello,

You could try substituting N's for your gaps and then use WU-BLAST with 
a wordsize equal to your input sequence (setting penalties for indels 
very high to ensure perfect matches).  I imagine that would be faster 
than scanning the sequence as you are doing.

Hope that helps,

Andrew

khoueiry wrote:
> Hello,
> 
> I want to parse a gapped sequence and search for a pattern in it... What
> is important for me is to get the Position of the pattern Start and End
> taking gaps into account:
> 
> i.e : 
> my $seqstring =
> "--------------------CAAAATAAATAGGTTATACAGAAACA---------------------AGATAAAAATTACA";
> my $qseq    = "CAAGATA";
> 
> so the result should give me : start  = 61 and End  = 89
> 
> I wrote a program to do that.. It works well but when working with very
> large sequences (And I have a lot of them), it take a lot of time.... 
> 
> In fact, my program parse the sequence with a sliding window equal the
> length of the pattern...
> 
> the while loop is attached :
> 
> Any suggestion will be appreciated....
> 
> 
> Pierre
> 
> 
> 
> 
> 
> ------------------------------------------------------------------------
> 
> while($i<=length($seqstring) - $winSize){
> 	
> 	my $nucCount = 0;
> 	my $substring = substr($seqstring, $i, $winSize);
> 	
> 	#If the substring doesn't contain nuc or begins with a gap
> 	if($substring !~ /[AGCT]/ or $substring =~ /^-/){
> 		$i++;
> 		next;
> 	}
> 	
> 	#if the substring doesn't contain gap		
> 	if($substring !~ /-/){
> 		#print $substring."\n";
> 		if($substring eq $qseq){
> 			print "$qseq found on $i..".($i+length($qseq))."\n";
> 			last;
> 		}
> 		$i++;
> 	}
> 	
> 	if($substring =~ /-/){
> 		$nucCount++ while $substring =~ /[AGCT]/g;
> 		my $first = $i;
> 		my $j = 1;
> 		while($nucCount < length($qseq)){		
> 			$substring = substr($seqstring, $i, $winSize+$j);
> 			$j++;
> 			$nucCount = 0;
> 			$nucCount++ while $substring =~ /[AGCT]/g;
> 		}
> 		my $last = ($i + $winSize+$j) - 1;
> 		
> 		if($nucCount = length($qseq)){
> 			my $gapCount = $substring =~ s/-//g;
> 			if($substring eq $qseq){
> 				print "$qseq found on $i..";
> 				print "$last\t"."with $gapCount Gap(s)\n";
> 				last;
> 			}
> 		}
> 		
> 		$i++;
> 		
> 		print $substring."\t";
> 		print $first."..".$last."\t".$nucCount."\n";
> 		$nucCount = 0;
> 	}	
> }	
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l


-- 
------------------------------------------------------------------
Andrew Walsh, M.Sc.
Bioinformatics Software Engineer
IT Unit
Cenix BioScience GmbH
Tatzberg 47
01307 Dresden
Germany
Tel. +49-351-4173 137
Fax  +49-351-4173 109

public key: http://www.cenix-bioscience.com/public_keys/walsh.gpg
------------------------------------------------------------------



More information about the Bioperl-l mailing list