[Bioperl-l] Pattern search with gap
Andrew Walsh
walsh at cenix-bioscience.com
Wed Jun 22 10:04:20 EDT 2005
Hello,
You could try substituting N's for your gaps and then use WU-BLAST with
a wordsize equal to your input sequence (setting penalties for indels
very high to ensure perfect matches). I imagine that would be faster
than scanning the sequence as you are doing.
Hope that helps,
Andrew
khoueiry wrote:
> Hello,
>
> I want to parse a gapped sequence and search for a pattern in it... What
> is important for me is to get the Position of the pattern Start and End
> taking gaps into account:
>
> i.e :
> my $seqstring =
> "--------------------CAAAATAAATAGGTTATACAGAAACA---------------------AGATAAAAATTACA";
> my $qseq = "CAAGATA";
>
> so the result should give me : start = 61 and End = 89
>
> I wrote a program to do that.. It works well but when working with very
> large sequences (And I have a lot of them), it take a lot of time....
>
> In fact, my program parse the sequence with a sliding window equal the
> length of the pattern...
>
> the while loop is attached :
>
> Any suggestion will be appreciated....
>
>
> Pierre
>
>
>
>
>
> ------------------------------------------------------------------------
>
> while($i<=length($seqstring) - $winSize){
>
> my $nucCount = 0;
> my $substring = substr($seqstring, $i, $winSize);
>
> #If the substring doesn't contain nuc or begins with a gap
> if($substring !~ /[AGCT]/ or $substring =~ /^-/){
> $i++;
> next;
> }
>
> #if the substring doesn't contain gap
> if($substring !~ /-/){
> #print $substring."\n";
> if($substring eq $qseq){
> print "$qseq found on $i..".($i+length($qseq))."\n";
> last;
> }
> $i++;
> }
>
> if($substring =~ /-/){
> $nucCount++ while $substring =~ /[AGCT]/g;
> my $first = $i;
> my $j = 1;
> while($nucCount < length($qseq)){
> $substring = substr($seqstring, $i, $winSize+$j);
> $j++;
> $nucCount = 0;
> $nucCount++ while $substring =~ /[AGCT]/g;
> }
> my $last = ($i + $winSize+$j) - 1;
>
> if($nucCount = length($qseq)){
> my $gapCount = $substring =~ s/-//g;
> if($substring eq $qseq){
> print "$qseq found on $i..";
> print "$last\t"."with $gapCount Gap(s)\n";
> last;
> }
> }
>
> $i++;
>
> print $substring."\t";
> print $first."..".$last."\t".$nucCount."\n";
> $nucCount = 0;
> }
> }
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
--
------------------------------------------------------------------
Andrew Walsh, M.Sc.
Bioinformatics Software Engineer
IT Unit
Cenix BioScience GmbH
Tatzberg 47
01307 Dresden
Germany
Tel. +49-351-4173 137
Fax +49-351-4173 109
public key: http://www.cenix-bioscience.com/public_keys/walsh.gpg
------------------------------------------------------------------
More information about the Bioperl-l
mailing list