best way to check sequence alphabet

Chris Dagdigian cdagdigian@genetics.com
Tue, 17 Dec 1996 16:16:01 -0400


Hi folks,

There are different approaches to checking out the contents of a sequence
string to see if it contains any funny characters...I do not know enough
about perl internals to pick the most efficient approach and would welcome
comments or suggestions from all.

In the meantime, I may try implementing it both ways and seeing what works
fastest.

-Chris Dagdigian

------

Georg writes:
>http://www.perl.com/perl/faq/Q5.5.html
>You can simply read the alphabet into a hash, and do the checking
>w/o a regexp. What do you think ? Is there a size problem ?
>btw, alphabet_ok should return the standard 1 / 0.

 The Perl faq attempts to answer the question "How can I tell whether an
array contains a certain element? " and its suggestion is to:

>  "...invert the original array and keep an associative array lying about
>whose
>  keys are the first array's values.
>
>   @blues = ('turquoise', 'teal', 'lapis lazuli');
>   undef %is_blue;
>    for (@blues) { $is_blue{$_} = 1; }
>
> Now you can check whether $is_blue{$some_color}. "

The difficulty I see here is that we have to answer the above question for
*every character* in a sequence string and that would involve some type of
loop. (is there another way?)

I ended up doing the comparison as a regular expression because I thought
that it would be more efficient than having to loop through the sequence
string checking char by char against a hash of acceptable alphabet letters.
*But* I don't actually know enough about perl internals to say that this is
the better choice.

 Example code:

>   ##Make string containing largest possible alphabet
>   my($al) = join("",@{$Alphabets{$self->_monomer . "GpMg"}});
>
>    ##Add backslash escape to the ? and - alphabet characers
>    ##(this is needed inside the regular expression)
>    $al =~ s/\?/\\?/;
>    $al =~ s/\-/\\-/;
>
>   ##Look for non-alphabet characters via regexp
>   if($seq =~ /[^$al]/i) { return "Not OK!"; }
>   else { return "ok"; }

[ here there is the problem of "\" characters in the sequence string
messing up the internal regexp... eg; "GAA\TC" would pass the test....]