[BioRuby] matching against a zillion patterns

Jan Aerts jan.aerts at gmail.com
Fri Oct 16 13:51:42 UTC 2009


Hey George,
So if I understand correctly you've got a huge number of aminoacid sequences
(how many?) and about 400 regular expressions. And for each of the aminoacid
sequences: if they match just one of the regular expressions they are put in
box A and if they match none of the regexps, they go into box B. Correct?

It just happens that something very similar was the subject of Jim Tisdall's
(from Beginning Perl for Bioinformatics fame) talk at the bioinformatics
course we're teaching at the moment :-)

First thing: avoid loops. You don't want to take loop over all regexps for
each AA sequences, or the other way around.

Are all regexps of the same length? Would be nice if they are, but not
critical. My approach would be to go over the data just once. So suppose the
regexps all are of the same length.

A. Prepare your data:
  a. "Decode" the regexps into literal strings: e.g. /A[BC]D/ become "ABD"
and "ACD".
  b. Create a hash that contains all those things as keys.
  c. Concatenate all AA sequences together, joined with a non-AA, let's say
a semicolon ";". E.g. CAARGNDLYSKNIG;GGARGNDLYSKNIG;KKARGNDLYSKNIG

B. Do the actual search
  a. If the length of the strings to match (what used to be the regexps, and
are now the keys in the hash) is 5: take the first 5 characters of your
concatenated AA string and check if that substring exists as a key in the
hash. If so: you know that the AA sequence between the surrounding ";"
characters should go in box A.
  b. Advance 1 position: take AAs 2 to 6.
  c. Go back to a.

You might have to tweak this approach to exactly fit your requirements, but
if your code used to take a very long time, this might speed things up
immensely.

(George: can you forward this to the ruby mailing list it was discussed on
initially? Cheers)

Good luck,
jan.


2009/10/16 George Githinji <georgkam at gmail.com>

> Recently had this discussion on the Ruby mailing list. Any ideas or
> solutions
>
> http://www.ruby-forum.com/topic/197365#new
>
> --
> ---------------
> Sincerely
> George
>
> Skype: george_g2
> Blog: http://biorelated.wordpress.com/
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>



More information about the BioRuby mailing list