[Biojava-l] Restriction Enzyme Support

Keith James kdj@sanger.ac.uk
20 Jun 2002 09:46:56 +0100

>>>>> "Mark" == Schreiber, Mark <mark.schreiber@agresearch.co.nz> writes:

    Mark> Hi - I don't believe that there is, although conceivably one
    Mark> wouldn't be hard to make and would probably be quite
    Mark> useful. Anyone interested in making one could probably take
    Mark> a lead from the proteomics package and the protease
    Mark> digestion classes. Any takers?

I'll have a go at this - I've done enough real restriction digests in
my time. Some virtual ones won't hurt.

The BioJava Protease class does its "cutting" by annotating Features
onto the target sequence. This is just one of the conceptual "cutting"
mode which springs to mind:

1. report the locations of cuts
2. annotate features representing the product fragments
3. return a set of actual product fragments as new objects

I've just had a look at the BioPerl RestrictionEnzyme class which
nicely does all three. I'd say that's a suitable reference point for a
good implementation - I'll aim for a similar API.

Given that we are using SymbolLists and not chars, regexes are out of
the picture. So how to make this efficient for big sequences? The
protease class uses a simple scan down the sequence, but given typical
use cases for RE digests (scan many Kb/Mb with potentially hundreds of
enzymes) I don't think the performance would be acceptable.

With a bit of test code I found that I can quickly obtain all motifs
up to n residues long using the SuffixTree class, but I haven't quite
figured out how to map their locations back to the SymbolList. After
this is done we can find ambiguous matches using BasisSymbols. Does
that sound like a reasonable approach?



-= Keith James - kdj@sanger.ac.uk - http://www.sanger.ac.uk/Users/kdj =-
Pathogen Sequencing Unit, Wellcome Trust Sanger Institute, Cambridge, UK