[Biojava-l] Restriction Enzyme Support

Schreiber, Mark mark.schreiber@agresearch.co.nz
Fri, 21 Jun 2002 08:56:17 +1200


I'd say suffix trees would be the way to do it quickly.

- Mark


> -----Original Message-----
> From: Keith James [mailto:kdj@sanger.ac.uk] 
> Sent: Thursday, 20 June 2002 8:47 p.m.
> To: Schreiber, Mark
> Cc: biojava-l@biojava.org
> Subject: Re: [Biojava-l] Restriction Enzyme Support
> 
> 
> >>>>> "Mark" == Schreiber, Mark <mark.schreiber@agresearch.co.nz> 
> >>>>> writes:
> 
>     Mark> Hi - I don't believe that there is, although conceivably one
>     Mark> wouldn't be hard to make and would probably be quite
>     Mark> useful. Anyone interested in making one could probably take
>     Mark> a lead from the proteomics package and the protease
>     Mark> digestion classes. Any takers?
> 
> I'll have a go at this - I've done enough real restriction 
> digests in my time. Some virtual ones won't hurt.
> 
> The BioJava Protease class does its "cutting" by annotating 
> Features onto the target sequence. This is just one of the 
> conceptual "cutting" mode which springs to mind:
> 
> 1. report the locations of cuts
> 2. annotate features representing the product fragments
> 3. return a set of actual product fragments as new objects
> 
> I've just had a look at the BioPerl RestrictionEnzyme class 
> which nicely does all three. I'd say that's a suitable 
> reference point for a good implementation - I'll aim for a 
> similar API.
> 
> Given that we are using SymbolLists and not chars, regexes 
> are out of the picture. So how to make this efficient for big 
> sequences? The protease class uses a simple scan down the 
> sequence, but given typical use cases for RE digests (scan 
> many Kb/Mb with potentially hundreds of
> enzymes) I don't think the performance would be acceptable.
> 
> With a bit of test code I found that I can quickly obtain all 
> motifs up to n residues long using the SuffixTree class, but 
> I haven't quite figured out how to map their locations back 
> to the SymbolList. After this is done we can find ambiguous 
> matches using BasisSymbols. Does that sound like a reasonable 
> approach?
> 
> Keith
> 
> -- 
> 
> -= Keith James - kdj@sanger.ac.uk - 
> http://www.sanger.ac.uk/Users/kdj =- Pathogen Sequencing 
> Unit, Wellcome Trust Sanger Institute, Cambridge, UK
> 
=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================