[Biojava-l] How to use the RestrictionEnzyme/RestrictionEnzymeManager classes

Mon, 09 Sep 2002 15:19:08 -0400 (EDT)

Some suggestions, might be useful:
1. there are partial digestions(actually no 100% cut) based on the amount of 
enzyme and the condition used by digestion. So in the code there might be 
helpful to take certain parameters to indicated the condition in order to 
determine the digestion efficiency.

2. Flanking sequences of the digestion sites matter when doing digestion. 
Different enzyme require different length of flanking sequences for the 
recogniztion sites, some 2 bases, some more, some none. Some times efficiency 
reduced, sometimes won't cut at all.

3. don't know if an apropriate error model to handle star activity is needed.

Hope helpful! ^_^
Haibo

Quoting Keith James <kdj@sanger.ac.uk>:

> >>>>> "Sylvain" == Sylvain Foisy <sylvain.foisy@bioneq.qc.ca> writes:
> 
>     Sylvain> Hi, I would like to use these classes to digest Sequence
>     Sylvain> objects. How would I do that? I tried to find something
>     Sylvain> similar to Digest (proteomics class) but I could not find
>     Sylvain> it. I would like to do a program like tacg that would
>     Sylvain> produce digests but using frames.
> 
> Hi Sylvain,
> 
> Sounds cool. tacg is the king of RE programs - I have to warn you that
> you won't get anywhere near its speed using BioJava.
> 
> The reason that these classes are a bit cryptic at the moment is that
> I have only had enough spare time to write the enzyme/enzyme-manager
> part. I have started on the RestrictionMapper (which will mark
> recognition sites and cut sites on a Sequence as Features) but not
> RestrictionDigest (which will cut a SymbolList into pieces using
> RestrictionEnzymes).
> 
> [Technical note: in order to efficiently process many
> RestrictionEnzymes at once the RestrictionMapper will need to be
> configurably multithreaded - so I need to add a ThreadPool interface
> (in case anyone wants to us a 3rd-party pool) and a SimpleThreadPool
> implementation.]
> 
> Here is a re-post of the relevant part of my initial announcement on
> this list. The key part for doing searches on a Sequence/SymbolList is
> that you create a Java 1.4 CharSequence view of the SymbolList and
> search that using a standard Java 1.4 regex Matcher using Patterns
> (one forward strand/one reverse strand) obtained from a
> RestrictionEnzyme instance (more detail below):
> 
> ----------------------------------------------------------------------
> 
> org.biojava.bio.molbio.RestrictionEnzyme
> 
> This class specifies restriction enzyme properties (recognition site,
> cut site(s), type of end produced) and also returns regex Strings
> suitable for finding forward and reverse strand recognition sites.
> 
> The constructors are public so that you can create custom enzymes, but
> the main way to get instances is through the RestrictionEnzymeManager.
> 
> org.biojava.bio.molbio.RestrictionEnzymeManager
> 
> This class is allows you to get an enzyme by name, get all
> isoschizomers of an enzyme by name, get all n-cutters and get a pair
> of java.util.regex.Patterns for the forward and reverse strand
> sites. There is a properties file
> (RestrictionEnzymeManager.properties) which is loaded as a
> ResourceBundle and tells the class where to find a REBASE file
> (withrefm.### format, same format as used by EMBOSS program
> rebaseextract - see REBASE site). I have not checked in a fallback
> copy of REBASE - it's quite big and I wanted to get some feedback
> first. Do we want the whole of a specific version of REBASE, or just a
> subset of common enzymes? Anyone can override this by using their own
> copy of REBASE and putting a new properties file in their CLASSPATH.
> 
> The part which is only partly implemented is searching. You can now
> do searches using
> 
> org.biojava.bio.seq.io.SymbolListCharSequence
> 
> This class is an implementation of the Java 1.4 interface
> CharSequence. It wraps a SymbolList and allows full regex seaching of
> any SymbolList whose Symbols can be tokenized to chars. It appears
> that the regex Matcher does not call the subSequence or toString
> methods, only charAt (which translates directly to symbolAt) so no
> extra copies of a big sequence get made. You need to use the regex
> engine in Java 1.4
> 
> Finally there's stuff to do:
> 
> org.biojava.bio.molbio.RestrictionDigest
> 
> Is not written. This will do the convenience stuff of spitting out
> SymbolList products etc. It should probably be threaded to search
> multiple enzymes (or at least both strands for one enzyme)
> simultaneously.
> 
> One thing I'm not clear on. Do we want "biologically correct"
> cutting. That is, if my sequence has two different enzyme sites which
> overlap and I do sequential digests, does the second fail to cut
> because its site is now partly single-stranded, even though the regex
> still matches on one strand? It seems the right way to me, but it may
> not to to everyone.
> 
> In summary, you can currently do full ambiguity searches on both
> strands with a bit of work.
> 
> 1. Get a copy of REBASE format #31
> 2. Edit the RestrictionEnzymeManager.properties file to point to it
> 
> Do something like this:
> 
> RestrictionEnzyme ecoRI = RestrictionEnzymeManager.getEnzyme("EcoRI");
> Pattern [] pat = RestrictionEnzymeManager.getPatterns(ecoRI);
> 
> CharSequence charSeq = new SymbolListCharSequence(mySymbolList);
> 
> Matcher forward = pat[0].matcher(charSeq);
> Matcher reverse = pat[1].matcher(charSeq);
> 
> Then proceed to use the Matcher as normal. Right now the coordinate
> you get back will be the start of the recognition site and you will
> have to calculate the actual cut(s). There are methods in
> RestrictionEnzyme which return the position(s) of the cut site in the
> coordinate space of the recognition site SymbolList (there are some
> freaky enzymes which cut both sides of their recognition site).
> 
> ----------------------------------------------------------------------
> 
> 
> I have had no feedback on whether the digestion (i.e. where
> SymbolLists are actually "cut" rather than annotated with Features)
> should be biologically correct. This involves remembering the
> positions of all single-stranded regions (3' or 5' overhangs) produced
> by digestion such that followup digestion with another enzyme does not
> cut where its recognition or cut site overlaps. This is probably only
> important if we want to progress to full virtual cloning experiments.
> 
> In summary, you can currently use RestrictionEnzyme instances to
> locate sites on a SymbolList using the Java regex package. I'm not
> sure when the support classes for digestion will be ready - I'm kinda
> busy right now. Probably 2 - 4 weeks.
> 
> Keith
> 
> -- 
> 
> - Keith James <kdj@sanger.ac.uk> bioinformatics programming support -
> - Pathogen Sequencing Unit, The Wellcome Trust Sanger Institute, UK -
> 
> 
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l
> 

Haibo Zhang
Computational Biology, NJIT & Rutgers University
http://afs13.njit.edu/~hz5