[Biojava-l] Restriction Enzyme Support
Ryan Cuprak
507107@corpmail.kodak.com
Thu, 20 Jun 2002 06:44:56 -0700
I did an initial pass at this a couple of weeks ago when creating a
prototype (not using biojava). This code isn't efficient and has a couple of
bugs, notably:
1. Doesn't go in reverse, only cuts the DNA in one direction.
2. Doesn't handle multiple enzymes
3. Doesn't handle enzymes which are imprecise, for instance an enzyme which
cuts AGGR (R is A or G)
-Ryan Cuprak
Note, I am calculating the weight of the fragments.
import java.util.Vector;
import java.util.List;
import java.util.StringTokenizer;
public class NucleotideRestrictionBean {
protected String m_left;
protected String m_right;
protected String m_sequence;
protected Vector m_digestionResults;
public NucleotideRestrictionBean () {
}
public void setRestrictionLeft ( String left ) {
m_left = left;
}
public void setRestrictionRight ( String right ) {
m_right = right;
}
public void setSequenceToCut ( String sequence ) {
m_digestionResults = new Vector();
m_sequence = sequence;
}
public Vector getDigestionResults () {
return ( m_digestionResults );
}
public void performDigestion () {
Vector splitResults = new Vector();
String restriction_sequence = m_left+m_right;
for ( int i = 0; i < m_sequence.length(); i++ ) {
char prospective[] = new char[restriction_sequence.length()];
if ( i+restriction_sequence.length() < m_sequence.length() ) {
m_sequence.getChars(i,i+restriction_sequence.length(),prospective,0);
} else {
m_sequence.getChars(i,m_sequence.length(),prospective,0);
}
String comparator = new String ( prospective );
if ( restriction_sequence.equals(comparator) ) {
splitResults.add ( new Integer(i));
}
}
float weight = 0;
int counter = 0;
StringBuffer bandSeq = new StringBuffer();
for ( int i = 0 ; i < m_sequence.length(); i++ ) {
char base;
base = m_sequence.charAt(i);
switch ( base ) {
case 'a': {
weight += 313.21;
bandSeq.append('a');
break;
}
case 'c': {
weight += 289.19;
bandSeq.append('c');
break;
}
case 'g': {
weight += 329.21;
bandSeq.append('g');
break;
}
case 't': {
weight += 288.20;
bandSeq.append('t');
break;
}
}
if ( counter >= splitResults.size() ) {
//-- ignore, terrible looking code eh?
} else {
if ( i ==
(((((Integer)splitResults.get(counter))).intValue()+m_left.length())-1) ) {
NucleotideStandardBand band = new
NucleotideStandardBand();
band.setSequence( bandSeq.toString() );
band.setWeight(weight);
m_digestionResults.add(band);
counter++;
weight = 0;
bandSeq = new StringBuffer();
}
}
}
if ( bandSeq.length() > 0 ) {
NucleotideStandardBand band = new NucleotideStandardBand();
band.setSequence( bandSeq.toString() );
band.setWeight(weight);
m_digestionResults.add(band);
}
}
}
On 6/20/02 1:46 AM, "Keith James" <kdj@sanger.ac.uk> wrote:
>>>>>> "Mark" == Schreiber, Mark <mark.schreiber@agresearch.co.nz> writes:
>
> Mark> Hi - I don't believe that there is, although conceivably one
> Mark> wouldn't be hard to make and would probably be quite
> Mark> useful. Anyone interested in making one could probably take
> Mark> a lead from the proteomics package and the protease
> Mark> digestion classes. Any takers?
>
> I'll have a go at this - I've done enough real restriction digests in
> my time. Some virtual ones won't hurt.
>
> The BioJava Protease class does its "cutting" by annotating Features
> onto the target sequence. This is just one of the conceptual "cutting"
> mode which springs to mind:
>
> 1. report the locations of cuts
> 2. annotate features representing the product fragments
> 3. return a set of actual product fragments as new objects
>
> I've just had a look at the BioPerl RestrictionEnzyme class which
> nicely does all three. I'd say that's a suitable reference point for a
> good implementation - I'll aim for a similar API.
>
> Given that we are using SymbolLists and not chars, regexes are out of
> the picture. So how to make this efficient for big sequences? The
> protease class uses a simple scan down the sequence, but given typical
> use cases for RE digests (scan many Kb/Mb with potentially hundreds of
> enzymes) I don't think the performance would be acceptable.
>
> With a bit of test code I found that I can quickly obtain all motifs
> up to n residues long using the SuffixTree class, but I haven't quite
> figured out how to map their locations back to the SymbolList. After
> this is done we can find ambiguous matches using BasisSymbols. Does
> that sound like a reasonable approach?
>
> Keith