[Biojava-l] Restriction Enzyme Support

Ryan Cuprak 507107@corpmail.kodak.com
Thu, 20 Jun 2002 06:44:56 -0700


 I did an initial pass at this a couple of weeks ago when creating a
prototype (not using biojava). This code isn't efficient and has a couple of
bugs, notably:
 1. Doesn't go in reverse, only cuts the DNA in one direction.
 2. Doesn't handle multiple enzymes
 3. Doesn't handle enzymes which are imprecise, for instance an enzyme which
cuts AGGR (R is A or G)
 
 -Ryan Cuprak
Note, I  am calculating the weight of the fragments.

import java.util.Vector;
import java.util.List;
import java.util.StringTokenizer;

public class NucleotideRestrictionBean {

    protected String m_left;
    protected String m_right;
    protected String m_sequence;
    protected Vector   m_digestionResults;

    public NucleotideRestrictionBean () {

    }

    public void setRestrictionLeft ( String left ) {
        m_left = left;
    }

    public void setRestrictionRight ( String right ) {
        m_right = right;
    }

    public void setSequenceToCut ( String sequence ) {
        m_digestionResults = new Vector();
        m_sequence = sequence;
    }

    public Vector getDigestionResults () {
        return ( m_digestionResults );
    }

    public void performDigestion () {
        Vector splitResults = new Vector();
        String restriction_sequence = m_left+m_right;
        for ( int i = 0; i < m_sequence.length(); i++ ) {
            char prospective[] = new char[restriction_sequence.length()];
            if ( i+restriction_sequence.length() < m_sequence.length() ) {
                
m_sequence.getChars(i,i+restriction_sequence.length(),prospective,0);
            } else {
                m_sequence.getChars(i,m_sequence.length(),prospective,0);
            }

            String comparator = new String ( prospective );
            if ( restriction_sequence.equals(comparator) ) {
                splitResults.add ( new Integer(i));
            }
        }
        float weight = 0;
        int counter = 0;
        StringBuffer bandSeq = new StringBuffer();
        for ( int i = 0 ; i < m_sequence.length(); i++ ) {
            char base;
            base = m_sequence.charAt(i);
            switch ( base ) {
                case 'a': {
                    weight += 313.21;
                    bandSeq.append('a');
                    break;
                }
                case 'c': {
                    weight += 289.19;
                    bandSeq.append('c');
                    break;
                }
                case 'g': {
                    weight += 329.21;
                    bandSeq.append('g');
                    break;
                }
                case 't': {
                    weight += 288.20;
                    bandSeq.append('t');
                    break;
                }
            }
            if ( counter >= splitResults.size() ) {
                //-- ignore, terrible looking code eh?
            } else {
                if ( i ==
(((((Integer)splitResults.get(counter))).intValue()+m_left.length())-1) ) {
                    NucleotideStandardBand band = new
NucleotideStandardBand();
                    band.setSequence( bandSeq.toString() );
                    band.setWeight(weight);
                    m_digestionResults.add(band);
                    counter++;
                    weight = 0;
                    bandSeq = new StringBuffer();
                }
            }
        }
        if ( bandSeq.length() > 0 ) {
            NucleotideStandardBand band = new NucleotideStandardBand();
            band.setSequence( bandSeq.toString() );
            band.setWeight(weight);
            m_digestionResults.add(band);
        }
    }
}
On 6/20/02 1:46 AM, "Keith James" <kdj@sanger.ac.uk> wrote:

>>>>>> "Mark" == Schreiber, Mark <mark.schreiber@agresearch.co.nz> writes:
> 
>   Mark> Hi - I don't believe that there is, although conceivably one
>   Mark> wouldn't be hard to make and would probably be quite
>   Mark> useful. Anyone interested in making one could probably take
>   Mark> a lead from the proteomics package and the protease
>   Mark> digestion classes. Any takers?
> 
> I'll have a go at this - I've done enough real restriction digests in
> my time. Some virtual ones won't hurt.
> 
> The BioJava Protease class does its "cutting" by annotating Features
> onto the target sequence. This is just one of the conceptual "cutting"
> mode which springs to mind:
> 
> 1. report the locations of cuts
> 2. annotate features representing the product fragments
> 3. return a set of actual product fragments as new objects
> 
> I've just had a look at the BioPerl RestrictionEnzyme class which
> nicely does all three. I'd say that's a suitable reference point for a
> good implementation - I'll aim for a similar API.
> 
> Given that we are using SymbolLists and not chars, regexes are out of
> the picture. So how to make this efficient for big sequences? The
> protease class uses a simple scan down the sequence, but given typical
> use cases for RE digests (scan many Kb/Mb with potentially hundreds of
> enzymes) I don't think the performance would be acceptable.
> 
> With a bit of test code I found that I can quickly obtain all motifs
> up to n residues long using the SuffixTree class, but I haven't quite
> figured out how to map their locations back to the SymbolList. After
> this is done we can find ambiguous matches using BasisSymbols. Does
> that sound like a reasonable approach?
> 
> Keith