[Bioperl-l] Re: [Bioperl-guts-l] Bio::Restriction modifications

Peter Blaiklock pblaiklo at 110.net
Wed Dec 3 22:13:28 EST 2003


On Wednesday 03 December 2003 08:16, Rob Edwards wrote:
> Peter,
>
> It sounds like we used the same approach to fix the same problems -
> great minds :)
>
> However, we both fixed extra things and so we need to merge.
>
> Here
>
> > Bio::Restriction::Analysis
> >
> >     Changed the site finding algorithm in '_new_cuts'. The new
> > algorithm
> > uses pos to find the recognition site regex and finds the actual cut
> > point
> > using $enz->cut. The cut positions are put in an array and substr is
> > used
> > to get digestion fragments.
>
> This is the best way to do it, it allows you to do a lot more with the
> sequences. I also used this approach for ambiguous enzymes. For
> non-ambiguous enzymes it is a lot more efficient to use an index based
> string search rather than a regexp, and I added an internal method to
> deal with this.

Nice.

> >    Fixed the origin bug for circular target sequences. If the sequence
> > is
> > circular the first 40 bases are appended to the end of the sequence.
> > Sites
> > that span the origin are now detected. First and last fragments are
> > reattached as before.
>
> I added a  method to EnzymeCollection to return the longest cut site in
> the collection and then added that length of sequence. This will
> correctly deal with any site > 40 bp. Using position instead of
> fragments sure makes this a lot easier, huh?

Sure does. You might want to make sure that the length includes the padding 
for offset cut points and dual cutters.

> >    Fixed the non-palindromes bug.  Non-palindromic sites are detected
> > in
> > the reverse orientation by searching the reverse complement of the
> > target
> > sequence, using $enz->complementary_cut and then subtracting from the
> > sequence length.
>
> I added a new method to Enzyme.pm to return the complementary cut site.
> I don't particularly like the way I did this. I'd like Analysis to
> correctly use the Enzyme->cut and Enzyme->complementary_cut which it
> doesn't at the moment. I think this is a lot better than using the "^"
> in the site routine.

You mean having site, cut and complementary_cut just be get/setters and 
letting the parser(s) deal with everything else? Sounds good.

> >    Fixed a bug where a site would be detected even if the
> > complementary cut
> > was off the end of the target sequence.
>
> I am not sure if I deal with this properly. I think it should be
> handled properly with circularization?

The problem is with offset cutters. If the cut point on the complementary 
strand is off the end, the target strand won't be cut either. Wouldn't matter 
if they were all blunt ended, but... There's a related problem with dual 
cutters. If either precut is off the end the enzyme won't digest at the 
postcut and should not be reported. As far as I could tell your code doesn't 
check for this but maybe I missed something?

> >    Added a method 'fragment_maps($enzyme_name)'. This method returns an
> > array of hashes where each hash corresponds to a restriction fragment.
> > The
> > hash keys are 'start', 'end' and 'seq'. 'seq' holds the sequence of the
> > fragment, while 'start' and 'end' hold the positions of the first and
> > last
> > bases of the fragment in the target sequence. This will allow
> > restriction fragments to be turned into features.
>
> This is cool.
>
> >       Added internal methods '_find_cut', '_multicut' and ' _digest'.
> > '_find_cut' figures out the cut point given the position of the
> > recognition
> > site and checks whether the cut point is off the end of the molecule.
> > This
> > has to be done twice for non-palindromic enzymes, so it makes sense to
> > make
> > it a separate method.
> >    '_digest' gets the restriction fragments from the list of cut
> > points. I
> > split this off of '_new_cuts' for the sake of readability, particularly
> > after we start dealing with overlapping sites.
> >    '_multicut' checks whether the "other" cut of a dual cut enzyme is
> > off
> > the end of the sequence. Could probably be folded back into
> > '_find_cut'.
>
> I did something similar, but probably used about 5x as many methods
> (making it much less clear I am sure).
>
> > Bio::Restriction::Enzyme
> >
> >    Changed the enzyme name correction code so only a single trailing
> > '1'
> > gets changed to an 'I'. This should prevent name mangling of enzymes
> > like Alw21I.
> >
> >    Changed '$cut && $self->cut($cut);' to 'defined $cut &&
> > $self->cut($cut);' (same for complementary cut) in the constructor.
> > This allows $self->cut to be zero.
> >
> >    Some minor changes in methods 'cut' and 'site'.
>
> These should be committed.
>
> > Bio::Restriction::IO::base
> >
> >    Changed method '_make_multicuts' so that the second enzyme is not
> > added
> > to the collection and the first enzyme is not added to the second
> > enzyme's
> > "others" array, only the other way around. The idea is to prevent dual
> > cut
> > sites from being reported twice. This will break scripts that used this
> > behavior to work around the dual cut bug, but you shouldn't have to do
> > that
> > any more.
> >
> >    Similar changes made to '_make_multisites'.
>
> These should be committed too.
>
> I'll try and merge this code with what I wrote and add it to cvs in a
> day or so. Once I get a near working version, I'll let you know,
> perhaps you can help debug the changes.

Great. I'll assemble some tests.

> Take a look at the latest versions in cvs and let me know what you
> think.
>
> Rob
 
Peter Blaiklock


More information about the Bioperl-l mailing list