[Bioperl-l] additional methods for Bio::SeqUtils for in-silico cloning

Fri Jan 13 21:47:50 UTC 2012

Hi Chris,

Thanks for merging my pull request. I have a version now tha allows 
using clone as an option. Would that be of interest too?

Frank

On 11/01/12 21:03, Frank Schwach wrote:
> Great, I'll work on a branch that gives the user the option to use 
> clone instead of new and then we can see if we want to use that in the 
> end. In the meantime, what do you think about pulling this into 
> bioperl-live? When I have some time again I can work on the HOWTO for 
> these new features for the BioPerl wiki
>
> Frank
>
>
> On 11/01/12 18:42, Fields, Christopher J wrote:
>> Note that Bio::Root::Root now has a clone() method that one can take 
>> advantage of for this purpose; if Storable or Clone is available, it 
>> will pick one of the two, preferably Clone over Storable.  It's 
>> fairly untested, but we haven't run into problems with it yet (I 
>> think it was in the last CPAN release).
>>
>> chris
>>
>> On Jan 11, 2012, at 12:38 PM, Roy Chaudhuri wrote:
>>
>>> Hi Frank,
>>>
>>> Looks great, I like the use of between locations, didn't think of that.
>>>
>>> It was suggested that I avoid using Clone for cat, 
>>> trunc_with_features etc. to avoid adding a dependency (which may no 
>>> longer be an issue) and because it would cause problems for Bio::Seq 
>>> implementations that use a database as the back-end. Maybe you could 
>>> add it as an option, but keep the default as is?
>>>
>>> Cheers,
>>> Roy.
>>>
>>> On 11/01/2012 18:16, Frank Schwach wrote:
>>>> Hi Roy and Chris,
>>>>
>>>> I have made the changes to the code now. As you suggested, feature 
>>>> ends
>>>> no longer change type and I insert a note instead to inform about the
>>>> deletion (or insertion), showing the length and position.
>>>> I have also added a feature to annotate deletion sites themselves 
>>>> (with
>>>> IN-BETWEEN locations).
>>>>
>>>> Roy's test script now prints:
>>>>
>>>> LOCUS       seq-accession_number            7 bp    dna     
>>>> linear   UNK
>>>> ACCESSION   unknown
>>>> FEATURES             Location/Qualifiers
>>>>        CDS             join(2..3,4..6)
>>>>                        /note="3bp internal deletion between pos 3 
>>>> and 4"
>>>>        CDS             2..3
>>>>                        /note="2bp deleted from feature end"
>>>>        misc_feature    3^4
>>>>                        /note="deletion of 3bp"
>>>> ORIGIN
>>>>           1 aaaaaaa
>>>> //
>>>>
>>>>
>>>> or, if you add strand information (-1 in this case) to the second 
>>>> feature:
>>>>
>>>> LOCUS       seq-accession_number            7 bp    dna     
>>>> linear   UNK
>>>> ACCESSION   unknown
>>>> FEATURES             Location/Qualifiers
>>>>        CDS             join(2..3,4..6)
>>>>                        /note="3bp internal deletion between pos 3 
>>>> and 4"
>>>>        CDS             complement(2..3)
>>>>                        /note="2bp deleted from feature 5' end"
>>>>        misc_feature    3^4
>>>>                        /note="deletion of 3bp"
>>>> ORIGIN
>>>>           1 aaaaaaa
>>>> //
>>>>
>>>> I have comitted this along with some bugfixes to my master branch 
>>>> on GitHub
>>>> https://github.com/fschwach/bioperl-live
>>>> so it's now also in my existing pull request.
>>>>
>>>> I'm still wondering if cloning the sequence objects rather than 
>>>> calling
>>>> 'new' on their respective classes would be an option inside 
>>>> 'delete' and
>>>> 'insert'?
>>>> I'm experimenting with this for my own purposes because I have to work
>>>> with custom sub-classes of Bio::Seq which have additional 
>>>> attributes and
>>>> therefore set 'can_call_new' to false.
>>>> Without cloning the objects, I first have to convert the custom
>>>> Bio::Seq::Foo objects to standard Bio::Seq, which I would like to 
>>>> avoid.
>>>> Is there any reason why something like Clone::Fast should not be 
>>>> used in
>>>> this case? It seems to work for me but there may be situations where
>>>> this is going to blow up which I am not aware of.
>>>> Cloning rather than calling new could be made an option in
>>>> Bio::SeqUtils. I have most of the code for that already.
>>>>
>>>> Frank
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 10/01/12 17:31, Roy Chaudhuri wrote:
>>>>> Or without the typo:
>>>>>
>>>>> CDS             join(2..3,4..6)
>>>>>                  /note="3 bp internal deletion"
>>>>> CDS             2..3
>>>>>                  /note="2 bp deleted from 3' end"
>>>>>
>>>>> On 10/01/2012 17:27, Roy Chaudhuri wrote:
>>>>>> I think it's me that didn't explain very well - I was talking about
>>>>>> overlapping (rather than spanning) a deletion, although I think 
>>>>>> the same
>>>>>> principle applies to the spanning example you gave. Here's some test
>>>>>> code:
>>>>>>
>>>>>> #!/usr/bin/perl
>>>>>> use warnings FATAL=>qw(all);
>>>>>> use strict;
>>>>>> use Bio::Seq;
>>>>>> use Bio::SeqIO;
>>>>>> use Bio::SeqUtils;
>>>>>> use Bio::SeqFeature::Generic;
>>>>>> my $seq=Bio::Seq->new(-id=>'seq', -seq=>'AAAAAAAAAA');
>>>>>> $seq->add_SeqFeature(Bio::SeqFeature::Generic->new(-primary_tag=>'CDS', 
>>>>>>
>>>>>>                                                       -start=>2,
>>>>>>                                                       -end=>9));
>>>>>>
>>>>>> $seq->add_SeqFeature(Bio::SeqFeature::Generic->new(-primary_tag=>'CDS', 
>>>>>>
>>>>>>                                                       -start=>2,
>>>>>>                                                       -end=>5));
>>>>>> my $out=Bio::SeqIO->newFh(-format=>'genbank');
>>>>>> my $trunc=Bio::SeqUtils->delete($seq, 4, 6);
>>>>>> print $out $trunc;
>>>>>>
>>>>>>
>>>>>> This currently outputs:
>>>>>> LOCUS       seq-accession_number            7 bp    dna     
>>>>>> linear   UNK
>>>>>> ACCESSION   unknown
>>>>>> FEATURES             Location/Qualifiers
>>>>>>         CDS             join(2..>3,<4..6)
>>>>>>         CDS             2..>3
>>>>>> ORIGIN
>>>>>>            1 aaaaaaa
>>>>>> //
>>>>>>
>>>>>> However, I was suggesting that the feature table should be something
>>>>>> like:
>>>>>> CDS             join(2..3,4..6)
>>>>>>                    /note="3 bp internal deletion"
>>>>>> CDS             join(2..3)
>>>>>>                    /note="2 bp deleted from 3' end"
>>>>>>
>>>>>> Fuzzy locations are intended to represent features which have 
>>>>>> boundaries
>>>>>> spanning outside of the sequence. For a defined deletion that's 
>>>>>> not the
>>>>>> case, the boundaries of the feature aren't unknown, they have been
>>>>>> specifically altered.
>>>>>>
>>>>>> Hope this is clearer.
>>>>>> Cheers,
>>>>>> Roy.
>>>>>>
>>>>>> On 10/01/2012 16:47, Frank Schwach wrote:
>>>>>>> Hi Roy,
>>>>>>>
>>>>>>> Sorry, I hadn't explained that very well: it's not the outer 
>>>>>>> boundaries
>>>>>>> of the feature that become fuzzy but the "inner" ones of the split
>>>>>>> locations:
>>>>>>>
>>>>>>>     --------------------           a feature's location
>>>>>>> ==========xxxx================= sequence
>>>>>>>
>>>>>>>
>>>>>>>     ---------                     sublocation 1
>>>>>>>              --------             sublocation 2
>>>>>>> ===============================
>>>>>>>
>>>>>>> x= sequence to delete
>>>>>>> The feature's location has changed from Simple to Split.
>>>>>>>
>>>>>>> Sublocation 1:
>>>>>>> start is still EXACT and has not changed
>>>>>>> end is now AFTER because this is not a true end of the feature
>>>>>>>
>>>>>>> Sublocation 2:
>>>>>>> start is BEFORE
>>>>>>> end is EXACT (but shifted)
>>>>>>>
>>>>>>> I hope this makes more sense(?)
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Frank
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, 2012-01-10 at 15:25 +0000, Roy Chaudhuri wrote:
>>>>>>>> Hi Frank,
>>>>>>>>
>>>>>>>> Looks good to me. One thing I'm not sure about - why do features
>>>>>>>> overlapping a deletion become fuzzy? That behaviour is in
>>>>>>>> trunc_with_features because it's intended to represent a taking a
>>>>>>>> subregion of a larger sequence, but if you're representing an 
>>>>>>>> internal
>>>>>>>> deletion then the boundaries of the overlapping feature aren't
>>>>>>>> unknown,
>>>>>>>> they have been specifically altered. Maybe you could give absolute
>>>>>>>> coordinates, but add a note indicating that the 5' or 3' end 
>>>>>>>> has been
>>>>>>>> truncated by however many bases.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Roy.
>>>>>>>>
>>>>>>>> On 10/01/2012 13:10, Frank Schwach wrote:
>>>>>>>>> Hi Chris,
>>>>>>>>>
>>>>>>>>> I have made the changes in a Git fork and made the pull 
>>>>>>>>> request now.
>>>>>>>>> If this is accepted into BioPerl I can also write a little 
>>>>>>>>> SeqUtils
>>>>>>>>> HOWTO for the BioPerl wiki.
>>>>>>>>>
>>>>>>>>> Frank
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, 2012-01-09 at 18:29 +0000, Fields, Christopher J wrote:
>>>>>>>>>> Sounds very promising!  The easiest way to contribute is via a
>>>>>>>>>> fork of the code on Github with a pull request (as you already
>>>>>>>>>> know, being a contributor to the Primer3 modules).
>>>>>>>>>>
>>>>>>>>>> chris
>>>>>>>>>>
>>>>>>>>>> On Jan 9, 2012, at 11:10 AM, Frank Schwach wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>> I needed to manipulate Bio::Seq objects with annotations and
>>>>>>>>>>> sequence
>>>>>>>>>>> features to simulate molecular cloning techniques, e.g. to 
>>>>>>>>>>> cut a
>>>>>>>>>>> vector
>>>>>>>>>>> and insert a fragment into it while preserving all the
>>>>>>>>>>> annotations and
>>>>>>>>>>> moving the features accordingly.
>>>>>>>>>>> My main aim was to split features that span deletion/insertion
>>>>>>>>>>> sites in
>>>>>>>>>>> a meaningful way, which can not be done with the currently 
>>>>>>>>>>> availble
>>>>>>>>>>> methods.
>>>>>>>>>>> I have modified Bio::SeqUtils so that I have the following new
>>>>>>>>>>> methods:
>>>>>>>>>>>
>>>>>>>>>>> delete
>>>>>>>>>>> ======
>>>>>>>>>>> removes a segment from a sequence object and adjusts positions
>>>>>>>>>>> and types
>>>>>>>>>>> of locations of sequence features:
>>>>>>>>>>> - locations of features that span the deletion sites are turned
>>>>>>>>>>> into
>>>>>>>>>>> Splits.
>>>>>>>>>>> - locations that extend into the deleted region are turned to
>>>>>>>>>>> Fuzzy to
>>>>>>>>>>> indicate that their true start/end was lost.
>>>>>>>>>>> - locations contained inside the deleted regions are lost.
>>>>>>>>>>> - other features are shifted according to the length of the
>>>>>>>>>>> deletion.
>>>>>>>>>>>
>>>>>>>>>>> insert
>>>>>>>>>>> ======
>>>>>>>>>>> adds a Bio::Seq object into another one between specified 
>>>>>>>>>>> insertion
>>>>>>>>>>> sites. This also affects the features on the recipient 
>>>>>>>>>>> sequence:
>>>>>>>>>>> - locations of features that span the insertion site are 
>>>>>>>>>>> split but
>>>>>>>>>>> position types are not turned to Fuzzy because no part of the
>>>>>>>>>>> original
>>>>>>>>>>> feature is lost.
>>>>>>>>>>> - other features are shifted according to the length of the
>>>>>>>>>>> insertion.
>>>>>>>>>>>
>>>>>>>>>>> ligate
>>>>>>>>>>> ======
>>>>>>>>>>> just for convenience. Supply a recipient, a fragment and one 
>>>>>>>>>>> or two
>>>>>>>>>>> sites to cut the recipient. Can also flip the fragment if 
>>>>>>>>>>> required.
>>>>>>>>>>> Simply calls delete [, reverse_complement_with_features] and
>>>>>>>>>>> insert in
>>>>>>>>>>> turn.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> One situation I haven't handled yet is a deletion that spans 
>>>>>>>>>>> the
>>>>>>>>>>> origin
>>>>>>>>>>> of a circular molecule but that should be a rare thing to do
>>>>>>>>>>> anyway. The
>>>>>>>>>>> code currently throws an error if this is attempted.
>>>>>>>>>>>
>>>>>>>>>>> I'm happy to contribute the code on Github if there is 
>>>>>>>>>>> interest?
>>>>>>>>>>> Comments on the handling of feature locations highly welcome!
>>>>>>>>>>>
>>>>>>>>>>> Frank
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>>
>>>>
>
>

-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.