[Bioperl-l] additional methods for Bio::SeqUtils for in-silico cloning
Frank Schwach
fs5 at sanger.ac.uk
Wed Jan 18 16:21:17 UTC 2012
Hi Roy,
I have a use-case for Bio::SeqUtils truncating a feature with negative
start location. In my case, this is happening because I transform
genomic coordinates of a feature to the coordinate frame of another
feature, so feature A that starts 10nt before the reference feature now
has a start of -10. I want to trim that to a fuzzy start = "<1"
Bio::SeqUtils::_coord_adjust can be used to trim feature A accordingly
but we need to change the regex that manipulates the coordinates
slightly by adding an optional "-":
map s/(\d+)/if ($add+$1<1) {'<1'} elsif (defined $length and $add
+$1>$length) {">$length"} else {$add+$1}/ge, @coords;
becomes
map s/(-?\d+)/if ($add+$1<1) {'<1'} elsif (defined $length and $add
+$1>$length) {">$length"} else {$add+$1}/ge, @coords;
It doesn't change anything else as far as I can tell and adds some more
flexibility to the method. If you are ok with it I can push this to my
queued pull request for Bio::SeqUtils.
Frank
On Thu, 2012-01-12 at 14:13 +0000, Frank Schwach wrote:
> I have now created a version that gives the option to create the
> products of 'delete' and 'insert' via Bio::Root::Root:clone instead of
> calling 'new' on the input seq object class. Seems to be working fine
> for me so far.
>
> 'delete' and 'insert' can now take a hashref of options.
> The only option so far is to set 'clone_obj to true, to use cloning
> instead of creating objects via 'new'.
> Setting this parameter to false or not supplying the options hashref at
> all will give you the old behaviour (call 'new').
> Example:
>
> my $product = Bio::SeqUtils->delete(
> $seq_obj,
> 11,
> 20,
> { clone_obj => 1}
> );
>
> The ligate method takes clone_obj as a named parameter:
>
> my $new_molecule = Bio::Sequtils::Pbrtools->ligate(
> -recipient => $vector,
> -fragment => $fragment,
> -left => 1000,
> -right => 1100,
> -flip => 1,
> -clone_obj => 1
> );
>
> This is in a branch of my GitHub repo if you would like to have a look:
>
> https://github.com/fschwach/bioperl-live/tree/sequtils_clone
>
> Unfortunately, I can't add this option to trunc_with_features because
> the creation of the new object is delegated to 'trunc'. I guess I could
> implement 'trunc' in Bio::SeqUtils itself(?)
>
> What do you think, could this be merged into bioperl-live?
>
>
> Frank
>
>
>
> On Wed, 2012-01-11 at 21:03 +0000, Frank Schwach wrote:
> > Great, I'll work on a branch that gives the user the option to use clone
> > instead of new and then we can see if we want to use that in the end. In
> > the meantime, what do you think about pulling this into bioperl-live?
> > When I have some time again I can work on the HOWTO for these new
> > features for the BioPerl wiki
> >
> > Frank
> >
> >
> > On 11/01/12 18:42, Fields, Christopher J wrote:
> > > Note that Bio::Root::Root now has a clone() method that one can take advantage of for this purpose; if Storable or Clone is available, it will pick one of the two, preferably Clone over Storable. It's fairly untested, but we haven't run into problems with it yet (I think it was in the last CPAN release).
> > >
> > > chris
> > >
> > > On Jan 11, 2012, at 12:38 PM, Roy Chaudhuri wrote:
> > >
> > >> Hi Frank,
> > >>
> > >> Looks great, I like the use of between locations, didn't think of that.
> > >>
> > >> It was suggested that I avoid using Clone for cat, trunc_with_features etc. to avoid adding a dependency (which may no longer be an issue) and because it would cause problems for Bio::Seq implementations that use a database as the back-end. Maybe you could add it as an option, but keep the default as is?
> > >>
> > >> Cheers,
> > >> Roy.
> > >>
> > >> On 11/01/2012 18:16, Frank Schwach wrote:
> > >>> Hi Roy and Chris,
> > >>>
> > >>> I have made the changes to the code now. As you suggested, feature ends
> > >>> no longer change type and I insert a note instead to inform about the
> > >>> deletion (or insertion), showing the length and position.
> > >>> I have also added a feature to annotate deletion sites themselves (with
> > >>> IN-BETWEEN locations).
> > >>>
> > >>> Roy's test script now prints:
> > >>>
> > >>> LOCUS seq-accession_number 7 bp dna linear UNK
> > >>> ACCESSION unknown
> > >>> FEATURES Location/Qualifiers
> > >>> CDS join(2..3,4..6)
> > >>> /note="3bp internal deletion between pos 3 and 4"
> > >>> CDS 2..3
> > >>> /note="2bp deleted from feature end"
> > >>> misc_feature 3^4
> > >>> /note="deletion of 3bp"
> > >>> ORIGIN
> > >>> 1 aaaaaaa
> > >>> //
> > >>>
> > >>>
> > >>> or, if you add strand information (-1 in this case) to the second feature:
> > >>>
> > >>> LOCUS seq-accession_number 7 bp dna linear UNK
> > >>> ACCESSION unknown
> > >>> FEATURES Location/Qualifiers
> > >>> CDS join(2..3,4..6)
> > >>> /note="3bp internal deletion between pos 3 and 4"
> > >>> CDS complement(2..3)
> > >>> /note="2bp deleted from feature 5' end"
> > >>> misc_feature 3^4
> > >>> /note="deletion of 3bp"
> > >>> ORIGIN
> > >>> 1 aaaaaaa
> > >>> //
> > >>>
> > >>> I have comitted this along with some bugfixes to my master branch on GitHub
> > >>> https://github.com/fschwach/bioperl-live
> > >>> so it's now also in my existing pull request.
> > >>>
> > >>> I'm still wondering if cloning the sequence objects rather than calling
> > >>> 'new' on their respective classes would be an option inside 'delete' and
> > >>> 'insert'?
> > >>> I'm experimenting with this for my own purposes because I have to work
> > >>> with custom sub-classes of Bio::Seq which have additional attributes and
> > >>> therefore set 'can_call_new' to false.
> > >>> Without cloning the objects, I first have to convert the custom
> > >>> Bio::Seq::Foo objects to standard Bio::Seq, which I would like to avoid.
> > >>> Is there any reason why something like Clone::Fast should not be used in
> > >>> this case? It seems to work for me but there may be situations where
> > >>> this is going to blow up which I am not aware of.
> > >>> Cloning rather than calling new could be made an option in
> > >>> Bio::SeqUtils. I have most of the code for that already.
> > >>>
> > >>> Frank
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> On 10/01/12 17:31, Roy Chaudhuri wrote:
> > >>>> Or without the typo:
> > >>>>
> > >>>> CDS join(2..3,4..6)
> > >>>> /note="3 bp internal deletion"
> > >>>> CDS 2..3
> > >>>> /note="2 bp deleted from 3' end"
> > >>>>
> > >>>> On 10/01/2012 17:27, Roy Chaudhuri wrote:
> > >>>>> I think it's me that didn't explain very well - I was talking about
> > >>>>> overlapping (rather than spanning) a deletion, although I think the same
> > >>>>> principle applies to the spanning example you gave. Here's some test
> > >>>>> code:
> > >>>>>
> > >>>>> #!/usr/bin/perl
> > >>>>> use warnings FATAL=>qw(all);
> > >>>>> use strict;
> > >>>>> use Bio::Seq;
> > >>>>> use Bio::SeqIO;
> > >>>>> use Bio::SeqUtils;
> > >>>>> use Bio::SeqFeature::Generic;
> > >>>>> my $seq=Bio::Seq->new(-id=>'seq', -seq=>'AAAAAAAAAA');
> > >>>>> $seq->add_SeqFeature(Bio::SeqFeature::Generic->new(-primary_tag=>'CDS',
> > >>>>> -start=>2,
> > >>>>> -end=>9));
> > >>>>>
> > >>>>> $seq->add_SeqFeature(Bio::SeqFeature::Generic->new(-primary_tag=>'CDS',
> > >>>>> -start=>2,
> > >>>>> -end=>5));
> > >>>>> my $out=Bio::SeqIO->newFh(-format=>'genbank');
> > >>>>> my $trunc=Bio::SeqUtils->delete($seq, 4, 6);
> > >>>>> print $out $trunc;
> > >>>>>
> > >>>>>
> > >>>>> This currently outputs:
> > >>>>> LOCUS seq-accession_number 7 bp dna linear UNK
> > >>>>> ACCESSION unknown
> > >>>>> FEATURES Location/Qualifiers
> > >>>>> CDS join(2..>3,<4..6)
> > >>>>> CDS 2..>3
> > >>>>> ORIGIN
> > >>>>> 1 aaaaaaa
> > >>>>> //
> > >>>>>
> > >>>>> However, I was suggesting that the feature table should be something
> > >>>>> like:
> > >>>>> CDS join(2..3,4..6)
> > >>>>> /note="3 bp internal deletion"
> > >>>>> CDS join(2..3)
> > >>>>> /note="2 bp deleted from 3' end"
> > >>>>>
> > >>>>> Fuzzy locations are intended to represent features which have boundaries
> > >>>>> spanning outside of the sequence. For a defined deletion that's not the
> > >>>>> case, the boundaries of the feature aren't unknown, they have been
> > >>>>> specifically altered.
> > >>>>>
> > >>>>> Hope this is clearer.
> > >>>>> Cheers,
> > >>>>> Roy.
> > >>>>>
> > >>>>> On 10/01/2012 16:47, Frank Schwach wrote:
> > >>>>>> Hi Roy,
> > >>>>>>
> > >>>>>> Sorry, I hadn't explained that very well: it's not the outer boundaries
> > >>>>>> of the feature that become fuzzy but the "inner" ones of the split
> > >>>>>> locations:
> > >>>>>>
> > >>>>>> -------------------- a feature's location
> > >>>>>> ==========xxxx================= sequence
> > >>>>>>
> > >>>>>>
> > >>>>>> --------- sublocation 1
> > >>>>>> -------- sublocation 2
> > >>>>>> ===============================
> > >>>>>>
> > >>>>>> x= sequence to delete
> > >>>>>> The feature's location has changed from Simple to Split.
> > >>>>>>
> > >>>>>> Sublocation 1:
> > >>>>>> start is still EXACT and has not changed
> > >>>>>> end is now AFTER because this is not a true end of the feature
> > >>>>>>
> > >>>>>> Sublocation 2:
> > >>>>>> start is BEFORE
> > >>>>>> end is EXACT (but shifted)
> > >>>>>>
> > >>>>>> I hope this makes more sense(?)
> > >>>>>>
> > >>>>>> Cheers,
> > >>>>>>
> > >>>>>> Frank
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> On Tue, 2012-01-10 at 15:25 +0000, Roy Chaudhuri wrote:
> > >>>>>>> Hi Frank,
> > >>>>>>>
> > >>>>>>> Looks good to me. One thing I'm not sure about - why do features
> > >>>>>>> overlapping a deletion become fuzzy? That behaviour is in
> > >>>>>>> trunc_with_features because it's intended to represent a taking a
> > >>>>>>> subregion of a larger sequence, but if you're representing an internal
> > >>>>>>> deletion then the boundaries of the overlapping feature aren't
> > >>>>>>> unknown,
> > >>>>>>> they have been specifically altered. Maybe you could give absolute
> > >>>>>>> coordinates, but add a note indicating that the 5' or 3' end has been
> > >>>>>>> truncated by however many bases.
> > >>>>>>>
> > >>>>>>> Cheers,
> > >>>>>>> Roy.
> > >>>>>>>
> > >>>>>>> On 10/01/2012 13:10, Frank Schwach wrote:
> > >>>>>>>> Hi Chris,
> > >>>>>>>>
> > >>>>>>>> I have made the changes in a Git fork and made the pull request now.
> > >>>>>>>> If this is accepted into BioPerl I can also write a little SeqUtils
> > >>>>>>>> HOWTO for the BioPerl wiki.
> > >>>>>>>>
> > >>>>>>>> Frank
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On Mon, 2012-01-09 at 18:29 +0000, Fields, Christopher J wrote:
> > >>>>>>>>> Sounds very promising! The easiest way to contribute is via a
> > >>>>>>>>> fork of the code on Github with a pull request (as you already
> > >>>>>>>>> know, being a contributor to the Primer3 modules).
> > >>>>>>>>>
> > >>>>>>>>> chris
> > >>>>>>>>>
> > >>>>>>>>> On Jan 9, 2012, at 11:10 AM, Frank Schwach wrote:
> > >>>>>>>>>
> > >>>>>>>>>> Hi all,
> > >>>>>>>>>>
> > >>>>>>>>>> I needed to manipulate Bio::Seq objects with annotations and
> > >>>>>>>>>> sequence
> > >>>>>>>>>> features to simulate molecular cloning techniques, e.g. to cut a
> > >>>>>>>>>> vector
> > >>>>>>>>>> and insert a fragment into it while preserving all the
> > >>>>>>>>>> annotations and
> > >>>>>>>>>> moving the features accordingly.
> > >>>>>>>>>> My main aim was to split features that span deletion/insertion
> > >>>>>>>>>> sites in
> > >>>>>>>>>> a meaningful way, which can not be done with the currently availble
> > >>>>>>>>>> methods.
> > >>>>>>>>>> I have modified Bio::SeqUtils so that I have the following new
> > >>>>>>>>>> methods:
> > >>>>>>>>>>
> > >>>>>>>>>> delete
> > >>>>>>>>>> ======
> > >>>>>>>>>> removes a segment from a sequence object and adjusts positions
> > >>>>>>>>>> and types
> > >>>>>>>>>> of locations of sequence features:
> > >>>>>>>>>> - locations of features that span the deletion sites are turned
> > >>>>>>>>>> into
> > >>>>>>>>>> Splits.
> > >>>>>>>>>> - locations that extend into the deleted region are turned to
> > >>>>>>>>>> Fuzzy to
> > >>>>>>>>>> indicate that their true start/end was lost.
> > >>>>>>>>>> - locations contained inside the deleted regions are lost.
> > >>>>>>>>>> - other features are shifted according to the length of the
> > >>>>>>>>>> deletion.
> > >>>>>>>>>>
> > >>>>>>>>>> insert
> > >>>>>>>>>> ======
> > >>>>>>>>>> adds a Bio::Seq object into another one between specified insertion
> > >>>>>>>>>> sites. This also affects the features on the recipient sequence:
> > >>>>>>>>>> - locations of features that span the insertion site are split but
> > >>>>>>>>>> position types are not turned to Fuzzy because no part of the
> > >>>>>>>>>> original
> > >>>>>>>>>> feature is lost.
> > >>>>>>>>>> - other features are shifted according to the length of the
> > >>>>>>>>>> insertion.
> > >>>>>>>>>>
> > >>>>>>>>>> ligate
> > >>>>>>>>>> ======
> > >>>>>>>>>> just for convenience. Supply a recipient, a fragment and one or two
> > >>>>>>>>>> sites to cut the recipient. Can also flip the fragment if required.
> > >>>>>>>>>> Simply calls delete [, reverse_complement_with_features] and
> > >>>>>>>>>> insert in
> > >>>>>>>>>> turn.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> One situation I haven't handled yet is a deletion that spans the
> > >>>>>>>>>> origin
> > >>>>>>>>>> of a circular molecule but that should be a rare thing to do
> > >>>>>>>>>> anyway. The
> > >>>>>>>>>> code currently throws an error if this is attempted.
> > >>>>>>>>>>
> > >>>>>>>>>> I'm happy to contribute the code on Github if there is interest?
> > >>>>>>>>>> Comments on the handling of feature locations highly welcome!
> > >>>>>>>>>>
> > >>>>>>>>>> Frank
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>
> > >>>>>>
> > >>>
> >
> >
>
>
>
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
More information about the Bioperl-l
mailing list