[Bioperl-l] Trans-Splicing Coding Sequences

Tue Aug 24 06:48:16 EDT 2004

Jason,

Removing the sort was definitely a step in the right direction, but I've found
another slight problem.  The following snippet from  SeqFeatureI.pm (lines
560-567) reverse complements each of the subsequences of the CDS if the section
is on the reverse strand:

if( $loc->strand == 1 ) {
   $seqstr .= $called_seq->subseq($loc->start,$loc->end);
} else {
   $seqstr .= $called_seq->trunc($loc->start,$loc->end)->revcom->seq();
}

I don't think that the body of the second else clause does what is intended. 
Reverse complementing each individual exon and splicing them together doesn't
give the same result as splicing together all of the exons and then reversing
(and complementing) the product. The correct way to do this would be the second
way, i.e: 
   1. Splice together all of the exons in a CDS.
   2. Reverse complement the spliced sequence if necessary.

The current method breaks on reverse complemented coding sequences, such as:
     CDS             complement(join(25482..27077,28659..28733))

I've made a patch that incorporates two changes into the spliced_seq method
of SeqFeatureI.pm:
   1. Adds an optional argument for turning off the sorting of locations.
   2. Moves reverse complementing of the spliced sequence occur after the
   entire sequence has been concatenated.

If anyone has any questions on this patch or any input on what I may have 
broken, please e-mail me. 

Thanks,

James Thompson
RIT Bioinformatics

On Sun, 22 Aug 2004, Jason Stajich wrote:

> The problem is that we sort the order of the locations in the calling of
> spliced_seq at around line 697 in Bio::SeqFeatureI so all you need to do
> remove the sorting - if you want to add an optional 3rd argument to this
> method which when true does not try and sort the sub locations.  Make
> sense?  If it works, post a patch and we'll incorporate it.
> 
> -jason
> 
>  On Fri, 20 Aug 2004, James Thompson wrote:
> 
> > Dear Bioperlers,
> >
> > I'm currently working on a project involving some trans-spliced coding sequences
> > from Arabidopsis, and I was wondering if BioPerl provided an easy way of taking a
> > trans-spliced CDS feature and correctly splicing it into a Bio::Seq object.
> >
> > Here's my naive stab at doing this using the Bioperl methods that I know:
> >
> > use strict;
> > use Bio::SeqIO;
> >
> > my $seqio = Bio::SeqIO->new( -file => 'NC_001284.gbk', -format => 'genbank' );
> > my $genome = $seqio->next_seq;
> >
> > foreach my $cds (grep { $_->primary_tag =~ /CDS/i } $genome->get_SeqFeatures) {
> >    print $cds->start, " -> ", $cds->end, "\n";
> >    print $cds->spliced_seq->translate->seq, "\n";
> >
> >    last;
> > }
> >
> > This just tries to use the spliced_seq method to splice the CDS into a sequence,
> > and here's the output:
> >
> > 79740 -> 333105
> > LFHDLWVYWSYPLRSISQDFDRIRNHWCSI*WYFYGDSVYRCRIPIQDHCSSFSYVGTRYL*GFTHPGYSIPFYCA*NLYFC
> > *YFTCFYLWFLWSYIATNLLFLQHCFYDLRSTGRHGPNESQKTSSS*FNWTCRLYSYWFLMWNHRRNSITTNWYLYLCINDD
> > GCIRHSFSITANPCQIYSGFGRSSQNESYFGYYLLHYYVLIRRNTPVSRLL*QILFVLRRFGLWGLLSSPSGSSD*RYRSFL
> > LYTLSEKNVF*YT*DMDSI*TNGS**VVTTSNDFLFHYFILAIPLSFVLSYSSNGTQFISLNESRIRSDPPTHVQSFFSGFP
> > RDLYH*CNLHFAHSWSCI*YL*EI*LSAVSQ*CGLAWIT*CSNNLASARRWRTSPNYCPFILE*SF*EGQFYIFLPNLSIIK
> > YGWYHFDVFRFFRPREV*CF*IHCINSTSYSRYALYDLGS*FNCHVFSY*ASKFMFLCNRSIKKKV*IFHGSRLEIFDLRCI
> > FLWNIIVW
> >
> > The translated sequence for this coding sequence should look like this:
> >
> > MKAEFVRILPHMFNLFLAVSPEIFIINATSILLIHGVVFSTSKK
> > YDYPPLASNVGWLGLLSVLITLLLLAAGAPLLTIAHLFWNNLFRRDNFTYFCQIFLLL
> > STAGTISMCFDSSDQERFDAFEFIVLIPLPTRGMLFMISAHDLIAMYLAIEPQSLCFY
> > VIAASKRKSEFSTEAGSKYLILGAFSSGILLFGCSMIYGSTGATHFDQLAKILTGYEI
> > TGARSSGIFMGILSIAVGFLFKITAVPFHMWAPDIYEGSPTPVTAFLSIAPKISISAN
> > ILRVSIYGSYGATLQQIFFFCSIASMILGALAAMAQTKVKRPLAHSSIGHVGYIRTGF
> > SCGTIEGIQSLLIGIFIYALMTMDAFAIVSALRQTRVKYIADLGALAKTNPISAITFS
> > ITMFSYAGIPPLAGFCSKFYLFFAALGCGAYFLAPVGVVTSVIGRFYYIRLVKRMFFD
> > TPRTWILYEPMDRNKSLLLAMTSFFITSSLLYPSPLFSVTHQMALSSYL
> >
> > I'm guessing that the spliced_seq method in Bio::SeqFeatureI isn't correctly
> > recognizing the Location definition for this coding sequence, which looks like this:
> > CDS             complement(join(327890..328078,329735..330306,
> >                 332945..333105,79740..80132,81113..81297))
> >
> > Could anyone help me shed any light on this? Ideally I'd like to translate all
> > of these CDS features into individual Bio::Seq objects for further analysis,
> > and I thought I'd ask for a bit of help before I wrote my own parser. Should I
> > try sub-classing Bio::SeqFeature to overwrite the spliced_seq method, or has
> > someone else already figured out this problem? Any suggestions would be very
> > helpful.
> >
> > Thanks,
> >
> > James Thompson
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at portal.open-bio.org
> > http://portal.open-bio.org/mailman/listinfo/bioperl-l
> >
> 
> --
> Jason Stajich
> Duke University
> jason at cgt.mc.duke.edu
> 




-------------- next part --------------
Index: SeqFeatureI.pm
===================================================================
RCS file: /home/repository/bioperl/bioperl-live/Bio/SeqFeatureI.pm,v
retrieving revision 1.55
diff -u -r1.55 SeqFeatureI.pm

--- SeqFeatureI.pm	13 Mar 2004 01:30:02 -0000	1.55
+++ SeqFeatureI.pm	25 Aug 2004 02:47:21 -0000
@@ -453,18 +453,24 @@
             number of N's (DNA) or X's (protein, though this is unlikely).
 
             This function is deliberately "magical" attempting to second guess
-            what a user wants as "the" sequence for this feature
+            what a user wants as "the" sequence for this feature.
 
             Implementing classes are free to override this method with their
-            own magic if they have a better idea what the user wants
+            own magic if they have a better idea what the user wants.
+
+            Note: This function sorts features according to their starting position,
+            if this is undesirable (e.g in splicing a trans-spliced CDS), pass
+            in a true value as the third argument.
 
   Args    : [optional] A Bio::DB::RandomAccessI compliant object
+            [optional} A true value as the third argument if user does not want
+                       features sorted by start location.
   Returns : A Bio::Seq
 
 =cut
 
 sub spliced_seq {
-    my ($self,$db) = @_;
+    my ($self,$db,$nosort) = @_;
    
     if( ! $self->location->isa("Bio::Location::SplitLocationI") ) {
 	return $self->seq(); # nice and easy!
@@ -496,22 +502,36 @@
 	! $self->absolute ) { 
 	$self->warn("Calling spliced_seq with a Bio::Das::SegmentI which does have absolute set to 1 -- be warned you may not be getting things on the correct strand");
     }
-    
-    my @locs = map { $_->[0] }
-    # sort so that most negative is first basically to order
-    # the features on the opposite strand 5'->3' on their strand
-    # rather than they way most are input which is on the fwd strand
-
-    sort { $a->[1] <=> $b->[1] } # Yes Tim, Schwartzian transformation
-    map { 
-	$fstrand = $_->strand unless defined $fstrand;
-	$mixed = 1 if defined $_->strand && $fstrand != $_->strand;
-	if( defined $_->seq_id ) {
-	    $mixedloc = 1 if( $_->seq_id ne $seqid );
-	}
-	[ $_, $_->start* ($_->strand || 1)];	    
-    } $self->location->each_Location; 
    
+    my @locs;
+    if ( !$nosort ) {
+        @locs = map { $_->[0] }
+        # sort so that most negative is first basically to order
+        # the features on the opposite strand 5'->3' on their strand
+        # rather than they way most are input which is on the fwd strand
+
+        sort { $a->[1] <=> $b->[1] } # Yes Tim, Schwartzian transformation
+        map { 
+            $fstrand = $_->strand unless defined $fstrand;
+            $mixed = 1 if defined $_->strand && $fstrand != $_->strand;
+            if( defined $_->seq_id ) {
+                $mixedloc = 1 if( $_->seq_id ne $seqid );
+            }
+            [ $_, $_->start * ($_->strand || 1)];	    
+        } $self->location->each_Location; 
+    } else {
+        # same as above, just without the sort
+        @locs = map { $_->[0] }
+        map { 
+            $fstrand = $_->strand unless defined $fstrand;
+            $mixed = 1 if defined $_->strand && $fstrand != $_->strand;
+            if( defined $_->seq_id ) {
+                $mixedloc = 1 if( $_->seq_id ne $seqid );
+            }
+            [ $_, $_->start * ($_->strand || 1)];	    
+        } $self->location->each_Location; 
+    }
+
     if ( $mixed ) { 
 	$self->warn("Mixed strand locations, spliced seq using the input order rather than trying to sort");    
 	@locs = $self->location->each_Location; 
@@ -520,6 +540,7 @@
 	@locs = $self->location->each_Location; 
     }
 
+    my $revcom;
     foreach my $loc ( @locs  ) {
 	if( ! $loc->isa("Bio::Location::Atomic") ) {
 	    $self->throw("Can only deal with one level deep locations");
@@ -553,21 +574,24 @@
 	} else {
 	    $called_seq = $self->entire_seq;
 	}
-	
+
 	if( $self->isa('Bio::Das::SegmentI') ) {
 	    my ($s,$e) = ($loc->start,$loc->end);	    
 	    $seqstr .= $called_seq->subseq($s,$e)->seq();
 	} else { 
-	    # This is dumb subseq should work on locations...
-	    if( $loc->strand == 1 ) {
-		$seqstr .= $called_seq->subseq($loc->start,$loc->end);
-	    } else {
-		$seqstr .= $called_seq->trunc($loc->start,$loc->end)->revcom->seq();
-	    }
+	    $seqstr .= $called_seq->subseq($loc->start,$loc->end);
+            # If we have a non-forward strand feature, revcom the entire
+            # sequence later. There may be a better way to do this.
+            if ( $loc->strand != 1 ) {
+               $revcom = 1;
+            }
 	}
     }
+
     my $out = Bio::Seq->new( -id => $self->entire_seq->display_id . "_spliced_feat",
 			     -seq => $seqstr);
+
+    if ( $revcom ) { $out = $out->revcom };
     
     return $out;
 }