[Bioperl-l] Trans-Splicing Coding Sequences

Tue Aug 24 23:20:11 EDT 2004

Yeah there is an outstanding bug about this rev-complementing I believe.

I'm not happy with how split locations and rev complement are handled but
I just don't have the time anymore to apply a good soln.

It just hasn't been fixed or addressed properly by someone so thanks for
giving this a go.

-jason

On Tue, 24 Aug 2004, James Thompson wrote:

> Jason,
>
> Removing the sort was definitely a step in the right direction, but I've found
> another slight problem.  The following snippet from  SeqFeatureI.pm (lines
> 560-567) reverse complements each of the subsequences of the CDS if the section
> is on the reverse strand:
>
> if( $loc->strand == 1 ) {
>    $seqstr .= $called_seq->subseq($loc->start,$loc->end);
> } else {
>    $seqstr .= $called_seq->trunc($loc->start,$loc->end)->revcom->seq();
> }
>
> I don't think that the body of the second else clause does what is intended.
> Reverse complementing each individual exon and splicing them together doesn't
> give the same result as splicing together all of the exons and then reversing
> (and complementing) the product. The correct way to do this would be the second
> way, i.e:
>    1. Splice together all of the exons in a CDS.
>    2. Reverse complement the spliced sequence if necessary.
>
> The current method breaks on reverse complemented coding sequences, such as:
>      CDS             complement(join(25482..27077,28659..28733))
>
> I've made a patch that incorporates two changes into the spliced_seq method
> of SeqFeatureI.pm:
>    1. Adds an optional argument for turning off the sorting of locations.
>    2. Moves reverse complementing of the spliced sequence occur after the
>    entire sequence has been concatenated.
>
> If anyone has any questions on this patch or any input on what I may have
> broken, please e-mail me.
>
> Thanks,
>
> James Thompson
> RIT Bioinformatics
>
> On Sun, 22 Aug 2004, Jason Stajich wrote:
>
> > The problem is that we sort the order of the locations in the calling of
> > spliced_seq at around line 697 in Bio::SeqFeatureI so all you need to do
> > remove the sorting - if you want to add an optional 3rd argument to this
> > method which when true does not try and sort the sub locations.  Make
> > sense?  If it works, post a patch and we'll incorporate it.
> >
> > -jason
> >
> >  On Fri, 20 Aug 2004, James Thompson wrote:
> >
> > > Dear Bioperlers,
> > >
> > > I'm currently working on a project involving some trans-spliced coding sequences
> > > from Arabidopsis, and I was wondering if BioPerl provided an easy way of taking a
> > > trans-spliced CDS feature and correctly splicing it into a Bio::Seq object.
> > >
> > > Here's my naive stab at doing this using the Bioperl methods that I know:
> > >
> > > use strict;
> > > use Bio::SeqIO;
> > >
> > > my $seqio = Bio::SeqIO->new( -file => 'NC_001284.gbk', -format => 'genbank' );
> > > my $genome = $seqio->next_seq;
> > >
> > > foreach my $cds (grep { $_->primary_tag =~ /CDS/i } $genome->get_SeqFeatures) {
> > >    print $cds->start, " -> ", $cds->end, "\n";
> > >    print $cds->spliced_seq->translate->seq, "\n";
> > >
> > >    last;
> > > }
> > >
> > > This just tries to use the spliced_seq method to splice the CDS into a sequence,
> > > and here's the output:
> > >
> > > 79740 -> 333105
> > > LFHDLWVYWSYPLRSISQDFDRIRNHWCSI*WYFYGDSVYRCRIPIQDHCSSFSYVGTRYL*GFTHPGYSIPFYCA*NLYFC
> > > *YFTCFYLWFLWSYIATNLLFLQHCFYDLRSTGRHGPNESQKTSSS*FNWTCRLYSYWFLMWNHRRNSITTNWYLYLCINDD
> > > GCIRHSFSITANPCQIYSGFGRSSQNESYFGYYLLHYYVLIRRNTPVSRLL*QILFVLRRFGLWGLLSSPSGSSD*RYRSFL
> > > LYTLSEKNVF*YT*DMDSI*TNGS**VVTTSNDFLFHYFILAIPLSFVLSYSSNGTQFISLNESRIRSDPPTHVQSFFSGFP
> > > RDLYH*CNLHFAHSWSCI*YL*EI*LSAVSQ*CGLAWIT*CSNNLASARRWRTSPNYCPFILE*SF*EGQFYIFLPNLSIIK
> > > YGWYHFDVFRFFRPREV*CF*IHCINSTSYSRYALYDLGS*FNCHVFSY*ASKFMFLCNRSIKKKV*IFHGSRLEIFDLRCI
> > > FLWNIIVW
> > >
> > > The translated sequence for this coding sequence should look like this:
> > >
> > > MKAEFVRILPHMFNLFLAVSPEIFIINATSILLIHGVVFSTSKK
> > > YDYPPLASNVGWLGLLSVLITLLLLAAGAPLLTIAHLFWNNLFRRDNFTYFCQIFLLL
> > > STAGTISMCFDSSDQERFDAFEFIVLIPLPTRGMLFMISAHDLIAMYLAIEPQSLCFY
> > > VIAASKRKSEFSTEAGSKYLILGAFSSGILLFGCSMIYGSTGATHFDQLAKILTGYEI
> > > TGARSSGIFMGILSIAVGFLFKITAVPFHMWAPDIYEGSPTPVTAFLSIAPKISISAN
> > > ILRVSIYGSYGATLQQIFFFCSIASMILGALAAMAQTKVKRPLAHSSIGHVGYIRTGF
> > > SCGTIEGIQSLLIGIFIYALMTMDAFAIVSALRQTRVKYIADLGALAKTNPISAITFS
> > > ITMFSYAGIPPLAGFCSKFYLFFAALGCGAYFLAPVGVVTSVIGRFYYIRLVKRMFFD
> > > TPRTWILYEPMDRNKSLLLAMTSFFITSSLLYPSPLFSVTHQMALSSYL
> > >
> > > I'm guessing that the spliced_seq method in Bio::SeqFeatureI isn't correctly
> > > recognizing the Location definition for this coding sequence, which looks like this:
> > > CDS             complement(join(327890..328078,329735..330306,
> > >                 332945..333105,79740..80132,81113..81297))
> > >
> > > Could anyone help me shed any light on this? Ideally I'd like to translate all
> > > of these CDS features into individual Bio::Seq objects for further analysis,
> > > and I thought I'd ask for a bit of help before I wrote my own parser. Should I
> > > try sub-classing Bio::SeqFeature to overwrite the spliced_seq method, or has
> > > someone else already figured out this problem? Any suggestions would be very
> > > helpful.
> > >
> > > Thanks,
> > >
> > > James Thompson
> > >
> > > _______________________________________________
> > > Bioperl-l mailing list
> > > Bioperl-l at portal.open-bio.org
> > > http://portal.open-bio.org/mailman/listinfo/bioperl-l
> > >
> >
> > --
> > Jason Stajich
> > Duke University
> > jason at cgt.mc.duke.edu
> >
>
>
>
>
>

--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu