[Bioperl-l] Trans-Splicing Coding Sequences

Sun Aug 29 05:08:03 EDT 2004

James - hmm the revcom at the end as you've done doesn't exactly work  
because it assumes the rev-complement applies to the whole split  
feature when in fact it could just be related to one of the  
sub-locations.

I think the proper way to do it is prepend the sequence chunk to the  
assembled location which I've done and the tests appear to work fine -  
in fact sorting the sub locations may be a silly default behavior in  
the first place, I can't remember why it was coded that way.

I've committed more tests in t/SeqFeature.t and added the A.thaliana MT  
genome as test data.
-jason

--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu

On Aug 24, 2004, at 8:48 PM, James Thompson wrote:

> Jason,
>
> Removing the sort was definitely a step in the right direction, but  
> I've found
> another slight problem.  The following snippet from  SeqFeatureI.pm  
> (lines
> 560-567) reverse complements each of the subsequences of the CDS if  
> the section
> is on the reverse strand:
>
> if( $loc->strand == 1 ) {
>    $seqstr .= $called_seq->subseq($loc->start,$loc->end);
> } else {
>    $seqstr .= $called_seq->trunc($loc->start,$loc->end)->revcom->seq();
> }
>
> I don't think that the body of the second else clause does what is  
> intended.
> Reverse complementing each individual exon and splicing them together  
> doesn't
> give the same result as splicing together all of the exons and then  
> reversing
> (and complementing) the product. The correct way to do this would be  
> the second
> way, i.e:
>    1. Splice together all of the exons in a CDS.
>    2. Reverse complement the spliced sequence if necessary.
>
> The current method breaks on reverse complemented coding sequences,  
> such as:
>      CDS             complement(join(25482..27077,28659..28733))
>
> I've made a patch that incorporates two changes into the spliced_seq  
> method
> of SeqFeatureI.pm:
>    1. Adds an optional argument for turning off the sorting of  
> locations.
>    2. Moves reverse complementing of the spliced sequence occur after  
> the
>    entire sequence has been concatenated.
>
> If anyone has any questions on this patch or any input on what I may  
> have
> broken, please e-mail me.
>
> Thanks,
>
> James Thompson
> RIT Bioinformatics
>
> On Sun, 22 Aug 2004, Jason Stajich wrote:
>
>> The problem is that we sort the order of the locations in the calling  
>> of
>> spliced_seq at around line 697 in Bio::SeqFeatureI so all you need to  
>> do
>> remove the sorting - if you want to add an optional 3rd argument to  
>> this
>> method which when true does not try and sort the sub locations.  Make
>> sense?  If it works, post a patch and we'll incorporate it.
>>
>> -jason
>>
>>  On Fri, 20 Aug 2004, James Thompson wrote:
>>
>>> Dear Bioperlers,
>>>
>>> I'm currently working on a project involving some trans-spliced  
>>> coding sequences
>>> from Arabidopsis, and I was wondering if BioPerl provided an easy  
>>> way of taking a
>>> trans-spliced CDS feature and correctly splicing it into a Bio::Seq  
>>> object.
>>>
>>> Here's my naive stab at doing this using the Bioperl methods that I  
>>> know:
>>>
>>> use strict;
>>> use Bio::SeqIO;
>>>
>>> my $seqio = Bio::SeqIO->new( -file => 'NC_001284.gbk', -format =>  
>>> 'genbank' );
>>> my $genome = $seqio->next_seq;
>>>
>>> foreach my $cds (grep { $_->primary_tag =~ /CDS/i }  
>>> $genome->get_SeqFeatures) {
>>>    print $cds->start, " -> ", $cds->end, "\n";
>>>    print $cds->spliced_seq->translate->seq, "\n";
>>>
>>>    last;
>>> }
>>>
>>> This just tries to use the spliced_seq method to splice the CDS into  
>>> a sequence,
>>> and here's the output:
>>>
>>> 79740 -> 333105
>>> LFHDLWVYWSYPLRSISQDFDRIRNHWCSI*WYFYGDSVYRCRIPIQDHCSSFSYVGTRYL*GFTHPGY 
>>> SIPFYCA*NLYFC
>>> *YFTCFYLWFLWSYIATNLLFLQHCFYDLRSTGRHGPNESQKTSSS*FNWTCRLYSYWFLMWNHRRNSI 
>>> TTNWYLYLCINDD
>>> GCIRHSFSITANPCQIYSGFGRSSQNESYFGYYLLHYYVLIRRNTPVSRLL*QILFVLRRFGLWGLLSS 
>>> PSGSSD*RYRSFL
>>> LYTLSEKNVF*YT*DMDSI*TNGS**VVTTSNDFLFHYFILAIPLSFVLSYSSNGTQFISLNESRIRSD 
>>> PPTHVQSFFSGFP
>>> RDLYH*CNLHFAHSWSCI*YL*EI*LSAVSQ*CGLAWIT*CSNNLASARRWRTSPNYCPFILE*SF*EG 
>>> QFYIFLPNLSIIK
>>> YGWYHFDVFRFFRPREV*CF*IHCINSTSYSRYALYDLGS*FNCHVFSY*ASKFMFLCNRSIKKKV*IF 
>>> HGSRLEIFDLRCI
>>> FLWNIIVW
>>>
>>> The translated sequence for this coding sequence should look like  
>>> this:
>>>
>>> MKAEFVRILPHMFNLFLAVSPEIFIINATSILLIHGVVFSTSKK
>>> YDYPPLASNVGWLGLLSVLITLLLLAAGAPLLTIAHLFWNNLFRRDNFTYFCQIFLLL
>>> STAGTISMCFDSSDQERFDAFEFIVLIPLPTRGMLFMISAHDLIAMYLAIEPQSLCFY
>>> VIAASKRKSEFSTEAGSKYLILGAFSSGILLFGCSMIYGSTGATHFDQLAKILTGYEI
>>> TGARSSGIFMGILSIAVGFLFKITAVPFHMWAPDIYEGSPTPVTAFLSIAPKISISAN
>>> ILRVSIYGSYGATLQQIFFFCSIASMILGALAAMAQTKVKRPLAHSSIGHVGYIRTGF
>>> SCGTIEGIQSLLIGIFIYALMTMDAFAIVSALRQTRVKYIADLGALAKTNPISAITFS
>>> ITMFSYAGIPPLAGFCSKFYLFFAALGCGAYFLAPVGVVTSVIGRFYYIRLVKRMFFD
>>> TPRTWILYEPMDRNKSLLLAMTSFFITSSLLYPSPLFSVTHQMALSSYL
>>>
>>> I'm guessing that the spliced_seq method in Bio::SeqFeatureI isn't  
>>> correctly
>>> recognizing the Location definition for this coding sequence, which  
>>> looks like this:
>>> CDS             complement(join(327890..328078,329735..330306,
>>>                 332945..333105,79740..80132,81113..81297))
>>>
>>> Could anyone help me shed any light on this? Ideally I'd like to  
>>> translate all
>>> of these CDS features into individual Bio::Seq objects for further  
>>> analysis,
>>> and I thought I'd ask for a bit of help before I wrote my own  
>>> parser. Should I
>>> try sub-classing Bio::SeqFeature to overwrite the spliced_seq  
>>> method, or has
>>> someone else already figured out this problem? Any suggestions would  
>>> be very
>>> helpful.
>>>
>>> Thanks,
>>>
>>> James Thompson
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at portal.open-bio.org
>>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>
>> --
>> Jason Stajich
>> Duke University
>> jason at cgt.mc.duke.edu
>>
>
>
>
>
> <jt-seqfeaturei.patch>_______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l