[Bioperl-l] Trans-Splicing Coding Sequences
Jason Stajich
jason at cgt.duhs.duke.edu
Sun Aug 29 05:08:03 EDT 2004
James - hmm the revcom at the end as you've done doesn't exactly work
because it assumes the rev-complement applies to the whole split
feature when in fact it could just be related to one of the
sub-locations.
I think the proper way to do it is prepend the sequence chunk to the
assembled location which I've done and the tests appear to work fine -
in fact sorting the sub locations may be a silly default behavior in
the first place, I can't remember why it was coded that way.
I've committed more tests in t/SeqFeature.t and added the A.thaliana MT
genome as test data.
-jason
--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu
On Aug 24, 2004, at 8:48 PM, James Thompson wrote:
> Jason,
>
> Removing the sort was definitely a step in the right direction, but
> I've found
> another slight problem. The following snippet from SeqFeatureI.pm
> (lines
> 560-567) reverse complements each of the subsequences of the CDS if
> the section
> is on the reverse strand:
>
> if( $loc->strand == 1 ) {
> $seqstr .= $called_seq->subseq($loc->start,$loc->end);
> } else {
> $seqstr .= $called_seq->trunc($loc->start,$loc->end)->revcom->seq();
> }
>
> I don't think that the body of the second else clause does what is
> intended.
> Reverse complementing each individual exon and splicing them together
> doesn't
> give the same result as splicing together all of the exons and then
> reversing
> (and complementing) the product. The correct way to do this would be
> the second
> way, i.e:
> 1. Splice together all of the exons in a CDS.
> 2. Reverse complement the spliced sequence if necessary.
>
> The current method breaks on reverse complemented coding sequences,
> such as:
> CDS complement(join(25482..27077,28659..28733))
>
> I've made a patch that incorporates two changes into the spliced_seq
> method
> of SeqFeatureI.pm:
> 1. Adds an optional argument for turning off the sorting of
> locations.
> 2. Moves reverse complementing of the spliced sequence occur after
> the
> entire sequence has been concatenated.
>
> If anyone has any questions on this patch or any input on what I may
> have
> broken, please e-mail me.
>
> Thanks,
>
> James Thompson
> RIT Bioinformatics
>
> On Sun, 22 Aug 2004, Jason Stajich wrote:
>
>> The problem is that we sort the order of the locations in the calling
>> of
>> spliced_seq at around line 697 in Bio::SeqFeatureI so all you need to
>> do
>> remove the sorting - if you want to add an optional 3rd argument to
>> this
>> method which when true does not try and sort the sub locations. Make
>> sense? If it works, post a patch and we'll incorporate it.
>>
>> -jason
>>
>> On Fri, 20 Aug 2004, James Thompson wrote:
>>
>>> Dear Bioperlers,
>>>
>>> I'm currently working on a project involving some trans-spliced
>>> coding sequences
>>> from Arabidopsis, and I was wondering if BioPerl provided an easy
>>> way of taking a
>>> trans-spliced CDS feature and correctly splicing it into a Bio::Seq
>>> object.
>>>
>>> Here's my naive stab at doing this using the Bioperl methods that I
>>> know:
>>>
>>> use strict;
>>> use Bio::SeqIO;
>>>
>>> my $seqio = Bio::SeqIO->new( -file => 'NC_001284.gbk', -format =>
>>> 'genbank' );
>>> my $genome = $seqio->next_seq;
>>>
>>> foreach my $cds (grep { $_->primary_tag =~ /CDS/i }
>>> $genome->get_SeqFeatures) {
>>> print $cds->start, " -> ", $cds->end, "\n";
>>> print $cds->spliced_seq->translate->seq, "\n";
>>>
>>> last;
>>> }
>>>
>>> This just tries to use the spliced_seq method to splice the CDS into
>>> a sequence,
>>> and here's the output:
>>>
>>> 79740 -> 333105
>>> LFHDLWVYWSYPLRSISQDFDRIRNHWCSI*WYFYGDSVYRCRIPIQDHCSSFSYVGTRYL*GFTHPGY
>>> SIPFYCA*NLYFC
>>> *YFTCFYLWFLWSYIATNLLFLQHCFYDLRSTGRHGPNESQKTSSS*FNWTCRLYSYWFLMWNHRRNSI
>>> TTNWYLYLCINDD
>>> GCIRHSFSITANPCQIYSGFGRSSQNESYFGYYLLHYYVLIRRNTPVSRLL*QILFVLRRFGLWGLLSS
>>> PSGSSD*RYRSFL
>>> LYTLSEKNVF*YT*DMDSI*TNGS**VVTTSNDFLFHYFILAIPLSFVLSYSSNGTQFISLNESRIRSD
>>> PPTHVQSFFSGFP
>>> RDLYH*CNLHFAHSWSCI*YL*EI*LSAVSQ*CGLAWIT*CSNNLASARRWRTSPNYCPFILE*SF*EG
>>> QFYIFLPNLSIIK
>>> YGWYHFDVFRFFRPREV*CF*IHCINSTSYSRYALYDLGS*FNCHVFSY*ASKFMFLCNRSIKKKV*IF
>>> HGSRLEIFDLRCI
>>> FLWNIIVW
>>>
>>> The translated sequence for this coding sequence should look like
>>> this:
>>>
>>> MKAEFVRILPHMFNLFLAVSPEIFIINATSILLIHGVVFSTSKK
>>> YDYPPLASNVGWLGLLSVLITLLLLAAGAPLLTIAHLFWNNLFRRDNFTYFCQIFLLL
>>> STAGTISMCFDSSDQERFDAFEFIVLIPLPTRGMLFMISAHDLIAMYLAIEPQSLCFY
>>> VIAASKRKSEFSTEAGSKYLILGAFSSGILLFGCSMIYGSTGATHFDQLAKILTGYEI
>>> TGARSSGIFMGILSIAVGFLFKITAVPFHMWAPDIYEGSPTPVTAFLSIAPKISISAN
>>> ILRVSIYGSYGATLQQIFFFCSIASMILGALAAMAQTKVKRPLAHSSIGHVGYIRTGF
>>> SCGTIEGIQSLLIGIFIYALMTMDAFAIVSALRQTRVKYIADLGALAKTNPISAITFS
>>> ITMFSYAGIPPLAGFCSKFYLFFAALGCGAYFLAPVGVVTSVIGRFYYIRLVKRMFFD
>>> TPRTWILYEPMDRNKSLLLAMTSFFITSSLLYPSPLFSVTHQMALSSYL
>>>
>>> I'm guessing that the spliced_seq method in Bio::SeqFeatureI isn't
>>> correctly
>>> recognizing the Location definition for this coding sequence, which
>>> looks like this:
>>> CDS complement(join(327890..328078,329735..330306,
>>> 332945..333105,79740..80132,81113..81297))
>>>
>>> Could anyone help me shed any light on this? Ideally I'd like to
>>> translate all
>>> of these CDS features into individual Bio::Seq objects for further
>>> analysis,
>>> and I thought I'd ask for a bit of help before I wrote my own
>>> parser. Should I
>>> try sub-classing Bio::SeqFeature to overwrite the spliced_seq
>>> method, or has
>>> someone else already figured out this problem? Any suggestions would
>>> be very
>>> helpful.
>>>
>>> Thanks,
>>>
>>> James Thompson
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at portal.open-bio.org
>>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>
>> --
>> Jason Stajich
>> Duke University
>> jason at cgt.mc.duke.edu
>>
>
>
>
>
> <jt-seqfeaturei.patch>_______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list