[Bioperl-l] Trans-Splicing Coding Sequences
James Thompson
tex at biosysadmin.com
Fri Aug 20 03:08:38 EDT 2004
Dear Bioperlers,
I'm currently working on a project involving some trans-spliced coding sequences
from Arabidopsis, and I was wondering if BioPerl provided an easy way of taking a
trans-spliced CDS feature and correctly splicing it into a Bio::Seq object.
Here's my naive stab at doing this using the Bioperl methods that I know:
use strict;
use Bio::SeqIO;
my $seqio = Bio::SeqIO->new( -file => 'NC_001284.gbk', -format => 'genbank' );
my $genome = $seqio->next_seq;
foreach my $cds (grep { $_->primary_tag =~ /CDS/i } $genome->get_SeqFeatures) {
print $cds->start, " -> ", $cds->end, "\n";
print $cds->spliced_seq->translate->seq, "\n";
last;
}
This just tries to use the spliced_seq method to splice the CDS into a sequence,
and here's the output:
79740 -> 333105
LFHDLWVYWSYPLRSISQDFDRIRNHWCSI*WYFYGDSVYRCRIPIQDHCSSFSYVGTRYL*GFTHPGYSIPFYCA*NLYFC
*YFTCFYLWFLWSYIATNLLFLQHCFYDLRSTGRHGPNESQKTSSS*FNWTCRLYSYWFLMWNHRRNSITTNWYLYLCINDD
GCIRHSFSITANPCQIYSGFGRSSQNESYFGYYLLHYYVLIRRNTPVSRLL*QILFVLRRFGLWGLLSSPSGSSD*RYRSFL
LYTLSEKNVF*YT*DMDSI*TNGS**VVTTSNDFLFHYFILAIPLSFVLSYSSNGTQFISLNESRIRSDPPTHVQSFFSGFP
RDLYH*CNLHFAHSWSCI*YL*EI*LSAVSQ*CGLAWIT*CSNNLASARRWRTSPNYCPFILE*SF*EGQFYIFLPNLSIIK
YGWYHFDVFRFFRPREV*CF*IHCINSTSYSRYALYDLGS*FNCHVFSY*ASKFMFLCNRSIKKKV*IFHGSRLEIFDLRCI
FLWNIIVW
The translated sequence for this coding sequence should look like this:
MKAEFVRILPHMFNLFLAVSPEIFIINATSILLIHGVVFSTSKK
YDYPPLASNVGWLGLLSVLITLLLLAAGAPLLTIAHLFWNNLFRRDNFTYFCQIFLLL
STAGTISMCFDSSDQERFDAFEFIVLIPLPTRGMLFMISAHDLIAMYLAIEPQSLCFY
VIAASKRKSEFSTEAGSKYLILGAFSSGILLFGCSMIYGSTGATHFDQLAKILTGYEI
TGARSSGIFMGILSIAVGFLFKITAVPFHMWAPDIYEGSPTPVTAFLSIAPKISISAN
ILRVSIYGSYGATLQQIFFFCSIASMILGALAAMAQTKVKRPLAHSSIGHVGYIRTGF
SCGTIEGIQSLLIGIFIYALMTMDAFAIVSALRQTRVKYIADLGALAKTNPISAITFS
ITMFSYAGIPPLAGFCSKFYLFFAALGCGAYFLAPVGVVTSVIGRFYYIRLVKRMFFD
TPRTWILYEPMDRNKSLLLAMTSFFITSSLLYPSPLFSVTHQMALSSYL
I'm guessing that the spliced_seq method in Bio::SeqFeatureI isn't correctly
recognizing the Location definition for this coding sequence, which looks like this:
CDS complement(join(327890..328078,329735..330306,
332945..333105,79740..80132,81113..81297))
Could anyone help me shed any light on this? Ideally I'd like to translate all
of these CDS features into individual Bio::Seq objects for further analysis,
and I thought I'd ask for a bit of help before I wrote my own parser. Should I
try sub-classing Bio::SeqFeature to overwrite the spliced_seq method, or has
someone else already figured out this problem? Any suggestions would be very
helpful.
Thanks,
James Thompson
More information about the Bioperl-l
mailing list