[Bioperl-l] Enhance Bio::PrimarySeqI::trunc() for Bio::Location::Split ?
Jay Hannah
jay at jays.net
Thu Mar 1 15:51:38 UTC 2007
In my GenBank files when I'm sitting on a CDS usually I can just call
$feature->seq->seq;
and out pops the exact nucleotide sequence which codes my protein. Very
cool.
Unfortunately, I have a crazy GenBank file which contains a CDS with a
split range like this: CDS join(1959..2355,1..92)
When I try to use $feature->seq->seq I don't end up with just the properly
pieced together coding region, I end up with the *entire* nucleotide
sequence.
This seems to be happening because
Bio::SeqFeature::Generic::seq
506: my $seq = $self->{'_gsf_seq'}->trunc($self->start(), $self->end());
(which is calling Bio::PrimarySeqI::trunc)
works fine when Bio::SeqFeature::Generic is using
'_location' => Bio::Location::Simple=HASH(0x1804344)
'_end' => 2842
'_location_type' => 'EXACT'
'_root_verbose' => 0
'_seqid' => 'AE015930'
'_start' => 1601
'_strand' => 1
but when things get complicated and Bio::SeqFeature::Generic is using
'_location' => Bio::Location::Split=HASH(0x1d1f130)
'_seqid' => 'PNECG'
'_splittype' => 'JOIN'
'_sublocations' => ARRAY(0x1d1e654)
0 Bio::Location::Simple=HASH(0x1d1f290)
'_end' => 2355
'_location_type' => 'EXACT'
'_root_verbose' => 0
'_seqid' => 'PNECG'
'_start' => 1959
'_strand' => 1
1 Bio::Location::Simple=HASH(0x1d1f338)
'_end' => 92
'_location_type' => 'EXACT'
'_root_verbose' => 0
'_seqid' => 'PNECG'
'_start' => 1
'_strand' => 1
Simply passing $self->start and $self->end into trunc() will not pull
off the appropriate magic.
Question 1: Perhaps my data was bad and I should refuse to process
join(1959..2355,1..92)? My accession is M12730, and if I download that
from NCBI now it looks like they've changed it so my problem no longer
exists in that sequence anyway. There are already 71 examples of CDS join
in various files in t/data, and *none* of those examples jump backwards.
Should I write this off as bad data or try to enhance BioPerl? I'm happy
to throw my painful M12730 on the end of t/data/test.genbank and write
tests for it if anyone thinks it is important.
Question 2: Even if we can just ignore my M12730, though, I think there's
still a problem afoot. Below I demo L26462 (already siting in
t/data/test.genbank) which has a
CDS join(866..957,1088..1310,2161..2289)
In this case (as my tests below demonstrate), $feature->seq->seq is
pulling the right range of nucleotide, but it's also pulling the gaps
(introns). Isn't that wrong? Shouldn't it skip the introns?
So... is the appropriate approach to try to enhance
Bio::PrimarySeqI::trunc() for Bio::Location::Split? Or should trunc() be
left alone and Bio::SeqFeature::Generic::seq() needs to get smarter?
Or...?
Thanks, oh mighty BioWizards! :)
j
seqlab.net
http://www.bioperl.org/wiki/User:Jhannah
-----------------
Tack this on the end of t/genbank.t and the length test at the end fails:
-----------------
# Enhance Bio::PrimarySeqI::trunc() for Bio::Location::Split ?
my $stream = Bio::SeqIO->new(-file => Bio::Root::IO->catfile
("t","data","test.genbank"),
-verbose => $verbose,
-format => 'genbank');
my $seq = $stream->next_seq;
while ($seq->accession ne "M37762") {
$seq = $stream->next_seq;
}
# M37762 has a CDS 76..819, which should work fine.
ok(my @features = $seq->get_SeqFeatures(), "get_SeqFeatures()");
my $feat;
foreach my $feat2 ( @features ) {
next unless ($feat2->primary_tag eq "CDS");
my @db_xrefs = $feat2->annotation->get_Annotations("db_xref");
if (grep { $_ eq "GI:179403" } @db_xrefs) {
$feat = $feat2;
last;
}
}
my ($protein_seq) = $feat->annotation->get_Annotations("translation");
ok($protein_seq =~ /^MTILFLTMVISYFGCMKA.*GWRFIRIDTSCVCTLTIKRGR$/, "protein sequence");
my ($nucleotide_seq) = $feat->seq->seq;
ok($nucleotide_seq =~ /^ATGACCATCCTTTTCCTT.*ACCATTAAAAGGGGAAGATAG$/, "nucleotide sequence");
is(length($nucleotide_seq), 744, "nucleotide length");
# Jump down to L26462 which has a CDS join(866..957,1088..1310,2161..2289), which is broken?
while ($seq->accession ne "L26462") {
$seq = $stream->next_seq;
}
ok(my @features = $seq->get_SeqFeatures(), "get_SeqFeatures()");
my $feat;
foreach my $feat2 ( @features ) {
next unless ($feat2->primary_tag eq "CDS");
my @db_xrefs = $feat2->annotation->get_Annotations("db_xref");
if (grep { $_ eq "GI:532506" } @db_xrefs) {
$feat = $feat2;
last;
}
}
my ($protein_seq) = $feat->annotation->get_Annotations("translation");
ok($protein_seq =~ /^MVHLTPEEKSAVTALWGK.*VQAAYQKVVAGVANALAHKYH$/, "protein sequence");
my ($nucleotide_seq) = $feat->seq->seq;
ok($nucleotide_seq =~ /^ATGGTGCATCTGACTCCT.*CTGGCCCACAAGTATCACTAA$/, "nucleotide sequence - correct CDS range");
#print "[$nucleotide_seq]\n";
ok($nucleotide_seq !~ /^ACCTCCTATTTGACACCA.*TGCTAGTCTCCCGGAACTATC$/, "nucleotide sequence - full nucleotide should not match");
is(length($nucleotide_seq), 444, "nucleotide length");
# I have an old(?) version of M12730 which lists
# CDS join(1959..2355,1..92)
# /db_xref="GI:150830"
# Crazy ranges like that don't work at all, you end up with the full nucleotide sequence...
# But NCBI doesn't list M12730 that way any more, so now I would be OK?
# ------------------
More information about the Bioperl-l
mailing list