repost - Re: [Bioperl-l] Huh? Bioperl Seq objects and strands
Wed, 20 Sep 2000 09:54:52 -0400
Ewan Birney wrote:
> > > Also, why are introns and exons top-level features of a sequence, when
> > > they are also (obviously) sub-features of a gene?
> > >
> This is an issue with GenBank/EMBL being mapped into a more interpretable
> GenBank/EMBL sometimes puts introns/exons separate from the CDS lines.
> Quite often they *disagree* with the CDS lines. What are we meant to do in
> these cases.
It may help to know that the information on the CDS line of a GenBank
text file is not a description of the splicing process, but a "SeqLoc" or
sequence location for the CDS feature. This is why it starts and ends
with start and stop codons, and not with the beginning and ending of
the first and last exons. Some published surveys of exon lengths are
actually based on interpreting the first and last intervals in the CDS
SeqLoc statements as exons, but they are not.
Every feature in the feature table has a SeqLoc mapping it to the
sequence, and a SeqLoc of "1..4" is the same as "join(1..3,4)" or
"order(1..2,3..4)" or "join(1,2,3,4)" etc, because they all specify the
same sequence location. A CDS that results from -1 translational
frameshifting after the 45th nucleotide might be specified by "join(1..45,
45..599)", so that the 45th nucleotide is included twice. Some entries
in GenBank actually use this entirely legitimate method.
Also, its not GenBank's decision about whether the introns and exons
appear explicitly in the feature table-- this is because the people who
submit sequences typically only annotate the CDS, and do not annotate
mRNA or intron features (usually they have no experimental evidence
to do this anyway). Programmers can interpret the CDS SeqLoc to
get implicit information on splicing (I do it all the time), but this
has its risks.
Arlin Stoltzfus (email@example.com)
CARB (www.carb.nist.gov), 9600 Gudelsky Dr., Rockville, Md 20850
ph. 301 738-6208; fax 301 738-6255; www.molevol.org/camel