repost - Re: [Bioperl-l] Huh? Bioperl Seq objects and strands
Ewan Birney
birney@ebi.ac.uk
Wed, 20 Sep 2000 15:18:24 +0100 (GMT)
On Wed, 20 Sep 2000, Arlin Stoltzfus wrote:
> Ewan Birney wrote:
> >
> > > > Also, why are introns and exons top-level features of a sequence, when
> > > > they are also (obviously) sub-features of a gene?
> > > >
> >
> > This is an issue with GenBank/EMBL being mapped into a more interpretable
> > format.
> >
> > GenBank/EMBL sometimes puts introns/exons separate from the CDS lines.
> > Quite often they *disagree* with the CDS lines. What are we meant to do in
> > these cases.
>
> It may help to know that the information on the CDS line of a GenBank
> text file is not a description of the splicing process, but a "SeqLoc" or
> sequence location for the CDS feature. This is why it starts and ends
> with start and stop codons, and not with the beginning and ending of
> the first and last exons. Some published surveys of exon lengths are
> actually based on interpreting the first and last intervals in the CDS
> SeqLoc statements as exons, but they are not.
>
> Every feature in the feature table has a SeqLoc mapping it to the
> sequence, and a SeqLoc of "1..4" is the same as "join(1..3,4)" or
> "order(1..2,3..4)" or "join(1,2,3,4)" etc, because they all specify the
> same sequence location. A CDS that results from -1 translational
> frameshifting after the 45th nucleotide might be specified by "join(1..45,
> 45..599)", so that the 45th nucleotide is included twice. Some entries
> in GenBank actually use this entirely legitimate method.
>
> Also, its not GenBank's decision about whether the introns and exons
> appear explicitly in the feature table-- this is because the people who
> submit sequences typically only annotate the CDS, and do not annotate
> mRNA or intron features (usually they have no experimental evidence
> to do this anyway). Programmers can interpret the CDS SeqLoc to
> get implicit information on splicing (I do it all the time), but this
> has its risks.
Indeed. Well said.
To summarise:
EMBL/GenBank is a mess to get data out of except for DNA Sequence
because the rest of the data is loosely standardised over the last 20
years in a variety of ways by millions of people.
Representing in objects which are more than "An EMBL/Genbank file
as an object" is challenging, but in some sense, what we want to do.
Or in other words:
there is no silver bullet here.
-----------------------------------------------------------------
Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420
<birney@ebi.ac.uk>.
-----------------------------------------------------------------