[Bioperl-l] CDS/exon was Re: *major* error in genbank pars
er or am i just insane?
Lin, Xiaoying J.
Xiaoying.Lin@celera.com
Fri, 9 Aug 2002 16:24:10 -0400
Francis,
Thanks for the clarification on Genbank model. Otherwise I will be guilty
of submitting several thousand genes without proper annotation ;-).
For better data handling and to avoid having out of sync mRNA/CDS features,
I am thinking to avoid store exons on CDS as separate feature at all, but
just to store the coordinates for mRNA and start/stop for translation.
and I will need help on 2 aspects:
1. Is this model OK? has anyone tried this.
2. I have not find a way to translate part of an exon (feature) with
bioperl, where remaining part of an exon is UTR. Could someone give me a
hit on how to do this?
Thanks.
Xiaoying
> -----Original Message-----
> From: Francis Ouellette [mailto:francis@cmmt.ubc.ca]
> Sent: Friday, August 09, 2002 3:40 PM
> To: Lin, Xiaoying J.
> Cc: lstein@cshl.org; brian.king@animorphics.net; Brian King; Ewan
> Birney; bioperl-l@bioperl.org
> Subject: Re: [Bioperl-l] Re: *major* error in genbank parser or am i
> just insane?
>
>
>
>
> { apologies: long reply]
>
> "Lin, Xiaoying J." wrote:
>
> > but for CDS features but no exon features, I am not sure I
> understand
> > you correctly. there are lots submissions in Genbank, which
> only comes
> > with CDS (join) features, but no separate exon features. If
> that is a
> > mistake, it is a systematic mistake then. How does the
> current parser
> > handle a record like
> >
> http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=
> nucleotide
> > &list_uids=1458097&dopt=GenBank
>
>
> Having a CDS (with join) and no exon feature is how most (the great
> majority) of CDS's are built that where submitted to the NCBI to be
> included into GenBank.
>
> The rationale for this is that there where tooooo many where the exon
> feature where not valid/validated and it was a bad feature,
> and that the
> very best place (within NCBI's data model) to check and
> validate these
> was to make sure the join that make up the CDS are valid, and
> make the
> right protein, with valid exons. All of the information you need/want
> is in the join statement.
>
> But "Ha" you say ... what about UTR's? Well, if you have non-coding
> exons,
> and you have their coordinates, you should put that information in a
> join
> statement in an mRNA feature.
>
> With those two features (CDS and mRNA) the exon feature becomes
> superfluous
> (in the NCBI data model, I know and understand this is not the case in
> bioperl
> world.
>
> Another thing, which as far as I know is *not* validated in
> the current
> NCBI
> model (well, it wasn't a few years back when I was a humble civil
> servant)
> was that the join statement from the mRNA and the one from the
> corresponding
> CDS where not matched to make sure they where in accordance, and
> obviously
> you don't have a translation to validate that join.
>
> Before people get bent out of shape against NCBI for not
> encouraging the
> exon feature, let me state the philosophy and reasoning behind that
> (very good, imho) decision: mRNA and proteins are real biological
> entities
> within the cell and with the NCBI data model, exon are not --
> they don't
> exit on their own. The NCBI data model (of which the GenBank flatfile
> is a *poor* text/report representation) tries to represent (read:
> validate, promote, allow computation on) biological "stuff".
> It doesn't
> care much for things which are not really "validatable" (an exon on
> it's own is next to impossible to validate, and CDS is much easier
> to validate).
>
> Anyway, I hope this long discourse explains a little where things
> are coming from ...
>
> cheers,
>
> f.
>
>
> --
> | B.F. Francis Ouellette francis@cmmt.ubc.ca |
> | Director, Bioinformatics Centre Tel: (604) 875-3815 |
> | University of British Columbia Fax: (604) 608-4795 |
> | Vancouver, BC Canada http://www.cmmt.ubc.ca/ouellette |
>