[Bioperl-l] Re: *major* error in genbank parser or am i just insane?
Francis Ouellette
francis@cmmt.ubc.ca
Fri, 09 Aug 2002 12:40:19 -0700
{ apologies: long reply]
"Lin, Xiaoying J." wrote:
> but for CDS features but no exon features, I am not sure I understand
> you correctly. there are lots submissions in Genbank, which only comes
> with CDS (join) features, but no separate exon features. If that is a
> mistake, it is a systematic mistake then. How does the current parser
> handle a record like
> http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=nucleotide
> &list_uids=1458097&dopt=GenBank
Having a CDS (with join) and no exon feature is how most (the great
majority) of CDS's are built that where submitted to the NCBI to be
included into GenBank.
The rationale for this is that there where tooooo many where the exon
feature where not valid/validated and it was a bad feature, and that the
very best place (within NCBI's data model) to check and validate these
was to make sure the join that make up the CDS are valid, and make the
right protein, with valid exons. All of the information you need/want
is in the join statement.
But "Ha" you say ... what about UTR's? Well, if you have non-coding
exons,
and you have their coordinates, you should put that information in a
join
statement in an mRNA feature.
With those two features (CDS and mRNA) the exon feature becomes
superfluous
(in the NCBI data model, I know and understand this is not the case in
bioperl
world.
Another thing, which as far as I know is *not* validated in the current
NCBI
model (well, it wasn't a few years back when I was a humble civil
servant)
was that the join statement from the mRNA and the one from the
corresponding
CDS where not matched to make sure they where in accordance, and
obviously
you don't have a translation to validate that join.
Before people get bent out of shape against NCBI for not encouraging the
exon feature, let me state the philosophy and reasoning behind that
(very good, imho) decision: mRNA and proteins are real biological
entities
within the cell and with the NCBI data model, exon are not -- they don't
exit on their own. The NCBI data model (of which the GenBank flatfile
is a *poor* text/report representation) tries to represent (read:
validate, promote, allow computation on) biological "stuff". It doesn't
care much for things which are not really "validatable" (an exon on
it's own is next to impossible to validate, and CDS is much easier
to validate).
Anyway, I hope this long discourse explains a little where things
are coming from ...
cheers,
f.
--
| B.F. Francis Ouellette francis@cmmt.ubc.ca |
| Director, Bioinformatics Centre Tel: (604) 875-3815 |
| University of British Columbia Fax: (604) 608-4795 |
| Vancouver, BC Canada http://www.cmmt.ubc.ca/ouellette |