[Bioperl-l] Re: Getting CDS boundaries from Unflattener

Sat Dec 27 10:19:05 EST 2003

On Fri, 19 Dec 2003, Scott Cain wrote:

> On Fri, 2003-12-19 at 10:48, Scott Cain wrote:
> > On Thu, 2003-12-18 at 16:52, Chris Mungall wrote:
[snip]
> > > > The other problem is that the exons' parentage is incorrect.  The exons
> > > > should be features of the gene, not the mRNA.
> > >
> > > I think you have this the wrong way round. Again, this must be a problem
> > > with how you're assigning parent tags in the GFF output, when I try
> > > AE003644 the exons are children of the mRNA, which is correct.
> > >
> > I don't think so; here are the relevant lines from SO:
> >
> >     @is_a at gene ; SO:0000704 ; SOFA:SOFA ; SOFA:region
> >      @part_of at transcript ; SO:0000673 ; SOFA:SOFA ; SOFA:region
> >       @part_of at exon ; SO:0000147 ; SOFA:SOFA ; SOFA:region
> >       @is_a at processed_transcript ; SO:0000233 ; SOFA:SOFA ; SOFA:region
> >        @is_a at mRNA ; SO:0000234 ; SOFA:SOFA ; SOFA:region ; synonym:messenger_RNA
> >         @part_of at CDS ; SO:0000316 ; SOFA:SOFA ; SOFA:region ; synonym:coding_sequence
> >
> > Now, I am not one to be lecturing on ontologies, so I may have
> > misinterpreted something here, but it looks to me like exon is part of a
> > transcript, but not part of an mRNA.  And since we typically don't have
> > transcript features in Genbank records, exon should be part_of gene.  An
> > alternative would be to infer a transcript feature for each mRNA feature
> > and tie the exons to the transcript features, but leaving the mRNAs and
> > CDSs as is.

exon definitely shouldn't be part of gene, as this will mess up anything
involving alternate splicing. It's OK to have exon part_of mRNA, because
mRNA is a subclass of transcript.

The logic here is quite subtle, we should really take this to the SO list.
Without getting too much into the logic of part_of, for now we can infer
the following

X is_a Y
Z (necessarily)part_of Y
=>
Z (can be)part_of X

I have some code on another branch of bioperl that does this kind of
consistency checking on bioperl seqfeature hierarchies via SO... need to
migrate this over.

In a later version, SO will have distinct notions of necessarily part_of
and necessarily has_part in the inverse direction, which will alllow more
powerful consistency checking.

> OK, the real problem is that the thing that is labeled an mRNA in the
> feature from Unflattener (which it is getting from the genbank record)
> is a transcript, not an mRNA/processed transcript.  That is not to say
> the genbank record is wrong--its not.  Generally, the mRNA feature is a
> collection of ranges in a join.  What Unflattener gives for an mRNA
> feature is really a primary transcript.

To a biologist it's possibly rather strange to think of an mRNA containing
exons; pre-mRNAs have exons, processed mRNAs have exon junctions.

I think it's still useful to think of the mRNA as the exon container, if
only conceptually.

In most representations, whether it is ensembl, chado, gff3 or the bioperl
objectes generated by the unflattener, we economise by having one entity
represent two entities, the pre and post processed forms. In actual fact,
there is often more than two. You can think of an mRNA feature as either
the processed mRNA, and the implicit causative features (much like how
introns are usually implicit) or as a prrimary protein coding transcript
with the potential/destiny to form an mRNA.

The alternative is to have a full GK-like object model for representing
all entities involved in transcription/translation, which isn't
appropraite for a genome database/object model.

Cheers
Chris