[Bioperl-l] Re: Getting CDS boundaries from Unflattener

Fri Dec 19 11:33:18 EST 2003

On Fri, 2003-12-19 at 10:48, Scott Cain wrote:
> On Thu, 2003-12-18 at 16:52, Chris Mungall wrote:
> > On Thu, 18 Dec 2003, Scott Cain wrote:
> 
> > > The biggest problem with this set of data is that the CDS spans
> > > introns.  The CDS really ought to be broken up into segments to match
> > > the exon boundaries.  As it is, it breaks display in gbrowse whether it
> > > is using chado or a GFF database as a backend.
> > 
> > When I use the unflattener on AE003644, the CDSs I get out have split
> > locations which match the coding exon boundaries - are you sure this isn't
> > a problem with the GFF code? Are you doing all the usual weird stuff like:
> > 
> >         if ($sf->location->isa("Bio::Location::SplitLocationI")) {
> >             @locs = $sf->location->each_Location;
> >         }
> 
> Oops--read that documentation, Scott.  OK, I fixed Bio::Tools::GFF to
> deal with split locations.
> > 
> > > The other problem is that the exons' parentage is incorrect.  The exons
> > > should be features of the gene, not the mRNA.
> > 
> > I think you have this the wrong way round. Again, this must be a problem
> > with how you're assigning parent tags in the GFF output, when I try
> > AE003644 the exons are children of the mRNA, which is correct.
> > 
> I don't think so; here are the relevant lines from SO:
> 
>     @is_a at gene ; SO:0000704 ; SOFA:SOFA ; SOFA:region
>      @part_of at transcript ; SO:0000673 ; SOFA:SOFA ; SOFA:region
>       @part_of at exon ; SO:0000147 ; SOFA:SOFA ; SOFA:region
>       @is_a at processed_transcript ; SO:0000233 ; SOFA:SOFA ; SOFA:region
>        @is_a at mRNA ; SO:0000234 ; SOFA:SOFA ; SOFA:region ; synonym:messenger_RNA
>         @part_of at CDS ; SO:0000316 ; SOFA:SOFA ; SOFA:region ; synonym:coding_sequence
> 
> Now, I am not one to be lecturing on ontologies, so I may have
> misinterpreted something here, but it looks to me like exon is part of a
> transcript, but not part of an mRNA.  And since we typically don't have
> transcript features in Genbank records, exon should be part_of gene.  An
> alternative would be to infer a transcript feature for each mRNA feature
> and tie the exons to the transcript features, but leaving the mRNAs and
> CDSs as is.
> 
OK, the real problem is that the thing that is labeled an mRNA in the
feature from Unflattener (which it is getting from the genbank record)
is a transcript, not an mRNA/processed transcript.  That is not to say
the genbank record is wrong--its not.  Generally, the mRNA feature is a
collection of ranges in a join.  What Unflattener gives for an mRNA
feature is really a primary transcript.
-- 
------------------------------------------------------------------------
Scott Cain, Ph. D.                                         cain at cshl.org
GMOD Coordinator (http://www.gmod.org/)                     216-392-3087
Cold Spring Harbor Laboratory