[Bioperl-l] Re: Getting CDS boundaries from Unflattener

Thu Dec 18 16:52:19 EST 2003

On Thu, 18 Dec 2003, Scott Cain wrote:

> Hi Chris,
>
> I very much what to reimplement Bio::DB::GFF::Adaptor::biofetch using
> Unflattener, but but there are a few problems I am having.  Below is a
> section of GFF that I generate using Unflattener from AE003644:
>
> AE003644        EMBL/GenBank/SwissProt  gene    20111   23268   .       +       .       ID=noc;db_xref=FLYBASE:FBgn0005771;gene=noc;locus_tag=CG4491;map=35B2-35B2;note=last+curated+on+Thu+Dec+13+16:51:32+PST+2001
> AE003644        EMBL/GenBank/SwissProt  mRNA    20111   23268   .       +       .       ID=noc_mRNA_1;Parent=noc;db_xref=FLYBASE:FBgn0005771;gene=noc;locus_tag=CG4491;product=CG4491-RA
> AE003644        EMBL/GenBank/SwissProt  CDS     20495   22410   .       +       .       Parent=noc_mRNA_1;codon_start=1;db_xref=GI:7298163,FLYBASE:FBgn0005771;gene=noc;locus_tag=CG4491;note=noc+gene+product;product=CG4491-PA;protein_id=AAF53399.1;translation=MVVLEGGGGV...
> AE003644        EMBL/GenBank/SwissProt  exon    20111   20584   .       +       .       Parent=noc_mRNA_1
> AE003644        EMBL/GenBank/SwissProt  exon    20887   23268   .       +       .       Parent=noc_mRNA_1
>
> The biggest problem with this set of data is that the CDS spans
> introns.  The CDS really ought to be broken up into segments to match
> the exon boundaries.  As it is, it breaks display in gbrowse whether it
> is using chado or a GFF database as a backend.

When I use the unflattener on AE003644, the CDSs I get out have split
locations which match the coding exon boundaries - are you sure this isn't
a problem with the GFF code? Are you doing all the usual weird stuff like:

        if ($sf->location->isa("Bio::Location::SplitLocationI")) {
            @locs = $sf->location->each_Location;
        }

> The other problem is that the exons' parentage is incorrect.  The exons
> should be features of the gene, not the mRNA.

I think you have this the wrong way round. Again, this must be a problem
with how you're assigning parent tags in the GFF output, when I try
AE003644 the exons are children of the mRNA, which is correct.

Try this using the unflattener script (which uses SeqIO::asciitree as
default output)

cd bioperl-live
perl scripts/seq/unflatten_seq.PLS t/data/AE003644_Adh-genomic.gb

Seq: AE003644
  databank_entry                                   1..263309[+]
  gene
    mRNA CG4491-RA
      CDS CG4491-PA                                20495..22410[+] ;
SPLIT: 20495..20584[+]  20887..22410[+]
      exon                                         20111..20584[+]
      exon                                         20887..23268[+]
  gene
    tRNA tRNA-Pro
      exon                                         25127..25198[+]

exons under mRNA, and CDS boundaries matching exons

(do a cvs update for the latest asciitree.pm)

cheers
chris

> Thanks,
> Scott
>
>
>
>