[Bioperl-l] Re: Getting CDS boundaries from Unflattener
Chris Mungall
cjm at fruitfly.org
Thu Dec 18 16:52:19 EST 2003
On Thu, 18 Dec 2003, Scott Cain wrote:
> Hi Chris,
>
> I very much what to reimplement Bio::DB::GFF::Adaptor::biofetch using
> Unflattener, but but there are a few problems I am having. Below is a
> section of GFF that I generate using Unflattener from AE003644:
>
> AE003644 EMBL/GenBank/SwissProt gene 20111 23268 . + . ID=noc;db_xref=FLYBASE:FBgn0005771;gene=noc;locus_tag=CG4491;map=35B2-35B2;note=last+curated+on+Thu+Dec+13+16:51:32+PST+2001
> AE003644 EMBL/GenBank/SwissProt mRNA 20111 23268 . + . ID=noc_mRNA_1;Parent=noc;db_xref=FLYBASE:FBgn0005771;gene=noc;locus_tag=CG4491;product=CG4491-RA
> AE003644 EMBL/GenBank/SwissProt CDS 20495 22410 . + . Parent=noc_mRNA_1;codon_start=1;db_xref=GI:7298163,FLYBASE:FBgn0005771;gene=noc;locus_tag=CG4491;note=noc+gene+product;product=CG4491-PA;protein_id=AAF53399.1;translation=MVVLEGGGGV...
> AE003644 EMBL/GenBank/SwissProt exon 20111 20584 . + . Parent=noc_mRNA_1
> AE003644 EMBL/GenBank/SwissProt exon 20887 23268 . + . Parent=noc_mRNA_1
>
> The biggest problem with this set of data is that the CDS spans
> introns. The CDS really ought to be broken up into segments to match
> the exon boundaries. As it is, it breaks display in gbrowse whether it
> is using chado or a GFF database as a backend.
When I use the unflattener on AE003644, the CDSs I get out have split
locations which match the coding exon boundaries - are you sure this isn't
a problem with the GFF code? Are you doing all the usual weird stuff like:
if ($sf->location->isa("Bio::Location::SplitLocationI")) {
@locs = $sf->location->each_Location;
}
> The other problem is that the exons' parentage is incorrect. The exons
> should be features of the gene, not the mRNA.
I think you have this the wrong way round. Again, this must be a problem
with how you're assigning parent tags in the GFF output, when I try
AE003644 the exons are children of the mRNA, which is correct.
Try this using the unflattener script (which uses SeqIO::asciitree as
default output)
cd bioperl-live
perl scripts/seq/unflatten_seq.PLS t/data/AE003644_Adh-genomic.gb
Seq: AE003644
databank_entry 1..263309[+]
gene
mRNA CG4491-RA
CDS CG4491-PA 20495..22410[+] ;
SPLIT: 20495..20584[+] 20887..22410[+]
exon 20111..20584[+]
exon 20887..23268[+]
gene
tRNA tRNA-Pro
exon 25127..25198[+]
exons under mRNA, and CDS boundaries matching exons
(do a cvs update for the latest asciitree.pm)
cheers
chris
> Thanks,
> Scott
>
>
>
>
More information about the Bioperl-l
mailing list