[Bioperl-l] *major* error in genbank parser or am i just insane?
Brian King
brian.king@animorphics.net
Fri, 9 Aug 2002 20:10:24 -0700 (PDT)
> here's a made up example:
>
> mRNA <=> mRNA
> sublocationof(mRNA) <=> exon
> misc_feature.type=snRNA <=> snRNA
> CDS <=> CDS
> sublocationof(CDS) <=> CDS-exon
> sublocationof(5'UTR) <=> 5'UTR-exon
> 5'UTR + CDS + 3'UTR <=> mRNA
> Seq(type=mRNA) <=> feature(type=mRNA)
> Seq(type=protein) <=> feature(type=CDS)
>
> forall(mRNA), mRNA.property.gene=GeneStructure.name
> => partof(mRNA, GeneStructure),
I like the idea of starting from test cases, so I'm
going to start by proposing some XML transformations
of some of the GenBank oddities. I remember an
example from GenBank where the CDS, mRNA, the exons,
and the rest all lined up into a beautiful gene model.
All the scientists I knew who saw the graphic said
that was the way the GenBank records should be, so it
seems hardly anyone knows the rationale for bare CDSs
that Francis explained. Anyway, I will try to code
some XML examples and post them soon.
I agree with the idea of making the transformation
from flat representation to hierarchical into a
separate layer, but it still leaves the problem of how
to represent joins before the transformation.
> this is a hard problem though; most of the time it's
> expedient for the
> programmer to make certain assumptions about how the
> data they are
> interested in is represented in genbank/embl, and
> just use bioperl/biosql
> as-is, performing their own hacky transformations.
Yes, the transformation has to be an extra step.
There still has to be an adequate representation of
the GenBank record before trying to match CDS regions
to exons or whatever.
Today I'm fighting to make Windows, Linux, and Oracle
all live together on my PC. So when I get that
situation fixed, I'll work on some XML.
Regards,
Brian
__________________________________________________
Do You Yahoo!?
HotJobs - Search Thousands of New Jobs
http://www.hotjobs.com