[Bioperl-l] Gene Interface?
Ewan Birney
birney@ebi.ac.uk
Sun, 3 Dec 2000 13:12:10 +0000 (GMT)
[apologies for the cross post. I can't see a better way to do this]
For the mini-update to the ensembl web site to address a number of small
issues, it is becoming clear that we need a more "delayed get" approach to
genes, such that getting gene-id and genedblinks (say) does not require a
full trawl across the exon/exontranscript etc table (this makes sense to
ensembl-dev people, will confuse bioperl people)
For the current web site, I will put in some sneaky calls on GeneObj that
work off geneid. Not nice.
Long term I am pretty sure we need to have a full blown interface
definition of gene and then have (potentially a number) of implementations
behind it, one being the current in-memory implementation. At the same
time, we might as well synchronise with the upcoming bioperl-0.7
genestructure interface as well. This is why I have cc'd in bioperl for
this. Hilmar in particular I know has views here and I want us to get to a
sensible solution.
So - I guess I am trying to open discussion of a gene interface, which
probably will have to be cross-posted between ensembl-dev and bioperl-l
(? do you agree hilmar?). Apologies for people who will get two copies of
the email...
In addition for ensembl, I guess there is a question about whether we aim
for this being before or after branching the main trunk. I suspect if
there is a large number of changes it has to be after branching <sigh>.
Let's map out some clear use cases for the generic gene interface:
- should be able to store transcript information
(one gene has multiple transcripts)
- easy to get protein and cDNA sequences
- should be able to store exons as seqfeatures
? should have slots for DBLinks/annotation (or do we want a higher
collection interface for this? If so, how structured?)
- should not mandate an in memory implementation
Here are some issues that I think could be difficult to reconcile between
bioperl and ensembl views:
- Ensembl genes and transcripts are NOT seqfeatures. The placement of
an ensembl gene on a single coordinate system is held in something called
"VirtualGene" (not a great name. It is a gene on a virtualcontig). Ensembl
has a big win by allowing a gene to be built "across" coordinate systems,
allowing the coordinate system to be by-and-large decoupled from the gene
structure. Some "magic" is used for the places where the gene structure is
highly dependent on the assembly.
(NB - the ensembl gene reminiscent of EMBL/GenBank 'exploded'
seqfeatures, where it goes
join(AL000012:122-132,AC000002:1000023-1000015).)
- Ensembl makes a distinction between alternative transcripts and
alternative translations (two alternative transcripts can have the same
translation). This makes the objects one step more complex
- Ensembl wants to keep track of DBLinks etc close to the gene
Here are some issues which should be a simple matter of providing
extension of a core bioperl interface for ensembl
- Ensembl can cope with exons that do not splice correctly (due to
missing intervenning genomic sequence) (needs phase information bound to
the exon for ensembl)
- Ensembl needs ensembl identifiers on all objects.
So - I suspect Michele, Arne, Richard and Hilmar (along with other
people) have views on this. Let's kick around some ideas and then see
whether we can get to a strong definition for bioperl which we can extend
in ensembl where necessary.
ewan
-----------------------------------------------------------------
Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420
<birney@ebi.ac.uk>.
-----------------------------------------------------------------