[Bioperl-l] Gene Interface?

Sun, 3 Dec 2000 13:12:10 +0000 (GMT)

[apologies for the cross post. I can't see a better way to do this]

For the mini-update to the ensembl web site to address a number of small
issues, it is becoming clear that we need a more "delayed get" approach to
genes, such that getting gene-id and genedblinks (say) does not require a
full trawl across the exon/exontranscript etc table (this makes sense to
ensembl-dev people, will confuse bioperl people)

For the current web site, I will put in some sneaky calls on GeneObj that
work off geneid. Not nice.

Long term I am pretty sure we need to have a full blown interface
definition of gene and then have (potentially a number) of implementations
behind it, one being the current in-memory implementation. At the same
time, we might as well synchronise with the upcoming bioperl-0.7
genestructure interface as well. This is why I have cc'd in bioperl for
this. Hilmar in particular I know has views here and I want us to get to a
sensible solution.

So - I guess I am trying to open discussion of a gene interface, which
probably will have to be cross-posted between ensembl-dev and bioperl-l 
(? do you agree hilmar?). Apologies for people who will get two copies of
the email...

In addition for ensembl, I guess there is a question about whether we aim
for this being before or after branching the main trunk. I suspect if
there is a large number of changes it has to be after branching <sigh>.

Let's map out some clear use cases for the generic gene interface:

   - should be able to store transcript information
     (one gene has multiple transcripts)
   - easy to get protein and cDNA sequences
   - should be able to store exons as seqfeatures
   ? should have slots for DBLinks/annotation (or do we want a higher
collection interface for this? If so, how structured?)
   - should not mandate an in memory implementation

Here are some issues that I think could be difficult to reconcile between
bioperl and ensembl views:

   - Ensembl genes and transcripts are NOT seqfeatures. The placement of
an ensembl gene on a single coordinate system is held in something called 
"VirtualGene" (not a great name. It is a gene on a virtualcontig). Ensembl
has a big win by allowing a gene to be built "across" coordinate systems,
allowing the coordinate system to be by-and-large decoupled from the gene
structure. Some "magic" is used for the places where the gene structure is
highly dependent on the assembly.

    (NB - the ensembl gene reminiscent of EMBL/GenBank 'exploded'
seqfeatures, where it goes
join(AL000012:122-132,AC000002:1000023-1000015).)

   - Ensembl makes a distinction between alternative transcripts and
alternative translations (two alternative transcripts can have the same
translation). This makes the objects one step more complex 

   - Ensembl wants to keep track of DBLinks etc close to the gene

Here are some issues which should be a simple matter of providing
extension of a core bioperl interface for ensembl

   - Ensembl can cope with exons that do not splice correctly (due to
missing intervenning genomic sequence) (needs phase information bound to
the exon for ensembl)

   - Ensembl needs ensembl identifiers on all objects.

So - I suspect Michele, Arne, Richard and Hilmar (along with other
people) have views on this. Let's kick around some ideas and then see
whether we can get to a strong definition for bioperl which we can extend
in ensembl where necessary.

ewan

-----------------------------------------------------------------
Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420
<birney@ebi.ac.uk>. 
-----------------------------------------------------------------