[Bioperl-l] Gene Interface discussion

Ewan Birney birney@ebi.ac.uk
Wed, 13 Dec 2000 20:52:32 +0000 (GMT)



  Hilmar and myself have a had a quick trans-timezone phone call yesterday
(Hilmar at the end of his day and myself at the start) to bash out the
Gene interface.

  Key points - 

  (a) We call the interface GeneStructureI as "Gene"
is a bit too ambiguous

  (b) GeneStructureI has-a array of TranscriptI

  (c) TranscriptI has-a array of ExonI

There is a method to get out all the Exons from a GeneStructure which
must be equivalent to getting all the Exons out of a Transcript and
making that list unique on the start/end/strand of the exons

  (d) GeneStructureI and TranscriptI inheriet from SeqFeatureI

The definition of SeqFeatureI is extended somewhat. SeqFeatureI
objects are now allowed to have component SeqFeatures (sub_SeqFeature
call) which are on different sequences to the parent SeqFeature.

There is a new method to SeqFeatureI - is_single_sequence which
returns TRUE if all component SeqFeatures are on the same sequence as
this SeqFeature, and returns FALSE if not. This will allow clients to
easily find (and possibly skip) these "expanded" seq features.

The ->start and ->end calls on a non single sequence composite
seqfeature should return the start and end point of the component
sequence features which lie on the "focus" sequence of this 
seqfeature (ie, whatever ->entire_seq and ->seqname implies). 

Clients should be aware that when is_single_sequence == 0, concepts
like "overlap" and "length" are not necessarily easy to define or
interpret. This is for the client code to deal with.  (apologies to
the clients - we can't do much more for you inside the objects)


EMBL/GenBank join(AL00012:120..123,1..5) should be parsed into
a SeqFeature::Generic structure supporting the above calls.


   (e) The interface definition does not indicate where or how
additional information (annotation, dblink) is stored. This is left
up to implementations to add if wished, for example, inherieting off
the DBLinkContainerI interface


Here is the complete proposal

 (Note to Hilmar - I have just dreamt up this business of dealing
with the difference between utr/cds/all exons being arguments to
the exon call. An alternative could be methods exon, cds, utr. Feel free
to complain loudly).

  All interfaces in the Bio::SeqFeature:: namespace

  GeneStructureI - inheriets from SeqFeatureI

  (inherieted methods, start,end,strand,seq,entire_seq,seqname,primary_tag,source_tag
   is_single_sequence, sub_SeqFeatures);

  Notes: sub_SeqFeatures must delegate to ->transcripts.
       : primary_tag must be 'genestructure'

  methods

  # returns an array of TranscriptI
  @transcripts = $gs->transcripts(); 


  # returns an array of exons. Allowed arguments 'all','cds','utr'
  # this call must be equivalent to 
  # foreach $t ( $gs->transcripts() ) { 
  #   get exons, make unique start/end/strand
  # }
                   
  @exons = $gs->exons('all');
  @cds   = $gs->exons('cds');
  @utr   = $gs->exons('utr');

  # GeneStructureI must implement this, even if it returns an empty list

  @promotors = $gs->promotors(); # could be empty

  # GeneStructureI must implement this, even if it returns an empty list
  
  @polya     = $gs->polya(); # could be empty


  TranscriptI - inheriets from SeqFeatureI

  (inherieted methods, start,end,strand,seq,entire_seq,seqname,primary_tag,source_tag
   is_single_sequence, sub_SeqFeatures);

  Notes : sub_SeqFeatures delegates to ->exons('all'),promotor,polya;
        : primary_tag must be 'transcript'

  # returns an array of exons. Allowed arguments 'all','cds','utr'
                   
  @exons = $tr->exons('all');
  @cds   = $tr->exons('cds');
  @utr   = $tr->exons('utr');

  $promotor = $tr->promotor(); # could be undef, meaning unknown
  $polya    = $tr->polya();    # could be undef, meaning unknown


  Transcript must have the following two methods

  $transcript->cdna();    # returns a Bio::PrimarySeqI of the cDNA
  $transcript->protein(); # returns a Bio::PrimarySeqI of the protein



   ExonI - inheriets from SeqFeatureI, cannot be composite,
           primary_tag must return one of 'exon' or 'cds' or 'utr'


   There are no additional requirements to the ExonI interface,
   though of course, implementations may require their own 
   system




To Do list:

   (a) discuss this proposal. Sane? Any more issues to be worked out?
  
   I am not 100% on the exons('argument') style call.

   The exon primary_tag is actually a hard thing to provide. Should
   the primary_tag change depending on the argument - this is very
   nasty for the implementation objects.

   (b) figure out how to get these things in and out of
       EMBL/GenBank format without loss of information
 
   (c) Ditto with GAME


Implementations:

    Hilmar/Ewan to do bioperl implementations

    Hilmar to do bioperl parsing modules

    Ewan/Hilmar to do the interfaces files
    
    Ewan to do Ensembl definitions when appropiate (when Ensembl
    moves to bioperl 0.7 compliancy)

    Ewan/Jason to look at EMBL/GenBank dumping issues

    Brad to look at Game dumping/reading issues
    

ewan









-----------------------------------------------------------------
Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420
<birney@ebi.ac.uk>. 
-----------------------------------------------------------------