[Biojava-l] genes and things

Ewan Birney birney@ebi.ac.uk
Wed, 27 Sep 2000 08:11:30 +0100 (GMT)


On Tue, 26 Sep 2000, Matthew Pocock wrote:

> Dear all,
> 
> At the end of this month I will be moving part-time to ensembl
> (http://www.ensembl.org). This means that I will end up having to think
> of genomes as containing genes! So far, the support for genes in BioJava
> is lacking. You can use StrandedFeature with a given type tag and a
> well-defined set of children to represent one, but we have no agreed
> structure.
> 
> How do you all want to represent genes? My aproach would be to use the
> interfaces below, but this may be overkill, or may miss out important
> biological posibilities. All comments & flames greatfully accepted.

Woah there Matt...


At Ensembl we have a strong, powerful gene model (written in Perl) which
works well. This does not agree with the model you have written down
below. 


For sure at Ensembl we are not changing our gene model in a hurry as i
personally spent 2-3 months of pain getting it right. Key points to the
Ensembl gene model:

	Genes and Transcripts *do not* inheriet from SeqFeature. (ie, they
don't have start/ends). This comes back to the fact that in draft genomes
you either 

	(a) make a completely artificial coordinate system that changes
very 2 weeks when assembly/sequence changes and move everything (yuk)

	(b) accept that there is no overall coordinate system 

we go for (b).

	Exons do inheriet from SeqFeature.


I realise that BioJava features have a composite nature, but I'd like to
see how that can play out with Ensembl genes.


	Secondly, exons don't have to splice correctly. This makes life
particularly interesting.




I would study the Ensembl gene model a little more before you jump in
here for biojava. We have done alright in this area of Ensembl. It is one
of our best bits of modelling...




> 
> Matthew
> 
> extend Feature with:
>   /*
>    * Generate a template object that could be used to create a feature
>    * that is the same as this one. This permits features to be cloned
> into
>    * other contexts e.g. from one database to another, without breaking
>    * the encapsulation.
>    */
>   Template makeTemplate()
> 
> /**
>  * A gene. This will contain zero-or-more transcript features,
>  * and may contain other things (e.g. propmoter elements). It also
>  * maintains a list of all exons known to exist in this gene.
>  */
> public interface Gene extends StrandedFeature {
>   /**
>    * Retrieve the set of exons in this gene. These will be Exon objects.
> 
>    * Only exons in this set are legal for use by an mRNA arrising from
>    * a gene.
>    */
>   public Set getExons();
> }
> 
> /**
>  * A transcript represents a region of a gene that is transcribed. It
>  * will normaly be contiguous, and its strand will be identical to the
>  * strand of the gene (except in odd circumstances). E.g. it is a
>  * region from where polymerase attaches to where it drops off.
>  * <P>
>  * Each transcript will have one SpliceVariant for each possible
>  * mRNA it can be turned into. A single transcript may be spliced
>  * in multiple ways, some of which may be exon-identical to how
>  * other transcripts are spliced.
>  */
> public interface Transcript extends StrandedFeature {
> }
> 
> /**
>  * A possible splicing pattern for a transcript. This should contain
>  * exons from the gene as features, to indicate which regions to
>  * splice in, and which to splice out. The location of the splice
>  * variant is from the beginning of the first exon to the end of the
>  * last one. It is possible that you don't know the transcript
> coordinates,
>  * but you do know the SpliceVariant produced, in which case we
>  * should either have a dummy parent transcript, or add it direct to
>  * the gene.
>  */
> public interface SpliceVariant extends StrandedFeature {
>   /**
>    * Retrieve an mRNA sequence made by splicing together this splice
> variant.
>    * <P>
>    * The returned sequence will contain TranslatedRegion features to
> indicate
>    * which bits are translated.
>    */
>   public Sequence getSplicedSequence();
> }
> 
> /**
>  * A region of mRNA that is translated into protein. This should in
>  * most cases have a contiguos location.
>  */
> public interface TranslatedRegion {
>   /**
>    * Retrive the translation for this region of the mRNA.
>    */
>   public Sequence getTranslation()
> }
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l
> 

-----------------------------------------------------------------
Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420
<birney@ebi.ac.uk>. 
-----------------------------------------------------------------