[Bioperl-l] trans-spliced genes & gff2 & chado

Mon Jul 21 12:17:29 EDT 2003

Hi Charles

There is no standard/accepted format for trans-spliced genes as far as I
know, but I would be interested in pushing for a de facto standard

I think this should only be used in cases where trans-splicing is not the
norm. It would be totally over the top to explicitly represent the
abundant trans-splicing in C elegans genes.

I would argue that there are two choice in how to represent these guys

either you think of transplicing as being one transcript with two distinct
sets of exons (for example, exons coming from opposite strands)

So an alternately spliced gene, of which one transcript is trans spliced
may look like this

gene
  transcript-A
    exon +
    exon +
    exon -
    exon -
    protein-A
  transcript-B
    exon +
    exon +
    protein-B

This tree/graph can easily be represented in gff3 or chado. I'm not sure
about gff2. This seems to be how you represented it in genbank format
below. The only known example of trans splicing in Drosophila, mod(mdg4),
is represented this way in genbank.

The problem with this approach is that things get weird if the two trans
spliced portions come from spatially disparate parts of the genome - one
can no longer think of the boundaries of the transcript as being defined
by the boundaries of the exons.

the other way of representing these chaps is to explicitly represent the
seperate transcripts, and have some mechanism for saying that these get
glommed together. I believe this is closer to the actual biology.

gene
  ts-transcript-1
    exon +
    exon +
  ts-transcript-2
    exon -
    exon -
  transcript-A
    ts-transcript-1
    ts-transcript-2
    protein-A
  transcript-B
    exon +
    exon +
    protein-B

Again, this tree/graph has a natural gff3 or chado implementation.

The advantge of this is that it allows us to preserve the constraints

forall transcript T, exon X: X part_of T => T.strand == X.strand

forall transcript T, exon X: X part_of T => T.start == min(X.start)
forall transcript T, exon X: X part_of T => T.end == min(X.end)

which is better if we ever find examples of trans splicing where the
components are some distance away. I believe this is the case but I don't
know much about this outside drosophila.

The disadvantage is that we have to bring in special-case code for doing
things like dynamically calculating the mRNA sequence...

I'm not really sure which way is best

c.

On 21 Jul 2003, Charles Hauser wrote:

> I am generating gff2 to run GBrowse  and have a trans-spliced gene to
> represent.  I can write code to deal with this particular instance, but
> was wondering if there is a generic solution?
>
> - is there a 'standard/accepted format' to display trans-spliced genes?
>
> - in chado I believe one would generate a feature for each segment -
> correct?
>
>
> Charles
>
>
>      gene            join(32737..32824,complement(174205..174384),
>                      complement(69520..71506))
>                      /gene="psaA"
>                      /note="trans splicing"
>      CDS             join(32737..32825,complement(174205..174384),
>                      complement(69520..71506))
>
>
>
> while( my $seq = $seqio->next_seq ) {
>     foreach my $f ( $seq->top_SeqFeatures() ) {
>
> <snip>
>
>
> 	$out->write_feature($f);
>     }
> }
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>