[Bioperl-l] GFF3 preliminary

Wed Feb 19 11:27:55 EST 2003

Hi,

Following up on discussions with Jim Kent, Suzi Lewis, Michele Clamp
and Richard Durbin, here is a new version of the GFF3 proposal.

Suzi, could you post this to song.sourceforge.net, when you have a
chance?  I don't seem to have write permissions to the htdocs
directory.

Best,

Lincoln

GENERIC FEATURE FORMAT VERSION 3: A PROPOSAL

Author:  Lincoln Stein
Date:    19 February 2003
Version: 0.2

Although there are many richer ways of representing genomic features
via XML, the stubborn persistence of a variety of ad-hoc tab-delimited
flat file formats declares the bioinformatics community's need for a
simple format that can be modified with a text editor and processed
with shell tools like grep.  The GFF format, although widely used, has
fragmented into multiple incompatible dialects.  When asked why they
have modified the published Sanger specification, bioinformaticists
frequently answer that the format was insufficient for their needs,
and they needed to extend it.  The proposed GFF3 format addresses the
most common extensions to GFF, while preserving backward compatibility
with previous formats. The new format:

    1) adds a mechanism for representing more than one level 
       of hierarchical grouping of features and subfeatures.
    2) separates the ideas of group membership and feature name/id
    3) constrains the feature type field to be taken from a controlled
       vocabulary.
    4) allows a single feature, such as an exon, to belong to more than
       one group at a time.
    5) one level of relative addressing for subfeatures (e.g. exons
       can be expressed in transcript coordinates)
    6) an explicit convention for pairwise alignments
    7) an explicit convention for features that occupy disjunct regions

The format consists of 10 columns, separated by spaces.  The following
unescaped characters are allowed within fields:
[a-zA-Z0-9.:;=%^*$@!+_?-].  All other characters must must be escaped
using the URL escaping conventions.  Unescaped quotation marks,
backslashes and other ad-hoc escaping conventions that have been added
to the GFF format are explicitly forbidden.  The =, ; and % characters
have reserved meanings as described below.

Undefined fields are replaced with the "." character, as described in
the original GFF spec.

Column 1: "seqid"

The ID of the landmark used to establish the coordinate system for the
current feature.  IDs must contain alphanumeric characters.
Whitespace, if present, must be escaped using URL escaping rules
(e.g. space="%20" or "+").

Column 2: "source"

The source of the feature.  This is unchanged from the older GFF specs
and is not part of a controlled vocabulary.

Column 3: "type"

The type of the feature (previously called the "method").  This is
constrained to be either: (a) a term from SOFA; or (b) a SOFA
accession number.  The latter alternative is distinguished using the
syntax SOFA:000000.

Columns 4 & 5: "start" and "end"

The start and end of the feature, in 1-based integer coordinates,
relative to the landmark given in column 1.  Start is less than end.

Column 6: "score"

The score of the feature, a floating point number.  As in earlier
versions of the format, the semantics of the score are ill-defined.
It is strongly recommended that E-values be used for sequence
similarity features, and that P-values be used for ab initio gene
prediction features.

Column 7: "strand"

The strand of the feature.  + for positive strand (relative to the
landmark), - for minus strand, and . for features that are not
stranded.  In addition, ? can be used for features whose strandedness
is relevant, but unknown.

Column 8: "phase"

The phase of the feature, for protein-encoding featues (primarily
CDSs).  This is an integer-valued field with the values 0, 1, or 2.
The integer indicates the offset from the start of the feature to the
first base of the first codon in the reading frame.  "." is used for
features that do not corresponding to a reading frame.

Column 9: "attributes"

A list of feature attributes in the format tag=value.  Multiple
tag=value pairs are separated by semicolons.  URL escaping rules are
used for tags or values containing the following characters: ",=;".
Whitespace should be replaced with the "+" character or the %20 URL
escape.  This will allow the file to survive text processing programs
that convert tabs into spaces.

Five tags are predefined:

    ID	   Indicates the name of the feature.  IDs must be unique
	   within the scope of the GFF file.

    Alias  A descriptive name for the feature.  It is suggested that
	   this tag be used whenever a secondary identifier for the
	   feature is needed, such as display names, locus names and
	   accession numbers.  Unlike ID, there is no requirement
	   that Alias be unique within the file.

    Parent Indicates the parent of the feature.  A parent ID can be
	   used to group exons into transcripts, transcripts into
	   genes, an so forth.  A feature may have multiple parents.

    Target Indicates the target of a nucleotide to nucleotide or
	   nucleotide to protein alignment.  The format of the
	   value is "target_id:start..end"  Start may be greater
	   than end to indicate a + strand alignment to the
	   reverse complement of a target nucleotide sequence.

    Align  The alignment of the feature to the target if the two
	   are not colinear.  The alignment is a string containing
	   the four characters "|X^v", where "|" indicates an
	   aligned match, "X" indicates an aligned mismatch, "^"
	   indicates a gap in the feature, and "v" indicates a
	   gap in the target.

Multiple attributes of the same type are indicated by separating the
values with the comma "," character, as in:

       Parent=AF2312,AB2812,abc-3

Note that attribute names are case sensitive.  "Parent" is not the
same as "parent".

In the example GFF3 file given below, the first column contains line
numbers that I have added for the purposes of the narrative.  Here are
some common scenarios that I have attempted to illustrate:

A) a simple feature, no public ID

Line 2 in the example is a feature of type "repeat". It is located on
the coordinate system defined by feature "ctg123", has a start and an
end and no ID.  It has an attribute named "Note" with value "ALU3."

B) a simple feature with a public ID

Line 3 is a feature of type clone.  It has a start and an end.  Its
parent is undefined (no Parent attribute), but it has an ID attribute
of "clone00001" and an Alias of "cTel33B."

C) a feature with multiple attributes

Line 5 is a feature of type "gene."  It has no parent, and has
attributes of type ID, Note, and GO_term.

D) a hierarchical grouping of features

Lines 5-13 demonstrate a hierarchical grouping.  At the top level is
line 5, which defines the extent of a "gene" with ID "gene00001".
Below this are two features of type mRNA (lines 6 and 7).  Their
Parent attributes are set to "gene00001", indicating that this feature
is their immediate parent.  Their IDs are indicated as separate
attributes.

This pattern is repeated for the exons listed on lines 8-11.  Exons
exon00001, exon00002, and exon00004 belong to both of the transcripts.
Therefore, their Parent attribute contains both the mRNA00001 and
mRNA00002 IDs separated by a comma.

Exon exon00003 belongs to mRNA00002 only, and therefore that
transcript's ID is listed as the sole Parent.

Lines 12 and 13 indicate coding_start and coding_end features.  These
subfeatures are hierarchically grouped underneath their corresponding
exons, but they do not have independent public IDs.

E) Disjunct coordinates

Lines 14-16 illustrates a single feature -- the CDS corresponding to
mRNA mRNA00001 -- which occupies multiple disjunct regions.  The
Parent attribute indicates that the CDS features belong to mRNA00001.
However, the attribute column assigns each of lines 14-16 the same ID.
Because the ID is the same, this is interpreted as a single feature
that spans multiple disjunct coordinate ranges.

NOTE: See "Representing Translations" for a discussion of why it might
not be a good idea to use represent translations in this way.

F) Alignments

Lines 17-19 demonstrate an alignment of two sequences using the
reserved Target attribute.  Each non-gapped segment becomes a line in
the GFF3 file.  The segments each share the same ID, thereby
indicating that the segments are disjunct regions of the same feature.
The Target attribute indicates the ID of the target sequence (which
does not have to be represented in the GFF3 file) and the start and
end coordinates of the aligned target.

Line 20 shows a gapped alignment using the Align attribute.  This
attribute's value should be interpreted this way:

 1501  gatt*ctccc 1510      ctg123
       ||||^||X||
 2001  gatttctgcc 2011      af923

Unlike the GFF1 and GFF2 formats, the Parent attribute for gapped
alignments may be empty. However, a valid alternative representation
is to create a single "match" feature, and a series of "hsp" features
contained within it.  Lines 21-23 show this alternative
representation.

G) Relative coordinates

Lines 24-27 illustrate using relative coordinate addressing in
feature/subfeature relationships.  Line 24 defines an mRNA that is
positioned on sequence landmark "ctg123" from positions 5000 to 6000.
Its ID field indicates that is mRNA03.  Lines 25-27 are exon
subfeatures of mRNA03 as indicated by their Parent attribute.
However, the seqid field specifies mRNA03 as the parent coordinate
system, thereby allowing the exons to begin at position 1.

  0  ##gff-version 3
  1  ##sequence-region ctg123:1..1497228     

  2  ctg123  flybase repeat  5000    5100    .       .       .       Note=ALU3
  3  ctg123  flybase clone   1       2679    .       +       .       ID=clone00001;Alias=cTel33B
  4  ctg123  flybase contig  1       1497228 .       +       .       ID=contig0001;Alias=ctg123

  5  ctg123  flybase gene    43733   44677   .       +       .       ID=gene00001;Alias=ADAM1;Note=unc-3;GO_term=GO:12345,GO:33421
  6  ctg123  flybase mRNA    43733   44677   .       +       .       ID=mRNA00001;Alias=ADAM1.t1;Parent=gene00001
  7  ctg123  flybase mRNA    43733   44677   .       +       .       ID=mRNA00002;Alias=ADAM1.t2;Parent=gene00001
  8  ctg123  flybase exon    43733   43961   .       +       .       ID=exon00001;Parent=mRNA00001,mRNA00002
  9  ctg123  flybase exon    44030   44234   .       +       .       ID=exon00002;Parent=mRNA00001,mRNA00002
 10  ctg123  flybase exon    44281   44328   .       +       .       ID=exon00003;Parant=mRNA00002
 11  ctg123  flybase exon    44521   44677   .       +       .       ID=exon00004;Parent=mRNA00001,mRNA00002
 12  ctg123  flybase coding_start    43740   43740   .       +       .       Parent=exon00001
 13  ctg123  flybase coding_end      44677   44677   .       +       .       Parent=exon00004

 14  ctg123  flybase cds     43740   43961   .       +       0       ID=cds00001;Parent=mRNA00001
 15  ctg123  flybase cds     44030   44234   .       +       1       ID=cds00001;Parent=mRNA00001
 16  ctg123  flybase cds     44521   44677   .       +       1       ID=cds00001;Parent=mRNA00001

 17  ctg123  flybase match   1       100     100     .       .       ID=match0001;Target=af923:1001..1100
 18  ctg123  flybase match   101     500     80      .       .       ID=match0001;Target=af923:1101..1500
 19  ctg123  flybase match   501     1000    80      .       .       ID=match0001;Target=af923:1501..2000
 20  ctg123  flybase match   1501    1510    60      .       .       ID=match0001;Target=af923:2001..2011;Align=||||^||X||

 21  ctg123  flybase match   5001    6000    100     .       .       ID=match0002;Target=ua388:1..1000
 22  ctg123  flybase hsp     5001    5500    .       .       .       Parent=match0002;Target=ua388:1..500
 23  ctg123  flybase hsp     5501    6000    .       .       .       Parent=match0002;Target-ua388:501.1000

 24  ctg123  flybase mRNA    5000    6000    +       .       .       ID=mRNA03;Alias=EVE1.t1
 25  mRNA03  flybase exon    1       300     +       .       .       ID=exon00005;Parent=mRNA03
 26  mRNA03  flybase exon    301     400     +       .       .       ID=exon00006;Parent=mRNA03
 27  mRNA03  flybase exon    401     1000    +       .       .       ID=exon00007;Parent=mRNA03

=================================================================

OTHER SYNTAX:

Comments are preceded by the # symbol.  Meta-data and directives are
preceded by ##.  The following directives are recognized:

  ##gff-version 3        
	The GFF version, always 3 in this spec.  This must
	be the topmost line of the file.

  ##sequence-region seqid:start..end
        The sequence segment referred to
	by this file, in the format seqid:start..end.
	This element is optional.  If it occurs, it must be
	the second line of the file.

  ###
        This directive (three # signs in a row) indicates that all
        forward references to feature IDs that have been seen to this
        point have been resolved.  After seeing this directive, a
        program that is processing the file serially can close off any
        open objects that it has created and return them, thereby
        allowing iterative access to the file.  Otherwise, software
        cannot know that a feature has been fully populated by its
        subfeatures until the end of the file has been reached.

=================================================================

REPRESENTING TRANSLATIONS

There are two ways of representing protein translations (e.g. ORFS,
CDS) in the various implementations of GFF2 and GTF.  One way is to
represent the translation as an interrupted "CDS" region beginning
with the first base of the first codon and ending at the last base of
the stop codon.  Another is to create a series of exons and to
indicate the position of the translational start and end on the first
and last coding exon.

An informal sampling of members of this list (Michele Clamp, Suzi
Lewis, Richard Durbin) suggests that the latter solution is cleaner
and more manageable in practice, leading to more consistent annotation
and to fewer ambiguities.  Therefore, I would propose that we
legislate that translations be represented implicitly by explicit
translational start and end positions.  For this to work properly, the
parent of the start and end sites must be the mRNA feature and NOT the
exon.

Under this model, here is a generic gene

  gene:  a bag of features, including regulatory motifs
     mRNA
	exon
	coding_start
	coding_end
	splice_donor
	splice_acceptor
	5_utr
	3_utr

Importantly, the UTRs, coding start and coding end are all children of
the mRNA.  Making them children of the exon (which some will be
tempted to do!) creates ambiguities in the interpretation of
alternative splices.

=================================================================

EXAMPLE PROGRAM

I have extended (in an experimental way), the Bio::Tools::GFF module
to accomodate this new format.  Here is a test script and its output
when run on the above file.

  0  #!/usr/bin/perl -w
  1  use strict;
  2  use lib '.';

  3  use Bio::Tools::GFF;
  4  my $file = 'gff3.txt';
  5  my $gffio = Bio::Tools::GFF->new(-file=>$file,-gff_version=>3);
  6  my @f = sort {$a->primary_tag cmp $b->primary_tag} $gffio->features;
  7  format_features(\@f);

  8  sub format_features {
  9    my $features = shift;
 10    my $tabs     = shift || 0;
 11    for my $f (@$features) {
 12      my $type  = $f->primary_tag;
 13      my $id    = $f->unique_id;
 14      $id       ||= '(no id)';# if $id =~ /HASH/;
 15      my ($start,$end) = ($f->start,$f->end);
 16      my $hit = $f->can('hstart') ? $f->hunique_id.":".$f->feature2->location->to_FTstring
 17                                  : '';
 18      print "\t"x$tabs,join("\t",$id,$type,$f->location->to_FTstring,$hit),"\n";
 19      format_features([$f->sub_SeqFeature],$tabs+1);
 20    }
 21  }

OUTPUT:

clone00001	clone	1..2679	
contig0001	contig	1..1497228	
gene00001	gene	43733..44677	
	mRNA00001	mRNA	43733..44677	
		exon00001	exon	43733..43961	
			(no id)	coding_start	43740	
		exon00002	exon	44030..44234	
		exon00004	exon	44521..44677	
			(no id)	coding_end	44677	
		cds00001	cds	join(43740..43961,44030..44234,44521..44677)	
	mRNA00002	mRNA	43733..44677	
		exon00001	exon	43733..43961	
			(no id)	coding_start	43740	
		exon00002	exon	44030..44234	
		exon00003	exon	44281..44328	
		exon00004	exon	44521..44677	
			(no id)	coding_end	44677	
mRNA03	mRNA	5000..6000	
	exon00005	exon	5000..5299	
	exon00006	exon	5300..5399	
	exon00007	exon	5400..5999	
match0001	match	join(1..100,101..500,501..1000,1501..1510)	af923:join(1001..1100,1101..1500,1501..2000,2001..2011)
match0002	match	5001..6000	ua388:1..1000
	(no id)	hsp	5001..5500	ua388:1..500
	(no id)	hsp	5501..6000	ua388:501..1000
(no id)	repeat	5000..5100	

-- 
========================================================================
Lincoln D. Stein                           Cold Spring Harbor Laboratory
lstein at cshl.org			                  Cold Spring Harbor, NY
	1 Bungtown Road, Cold Spring Harbor, NY 11724
========================================================================