[Bioperl-l] Re: Proposed GFF version 3

Fri Feb 7 22:04:45 EST 2003

Hi Lincoln,

I just skimmed this, but it looks good.

-S

On Friday, February 7, 2003, at 04:54 PM, Lincoln Stein wrote:

> This letter is to discuss a proposed extension to GFF.  It arises from
> conversations with Richard Durbin during last fall's Hinxton genome
> informatics meeting.
>
> Although there are many richer ways of representing genomic features
> via XML, the stubborn persistence of a variety of ad-hoc tab-delimited
> flat file formats declares the bioinformatics community's need for a
> simple format that can be modified with a text editor and processed
> with shell tools like grep.  The GFF format, although widely used, has
> fragmented into multiple incompatible dialects.  When asked why they
> have modified the published Sanger specification, bioinformaticists
> frequently answer that the format was insufficient for their needs,
> and they needed to extend it.  The proposed GFF3 format addresses the
> most common extensions to GFF, while preserving backward compatibility
> with previous formats. The new format:
>
>     1) adds a mechanism for representing more than one level
>        of hierarchical grouping of features and subfeatures.
>     2) separates the ideas of group membership and feature name/id
>     3) constrains the feature type field to be taken from a controlled
>        vocabulary.
>     4) allows a single feature, such as an exon, to belong to more than
>        one group at a time.
>     5) one level of relative addressing for subfeatures (e.g. exons
>        can be expressed in transcript coordinates)
>     6) an explicit convention for pairwise alignments
>     7) an explicit convention for features that occupy disjunct regions
>
> The format consists of 10 columns, separated by spaces.  The following
> unescaped characters are allowed within fields:
> [a-zA-Z0-9.:;=%^*$@!+_?-].  All other characters must must be escaped
> using the URL escaping conventions.  Unescaped quotation marks,
> backslashes and other ad-hoc escaping conventions that have been added
> to the GFF format are explicitly forbidden.  The =, ; and % characters
> have reserved meanings as described below.
>
> Undefined fields are replaced with the "." character, as described in
> the original GFF spec.
>
> Column 1: "seqid"
>
> The ID of the landmark used to establish the coordinate system for the
> current feature.  IDs must contain alphanumeric characters.
> Whitespace, if present, must be escaped using URL escaping rules
> (e.g. space="%20").
>
> Column 2: "source"
>
> The source of the feature.  This is unchanged from the older GFF specs
> and is not part of a controlled vocabulary.
>
> Column 3: "type"
>
> The type of the feature (previously called the "method").  This is
> constrained to be either: (a) a term from SOFA; or (b) a SOFA
> accession number.  The latter alternative is distinguished using the
> syntax SOFA:000000.
>
> Columns 4 & 5: "start" and "end"
>
> The start and end of the feature, in 1-based integer coordinates,
> relative to the landmark given in column 1.  Start is less than end.
>
> Column 6: "score"
>
> The score of the feature, a floating point number.  As in earlier
> versions of the format, the semantics of the score are ill-defined.
> It is strongly recommended that E-values be used for sequence
> similarity features, and that P-values be used for ab initio gene
> prediction features.
>
> Column 7: "strand"
>
> The strand of the feature.  + for positive strand (relative to the
> landmark), - for minus strand, and . for features that are not
> stranded.  In addition, ? can be used for features whose strandedness
> is relevant, but unknown.
>
> Column 8: "phase"
>
> The phase of the feature, for protein-encoding featues (primarily
> CDSs).  This is an integer-valued field with the values 0, 1, or 2.
> The integer indicates the offset from the start of the feature to the
> first base of the first codon in the reading frame.  "." is used for
> features that do not corresponding to a reading frame.
>
> Column 9: "group"
>
> A list of the immediate parents of the current feature.  Multiple
> parents are allowed (example: one exon shared by multiple
> transcripts). Multiple parents are separated by a semicolon.
> Parentless features have a dot in this field.
>
> Column 10: "attributes"
>
> A list of feature attributes in the format tag=value.  Multiple
> tag=value pairs are separated by semicolons.  URL escaping rules are
> used for tags or values containing whitespace, "=" characters and
> semicolons.
>
> Two tags are special:
>
>     ID	 Indicates the name of the feature.  IDs must be unique
> 	 within the scope of the GFF file.
>
>     Target Indicates the target of a nucleotide to nucleotide or
> 	   nucleotide to protein alignment.  The format of the
> 	   value is "target_id:start..end"  Start may be greater
> 	   than end to indicate a + strand alignment to the
> 	   reverse complement of a target nucleotide sequence.
>
> In the example GFF3 file given below, the first column contains line
> numbers that I have added for the purposes of the narrative.  Here are
> some common scenarios that I have attempted to illustrate:
>
> A) a simple feature, no public ID
>
> Line 2 in the example is a feature of type "repeat". It has a start
> and an end and no ID, but it does have an attribute named "Note."
>
> B) a simple feature with a public ID
>
> Line 3 is a feature of type clone.  It has a start and an end.  Its
> parent is undefined (empty column 9), but it has an attribute of type
> ID with value "cTel33B."
>
> C) a feature with multiple attributes
>
> Line 5 is a feature of type "gene."  It has no parent, and has
> attributes of type ID, Note, and GO_term.
>
> D) a hierarchical grouping of features
>
> Lines 5-13 demonstrate a hierarchical grouping.  At the top level is
> line 5, which defines the extent of a "gene" with ID Y74C9A.1.  Below
> this are two features of type mRNA (lines 6 and 7).  Their group
> fields contain the ID of Y74C9A.1, indicating that this feature is
> their immediate parent.  In the 10th column, the mRNA features have
> their own IDs independent of the ID of the parent gene.
>
> This pattern is repeated for the exons listed on lines 8-11.  Exons
> e1, e2, and e4 belong to both of the transcripts.  Therefore, both
> transcript IDs are listed in the group column, separated by
> semicolons.
>
> Exon e3 belongs only to one of the transcripts, and therefore only
> that transcript's ID is listed in the group column.
>
> Lines 12 and 13 indicate coding_start and coding_end features.  These
> subfeatures are hierarchically grouped underneath their corresponding
> exons, but they do not have independent public IDs.
>
> E) Disjunct coordinates
>
> Lines 14-16 illustrates a single feature -- the CDS corresponding to
> mRNA Y74C9A.1a -- which occupies multiple disjunct regions.  The group
> column indicates that the CDS belongs to mRNA Y74C9A.1a.  However, the
> attribute column assigns each of lines 14-16 the same ID.  Because the
> ID is the same, this is to be interpreted as a single feature that
> spans multiple locations.
>
> F) Alignments
>
> Lines 17-19 demonstrate a gapped alignment of two sequences using the
> reserved Target attribute.  Each non-gapped segment becomes a line in
> the GFF3 file.  The segments each share the same ID, thereby
> indicating that the segments are disjunct regions of the same feature.
> The Target attribute indicates the ID of the target sequence (which
> does not have to be represented in the GFF3 file) and the start and
> end coordinates of the aligned target.
>
> Unlike the GFF1 and GFF2 formats, the group field for gapped
> alignments can be empty. However, a valid alternative representation
> is to create a single "match" feature, and a series of "hsp" features
> underneath it via the group field.  Lines 20-22 show this alternative
> representation.
>
> G) Relative coordinates
>
> Lines 23-26 illustrate using relative coordinate addressing in
> feature/subfeature relationships.  Line 23 defines an mRNA that is
> positioned on sequence landmark "I" from positions 5000 to 6000.  Its
> ID field indicates that it is M7.3.  Lines 24-26 are exon subfeatures
> of M7.3 as indicated by their group field.  However, the seqid field
> specifies M7.3 as the parent coordinate system, thereby allowing the
> exons to begin at position 1.
>
>   0  ##gff-version 3
>   1  ##sequence-region I:1..14972282
>   2  I       wormbase        repeat  5000    5100    .       .       .  
>       .       Note=ALU3
>   3  I       wormbase        clone   1       2679    .       +       .  
>       .       ID=cTel33B
>   4  I       wormbase        contig  1       14972282        .       +  
>       .       .       ID=CHROMOSOME_I
>   5  I       wormbase        gene    43733   44677   .       +       .  
>               .       ID=Y74C9A.1;Note=unc-3;GO_term=GO:12345
>   6  I       wormbase        mRNA    43733   44677   .       +       .  
>       Y74C9A.1        ID=Y74C9A.1a
>   7  I       wormbase        mRNA    43733   44677   .       +       .  
>       Y74C9A.1        ID=Y74C9A.1b
>   8  I       wormbase        exon    43733   43961   .       +       .  
>       Y74C9A.1a;Y74C9A.1b     ID=e1
>   9  I       wormbase        exon    44030   44234   .       +       .  
>       Y74C9A.1a;T:Y74C9A.1b   ID=e2
>  10  I       wormbase        exon    44281   44328   .       +       .  
>       Y74C9A.1b       ID=e3
>  11  I       wormbase        exon    44521   44677   .       +       .  
>       Y74C9A.1a;T:Y74C9A.1b   ID=e4
>  12  I       wormbase        coding_start    43740   43740   .       +  
>       .       e1
>  13  I       wormbase        coding_end      44677   44677   .       +  
>       .       e4
>  14  I       wormbase        cds     43740   43961   .       +       0  
>       Y74C9A.1a
>  15  I       wormbase        cds     44030   44234   .       +       1  
>       Y74C9A.1a
>  16  I       wormbase        cds     44521   44677   .       +       1  
>       Y74C9A.1a
>  17  I       wormbase        match   1       100     100     .       .  
>       .       ID=12345.s;Target=cb123:1001..1100
>  18  I       wormbase        match   101     500     20      .       .  
>       .       ID=12345.s;Target=cb123:1101..1500
>  19  I       wormbase        match   501     1000    80      .       .  
>       .       ID=12345.s;Target=cb123:1501..2000
>  20  I       wormbase        match   5001    6000    100     .       .  
>       .       ID=abc;Target=M1:1..1000
>  21  I       wormbase        hsp     5001    5500    .       .       .  
>       abc     Target=M1:1..500
>  22  I       wormbase        hsp     5501    6000    .       .       .  
>       abc     Target=M1:501..100
>  23  I       wormbase        mRNA    5000    6000    +       .       .  
>       .       ID=M7.3
>  24  M7.3    wormbase        exon    1       300     +       .       .  
>       M7.3    ID=M7.3.1
>  25  M7.3    wormbase        exon    301     400     +       .       .  
>       M7.3    ID=M7.3.2
>  26  M7.3    wormbase        exon    401     1000    +       .       .  
>       M7.3    ID=M7.3.3
>
> =================================================================
>
> I have extended (in an experimental way), the Bio::Tools::GFF module
> to accomodate this new format.  Here is a test script and its output
> when run on the above file.
>
>   0  #!/usr/bin/perl -w
>   1  use strict;
>   2  use lib '.';
>
>   3  use Bio::Tools::GFF;
>   4  my $gffio = Bio::Tools::GFF->new(-fh=>\*STDIN,-gff_version=>3);
>   5  my @f = $gffio->features;
>   6  format_features(\@f);
>
>   7  sub format_features {
>   8    my $features = shift;
>   9    my $tabs     = shift || 0;
>  10    for my $f (@$features) {
>  11      my $type  = $f->primary_tag;
>  12      my $id    = $f->unique_id;
>  13      $id       ||= '(no id)';
>  14      my ($start,$end) = ($f->start,$f->end);
>  15      my $alt = ($f->alternative_locations)[0];
>  16      my ($target,$tstart,$tend) =  
> ($alt->seq_id,$alt->start,$alt->end) if $alt;
>
>  17      print  
> "\t"x$tabs,join("\t",$id,$type,$f->location->to_FTstring,eval{$alt- 
> >location->seq_id,$alt->location->to_FTstring}),"\n";
>  18      format_features([$f->sub_SeqFeature],$tabs+1);
>  19    }
>  20  }
>
>  21  1;
>
> OUTPUT:
>
> cTel33B	clone	1..2679
> CHROMOSOME_I	contig	1..14972282
> 12345.s	match	join(101..500,1..100,501..1000)
> M7.3	mRNA	5000..6000
> 	M7.3.1	exon	5000..5299
> 	M7.3.2	exon	5300..5399
> 	M7.3.3	exon	5400..5999
> abc	match	5001..6000
> 	(no id)	hsp	5001..5500
> 	(no id)	hsp	5501..6000
> (no id)	repeat	5000..5100
> Y74C9A.1	gene	43733..44677
> 	Y74C9A.1a	mRNA	43733..44677
> 		e1	exon	43733..43961
> 			(no id)	coding_start	43740
> 		e2	exon	44030..44234
> 		e4	exon	44521..44677
> 			(no id)	coding_end	44677
> 		(no id)	cds	43740..43961
> 		(no id)	cds	44030..44234
> 		(no id)	cds	44521..44677
> 	Y74C9A.1b	mRNA	43733..44677
> 		e1	exon	43733..43961
> 			(no id)	coding_start	43740
> 		e3	exon	44281..44328
>
>
> -- 
> ======================================================================= 
> =
> Lincoln D. Stein                           Cold Spring Harbor  
> Laboratory
> lstein at cshl.org			                  Cold Spring Harbor, NY
> 	1 Bungtown Road, Cold Spring Harbor, NY 11724
> ======================================================================= 
> =