[Bioperl-l] Re: GFF3 preliminary

Thu Feb 20 16:26:41 EST 2003

Here are more general comments on the GFF 3 proposal:

1) I'd like to add the following comment to the definition of columns 4 
and 5 (start and end):
   When the feature is an interbase junction, such as a splice site or
   insertion point, by convention the start and end should be the
   flanking two bases.

2) Following the exchange triggered by Ian's email, I suggest the 
definition for column 8 (phase) replaces "start of the feature" by
"5' end of the feature (start if forward strand, end if reverse 
strand)".  And the end of this section could point out that if no strand 
is given phase is not meaningful and should be '.'.

3) I'm not convinced by the format for the Align string.  This requires
a character per aligned base.  There are a variety of run-length type 
encodings in common use that are much more compact.  e.g. Ensembl uses a 
string such as "60M1D8M3I15M" to mean "60 match, then 1 delete, then 8 
match, then 3 insert, then 15 match".  They call this CIGAR, but when I
talked to Guy Slater, who invented CIGAR for exonerate, his version is 
subtly different: "M 60 D 1 M 8 I 3 M 15" for the same string (see 
http://www.ensembl.org/Docs/wiki/html/EnsemblDocs/CigarFormat.html).
Jim Kent also has something like this.  I'd prefer us to standardise on 
one of these formats, all of which are very short for ungapped matches. 
  Otherwise alignments will be too large.  I don't think we need to know 
about mismatches in the format.  If there are regions unaligned in both 
sequences we can simply have xIyD or the equivalent in whatever syntax 
is agreed.  We ought to agree how to handle split codons in DNA to 
protein alignments as part of this spec.

4) After discussion with Michele last week I propose we don't introduce 
relative coordinates with respect to an ID.  First, this adds 
significant complexity that you are requiring any parser to handle, even 
one dealing with simple features not genes with exons and coding 
sequences.  I bet many programs that read gff would not handle it. 
Second it creates the potential for getting things wrong, particularly 
with IDs corresponding to objects on the reverse strand.  Third, I don't 
see that it saves effort - someone has to map backwards and forwards 
from relative to absolute coordinates and I don't see that GFF, which we 
all agree should be kept simple as a primary goal, should carry that 
burden.  It should be done elsewhere.  Fourth, what is the scope?  Can 
you write -1..2000 for an ID that is 1000bp long? (If so this had better 
mean 2 bases before the start of the object, but that might confuse some 
people.)  Fifth, it adds interdependency between lines, making actions 
like subsetting using grep or simple perl much more likely to break.

5) I find the statement about disjunct coordinates being implied by 
sharing the same ID at odds with the statement that the ID is unique 
within the scope of the file.  In particular what would happen if this 
ID got used as the seqid for new features (although I argued above that 
we should not go that way).  I would much prefer that GFF features are 
only one per line, and are all simple intervals or interbase junctions. 
  When we want to represent something more complex we should agree on 
how to do it out of these primitives, not define GFF to itself represent 
more complex structures.

Personally I would prefer to have one CDS line span the coding region, 
with exons that (at least in part) contribute to it as children. e.g.

5  ctg123  flybase gene    43733   44677   .       +       . 
ID=gene00001;Alias=ADAM1;Note=unc-3;GO_term=GO:12345,GO:33421
6  ctg123  flybase mRNA    43733   44677   .       +       . 
ID=mRNA00001;Alias=ADAM1.t1;Parent=gene00001
7  ctg123  flybase mRNA    43733   44677   .       +       . 
ID=mRNA00002;Alias=ADAM1.t2;Parent=gene00001
14  ctg123  flybase cds     43740   44677   .       +       0 
ID=cds00001;Parent=mRNA00001,gene00001
15  ctg123  flybase cds     43740   44677   .       +       0 
ID=cds00001;Parent=mRNA00002,gene00001
8  ctg123  flybase exon    43733   43961   .       +       . 
ID=exon00001;Parent=mRNA00001,mRNA00002,cds00001,cds00002
9  ctg123  flybase exon    44030   44234   .       +       . 
ID=exon00002;Parent=mRNA00001,mRNA00002,cds00001,cds00002
10  ctg123  flybase exon    44281   44328   .       +       . 
ID=exon00003;Parent=mRNA00002,cds00002
11  ctg123  flybase exon    44521   44677   .       +       . 
ID=exon00004;Parent=mRNA00001,mRNA00002,cds00001,cds00002
12  ctg123  flybase coding_start    43739   43740   .       +       . 
     Parent=cds00001,cds00002
13  ctg123  flybase coding_end      44677   44678   .       +       . 
     Parent=cds00001,cds00002

Note that I changed the coordinates of the coding_start and coding_end 
junctions to flank the junction, and made them children of the CDS, not 
the exon.

I prefer this because it allows one CDS to corresond to multiple mRNAs, 
which is quite common with alternate promoters and poly-adenylation 
sites.  However I don't think it corresponds to Ensembl's view which 
makes a CDS on exons within an mRNA, although it is compatible.  I think 
it does fit with Flybase's ideas, at least as once explained to me by 
Bill Gelbart, and with what I would like to do for WormBase, though I 
don't think we are all agreed on that yet.  As for the genefinders, they 
almost all just predict cds's and so it simplifies considerably because 
there is no need for the mRNA features or identifiers - you just have 
cds's as children of genes.

Note that this agrees with Lincoln's prescription at the end of his mail 
that coding_start and _end information should not be with respect to the 
exon.

Note also that by including coding_start and coding_end I was intending 
to imply that there is an ATG and a stop codon encoded.  i.e. if they 
are missing then this might just be a gene fragment.

I know there are many ways to handle these things.  We should try to 
converge on a standard, but I think this may be an adjunct document, not 
part of the GFFv3 standard itself, just as GTF was to GFFv2.

Richard

Lincoln Stein wrote:
> Hi,
> 
> Following up on discussions with Jim Kent, Suzi Lewis, Michele Clamp
> and Richard Durbin, here is a new version of the GFF3 proposal.
> 
> Suzi, could you post this to song.sourceforge.net, when you have a
> chance?  I don't seem to have write permissions to the htdocs
> directory.
> 
> Best,
> 
> Lincoln
> 
> 
> GENERIC FEATURE FORMAT VERSION 3: A PROPOSAL
> 
> Author:  Lincoln Stein
> Date:    19 February 2003
> Version: 0.2
> 
> Although there are many richer ways of representing genomic features
> via XML, the stubborn persistence of a variety of ad-hoc tab-delimited
> flat file formats declares the bioinformatics community's need for a
> simple format that can be modified with a text editor and processed
> with shell tools like grep.  The GFF format, although widely used, has
> fragmented into multiple incompatible dialects.  When asked why they
> have modified the published Sanger specification, bioinformaticists
> frequently answer that the format was insufficient for their needs,
> and they needed to extend it.  The proposed GFF3 format addresses the
> most common extensions to GFF, while preserving backward compatibility
> with previous formats. The new format:
> 
>     1) adds a mechanism for representing more than one level 
>        of hierarchical grouping of features and subfeatures.
>     2) separates the ideas of group membership and feature name/id
>     3) constrains the feature type field to be taken from a controlled
>        vocabulary.
>     4) allows a single feature, such as an exon, to belong to more than
>        one group at a time.
>     5) one level of relative addressing for subfeatures (e.g. exons
>        can be expressed in transcript coordinates)
>     6) an explicit convention for pairwise alignments
>     7) an explicit convention for features that occupy disjunct regions
> 
> The format consists of 10 columns, separated by spaces.  The following
> unescaped characters are allowed within fields:
> [a-zA-Z0-9.:;=%^*$@!+_?-].  All other characters must must be escaped
> using the URL escaping conventions.  Unescaped quotation marks,
> backslashes and other ad-hoc escaping conventions that have been added
> to the GFF format are explicitly forbidden.  The =, ; and % characters
> have reserved meanings as described below.
> 
> Undefined fields are replaced with the "." character, as described in
> the original GFF spec.
> 
> Column 1: "seqid"
> 
> The ID of the landmark used to establish the coordinate system for the
> current feature.  IDs must contain alphanumeric characters.
> Whitespace, if present, must be escaped using URL escaping rules
> (e.g. space="%20" or "+").
> 
> Column 2: "source"
> 
> The source of the feature.  This is unchanged from the older GFF specs
> and is not part of a controlled vocabulary.
> 
> Column 3: "type"
> 
> The type of the feature (previously called the "method").  This is
> constrained to be either: (a) a term from SOFA; or (b) a SOFA
> accession number.  The latter alternative is distinguished using the
> syntax SOFA:000000.
> 
> Columns 4 & 5: "start" and "end"
> 
> The start and end of the feature, in 1-based integer coordinates,
> relative to the landmark given in column 1.  Start is less than end.
> 
> Column 6: "score"
> 
> The score of the feature, a floating point number.  As in earlier
> versions of the format, the semantics of the score are ill-defined.
> It is strongly recommended that E-values be used for sequence
> similarity features, and that P-values be used for ab initio gene
> prediction features.
> 
> Column 7: "strand"
> 
> The strand of the feature.  + for positive strand (relative to the
> landmark), - for minus strand, and . for features that are not
> stranded.  In addition, ? can be used for features whose strandedness
> is relevant, but unknown.
> 
> Column 8: "phase"
> 
> The phase of the feature, for protein-encoding featues (primarily
> CDSs).  This is an integer-valued field with the values 0, 1, or 2.
> The integer indicates the offset from the start of the feature to the
> first base of the first codon in the reading frame.  "." is used for
> features that do not corresponding to a reading frame.
> 
> Column 9: "attributes"
> 
> A list of feature attributes in the format tag=value.  Multiple
> tag=value pairs are separated by semicolons.  URL escaping rules are
> used for tags or values containing the following characters: ",=;".
> Whitespace should be replaced with the "+" character or the %20 URL
> escape.  This will allow the file to survive text processing programs
> that convert tabs into spaces.
> 
> Five tags are predefined:
> 
>     ID	   Indicates the name of the feature.  IDs must be unique
> 	   within the scope of the GFF file.
> 
>     Alias  A descriptive name for the feature.  It is suggested that
> 	   this tag be used whenever a secondary identifier for the
> 	   feature is needed, such as display names, locus names and
> 	   accession numbers.  Unlike ID, there is no requirement
> 	   that Alias be unique within the file.
> 
>     Parent Indicates the parent of the feature.  A parent ID can be
> 	   used to group exons into transcripts, transcripts into
> 	   genes, an so forth.  A feature may have multiple parents.
> 
>     Target Indicates the target of a nucleotide to nucleotide or
> 	   nucleotide to protein alignment.  The format of the
> 	   value is "target_id:start..end"  Start may be greater
> 	   than end to indicate a + strand alignment to the
> 	   reverse complement of a target nucleotide sequence.
> 
>     Align  The alignment of the feature to the target if the two
> 	   are not colinear.  The alignment is a string containing
> 	   the four characters "|X^v", where "|" indicates an
> 	   aligned match, "X" indicates an aligned mismatch, "^"
> 	   indicates a gap in the feature, and "v" indicates a
> 	   gap in the target.
> 
> Multiple attributes of the same type are indicated by separating the
> values with the comma "," character, as in:
> 
>        Parent=AF2312,AB2812,abc-3
> 
> Note that attribute names are case sensitive.  "Parent" is not the
> same as "parent".
> 
> In the example GFF3 file given below, the first column contains line
> numbers that I have added for the purposes of the narrative.  Here are
> some common scenarios that I have attempted to illustrate:
> 
> A) a simple feature, no public ID
> 
> Line 2 in the example is a feature of type "repeat". It is located on
> the coordinate system defined by feature "ctg123", has a start and an
> end and no ID.  It has an attribute named "Note" with value "ALU3."
> 
> B) a simple feature with a public ID
> 
> Line 3 is a feature of type clone.  It has a start and an end.  Its
> parent is undefined (no Parent attribute), but it has an ID attribute
> of "clone00001" and an Alias of "cTel33B."
> 
> C) a feature with multiple attributes
> 
> Line 5 is a feature of type "gene."  It has no parent, and has
> attributes of type ID, Note, and GO_term.
> 
> D) a hierarchical grouping of features
> 
> Lines 5-13 demonstrate a hierarchical grouping.  At the top level is
> line 5, which defines the extent of a "gene" with ID "gene00001".
> Below this are two features of type mRNA (lines 6 and 7).  Their
> Parent attributes are set to "gene00001", indicating that this feature
> is their immediate parent.  Their IDs are indicated as separate
> attributes.
> 
> This pattern is repeated for the exons listed on lines 8-11.  Exons
> exon00001, exon00002, and exon00004 belong to both of the transcripts.
> Therefore, their Parent attribute contains both the mRNA00001 and
> mRNA00002 IDs separated by a comma.
> 
> Exon exon00003 belongs to mRNA00002 only, and therefore that
> transcript's ID is listed as the sole Parent.
> 
> Lines 12 and 13 indicate coding_start and coding_end features.  These
> subfeatures are hierarchically grouped underneath their corresponding
> exons, but they do not have independent public IDs.
> 
> E) Disjunct coordinates
> 
> Lines 14-16 illustrates a single feature -- the CDS corresponding to
> mRNA mRNA00001 -- which occupies multiple disjunct regions.  The
> Parent attribute indicates that the CDS features belong to mRNA00001.
> However, the attribute column assigns each of lines 14-16 the same ID.
> Because the ID is the same, this is interpreted as a single feature
> that spans multiple disjunct coordinate ranges.
> 
> NOTE: See "Representing Translations" for a discussion of why it might
> not be a good idea to use represent translations in this way.
> 
> F) Alignments
> 
> Lines 17-19 demonstrate an alignment of two sequences using the
> reserved Target attribute.  Each non-gapped segment becomes a line in
> the GFF3 file.  The segments each share the same ID, thereby
> indicating that the segments are disjunct regions of the same feature.
> The Target attribute indicates the ID of the target sequence (which
> does not have to be represented in the GFF3 file) and the start and
> end coordinates of the aligned target.
> 
> Line 20 shows a gapped alignment using the Align attribute.  This
> attribute's value should be interpreted this way:
> 
> 
>  1501  gatt*ctccc 1510      ctg123
>        ||||^||X||
>  2001  gatttctgcc 2011      af923
> 
> Unlike the GFF1 and GFF2 formats, the Parent attribute for gapped
> alignments may be empty. However, a valid alternative representation
> is to create a single "match" feature, and a series of "hsp" features
> contained within it.  Lines 21-23 show this alternative
> representation.
> 
> G) Relative coordinates
> 
> Lines 24-27 illustrate using relative coordinate addressing in
> feature/subfeature relationships.  Line 24 defines an mRNA that is
> positioned on sequence landmark "ctg123" from positions 5000 to 6000.
> Its ID field indicates that is mRNA03.  Lines 25-27 are exon
> subfeatures of mRNA03 as indicated by their Parent attribute.
> However, the seqid field specifies mRNA03 as the parent coordinate
> system, thereby allowing the exons to begin at position 1.
> 
>   0  ##gff-version 3
>   1  ##sequence-region ctg123:1..1497228     
> 
>   2  ctg123  flybase repeat  5000    5100    .       .       .       Note=ALU3
>   3  ctg123  flybase clone   1       2679    .       +       .       ID=clone00001;Alias=cTel33B
>   4  ctg123  flybase contig  1       1497228 .       +       .       ID=contig0001;Alias=ctg123
> 
>   5  ctg123  flybase gene    43733   44677   .       +       .       ID=gene00001;Alias=ADAM1;Note=unc-3;GO_term=GO:12345,GO:33421
>   6  ctg123  flybase mRNA    43733   44677   .       +       .       ID=mRNA00001;Alias=ADAM1.t1;Parent=gene00001
>   7  ctg123  flybase mRNA    43733   44677   .       +       .       ID=mRNA00002;Alias=ADAM1.t2;Parent=gene00001
>   8  ctg123  flybase exon    43733   43961   .       +       .       ID=exon00001;Parent=mRNA00001,mRNA00002
>   9  ctg123  flybase exon    44030   44234   .       +       .       ID=exon00002;Parent=mRNA00001,mRNA00002
>  10  ctg123  flybase exon    44281   44328   .       +       .       ID=exon00003;Parant=mRNA00002
>  11  ctg123  flybase exon    44521   44677   .       +       .       ID=exon00004;Parent=mRNA00001,mRNA00002
>  12  ctg123  flybase coding_start    43740   43740   .       +       .       Parent=exon00001
>  13  ctg123  flybase coding_end      44677   44677   .       +       .       Parent=exon00004
> 
>  14  ctg123  flybase cds     43740   43961   .       +       0       ID=cds00001;Parent=mRNA00001
>  15  ctg123  flybase cds     44030   44234   .       +       1       ID=cds00001;Parent=mRNA00001
>  16  ctg123  flybase cds     44521   44677   .       +       1       ID=cds00001;Parent=mRNA00001
> 
>  17  ctg123  flybase match   1       100     100     .       .       ID=match0001;Target=af923:1001..1100
>  18  ctg123  flybase match   101     500     80      .       .       ID=match0001;Target=af923:1101..1500
>  19  ctg123  flybase match   501     1000    80      .       .       ID=match0001;Target=af923:1501..2000
>  20  ctg123  flybase match   1501    1510    60      .       .       ID=match0001;Target=af923:2001..2011;Align=||||^||X||
> 
>  21  ctg123  flybase match   5001    6000    100     .       .       ID=match0002;Target=ua388:1..1000
>  22  ctg123  flybase hsp     5001    5500    .       .       .       Parent=match0002;Target=ua388:1..500
>  23  ctg123  flybase hsp     5501    6000    .       .       .       Parent=match0002;Target-ua388:501.1000
> 
>  24  ctg123  flybase mRNA    5000    6000    +       .       .       ID=mRNA03;Alias=EVE1.t1
>  25  mRNA03  flybase exon    1       300     +       .       .       ID=exon00005;Parent=mRNA03
>  26  mRNA03  flybase exon    301     400     +       .       .       ID=exon00006;Parent=mRNA03
>  27  mRNA03  flybase exon    401     1000    +       .       .       ID=exon00007;Parent=mRNA03
> 
> =================================================================
> 
> OTHER SYNTAX:
> 
> Comments are preceded by the # symbol.  Meta-data and directives are
> preceded by ##.  The following directives are recognized:
> 
>   ##gff-version 3        
> 	The GFF version, always 3 in this spec.  This must
> 	be the topmost line of the file.
> 
>   ##sequence-region seqid:start..end
>         The sequence segment referred to
> 	by this file, in the format seqid:start..end.
> 	This element is optional.  If it occurs, it must be
> 	the second line of the file.
> 
>   ###
>         This directive (three # signs in a row) indicates that all
>         forward references to feature IDs that have been seen to this
>         point have been resolved.  After seeing this directive, a
>         program that is processing the file serially can close off any
>         open objects that it has created and return them, thereby
>         allowing iterative access to the file.  Otherwise, software
>         cannot know that a feature has been fully populated by its
>         subfeatures until the end of the file has been reached.
>                  
> 
> =================================================================
> 
> REPRESENTING TRANSLATIONS
> 
> There are two ways of representing protein translations (e.g. ORFS,
> CDS) in the various implementations of GFF2 and GTF.  One way is to
> represent the translation as an interrupted "CDS" region beginning
> with the first base of the first codon and ending at the last base of
> the stop codon.  Another is to create a series of exons and to
> indicate the position of the translational start and end on the first
> and last coding exon.
> 
> An informal sampling of members of this list (Michele Clamp, Suzi
> Lewis, Richard Durbin) suggests that the latter solution is cleaner
> and more manageable in practice, leading to more consistent annotation
> and to fewer ambiguities.  Therefore, I would propose that we
> legislate that translations be represented implicitly by explicit
> translational start and end positions.  For this to work properly, the
> parent of the start and end sites must be the mRNA feature and NOT the
> exon.
> 
> Under this model, here is a generic gene
> 
>   gene:  a bag of features, including regulatory motifs
>      mRNA
> 	exon
> 	coding_start
> 	coding_end
> 	splice_donor
> 	splice_acceptor
> 	5_utr
> 	3_utr
> 
> Importantly, the UTRs, coding start and coding end are all children of
> the mRNA.  Making them children of the exon (which some will be
> tempted to do!) creates ambiguities in the interpretation of
> alternative splices.
> 
> =================================================================
> 
> EXAMPLE PROGRAM
> 
> I have extended (in an experimental way), the Bio::Tools::GFF module
> to accomodate this new format.  Here is a test script and its output
> when run on the above file.
> 
>   0  #!/usr/bin/perl -w
>   1  use strict;
>   2  use lib '.';
> 
>   3  use Bio::Tools::GFF;
>   4  my $file = 'gff3.txt';
>   5  my $gffio = Bio::Tools::GFF->new(-file=>$file,-gff_version=>3);
>   6  my @f = sort {$a->primary_tag cmp $b->primary_tag} $gffio->features;
>   7  format_features(\@f);
> 
>   8  sub format_features {
>   9    my $features = shift;
>  10    my $tabs     = shift || 0;
>  11    for my $f (@$features) {
>  12      my $type  = $f->primary_tag;
>  13      my $id    = $f->unique_id;
>  14      $id       ||= '(no id)';# if $id =~ /HASH/;
>  15      my ($start,$end) = ($f->start,$f->end);
>  16      my $hit = $f->can('hstart') ? $f->hunique_id.":".$f->feature2->location->to_FTstring
>  17                                  : '';
>  18      print "\t"x$tabs,join("\t",$id,$type,$f->location->to_FTstring,$hit),"\n";
>  19      format_features([$f->sub_SeqFeature],$tabs+1);
>  20    }
>  21  }
> 
> OUTPUT:
> 
> clone00001	clone	1..2679	
> contig0001	contig	1..1497228	
> gene00001	gene	43733..44677	
> 	mRNA00001	mRNA	43733..44677	
> 		exon00001	exon	43733..43961	
> 			(no id)	coding_start	43740	
> 		exon00002	exon	44030..44234	
> 		exon00004	exon	44521..44677	
> 			(no id)	coding_end	44677	
> 		cds00001	cds	join(43740..43961,44030..44234,44521..44677)	
> 	mRNA00002	mRNA	43733..44677	
> 		exon00001	exon	43733..43961	
> 			(no id)	coding_start	43740	
> 		exon00002	exon	44030..44234	
> 		exon00003	exon	44281..44328	
> 		exon00004	exon	44521..44677	
> 			(no id)	coding_end	44677	
> mRNA03	mRNA	5000..6000	
> 	exon00005	exon	5000..5299	
> 	exon00006	exon	5300..5399	
> 	exon00007	exon	5400..5999	
> match0001	match	join(1..100,101..500,501..1000,1501..1510)	af923:join(1001..1100,1101..1500,1501..2000,2001..2011)
> match0002	match	5001..6000	ua388:1..1000
> 	(no id)	hsp	5001..5500	ua388:1..500
> 	(no id)	hsp	5501..6000	ua388:501..1000
> (no id)	repeat	5000..5100	
> 
> 
>