[Bioperl-l] Re: [SO-devel] GFF3 preliminary

Tue Feb 18 22:25:52 EST 2003

Hi All,

".  When asked why they
> have modified the published Sanger specification, bioinformaticists
> frequently answer that the format was insufficient for their needs...",

So why not just use XML? you know, with like a real DTD, like the rest of the
world and be done with it ?

--mark 

> Hi,
> 
> Following up on discussions with Jim Kent, Suzi Lewis, Michele Clamp
> and Richard Durbin, here is a new version of the GFF3 proposal.
> 
> Suzi, could you post this to song.sourceforge.net, when you have a
> chance?  I don't seem to have write permissions to the htdocs
> directory.
> 
> Best,
> 
> Lincoln
> 
> 
> GENERIC FEATURE FORMAT VERSION 3: A PROPOSAL
> 
> Author:  Lincoln Stein
> Date:    19 February 2003
> Version: 0.2
> 
> Although there are many richer ways of representing genomic features
> via XML, the stubborn persistence of a variety of ad-hoc tab-delimited
> flat file formats declares the bioinformatics community's need for a
> simple format that can be modified with a text editor and processed
> with shell tools like grep.  The GFF format, although widely used, has
> fragmented into multiple incompatible dialects.  When asked why they
> have modified the published Sanger specification, bioinformaticists
> frequently answer that the format was insufficient for their needs,
> and they needed to extend it.  The proposed GFF3 format addresses the
> most common extensions to GFF, while preserving backward compatibility
> with previous formats. The new format:
> 
>     1) adds a mechanism for representing more than one level 
>        of hierarchical grouping of features and subfeatures.
>     2) separates the ideas of group membership and feature name/id
>     3) constrains the feature type field to be taken from a controlled
>        vocabulary.
>     4) allows a single feature, such as an exon, to belong to more than
>        one group at a time.
>     5) one level of relative addressing for subfeatures (e.g. exons
>        can be expressed in transcript coordinates)
>     6) an explicit convention for pairwise alignments
>     7) an explicit convention for features that occupy disjunct regions
> 
> The format consists of 10 columns, separated by spaces.  The following
> unescaped characters are allowed within fields:
> [a-zA-Z0-9.:;=%^*$@!+_?-].  All other characters must must be escaped
> using the URL escaping conventions.  Unescaped quotation marks,
> backslashes and other ad-hoc escaping conventions that have been added
> to the GFF format are explicitly forbidden.  The =, ; and % characters
> have reserved meanings as described below.
> 
> Undefined fields are replaced with the "." character, as described in
> the original GFF spec.
> 
> Column 1: "seqid"
> 
> The ID of the landmark used to establish the coordinate system for the
> current feature.  IDs must contain alphanumeric characters.
> Whitespace, if present, must be escaped using URL escaping rules
> (e.g. space="%20" or "+").
> 
> Column 2: "source"
> 
> The source of the feature.  This is unchanged from the older GFF specs
> and is not part of a controlled vocabulary.
> 
> Column 3: "type"
> 
> The type of the feature (previously called the "method").  This is
> constrained to be either: (a) a term from SOFA; or (b) a SOFA
> accession number.  The latter alternative is distinguished using the
> syntax SOFA:000000.
> 
> Columns 4 & 5: "start" and "end"
> 
> The start and end of the feature, in 1-based integer coordinates,
> relative to the landmark given in column 1.  Start is less than end.
> 
> Column 6: "score"
> 
> The score of the feature, a floating point number.  As in earlier
> versions of the format, the semantics of the score are ill-defined.
> It is strongly recommended that E-values be used for sequence
> similarity features, and that P-values be used for ab initio gene
> prediction features.
> 
> Column 7: "strand"
> 
> The strand of the feature.  + for positive strand (relative to the
> landmark), - for minus strand, and . for features that are not
> stranded.  In addition, ? can be used for features whose strandedness
> is relevant, but unknown.
> 
> Column 8: "phase"
> 
> The phase of the feature, for protein-encoding featues (primarily
> CDSs).  This is an integer-valued field with the values 0, 1, or 2.
> The integer indicates the offset from the start of the feature to the
> first base of the first codon in the reading frame.  "." is used for
> features that do not corresponding to a reading frame.
> 
> Column 9: "attributes"
> 
> A list of feature attributes in the format tag=value.  Multiple
> tag=value pairs are separated by semicolons.  URL escaping rules are
> used for tags or values containing the following characters: ",=;".
> Whitespace should be replaced with the "+" character or the %20 URL
> escape.  This will allow the file to survive text processing programs
> that convert tabs into spaces.
> 
> Five tags are predefined:
> 
>     ID	   Indicates the name of the feature.  IDs must be unique
> 	   within the scope of the GFF file.
> 
>     Alias  A descriptive name for the feature.  It is suggested that
> 	   this tag be used whenever a secondary identifier for the
> 	   feature is needed, such as display names, locus names and
> 	   accession numbers.  Unlike ID, there is no requirement
> 	   that Alias be unique within the file.
> 
>     Parent Indicates the parent of the feature.  A parent ID can be
> 	   used to group exons into transcripts, transcripts into
> 	   genes, an so forth.  A feature may have multiple parents.
> 
>     Target Indicates the target of a nucleotide to nucleotide or
> 	   nucleotide to protein alignment.  The format of the
> 	   value is "target_id:start..end"  Start may be greater
> 	   than end to indicate a + strand alignment to the
> 	   reverse complement of a target nucleotide sequence.
> 
>     Align  The alignment of the feature to the target if the two
> 	   are not colinear.  The alignment is a string containing
> 	   the four characters "|X^v", where "|" indicates an
> 	   aligned match, "X" indicates an aligned mismatch, "^"
> 	   indicates a gap in the feature, and "v" indicates a
> 	   gap in the target.
> 
> Multiple attributes of the same type are indicated by separating the
> values with the comma "," character, as in:
> 
>        Parent=AF2312,AB2812,abc-3
> 
> Note that attribute names are case sensitive.  "Parent" is not the
> same as "parent".
> 
> In the example GFF3 file given below, the first column contains line
> numbers that I have added for the purposes of the narrative.  Here are
> some common scenarios that I have attempted to illustrate:
> 
> A) a simple feature, no public ID
> 
> Line 2 in the example is a feature of type "repeat". It is located on
> the coordinate system defined by feature "ctg123", has a start and an
> end and no ID.  It has an attribute named "Note" with value "ALU3."
> 
> B) a simple feature with a public ID
> 
> Line 3 is a feature of type clone.  It has a start and an end.  Its
> parent is undefined (no Parent attribute), but it has an ID attribute
> of "clone00001" and an Alias of "cTel33B."
> 
> C) a feature with multiple attributes
> 
> Line 5 is a feature of type "gene."  It has no parent, and has
> attributes of type ID, Note, and GO_term.
> 
> D) a hierarchical grouping of features
> 
> Lines 5-13 demonstrate a hierarchical grouping.  At the top level is
> line 5, which defines the extent of a "gene" with ID "gene00001".
> Below this are two features of type mRNA (lines 6 and 7).  Their
> Parent attributes are set to "gene00001", indicating that this feature
> is their immediate parent.  Their IDs are indicated as separate
> attributes.
> 
> This pattern is repeated for the exons listed on lines 8-11.  Exons
> exon00001, exon00002, and exon00004 belong to both of the transcripts.
> Therefore, their Parent attribute contains both the mRNA00001 and
> mRNA00002 IDs separated by a comma.
> 
> Exon exon00003 belongs to mRNA00002 only, and therefore that
> transcript's ID is listed as the sole Parent.
> 
> Lines 12 and 13 indicate coding_start and coding_end features.  These
> subfeatures are hierarchically grouped underneath their corresponding
> exons, but they do not have independent public IDs.
> 
> E) Disjunct coordinates
> 
> Lines 14-16 illustrates a single feature -- the CDS corresponding to
> mRNA mRNA00001 -- which occupies multiple disjunct regions.  The
> Parent attribute indicates that the CDS features belong to mRNA00001.
> However, the attribute column assigns each of lines 14-16 the same ID.
> Because the ID is the same, this is interpreted as a single feature
> that spans multiple disjunct coordinate ranges.
> 
> NOTE: See "Representing Translations" for a discussion of why it might
> not be a good idea to use represent translations in this way.
> 
> F) Alignments
> 
> Lines 17-19 demonstrate an alignment of two sequences using the
> reserved Target attribute.  Each non-gapped segment becomes a line in
> the GFF3 file.  The segments each share the same ID, thereby
> indicating that the segments are disjunct regions of the same feature.
> The Target attribute indicates the ID of the target sequence (which
> does not have to be represented in the GFF3 file) and the start and
> end coordinates of the aligned target.
> 
> Line 20 shows a gapped alignment using the Align attribute.  This
> attribute's value should be interpreted this way:
> 
> 
>  1501  gatt*ctccc 1510      ctg123
>        ||||^||X||
>  2001  gatttctgcc 2011      af923
> 
> Unlike the GFF1 and GFF2 formats, the Parent attribute for gapped
> alignments may be empty. However, a valid alternative representation
> is to create a single "match" feature, and a series of "hsp" features
> contained within it.  Lines 21-23 show this alternative
> representation.
> 
> G) Relative coordinates
> 
> Lines 24-27 illustrate using relative coordinate addressing in
> feature/subfeature relationships.  Line 24 defines an mRNA that is
> positioned on sequence landmark "ctg123" from positions 5000 to 6000.
> Its ID field indicates that is mRNA03.  Lines 25-27 are exon
> subfeatures of mRNA03 as indicated by their Parent attribute.
> However, the seqid field specifies mRNA03 as the parent coordinate
> system, thereby allowing the exons to begin at position 1.
> 
>   0  ##gff-version 3
>   1  ##sequence-region ctg123:1..1497228     
> 
>   2  ctg123  flybase repeat  5000    5100    .       .       .      
Note=ALU3
>   3  ctg123  flybase clone   1       2679    .       +       .      
ID=clone00001;Alias=cTel33B
>   4  ctg123  flybase contig  1       1497228 .       +       .      
ID=contig0001;Alias=ctg123
> 
>   5  ctg123  flybase gene    43733   44677   .       +       .      
ID=gene00001;Alias=ADAM1;Note=unc-3;GO_term=GO:12345,GO:33421
>   6  ctg123  flybase mRNA    43733   44677   .       +       .      
ID=mRNA00001;Alias=ADAM1.t1;Parent=gene00001
>   7  ctg123  flybase mRNA    43733   44677   .       +       .      
ID=mRNA00002;Alias=ADAM1.t2;Parent=gene00001
>   8  ctg123  flybase exon    43733   43961   .       +       .      
ID=exon00001;Parent=mRNA00001,mRNA00002
>   9  ctg123  flybase exon    44030   44234   .       +       .      
ID=exon00002;Parent=mRNA00001,mRNA00002
>  10  ctg123  flybase exon    44281   44328   .       +       .      
ID=exon00003;Parant=mRNA00002
>  11  ctg123  flybase exon    44521   44677   .       +       .      
ID=exon00004;Parent=mRNA00001,mRNA00002
>  12  ctg123  flybase coding_start    43740   43740   .       +       .      
Parent=exon00001
>  13  ctg123  flybase coding_end      44677   44677   .       +       .      
Parent=exon00004
> 
>  14  ctg123  flybase cds     43740   43961   .       +       0      
ID=cds00001;Parent=mRNA00001
>  15  ctg123  flybase cds     44030   44234   .       +       1      
ID=cds00001;Parent=mRNA00001
>  16  ctg123  flybase cds     44521   44677   .       +       1      
ID=cds00001;Parent=mRNA00001
> 
>  17  ctg123  flybase match   1       100     100     .       .      
ID=match0001;Target=af923:1001..1100
>  18  ctg123  flybase match   101     500     80      .       .      
ID=match0001;Target=af923:1101..1500
>  19  ctg123  flybase match   501     1000    80      .       .      
ID=match0001;Target=af923:1501..2000
>  20  ctg123  flybase match   1501    1510    60      .       .      
ID=match0001;Target=af923:2001..2011;Align=||||^||X||
> 
>  21  ctg123  flybase match   5001    6000    100     .       .      
ID=match0002;Target=ua388:1..1000
>  22  ctg123  flybase hsp     5001    5500    .       .       .      
Parent=match0002;Target=ua388:1..500
>  23  ctg123  flybase hsp     5501    6000    .       .       .      
Parent=match0002;Target-ua388:501.1000
> 
>  24  ctg123  flybase mRNA    5000    6000    +       .       .      
ID=mRNA03;Alias=EVE1.t1
>  25  mRNA03  flybase exon    1       300     +       .       .      
ID=exon00005;Parent=mRNA03
>  26  mRNA03  flybase exon    301     400     +       .       .      
ID=exon00006;Parent=mRNA03
>  27  mRNA03  flybase exon    401     1000    +       .       .      
ID=exon00007;Parent=mRNA03
> 
> =================================================================
> 
> OTHER SYNTAX:
> 
> Comments are preceded by the # symbol.  Meta-data and directives are
> preceded by ##.  The following directives are recognized:
> 
>   ##gff-version 3        
> 	The GFF version, always 3 in this spec.  This must
> 	be the topmost line of the file.
> 
>   ##sequence-region seqid:start..end
>         The sequence segment referred to
> 	by this file, in the format seqid:start..end.
> 	This element is optional.  If it occurs, it must be
> 	the second line of the file.
> 
>   ###
>         This directive (three # signs in a row) indicates that all
>         forward references to feature IDs that have been seen to this
>         point have been resolved.  After seeing this directive, a
>         program that is processing the file serially can close off any
>         open objects that it has created and return them, thereby
>         allowing iterative access to the file.  Otherwise, software
>         cannot know that a feature has been fully populated by its
>         subfeatures until the end of the file has been reached.
>                  
> 
> =================================================================
> 
> REPRESENTING TRANSLATIONS
> 
> There are two ways of representing protein translations (e.g. ORFS,
> CDS) in the various implementations of GFF2 and GTF.  One way is to
> represent the translation as an interrupted "CDS" region beginning
> with the first base of the first codon and ending at the last base of
> the stop codon.  Another is to create a series of exons and to
> indicate the position of the translational start and end on the first
> and last coding exon.
> 
> An informal sampling of members of this list (Michele Clamp, Suzi
> Lewis, Richard Durbin) suggests that the latter solution is cleaner
> and more manageable in practice, leading to more consistent annotation
> and to fewer ambiguities.  Therefore, I would propose that we
> legislate that translations be represented implicitly by explicit
> translational start and end positions.  For this to work properly, the
> parent of the start and end sites must be the mRNA feature and NOT the
> exon.
> 
> Under this model, here is a generic gene
> 
>   gene:  a bag of features, including regulatory motifs
>      mRNA
> 	exon
> 	coding_start
> 	coding_end
> 	splice_donor
> 	splice_acceptor
> 	5_utr
> 	3_utr
> 
> Importantly, the UTRs, coding start and coding end are all children of
> the mRNA.  Making them children of the exon (which some will be
> tempted to do!) creates ambiguities in the interpretation of
> alternative splices.
> 
> =================================================================
> 
> EXAMPLE PROGRAM
> 
> I have extended (in an experimental way), the Bio::Tools::GFF module
> to accomodate this new format.  Here is a test script and its output
> when run on the above file.
> 
>   0  #!/usr/bin/perl -w
>   1  use strict;
>   2  use lib '.';
> 
>   3  use Bio::Tools::GFF;
>   4  my $file = 'gff3.txt';
>   5  my $gffio = Bio::Tools::GFF->new(-file=>$file,-gff_version=>3);
>   6  my @f = sort {$a->primary_tag cmp $b->primary_tag} $gffio->features;
>   7  format_features(\@f);
> 
>   8  sub format_features {
>   9    my $features = shift;
>  10    my $tabs     = shift || 0;
>  11    for my $f (@$features) {
>  12      my $type  = $f->primary_tag;
>  13      my $id    = $f->unique_id;
>  14      $id       ||= '(no id)';# if $id =~ /HASH/;
>  15      my ($start,$end) = ($f->start,$f->end);
>  16      my $hit = $f->can('hstart') ?
$f->hunique_id.":".$f->feature2->location->to_FTstring
>  17                                  : '';
>  18      print
"\t"x$tabs,join("\t",$id,$type,$f->location->to_FTstring,$hit),"\n";
>  19      format_features([$f->sub_SeqFeature],$tabs+1);
>  20    }
>  21  }
> 
> OUTPUT:
> 
> clone00001	clone	1..2679	
> contig0001	contig	1..1497228	
> gene00001	gene	43733..44677	
> 	mRNA00001	mRNA	43733..44677	
> 		exon00001	exon	43733..43961	
> 			(no id)	coding_start	43740	
> 		exon00002	exon	44030..44234	
> 		exon00004	exon	44521..44677	
> 			(no id)	coding_end	44677	
> 		cds00001	cds	join(43740..43961,44030..44234,44521..44677)	
> 	mRNA00002	mRNA	43733..44677	
> 		exon00001	exon	43733..43961	
> 			(no id)	coding_start	43740	
> 		exon00002	exon	44030..44234	
> 		exon00003	exon	44281..44328	
> 		exon00004	exon	44521..44677	
> 			(no id)	coding_end	44677	
> mRNA03	mRNA	5000..6000	
> 	exon00005	exon	5000..5299	
> 	exon00006	exon	5300..5399	
> 	exon00007	exon	5400..5999	
> match0001	match	join(1..100,101..500,501..1000,1501..1510)
af923:join(1001..1100,1101..1500,1501..2000,2001..2011)
> match0002	match	5001..6000	ua388:1..1000
> 	(no id)	hsp	5001..5500	ua388:1..500
> 	(no id)	hsp	5501..6000	ua388:501..1000
> (no id)	repeat	5000..5100	
> 
> 
> 
> -- 
> ========================================================================
> Lincoln D. Stein                           Cold Spring Harbor Laboratory
> lstein at cshl.org			                  Cold Spring Harbor, NY
> 	1 Bungtown Road, Cold Spring Harbor, NY 11724
> ========================================================================
> 
> 
> -------------------------------------------------------
> This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
> The most comprehensive and flexible code editor you can use.
> Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
> www.slickedit.com/sourceforge
> _______________________________________________
> SOng-devel mailing list
> SOng-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/song-devel
> 
-- 
M