[Bioperl-l] Re: Proposed GFF version 3

Tue Feb 11 17:27:40 EST 2003

OK.  Let's fix on that:

   Column 9 is "attributes" and column 10 is "parents".

Richard

Lincoln Stein wrote:
> The important thing to me is to be able to preserve some backward 
> compatibility with GFF2.  I don't think it will make much of a difference 
> which order the two columns fall in because some people used column 9 for 
> grouping and others for attributes.  How about calling column 10 "parents"?
> 
> I went to URL format mostly because Perl parsing will be a lot faster (Perl 
> likes regular expressions, but those don't play well with shell-style quote 
> and backslashing rules).  The official URL standard uses the semicolon.  The 
> very earliest CGI specification used ampersands, but this was abandoned about 
> five years ago when people realized that this violated the HTML spec 
> (ampersands must be escaped, so the correct way to write ampersanded 
> parameter lists is:
> 
> 	<a href="/cgi-bin/foo?first=a&amp;second=b&amp;third=c">
> 
> I'm surprised to hear that Ensembl uses ampersands in its URLs.  I bet their 
> pages don't validate against the XHTML validators.
> 
> Lincoln
> 
> 
> On Tuesday 11 February 2003 07:54 am, Richard Durbin wrote:
> 
>>Swap them entirely.  i.e. put the attributes in column 9 and call that
>>"attributes" and put the new hierarchical group term in column 10 and
>>call that "group".  Or perhaps it would be better to call it something
>>else to minimise confusion, because in gff version 1 column 9 was called
>>group.  What about calling column 10 "cluster"?
>>
>>I see you have switched to URL type format for the attributes, away from
>>acedb.  That's fine - URL format is much more universal.  But is ';' a
>>standard separator in URLS?  I just looked and see that Ensembl uses '&'
>>and WormBase uses ';' and I think I have seen '+' somewhere, so maybe
>>there is no standard.
>>
>>Richard
>>
>>Lincoln Stein wrote:
>>
>>>Hi Richard,
>>>
>>>Do you mean that we should swap columns 9 and 10 entirely, or just swap
>>>their names?  I think you mean the former, but I want to be sure.
>>>
>>>Lincoln
>>>
>>>On Monday 10 February 2003 11:12 am, Richard Durbin wrote:
>>>
>>>>Hello all,
>>>>
>>>>This looks very nice to me.  Not surprising perhaps because I had an
>>>>earlier involvement as Lincoln says.
>>>>
>>>>I have added gff-list at sanger.ac.uk to the mailing Cc: list because it is
>>>>the "official" GFF mailing list, although it is very little used.
>>>>
>>>>I have one major comment, that columns 9 (group) and 10 (attributes)
>>>>should be switched.  Although GFF version 1 column 9 was called "group"
>>>>in version 2, which is what has been current for over two years, this
>>>>was renamed "attribute" and contains the attribute information.  For
>>>>consistency we should keep column 9 for the attributes.  Also, in many
>>>>cases there will be attributes but no group.
>>>>
>>>>I like ID and Target.  I see the idea with hsp's for gapped alignments,
>>>>though perhaps they could be called "match_block".  But there is a case
>>>>I think to also encode gapped alignments on one line, perhaps using the
>>>>CIGAR encoding used by ENSEMBL (and BioPerl?), e.g. as
>>>>
>>>>		Target=M1:1..1000;Align=xxxxxxx
>>>>
>>>>(sorry I don't know cigar format well enough to write a legal string.
>>>>
>>>>Richard
>>>>
>>>>Lincoln Stein wrote:
>>>>
>>>>>This letter is to discuss a proposed extension to GFF.  It arises from
>>>>>conversations with Richard Durbin during last fall's Hinxton genome
>>>>>informatics meeting.
>>>>>
>>>>>Although there are many richer ways of representing genomic features
>>>>>via XML, the stubborn persistence of a variety of ad-hoc tab-delimited
>>>>>flat file formats declares the bioinformatics community's need for a
>>>>>simple format that can be modified with a text editor and processed
>>>>>with shell tools like grep.  The GFF format, although widely used, has
>>>>>fragmented into multiple incompatible dialects.  When asked why they
>>>>>have modified the published Sanger specification, bioinformaticists
>>>>>frequently answer that the format was insufficient for their needs,
>>>>>and they needed to extend it.  The proposed GFF3 format addresses the
>>>>>most common extensions to GFF, while preserving backward compatibility
>>>>>with previous formats. The new format:
>>>>>
>>>>>   1) adds a mechanism for representing more than one level
>>>>>      of hierarchical grouping of features and subfeatures.
>>>>>   2) separates the ideas of group membership and feature name/id
>>>>>   3) constrains the feature type field to be taken from a controlled
>>>>>      vocabulary.
>>>>>   4) allows a single feature, such as an exon, to belong to more than
>>>>>      one group at a time.
>>>>>   5) one level of relative addressing for subfeatures (e.g. exons
>>>>>      can be expressed in transcript coordinates)
>>>>>   6) an explicit convention for pairwise alignments
>>>>>   7) an explicit convention for features that occupy disjunct regions
>>>>>
>>>>>The format consists of 10 columns, separated by spaces.  The following
>>>>>unescaped characters are allowed within fields:
>>>>>[a-zA-Z0-9.:;=%^*$@!+_?-].  All other characters must must be escaped
>>>>>using the URL escaping conventions.  Unescaped quotation marks,
>>>>>backslashes and other ad-hoc escaping conventions that have been added
>>>>>to the GFF format are explicitly forbidden.  The =, ; and % characters
>>>>>have reserved meanings as described below.
>>>>>
>>>>>Undefined fields are replaced with the "." character, as described in
>>>>>the original GFF spec.
>>>>>
>>>>>Column 1: "seqid"
>>>>>
>>>>>The ID of the landmark used to establish the coordinate system for the
>>>>>current feature.  IDs must contain alphanumeric characters.
>>>>>Whitespace, if present, must be escaped using URL escaping rules
>>>>>(e.g. space="%20").
>>>>>
>>>>>Column 2: "source"
>>>>>
>>>>>The source of the feature.  This is unchanged from the older GFF specs
>>>>>and is not part of a controlled vocabulary.
>>>>>
>>>>>Column 3: "type"
>>>>>
>>>>>The type of the feature (previously called the "method").  This is
>>>>>constrained to be either: (a) a term from SOFA; or (b) a SOFA
>>>>>accession number.  The latter alternative is distinguished using the
>>>>>syntax SOFA:000000.
>>>>>
>>>>>Columns 4 & 5: "start" and "end"
>>>>>
>>>>>The start and end of the feature, in 1-based integer coordinates,
>>>>>relative to the landmark given in column 1.  Start is less than end.
>>>>>
>>>>>Column 6: "score"
>>>>>
>>>>>The score of the feature, a floating point number.  As in earlier
>>>>>versions of the format, the semantics of the score are ill-defined.
>>>>>It is strongly recommended that E-values be used for sequence
>>>>>similarity features, and that P-values be used for ab initio gene
>>>>>prediction features.
>>>>>
>>>>>Column 7: "strand"
>>>>>
>>>>>The strand of the feature.  + for positive strand (relative to the
>>>>>landmark), - for minus strand, and . for features that are not
>>>>>stranded.  In addition, ? can be used for features whose strandedness
>>>>>is relevant, but unknown.
>>>>>
>>>>>Column 8: "phase"
>>>>>
>>>>>The phase of the feature, for protein-encoding featues (primarily
>>>>>CDSs).  This is an integer-valued field with the values 0, 1, or 2.
>>>>>The integer indicates the offset from the start of the feature to the
>>>>>first base of the first codon in the reading frame.  "." is used for
>>>>>features that do not corresponding to a reading frame.
>>>>>
>>>>>Column 9: "group"
>>>>>
>>>>>A list of the immediate parents of the current feature.  Multiple
>>>>>parents are allowed (example: one exon shared by multiple
>>>>>transcripts). Multiple parents are separated by a semicolon.
>>>>>Parentless features have a dot in this field.
>>>>>
>>>>>Column 10: "attributes"
>>>>>
>>>>>A list of feature attributes in the format tag=value.  Multiple
>>>>>tag=value pairs are separated by semicolons.  URL escaping rules are
>>>>>used for tags or values containing whitespace, "=" characters and
>>>>>semicolons.
>>>>>
>>>>>Two tags are special:
>>>>>
>>>>>   ID	 Indicates the name of the feature.  IDs must be unique
>>>>>	 within the scope of the GFF file.
>>>>>
>>>>>   Target Indicates the target of a nucleotide to nucleotide or
>>>>>	   nucleotide to protein alignment.  The format of the
>>>>>	   value is "target_id:start..end"  Start may be greater
>>>>>	   than end to indicate a + strand alignment to the
>>>>>	   reverse complement of a target nucleotide sequence.
>>>>>
>>>>>In the example GFF3 file given below, the first column contains line
>>>>>numbers that I have added for the purposes of the narrative.  Here are
>>>>>some common scenarios that I have attempted to illustrate:
>>>>>
>>>>>A) a simple feature, no public ID
>>>>>
>>>>>Line 2 in the example is a feature of type "repeat". It has a start
>>>>>and an end and no ID, but it does have an attribute named "Note."
>>>>>
>>>>>B) a simple feature with a public ID
>>>>>
>>>>>Line 3 is a feature of type clone.  It has a start and an end.  Its
>>>>>parent is undefined (empty column 9), but it has an attribute of type
>>>>>ID with value "cTel33B."
>>>>>
>>>>>C) a feature with multiple attributes
>>>>>
>>>>>Line 5 is a feature of type "gene."  It has no parent, and has
>>>>>attributes of type ID, Note, and GO_term.
>>>>>
>>>>>D) a hierarchical grouping of features
>>>>>
>>>>>Lines 5-13 demonstrate a hierarchical grouping.  At the top level is
>>>>>line 5, which defines the extent of a "gene" with ID Y74C9A.1.  Below
>>>>>this are two features of type mRNA (lines 6 and 7).  Their group
>>>>>fields contain the ID of Y74C9A.1, indicating that this feature is
>>>>>their immediate parent.  In the 10th column, the mRNA features have
>>>>>their own IDs independent of the ID of the parent gene.
>>>>>
>>>>>This pattern is repeated for the exons listed on lines 8-11.  Exons
>>>>>e1, e2, and e4 belong to both of the transcripts.  Therefore, both
>>>>>transcript IDs are listed in the group column, separated by
>>>>>semicolons.
>>>>>
>>>>>Exon e3 belongs only to one of the transcripts, and therefore only
>>>>>that transcript's ID is listed in the group column.
>>>>>
>>>>>Lines 12 and 13 indicate coding_start and coding_end features.  These
>>>>>subfeatures are hierarchically grouped underneath their corresponding
>>>>>exons, but they do not have independent public IDs.
>>>>>
>>>>>E) Disjunct coordinates
>>>>>
>>>>>Lines 14-16 illustrates a single feature -- the CDS corresponding to
>>>>>mRNA Y74C9A.1a -- which occupies multiple disjunct regions.  The group
>>>>>column indicates that the CDS belongs to mRNA Y74C9A.1a.  However, the
>>>>>attribute column assigns each of lines 14-16 the same ID.  Because the
>>>>>ID is the same, this is to be interpreted as a single feature that
>>>>>spans multiple locations.
>>>>>
>>>>>F) Alignments
>>>>>
>>>>>Lines 17-19 demonstrate a gapped alignment of two sequences using the
>>>>>reserved Target attribute.  Each non-gapped segment becomes a line in
>>>>>the GFF3 file.  The segments each share the same ID, thereby
>>>>>indicating that the segments are disjunct regions of the same feature.
>>>>>The Target attribute indicates the ID of the target sequence (which
>>>>>does not have to be represented in the GFF3 file) and the start and
>>>>>end coordinates of the aligned target.
>>>>>
>>>>>Unlike the GFF1 and GFF2 formats, the group field for gapped
>>>>>alignments can be empty. However, a valid alternative representation
>>>>>is to create a single "match" feature, and a series of "hsp" features
>>>>>underneath it via the group field.  Lines 20-22 show this alternative
>>>>>representation.
>>>>>
>>>>>G) Relative coordinates
>>>>>
>>>>>Lines 23-26 illustrate using relative coordinate addressing in
>>>>>feature/subfeature relationships.  Line 23 defines an mRNA that is
>>>>>positioned on sequence landmark "I" from positions 5000 to 6000.  Its
>>>>>ID field indicates that it is M7.3.  Lines 24-26 are exon subfeatures
>>>>>of M7.3 as indicated by their group field.  However, the seqid field
>>>>>specifies M7.3 as the parent coordinate system, thereby allowing the
>>>>>exons to begin at position 1.
>>>>>
>>>>> 0  ##gff-version 3
>>>>> 1  ##sequence-region I:1..14972282
>>>>> 2  I       wormbase        repeat  5000    5100    .       .       .
>>>>>  .       Note=ALU3 3  I       wormbase        clone   1       2679   
>>>>>. +       .       .       ID=cTel33B 4  I       wormbase
>>>>>contig  1       14972282        .       +       .       .
>>>>>ID=CHROMOSOME_I 5  I       wormbase        gene    43733   44677   .
>>>>>+       .               .       ID=Y74C9A.1;Note=unc-3;GO_term=GO:12345
>>>>>6  I       wormbase        mRNA    43733   44677   .       +       .
>>>>>Y74C9A.1        ID=Y74C9A.1a 7  I       wormbase        mRNA    43733
>>>>>44677   .       +       .       Y74C9A.1        ID=Y74C9A.1b 8  I
>>>>>wormbase        exon    43733   43961   .       +       .
>>>>>Y74C9A.1a;Y74C9A.1b     ID=e1 9  I       wormbase        exon    44030
>>>>>44234   .       +       .       Y74C9A.1a;T:Y74C9A.1b   ID=e2 10  I
>>>>>wormbase        exon    44281   44328   .       +       .      
>>>>>Y74C9A.1b ID=e3 11  I       wormbase        exon    44521   44677   .  
>>>>>    + .       Y74C9A.1a;T:Y74C9A.1b   ID=e4 12  I       wormbase
>>>>>coding_start    43740   43740   .       +       .       e1 13  I
>>>>>wormbase        coding_end      44677   44677   .       +       . e4 14
>>>>> I       wormbase        cds     43740   43961   .       +       0
>>>>>Y74C9A.1a 15  I       wormbase        cds     44030   44234   . +      
>>>>>1       Y74C9A.1a 16  I       wormbase        cds     44521 44677   .  
>>>>>    +       1       Y74C9A.1a 17  I       wormbase
>>>>>match   1       100     100     .       .       .
>>>>>ID=12345.s;Target=cb123:1001..1100 18  I       wormbase        match
>>>>>101     500     20      .       .       .
>>>>>ID=12345.s;Target=cb123:1101..1500 19  I       wormbase        match
>>>>>501     1000    80      .       .       .
>>>>>ID=12345.s;Target=cb123:1501..2000 20  I       wormbase        match
>>>>>5001    6000    100     .       .       .       ID=abc;Target=M1:1..1000
>>>>>21  I       wormbase        hsp     5001    5500    .       .       .
>>>>> abc     Target=M1:1..500 22  I       wormbase        hsp     5501
>>>>>6000    .       .       .       abc     Target=M1:501..100 23  I
>>>>>wormbase        mRNA    5000    6000    +       .       .       .
>>>>>ID=M7.3 24  M7.3    wormbase        exon    1       300     +       .
>>>>> .       M7.3    ID=M7.3.1 25  M7.3    wormbase        exon    301
>>>>>400     +       .       .       M7.3    ID=M7.3.2 26  M7.3    wormbase
>>>>>   exon    401     1000    +       .       .       M7.3    ID=M7.3.3
>>>>>
>>>>>=================================================================
>>>>>
>>>>>I have extended (in an experimental way), the Bio::Tools::GFF module
>>>>>to accomodate this new format.  Here is a test script and its output
>>>>>when run on the above file.
>>>>>
>>>>> 0  #!/usr/bin/perl -w
>>>>> 1  use strict;
>>>>> 2  use lib '.';
>>>>>
>>>>> 3  use Bio::Tools::GFF;
>>>>> 4  my $gffio = Bio::Tools::GFF->new(-fh=>\*STDIN,-gff_version=>3);
>>>>> 5  my @f = $gffio->features;
>>>>> 6  format_features(\@f);
>>>>>
>>>>> 7  sub format_features {
>>>>> 8    my $features = shift;
>>>>> 9    my $tabs     = shift || 0;
>>>>>10    for my $f (@$features) {
>>>>>11      my $type  = $f->primary_tag;
>>>>>12      my $id    = $f->unique_id;
>>>>>13      $id       ||= '(no id)';
>>>>>14      my ($start,$end) = ($f->start,$f->end);
>>>>>15      my $alt = ($f->alternative_locations)[0];
>>>>>16      my ($target,$tstart,$tend) =
>>>>>($alt->seq_id,$alt->start,$alt->end) if $alt;
>>>>>
>>>>>17      print
>>>>>"\t"x$tabs,join("\t",$id,$type,$f->location->to_FTstring,eval{$alt->loca
>>>>>t ion->seq_id,$alt->location->to_FTstring}),"\n"; 18
>>>>>format_features([$f->sub_SeqFeature],$tabs+1);
>>>>>19    }
>>>>>20  }
>>>>>
>>>>>21  1;
>>>>>
>>>>>OUTPUT:
>>>>>
>>>>>cTel33B	clone	1..2679
>>>>>CHROMOSOME_I	contig	1..14972282
>>>>>12345.s	match	join(101..500,1..100,501..1000)
>>>>>M7.3	mRNA	5000..6000
>>>>>	M7.3.1	exon	5000..5299
>>>>>	M7.3.2	exon	5300..5399
>>>>>	M7.3.3	exon	5400..5999
>>>>>abc	match	5001..6000
>>>>>	(no id)	hsp	5001..5500
>>>>>	(no id)	hsp	5501..6000
>>>>>(no id)	repeat	5000..5100
>>>>>Y74C9A.1	gene	43733..44677
>>>>>	Y74C9A.1a	mRNA	43733..44677
>>>>>		e1	exon	43733..43961
>>>>>			(no id)	coding_start	43740
>>>>>		e2	exon	44030..44234
>>>>>		e4	exon	44521..44677
>>>>>			(no id)	coding_end	44677
>>>>>		(no id)	cds	43740..43961
>>>>>		(no id)	cds	44030..44234
>>>>>		(no id)	cds	44521..44677
>>>>>	Y74C9A.1b	mRNA	43733..44677
>>>>>		e1	exon	43733..43961
>>>>>			(no id)	coding_start	43740
>>>>>		e3	exon	44281..44328
>>>>
>