[Bioperl-l] Re: Proposed GFF version 3

Lincoln Stein lstein at cshl.org
Tue Feb 11 09:21:21 EST 2003


The important thing to me is to be able to preserve some backward 
compatibility with GFF2.  I don't think it will make much of a difference 
which order the two columns fall in because some people used column 9 for 
grouping and others for attributes.  How about calling column 10 "parents"?

I went to URL format mostly because Perl parsing will be a lot faster (Perl 
likes regular expressions, but those don't play well with shell-style quote 
and backslashing rules).  The official URL standard uses the semicolon.  The 
very earliest CGI specification used ampersands, but this was abandoned about 
five years ago when people realized that this violated the HTML spec 
(ampersands must be escaped, so the correct way to write ampersanded 
parameter lists is:

	<a href="/cgi-bin/foo?first=a&amp;second=b&amp;third=c">

I'm surprised to hear that Ensembl uses ampersands in its URLs.  I bet their 
pages don't validate against the XHTML validators.

Lincoln


On Tuesday 11 February 2003 07:54 am, Richard Durbin wrote:
> Swap them entirely.  i.e. put the attributes in column 9 and call that
> "attributes" and put the new hierarchical group term in column 10 and
> call that "group".  Or perhaps it would be better to call it something
> else to minimise confusion, because in gff version 1 column 9 was called
> group.  What about calling column 10 "cluster"?
>
> I see you have switched to URL type format for the attributes, away from
> acedb.  That's fine - URL format is much more universal.  But is ';' a
> standard separator in URLS?  I just looked and see that Ensembl uses '&'
> and WormBase uses ';' and I think I have seen '+' somewhere, so maybe
> there is no standard.
>
> Richard
>
> Lincoln Stein wrote:
> > Hi Richard,
> >
> > Do you mean that we should swap columns 9 and 10 entirely, or just swap
> > their names?  I think you mean the former, but I want to be sure.
> >
> > Lincoln
> >
> > On Monday 10 February 2003 11:12 am, Richard Durbin wrote:
> >>Hello all,
> >>
> >>This looks very nice to me.  Not surprising perhaps because I had an
> >>earlier involvement as Lincoln says.
> >>
> >>I have added gff-list at sanger.ac.uk to the mailing Cc: list because it is
> >>the "official" GFF mailing list, although it is very little used.
> >>
> >>I have one major comment, that columns 9 (group) and 10 (attributes)
> >>should be switched.  Although GFF version 1 column 9 was called "group"
> >>in version 2, which is what has been current for over two years, this
> >>was renamed "attribute" and contains the attribute information.  For
> >>consistency we should keep column 9 for the attributes.  Also, in many
> >>cases there will be attributes but no group.
> >>
> >>I like ID and Target.  I see the idea with hsp's for gapped alignments,
> >>though perhaps they could be called "match_block".  But there is a case
> >>I think to also encode gapped alignments on one line, perhaps using the
> >>CIGAR encoding used by ENSEMBL (and BioPerl?), e.g. as
> >>
> >>		Target=M1:1..1000;Align=xxxxxxx
> >>
> >>(sorry I don't know cigar format well enough to write a legal string.
> >>
> >>Richard
> >>
> >>Lincoln Stein wrote:
> >>>This letter is to discuss a proposed extension to GFF.  It arises from
> >>>conversations with Richard Durbin during last fall's Hinxton genome
> >>>informatics meeting.
> >>>
> >>>Although there are many richer ways of representing genomic features
> >>>via XML, the stubborn persistence of a variety of ad-hoc tab-delimited
> >>>flat file formats declares the bioinformatics community's need for a
> >>>simple format that can be modified with a text editor and processed
> >>>with shell tools like grep.  The GFF format, although widely used, has
> >>>fragmented into multiple incompatible dialects.  When asked why they
> >>>have modified the published Sanger specification, bioinformaticists
> >>>frequently answer that the format was insufficient for their needs,
> >>>and they needed to extend it.  The proposed GFF3 format addresses the
> >>>most common extensions to GFF, while preserving backward compatibility
> >>>with previous formats. The new format:
> >>>
> >>>    1) adds a mechanism for representing more than one level
> >>>       of hierarchical grouping of features and subfeatures.
> >>>    2) separates the ideas of group membership and feature name/id
> >>>    3) constrains the feature type field to be taken from a controlled
> >>>       vocabulary.
> >>>    4) allows a single feature, such as an exon, to belong to more than
> >>>       one group at a time.
> >>>    5) one level of relative addressing for subfeatures (e.g. exons
> >>>       can be expressed in transcript coordinates)
> >>>    6) an explicit convention for pairwise alignments
> >>>    7) an explicit convention for features that occupy disjunct regions
> >>>
> >>>The format consists of 10 columns, separated by spaces.  The following
> >>>unescaped characters are allowed within fields:
> >>>[a-zA-Z0-9.:;=%^*$@!+_?-].  All other characters must must be escaped
> >>>using the URL escaping conventions.  Unescaped quotation marks,
> >>>backslashes and other ad-hoc escaping conventions that have been added
> >>>to the GFF format are explicitly forbidden.  The =, ; and % characters
> >>>have reserved meanings as described below.
> >>>
> >>>Undefined fields are replaced with the "." character, as described in
> >>>the original GFF spec.
> >>>
> >>>Column 1: "seqid"
> >>>
> >>>The ID of the landmark used to establish the coordinate system for the
> >>>current feature.  IDs must contain alphanumeric characters.
> >>>Whitespace, if present, must be escaped using URL escaping rules
> >>>(e.g. space="%20").
> >>>
> >>>Column 2: "source"
> >>>
> >>>The source of the feature.  This is unchanged from the older GFF specs
> >>>and is not part of a controlled vocabulary.
> >>>
> >>>Column 3: "type"
> >>>
> >>>The type of the feature (previously called the "method").  This is
> >>>constrained to be either: (a) a term from SOFA; or (b) a SOFA
> >>>accession number.  The latter alternative is distinguished using the
> >>>syntax SOFA:000000.
> >>>
> >>>Columns 4 & 5: "start" and "end"
> >>>
> >>>The start and end of the feature, in 1-based integer coordinates,
> >>>relative to the landmark given in column 1.  Start is less than end.
> >>>
> >>>Column 6: "score"
> >>>
> >>>The score of the feature, a floating point number.  As in earlier
> >>>versions of the format, the semantics of the score are ill-defined.
> >>>It is strongly recommended that E-values be used for sequence
> >>>similarity features, and that P-values be used for ab initio gene
> >>>prediction features.
> >>>
> >>>Column 7: "strand"
> >>>
> >>>The strand of the feature.  + for positive strand (relative to the
> >>>landmark), - for minus strand, and . for features that are not
> >>>stranded.  In addition, ? can be used for features whose strandedness
> >>>is relevant, but unknown.
> >>>
> >>>Column 8: "phase"
> >>>
> >>>The phase of the feature, for protein-encoding featues (primarily
> >>>CDSs).  This is an integer-valued field with the values 0, 1, or 2.
> >>>The integer indicates the offset from the start of the feature to the
> >>>first base of the first codon in the reading frame.  "." is used for
> >>>features that do not corresponding to a reading frame.
> >>>
> >>>Column 9: "group"
> >>>
> >>>A list of the immediate parents of the current feature.  Multiple
> >>>parents are allowed (example: one exon shared by multiple
> >>>transcripts). Multiple parents are separated by a semicolon.
> >>>Parentless features have a dot in this field.
> >>>
> >>>Column 10: "attributes"
> >>>
> >>>A list of feature attributes in the format tag=value.  Multiple
> >>>tag=value pairs are separated by semicolons.  URL escaping rules are
> >>>used for tags or values containing whitespace, "=" characters and
> >>>semicolons.
> >>>
> >>>Two tags are special:
> >>>
> >>>    ID	 Indicates the name of the feature.  IDs must be unique
> >>>	 within the scope of the GFF file.
> >>>
> >>>    Target Indicates the target of a nucleotide to nucleotide or
> >>>	   nucleotide to protein alignment.  The format of the
> >>>	   value is "target_id:start..end"  Start may be greater
> >>>	   than end to indicate a + strand alignment to the
> >>>	   reverse complement of a target nucleotide sequence.
> >>>
> >>>In the example GFF3 file given below, the first column contains line
> >>>numbers that I have added for the purposes of the narrative.  Here are
> >>>some common scenarios that I have attempted to illustrate:
> >>>
> >>>A) a simple feature, no public ID
> >>>
> >>>Line 2 in the example is a feature of type "repeat". It has a start
> >>>and an end and no ID, but it does have an attribute named "Note."
> >>>
> >>>B) a simple feature with a public ID
> >>>
> >>>Line 3 is a feature of type clone.  It has a start and an end.  Its
> >>>parent is undefined (empty column 9), but it has an attribute of type
> >>>ID with value "cTel33B."
> >>>
> >>>C) a feature with multiple attributes
> >>>
> >>>Line 5 is a feature of type "gene."  It has no parent, and has
> >>>attributes of type ID, Note, and GO_term.
> >>>
> >>>D) a hierarchical grouping of features
> >>>
> >>>Lines 5-13 demonstrate a hierarchical grouping.  At the top level is
> >>>line 5, which defines the extent of a "gene" with ID Y74C9A.1.  Below
> >>>this are two features of type mRNA (lines 6 and 7).  Their group
> >>>fields contain the ID of Y74C9A.1, indicating that this feature is
> >>>their immediate parent.  In the 10th column, the mRNA features have
> >>>their own IDs independent of the ID of the parent gene.
> >>>
> >>>This pattern is repeated for the exons listed on lines 8-11.  Exons
> >>>e1, e2, and e4 belong to both of the transcripts.  Therefore, both
> >>>transcript IDs are listed in the group column, separated by
> >>>semicolons.
> >>>
> >>>Exon e3 belongs only to one of the transcripts, and therefore only
> >>>that transcript's ID is listed in the group column.
> >>>
> >>>Lines 12 and 13 indicate coding_start and coding_end features.  These
> >>>subfeatures are hierarchically grouped underneath their corresponding
> >>>exons, but they do not have independent public IDs.
> >>>
> >>>E) Disjunct coordinates
> >>>
> >>>Lines 14-16 illustrates a single feature -- the CDS corresponding to
> >>>mRNA Y74C9A.1a -- which occupies multiple disjunct regions.  The group
> >>>column indicates that the CDS belongs to mRNA Y74C9A.1a.  However, the
> >>>attribute column assigns each of lines 14-16 the same ID.  Because the
> >>>ID is the same, this is to be interpreted as a single feature that
> >>>spans multiple locations.
> >>>
> >>>F) Alignments
> >>>
> >>>Lines 17-19 demonstrate a gapped alignment of two sequences using the
> >>>reserved Target attribute.  Each non-gapped segment becomes a line in
> >>>the GFF3 file.  The segments each share the same ID, thereby
> >>>indicating that the segments are disjunct regions of the same feature.
> >>>The Target attribute indicates the ID of the target sequence (which
> >>>does not have to be represented in the GFF3 file) and the start and
> >>>end coordinates of the aligned target.
> >>>
> >>>Unlike the GFF1 and GFF2 formats, the group field for gapped
> >>>alignments can be empty. However, a valid alternative representation
> >>>is to create a single "match" feature, and a series of "hsp" features
> >>>underneath it via the group field.  Lines 20-22 show this alternative
> >>>representation.
> >>>
> >>>G) Relative coordinates
> >>>
> >>>Lines 23-26 illustrate using relative coordinate addressing in
> >>>feature/subfeature relationships.  Line 23 defines an mRNA that is
> >>>positioned on sequence landmark "I" from positions 5000 to 6000.  Its
> >>>ID field indicates that it is M7.3.  Lines 24-26 are exon subfeatures
> >>>of M7.3 as indicated by their group field.  However, the seqid field
> >>>specifies M7.3 as the parent coordinate system, thereby allowing the
> >>>exons to begin at position 1.
> >>>
> >>>  0  ##gff-version 3
> >>>  1  ##sequence-region I:1..14972282
> >>>  2  I       wormbase        repeat  5000    5100    .       .       .
> >>>   .       Note=ALU3 3  I       wormbase        clone   1       2679   
> >>> . +       .       .       ID=cTel33B 4  I       wormbase
> >>>contig  1       14972282        .       +       .       .
> >>>ID=CHROMOSOME_I 5  I       wormbase        gene    43733   44677   .
> >>> +       .               .       ID=Y74C9A.1;Note=unc-3;GO_term=GO:12345
> >>>6  I       wormbase        mRNA    43733   44677   .       +       .
> >>> Y74C9A.1        ID=Y74C9A.1a 7  I       wormbase        mRNA    43733
> >>>44677   .       +       .       Y74C9A.1        ID=Y74C9A.1b 8  I
> >>>wormbase        exon    43733   43961   .       +       .
> >>>Y74C9A.1a;Y74C9A.1b     ID=e1 9  I       wormbase        exon    44030
> >>>44234   .       +       .       Y74C9A.1a;T:Y74C9A.1b   ID=e2 10  I
> >>>wormbase        exon    44281   44328   .       +       .      
> >>> Y74C9A.1b ID=e3 11  I       wormbase        exon    44521   44677   .  
> >>>     + .       Y74C9A.1a;T:Y74C9A.1b   ID=e4 12  I       wormbase
> >>> coding_start    43740   43740   .       +       .       e1 13  I
> >>> wormbase        coding_end      44677   44677   .       +       . e4 14
> >>>  I       wormbase        cds     43740   43961   .       +       0
> >>> Y74C9A.1a 15  I       wormbase        cds     44030   44234   . +      
> >>> 1       Y74C9A.1a 16  I       wormbase        cds     44521 44677   .  
> >>>     +       1       Y74C9A.1a 17  I       wormbase
> >>>match   1       100     100     .       .       .
> >>>ID=12345.s;Target=cb123:1001..1100 18  I       wormbase        match
> >>>101     500     20      .       .       .
> >>>ID=12345.s;Target=cb123:1101..1500 19  I       wormbase        match
> >>>501     1000    80      .       .       .
> >>>ID=12345.s;Target=cb123:1501..2000 20  I       wormbase        match
> >>>5001    6000    100     .       .       .       ID=abc;Target=M1:1..1000
> >>>21  I       wormbase        hsp     5001    5500    .       .       .
> >>>  abc     Target=M1:1..500 22  I       wormbase        hsp     5501
> >>>6000    .       .       .       abc     Target=M1:501..100 23  I
> >>>wormbase        mRNA    5000    6000    +       .       .       .
> >>>ID=M7.3 24  M7.3    wormbase        exon    1       300     +       .
> >>>  .       M7.3    ID=M7.3.1 25  M7.3    wormbase        exon    301
> >>>400     +       .       .       M7.3    ID=M7.3.2 26  M7.3    wormbase
> >>>    exon    401     1000    +       .       .       M7.3    ID=M7.3.3
> >>>
> >>>=================================================================
> >>>
> >>>I have extended (in an experimental way), the Bio::Tools::GFF module
> >>>to accomodate this new format.  Here is a test script and its output
> >>>when run on the above file.
> >>>
> >>>  0  #!/usr/bin/perl -w
> >>>  1  use strict;
> >>>  2  use lib '.';
> >>>
> >>>  3  use Bio::Tools::GFF;
> >>>  4  my $gffio = Bio::Tools::GFF->new(-fh=>\*STDIN,-gff_version=>3);
> >>>  5  my @f = $gffio->features;
> >>>  6  format_features(\@f);
> >>>
> >>>  7  sub format_features {
> >>>  8    my $features = shift;
> >>>  9    my $tabs     = shift || 0;
> >>> 10    for my $f (@$features) {
> >>> 11      my $type  = $f->primary_tag;
> >>> 12      my $id    = $f->unique_id;
> >>> 13      $id       ||= '(no id)';
> >>> 14      my ($start,$end) = ($f->start,$f->end);
> >>> 15      my $alt = ($f->alternative_locations)[0];
> >>> 16      my ($target,$tstart,$tend) =
> >>>($alt->seq_id,$alt->start,$alt->end) if $alt;
> >>>
> >>> 17      print
> >>>"\t"x$tabs,join("\t",$id,$type,$f->location->to_FTstring,eval{$alt->loca
> >>>t ion->seq_id,$alt->location->to_FTstring}),"\n"; 18
> >>>format_features([$f->sub_SeqFeature],$tabs+1);
> >>> 19    }
> >>> 20  }
> >>>
> >>> 21  1;
> >>>
> >>>OUTPUT:
> >>>
> >>>cTel33B	clone	1..2679
> >>>CHROMOSOME_I	contig	1..14972282
> >>>12345.s	match	join(101..500,1..100,501..1000)
> >>>M7.3	mRNA	5000..6000
> >>>	M7.3.1	exon	5000..5299
> >>>	M7.3.2	exon	5300..5399
> >>>	M7.3.3	exon	5400..5999
> >>>abc	match	5001..6000
> >>>	(no id)	hsp	5001..5500
> >>>	(no id)	hsp	5501..6000
> >>>(no id)	repeat	5000..5100
> >>>Y74C9A.1	gene	43733..44677
> >>>	Y74C9A.1a	mRNA	43733..44677
> >>>		e1	exon	43733..43961
> >>>			(no id)	coding_start	43740
> >>>		e2	exon	44030..44234
> >>>		e4	exon	44521..44677
> >>>			(no id)	coding_end	44677
> >>>		(no id)	cds	43740..43961
> >>>		(no id)	cds	44030..44234
> >>>		(no id)	cds	44521..44677
> >>>	Y74C9A.1b	mRNA	43733..44677
> >>>		e1	exon	43733..43961
> >>>			(no id)	coding_start	43740
> >>>		e3	exon	44281..44328

-- 
Lincoln Stein
lstein at cshl.org
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
(516) 367-8380 (voice)
(516) 367-8389 (fax)



More information about the Bioperl-l mailing list