[Bioperl-l] Re: Proposed GFF version 3

Tue Feb 11 10:40:58 EST 2003

Ah, I see.  I'm looking at a lot of old fashioned sites then.

----- Original Message -----
From: "Lincoln Stein" <lstein at cshl.org>
To: "Richard Durbin" <rd at sanger.ac.uk>
Cc: <bioperl-l at bioperl.org>; <suzi at fruitfly.org>; <gff-list at sanger.ac.uk>
Sent: Tuesday, February 11, 2003 6:21 AM
Subject: Re: Proposed GFF version 3

> The important thing to me is to be able to preserve some backward
> compatibility with GFF2.  I don't think it will make much of a difference
> which order the two columns fall in because some people used column 9 for
> grouping and others for attributes.  How about calling column 10
"parents"?
>
> I went to URL format mostly because Perl parsing will be a lot faster
(Perl
> likes regular expressions, but those don't play well with shell-style
quote
> and backslashing rules).  The official URL standard uses the semicolon.
The
> very earliest CGI specification used ampersands, but this was abandoned
about
> five years ago when people realized that this violated the HTML spec
> (ampersands must be escaped, so the correct way to write ampersanded
> parameter lists is:
>
> <a href="/cgi-bin/foo?first=a&amp;second=b&amp;third=c">
>
> I'm surprised to hear that Ensembl uses ampersands in its URLs.  I bet
their
> pages don't validate against the XHTML validators.
>
> Lincoln
>
>
> On Tuesday 11 February 2003 07:54 am, Richard Durbin wrote:
> > Swap them entirely.  i.e. put the attributes in column 9 and call that
> > "attributes" and put the new hierarchical group term in column 10 and
> > call that "group".  Or perhaps it would be better to call it something
> > else to minimise confusion, because in gff version 1 column 9 was called
> > group.  What about calling column 10 "cluster"?
> >
> > I see you have switched to URL type format for the attributes, away from
> > acedb.  That's fine - URL format is much more universal.  But is ';' a
> > standard separator in URLS?  I just looked and see that Ensembl uses '&'
> > and WormBase uses ';' and I think I have seen '+' somewhere, so maybe
> > there is no standard.
> >
> > Richard
> >
> > Lincoln Stein wrote:
> > > Hi Richard,
> > >
> > > Do you mean that we should swap columns 9 and 10 entirely, or just
swap
> > > their names?  I think you mean the former, but I want to be sure.
> > >
> > > Lincoln
> > >
> > > On Monday 10 February 2003 11:12 am, Richard Durbin wrote:
> > >>Hello all,
> > >>
> > >>This looks very nice to me.  Not surprising perhaps because I had an
> > >>earlier involvement as Lincoln says.
> > >>
> > >>I have added gff-list at sanger.ac.uk to the mailing Cc: list because it
is
> > >>the "official" GFF mailing list, although it is very little used.
> > >>
> > >>I have one major comment, that columns 9 (group) and 10 (attributes)
> > >>should be switched.  Although GFF version 1 column 9 was called
"group"
> > >>in version 2, which is what has been current for over two years, this
> > >>was renamed "attribute" and contains the attribute information.  For
> > >>consistency we should keep column 9 for the attributes.  Also, in many
> > >>cases there will be attributes but no group.
> > >>
> > >>I like ID and Target.  I see the idea with hsp's for gapped
alignments,
> > >>though perhaps they could be called "match_block".  But there is a
case
> > >>I think to also encode gapped alignments on one line, perhaps using
the
> > >>CIGAR encoding used by ENSEMBL (and BioPerl?), e.g. as
> > >>
> > >> Target=M1:1..1000;Align=xxxxxxx
> > >>
> > >>(sorry I don't know cigar format well enough to write a legal string.
> > >>
> > >>Richard
> > >>
> > >>Lincoln Stein wrote:
> > >>>This letter is to discuss a proposed extension to GFF.  It arises
from
> > >>>conversations with Richard Durbin during last fall's Hinxton genome
> > >>>informatics meeting.
> > >>>
> > >>>Although there are many richer ways of representing genomic features
> > >>>via XML, the stubborn persistence of a variety of ad-hoc
tab-delimited
> > >>>flat file formats declares the bioinformatics community's need for a
> > >>>simple format that can be modified with a text editor and processed
> > >>>with shell tools like grep.  The GFF format, although widely used,
has
> > >>>fragmented into multiple incompatible dialects.  When asked why they
> > >>>have modified the published Sanger specification, bioinformaticists
> > >>>frequently answer that the format was insufficient for their needs,
> > >>>and they needed to extend it.  The proposed GFF3 format addresses the
> > >>>most common extensions to GFF, while preserving backward
compatibility
> > >>>with previous formats. The new format:
> > >>>
> > >>>    1) adds a mechanism for representing more than one level
> > >>>       of hierarchical grouping of features and subfeatures.
> > >>>    2) separates the ideas of group membership and feature name/id
> > >>>    3) constrains the feature type field to be taken from a
controlled
> > >>>       vocabulary.
> > >>>    4) allows a single feature, such as an exon, to belong to more
than
> > >>>       one group at a time.
> > >>>    5) one level of relative addressing for subfeatures (e.g. exons
> > >>>       can be expressed in transcript coordinates)
> > >>>    6) an explicit convention for pairwise alignments
> > >>>    7) an explicit convention for features that occupy disjunct
regions
> > >>>
> > >>>The format consists of 10 columns, separated by spaces.  The
following
> > >>>unescaped characters are allowed within fields:
> > >>>[a-zA-Z0-9.:;=%^*$@!+_?-].  All other characters must must be escaped
> > >>>using the URL escaping conventions.  Unescaped quotation marks,
> > >>>backslashes and other ad-hoc escaping conventions that have been
added
> > >>>to the GFF format are explicitly forbidden.  The =, ; and %
characters
> > >>>have reserved meanings as described below.
> > >>>
> > >>>Undefined fields are replaced with the "." character, as described in
> > >>>the original GFF spec.
> > >>>
> > >>>Column 1: "seqid"
> > >>>
> > >>>The ID of the landmark used to establish the coordinate system for
the
> > >>>current feature.  IDs must contain alphanumeric characters.
> > >>>Whitespace, if present, must be escaped using URL escaping rules
> > >>>(e.g. space="%20").
> > >>>
> > >>>Column 2: "source"
> > >>>
> > >>>The source of the feature.  This is unchanged from the older GFF
specs
> > >>>and is not part of a controlled vocabulary.
> > >>>
> > >>>Column 3: "type"
> > >>>
> > >>>The type of the feature (previously called the "method").  This is
> > >>>constrained to be either: (a) a term from SOFA; or (b) a SOFA
> > >>>accession number.  The latter alternative is distinguished using the
> > >>>syntax SOFA:000000.
> > >>>
> > >>>Columns 4 & 5: "start" and "end"
> > >>>
> > >>>The start and end of the feature, in 1-based integer coordinates,
> > >>>relative to the landmark given in column 1.  Start is less than end.
> > >>>
> > >>>Column 6: "score"
> > >>>
> > >>>The score of the feature, a floating point number.  As in earlier
> > >>>versions of the format, the semantics of the score are ill-defined.
> > >>>It is strongly recommended that E-values be used for sequence
> > >>>similarity features, and that P-values be used for ab initio gene
> > >>>prediction features.
> > >>>
> > >>>Column 7: "strand"
> > >>>
> > >>>The strand of the feature.  + for positive strand (relative to the
> > >>>landmark), - for minus strand, and . for features that are not
> > >>>stranded.  In addition, ? can be used for features whose strandedness
> > >>>is relevant, but unknown.
> > >>>
> > >>>Column 8: "phase"
> > >>>
> > >>>The phase of the feature, for protein-encoding featues (primarily
> > >>>CDSs).  This is an integer-valued field with the values 0, 1, or 2.
> > >>>The integer indicates the offset from the start of the feature to the
> > >>>first base of the first codon in the reading frame.  "." is used for
> > >>>features that do not corresponding to a reading frame.
> > >>>
> > >>>Column 9: "group"
> > >>>
> > >>>A list of the immediate parents of the current feature.  Multiple
> > >>>parents are allowed (example: one exon shared by multiple
> > >>>transcripts). Multiple parents are separated by a semicolon.
> > >>>Parentless features have a dot in this field.
> > >>>
> > >>>Column 10: "attributes"
> > >>>
> > >>>A list of feature attributes in the format tag=value.  Multiple
> > >>>tag=value pairs are separated by semicolons.  URL escaping rules are
> > >>>used for tags or values containing whitespace, "=" characters and
> > >>>semicolons.
> > >>>
> > >>>Two tags are special:
> > >>>
> > >>>    ID Indicates the name of the feature.  IDs must be unique
> > >>> within the scope of the GFF file.
> > >>>
> > >>>    Target Indicates the target of a nucleotide to nucleotide or
> > >>>    nucleotide to protein alignment.  The format of the
> > >>>    value is "target_id:start..end"  Start may be greater
> > >>>    than end to indicate a + strand alignment to the
> > >>>    reverse complement of a target nucleotide sequence.
> > >>>
> > >>>In the example GFF3 file given below, the first column contains line
> > >>>numbers that I have added for the purposes of the narrative.  Here
are
> > >>>some common scenarios that I have attempted to illustrate:
> > >>>
> > >>>A) a simple feature, no public ID
> > >>>
> > >>>Line 2 in the example is a feature of type "repeat". It has a start
> > >>>and an end and no ID, but it does have an attribute named "Note."
> > >>>
> > >>>B) a simple feature with a public ID
> > >>>
> > >>>Line 3 is a feature of type clone.  It has a start and an end.  Its
> > >>>parent is undefined (empty column 9), but it has an attribute of type
> > >>>ID with value "cTel33B."
> > >>>
> > >>>C) a feature with multiple attributes
> > >>>
> > >>>Line 5 is a feature of type "gene."  It has no parent, and has
> > >>>attributes of type ID, Note, and GO_term.
> > >>>
> > >>>D) a hierarchical grouping of features
> > >>>
> > >>>Lines 5-13 demonstrate a hierarchical grouping.  At the top level is
> > >>>line 5, which defines the extent of a "gene" with ID Y74C9A.1.  Below
> > >>>this are two features of type mRNA (lines 6 and 7).  Their group
> > >>>fields contain the ID of Y74C9A.1, indicating that this feature is
> > >>>their immediate parent.  In the 10th column, the mRNA features have
> > >>>their own IDs independent of the ID of the parent gene.
> > >>>
> > >>>This pattern is repeated for the exons listed on lines 8-11.  Exons
> > >>>e1, e2, and e4 belong to both of the transcripts.  Therefore, both
> > >>>transcript IDs are listed in the group column, separated by
> > >>>semicolons.
> > >>>
> > >>>Exon e3 belongs only to one of the transcripts, and therefore only
> > >>>that transcript's ID is listed in the group column.
> > >>>
> > >>>Lines 12 and 13 indicate coding_start and coding_end features.  These
> > >>>subfeatures are hierarchically grouped underneath their corresponding
> > >>>exons, but they do not have independent public IDs.
> > >>>
> > >>>E) Disjunct coordinates
> > >>>
> > >>>Lines 14-16 illustrates a single feature -- the CDS corresponding to
> > >>>mRNA Y74C9A.1a -- which occupies multiple disjunct regions.  The
group
> > >>>column indicates that the CDS belongs to mRNA Y74C9A.1a.  However,
the
> > >>>attribute column assigns each of lines 14-16 the same ID.  Because
the
> > >>>ID is the same, this is to be interpreted as a single feature that
> > >>>spans multiple locations.
> > >>>
> > >>>F) Alignments
> > >>>
> > >>>Lines 17-19 demonstrate a gapped alignment of two sequences using the
> > >>>reserved Target attribute.  Each non-gapped segment becomes a line in
> > >>>the GFF3 file.  The segments each share the same ID, thereby
> > >>>indicating that the segments are disjunct regions of the same
feature.
> > >>>The Target attribute indicates the ID of the target sequence (which
> > >>>does not have to be represented in the GFF3 file) and the start and
> > >>>end coordinates of the aligned target.
> > >>>
> > >>>Unlike the GFF1 and GFF2 formats, the group field for gapped
> > >>>alignments can be empty. However, a valid alternative representation
> > >>>is to create a single "match" feature, and a series of "hsp" features
> > >>>underneath it via the group field.  Lines 20-22 show this alternative
> > >>>representation.
> > >>>
> > >>>G) Relative coordinates
> > >>>
> > >>>Lines 23-26 illustrate using relative coordinate addressing in
> > >>>feature/subfeature relationships.  Line 23 defines an mRNA that is
> > >>>positioned on sequence landmark "I" from positions 5000 to 6000.  Its
> > >>>ID field indicates that it is M7.3.  Lines 24-26 are exon subfeatures
> > >>>of M7.3 as indicated by their group field.  However, the seqid field
> > >>>specifies M7.3 as the parent coordinate system, thereby allowing the
> > >>>exons to begin at position 1.
> > >>>
> > >>>  0  ##gff-version 3
> > >>>  1  ##sequence-region I:1..14972282
> > >>>  2  I       wormbase        repeat  5000    5100    .       .
.
> > >>>   .       Note=ALU3 3  I       wormbase        clone   1       2679
> > >>> . +       .       .       ID=cTel33B 4  I       wormbase
> > >>>contig  1       14972282        .       +       .       .
> > >>>ID=CHROMOSOME_I 5  I       wormbase        gene    43733   44677   .
> > >>> +       .               .
ID=Y74C9A.1;Note=unc-3;GO_term=GO:12345
> > >>>6  I       wormbase        mRNA    43733   44677   .       +       .
> > >>> Y74C9A.1        ID=Y74C9A.1a 7  I       wormbase        mRNA
43733
> > >>>44677   .       +       .       Y74C9A.1        ID=Y74C9A.1b 8  I
> > >>>wormbase        exon    43733   43961   .       +       .
> > >>>Y74C9A.1a;Y74C9A.1b     ID=e1 9  I       wormbase        exon
44030
> > >>>44234   .       +       .       Y74C9A.1a;T:Y74C9A.1b   ID=e2 10  I
> > >>>wormbase        exon    44281   44328   .       +       .
> > >>> Y74C9A.1b ID=e3 11  I       wormbase        exon    44521   44677
.
> > >>>     + .       Y74C9A.1a;T:Y74C9A.1b   ID=e4 12  I       wormbase
> > >>> coding_start    43740   43740   .       +       .       e1 13  I
> > >>> wormbase        coding_end      44677   44677   .       +       . e4
14
> > >>>  I       wormbase        cds     43740   43961   .       +       0
> > >>> Y74C9A.1a 15  I       wormbase        cds     44030   44234   . +
> > >>> 1       Y74C9A.1a 16  I       wormbase        cds     44521 44677
.
> > >>>     +       1       Y74C9A.1a 17  I       wormbase
> > >>>match   1       100     100     .       .       .
> > >>>ID=12345.s;Target=cb123:1001..1100 18  I       wormbase        match
> > >>>101     500     20      .       .       .
> > >>>ID=12345.s;Target=cb123:1101..1500 19  I       wormbase        match
> > >>>501     1000    80      .       .       .
> > >>>ID=12345.s;Target=cb123:1501..2000 20  I       wormbase        match
> > >>>5001    6000    100     .       .       .
ID=abc;Target=M1:1..1000
> > >>>21  I       wormbase        hsp     5001    5500    .       .       .
> > >>>  abc     Target=M1:1..500 22  I       wormbase        hsp     5501
> > >>>6000    .       .       .       abc     Target=M1:501..100 23  I
> > >>>wormbase        mRNA    5000    6000    +       .       .       .
> > >>>ID=M7.3 24  M7.3    wormbase        exon    1       300     +       .
> > >>>  .       M7.3    ID=M7.3.1 25  M7.3    wormbase        exon    301
> > >>>400     +       .       .       M7.3    ID=M7.3.2 26  M7.3
wormbase
> > >>>    exon    401     1000    +       .       .       M7.3    ID=M7.3.3
> > >>>
> > >>>=================================================================
> > >>>
> > >>>I have extended (in an experimental way), the Bio::Tools::GFF module
> > >>>to accomodate this new format.  Here is a test script and its output
> > >>>when run on the above file.
> > >>>
> > >>>  0  #!/usr/bin/perl -w
> > >>>  1  use strict;
> > >>>  2  use lib '.';
> > >>>
> > >>>  3  use Bio::Tools::GFF;
> > >>>  4  my $gffio = Bio::Tools::GFF->new(-fh=>\*STDIN,-gff_version=>3);
> > >>>  5  my @f = $gffio->features;
> > >>>  6  format_features(\@f);
> > >>>
> > >>>  7  sub format_features {
> > >>>  8    my $features = shift;
> > >>>  9    my $tabs     = shift || 0;
> > >>> 10    for my $f (@$features) {
> > >>> 11      my $type  = $f->primary_tag;
> > >>> 12      my $id    = $f->unique_id;
> > >>> 13      $id       ||= '(no id)';
> > >>> 14      my ($start,$end) = ($f->start,$f->end);
> > >>> 15      my $alt = ($f->alternative_locations)[0];
> > >>> 16      my ($target,$tstart,$tend) =
> > >>>($alt->seq_id,$alt->start,$alt->end) if $alt;
> > >>>
> > >>> 17      print
> >
>>>"\t"x$tabs,join("\t",$id,$type,$f->location->to_FTstring,eval{$alt->loca
> > >>>t ion->seq_id,$alt->location->to_FTstring}),"\n"; 18
> > >>>format_features([$f->sub_SeqFeature],$tabs+1);
> > >>> 19    }
> > >>> 20  }
> > >>>
> > >>> 21  1;
> > >>>
> > >>>OUTPUT:
> > >>>
> > >>>cTel33B clone 1..2679
> > >>>CHROMOSOME_I contig 1..14972282
> > >>>12345.s match join(101..500,1..100,501..1000)
> > >>>M7.3 mRNA 5000..6000
> > >>> M7.3.1 exon 5000..5299
> > >>> M7.3.2 exon 5300..5399
> > >>> M7.3.3 exon 5400..5999
> > >>>abc match 5001..6000
> > >>> (no id) hsp 5001..5500
> > >>> (no id) hsp 5501..6000
> > >>>(no id) repeat 5000..5100
> > >>>Y74C9A.1 gene 43733..44677
> > >>> Y74C9A.1a mRNA 43733..44677
> > >>> e1 exon 43733..43961
> > >>> (no id) coding_start 43740
> > >>> e2 exon 44030..44234
> > >>> e4 exon 44521..44677
> > >>> (no id) coding_end 44677
> > >>> (no id) cds 43740..43961
> > >>> (no id) cds 44030..44234
> > >>> (no id) cds 44521..44677
> > >>> Y74C9A.1b mRNA 43733..44677
> > >>> e1 exon 43733..43961
> > >>> (no id) coding_start 43740
> > >>> e3 exon 44281..44328
>
> --
> Lincoln Stein
> lstein at cshl.org
> Cold Spring Harbor Laboratory
> 1 Bungtown Road
> Cold Spring Harbor, NY 11724
> (516) 367-8380 (voice)
> (516) 367-8389 (fax)
>