[Bioperl-l] Re: Proposed GFF version 3
Richard Durbin
rd at sanger.ac.uk
Wed Feb 12 20:16:36 EST 2003
This is a bit like the 9th field of GFF, which was changed from group
to attribute very quickly on the GFF specification page, but much
(most?) of the world still thinks of it as a group field.....
It's amazing how ineffective formal standards are once informal ones
have sprung up.
Richard
Jim Kent wrote:
> Ah, I see. I'm looking at a lot of old fashioned sites then.
>
> ----- Original Message -----
> From: "Lincoln Stein" <lstein at cshl.org>
> To: "Richard Durbin" <rd at sanger.ac.uk>
> Cc: <bioperl-l at bioperl.org>; <suzi at fruitfly.org>; <gff-list at sanger.ac.uk>
> Sent: Tuesday, February 11, 2003 6:21 AM
> Subject: Re: Proposed GFF version 3
>
>
>
>>The important thing to me is to be able to preserve some backward
>>compatibility with GFF2. I don't think it will make much of a difference
>>which order the two columns fall in because some people used column 9 for
>>grouping and others for attributes. How about calling column 10
>
> "parents"?
>
>>I went to URL format mostly because Perl parsing will be a lot faster
>
> (Perl
>
>>likes regular expressions, but those don't play well with shell-style
>
> quote
>
>>and backslashing rules). The official URL standard uses the semicolon.
>
> The
>
>>very earliest CGI specification used ampersands, but this was abandoned
>
> about
>
>>five years ago when people realized that this violated the HTML spec
>>(ampersands must be escaped, so the correct way to write ampersanded
>>parameter lists is:
>>
>><a href="/cgi-bin/foo?first=a&second=b&third=c">
>>
>>I'm surprised to hear that Ensembl uses ampersands in its URLs. I bet
>
> their
>
>>pages don't validate against the XHTML validators.
>>
>>Lincoln
>>
>>
>>On Tuesday 11 February 2003 07:54 am, Richard Durbin wrote:
>>
>>>Swap them entirely. i.e. put the attributes in column 9 and call that
>>>"attributes" and put the new hierarchical group term in column 10 and
>>>call that "group". Or perhaps it would be better to call it something
>>>else to minimise confusion, because in gff version 1 column 9 was called
>>>group. What about calling column 10 "cluster"?
>>>
>>>I see you have switched to URL type format for the attributes, away from
>>>acedb. That's fine - URL format is much more universal. But is ';' a
>>>standard separator in URLS? I just looked and see that Ensembl uses '&'
>>>and WormBase uses ';' and I think I have seen '+' somewhere, so maybe
>>>there is no standard.
>>>
>>>Richard
>>>
>>>Lincoln Stein wrote:
>>>
>>>>Hi Richard,
>>>>
>>>>Do you mean that we should swap columns 9 and 10 entirely, or just
>>>
> swap
>
>>>>their names? I think you mean the former, but I want to be sure.
>>>>
>>>>Lincoln
>>>>
>>>>On Monday 10 February 2003 11:12 am, Richard Durbin wrote:
>>>>
>>>>>Hello all,
>>>>>
>>>>>This looks very nice to me. Not surprising perhaps because I had an
>>>>>earlier involvement as Lincoln says.
>>>>>
>>>>>I have added gff-list at sanger.ac.uk to the mailing Cc: list because it
>>>>
> is
>
>>>>>the "official" GFF mailing list, although it is very little used.
>>>>>
>>>>>I have one major comment, that columns 9 (group) and 10 (attributes)
>>>>>should be switched. Although GFF version 1 column 9 was called
>>>>
> "group"
>
>>>>>in version 2, which is what has been current for over two years, this
>>>>>was renamed "attribute" and contains the attribute information. For
>>>>>consistency we should keep column 9 for the attributes. Also, in many
>>>>>cases there will be attributes but no group.
>>>>>
>>>>>I like ID and Target. I see the idea with hsp's for gapped
>>>>
> alignments,
>
>>>>>though perhaps they could be called "match_block". But there is a
>>>>
> case
>
>>>>>I think to also encode gapped alignments on one line, perhaps using
>>>>
> the
>
>>>>>CIGAR encoding used by ENSEMBL (and BioPerl?), e.g. as
>>>>>
>>>>>Target=M1:1..1000;Align=xxxxxxx
>>>>>
>>>>>(sorry I don't know cigar format well enough to write a legal string.
>>>>>
>>>>>Richard
>>>>>
>>>>>Lincoln Stein wrote:
>>>>>
>>>>>>This letter is to discuss a proposed extension to GFF. It arises
>>>>>
> from
>
>>>>>>conversations with Richard Durbin during last fall's Hinxton genome
>>>>>>informatics meeting.
>>>>>>
>>>>>>Although there are many richer ways of representing genomic features
>>>>>>via XML, the stubborn persistence of a variety of ad-hoc
>>>>>
> tab-delimited
>
>>>>>>flat file formats declares the bioinformatics community's need for a
>>>>>>simple format that can be modified with a text editor and processed
>>>>>>with shell tools like grep. The GFF format, although widely used,
>>>>>
> has
>
>>>>>>fragmented into multiple incompatible dialects. When asked why they
>>>>>>have modified the published Sanger specification, bioinformaticists
>>>>>>frequently answer that the format was insufficient for their needs,
>>>>>>and they needed to extend it. The proposed GFF3 format addresses the
>>>>>>most common extensions to GFF, while preserving backward
>>>>>
> compatibility
>
>>>>>>with previous formats. The new format:
>>>>>>
>>>>>> 1) adds a mechanism for representing more than one level
>>>>>> of hierarchical grouping of features and subfeatures.
>>>>>> 2) separates the ideas of group membership and feature name/id
>>>>>> 3) constrains the feature type field to be taken from a
>>>>>
> controlled
>
>>>>>> vocabulary.
>>>>>> 4) allows a single feature, such as an exon, to belong to more
>>>>>
> than
>
>>>>>> one group at a time.
>>>>>> 5) one level of relative addressing for subfeatures (e.g. exons
>>>>>> can be expressed in transcript coordinates)
>>>>>> 6) an explicit convention for pairwise alignments
>>>>>> 7) an explicit convention for features that occupy disjunct
>>>>>
> regions
>
>>>>>>The format consists of 10 columns, separated by spaces. The
>>>>>
> following
>
>>>>>>unescaped characters are allowed within fields:
>>>>>>[a-zA-Z0-9.:;=%^*$@!+_?-]. All other characters must must be escaped
>>>>>>using the URL escaping conventions. Unescaped quotation marks,
>>>>>>backslashes and other ad-hoc escaping conventions that have been
>>>>>
> added
>
>>>>>>to the GFF format are explicitly forbidden. The =, ; and %
>>>>>
> characters
>
>>>>>>have reserved meanings as described below.
>>>>>>
>>>>>>Undefined fields are replaced with the "." character, as described in
>>>>>>the original GFF spec.
>>>>>>
>>>>>>Column 1: "seqid"
>>>>>>
>>>>>>The ID of the landmark used to establish the coordinate system for
>>>>>
> the
>
>>>>>>current feature. IDs must contain alphanumeric characters.
>>>>>>Whitespace, if present, must be escaped using URL escaping rules
>>>>>>(e.g. space="%20").
>>>>>>
>>>>>>Column 2: "source"
>>>>>>
>>>>>>The source of the feature. This is unchanged from the older GFF
>>>>>
> specs
>
>>>>>>and is not part of a controlled vocabulary.
>>>>>>
>>>>>>Column 3: "type"
>>>>>>
>>>>>>The type of the feature (previously called the "method"). This is
>>>>>>constrained to be either: (a) a term from SOFA; or (b) a SOFA
>>>>>>accession number. The latter alternative is distinguished using the
>>>>>>syntax SOFA:000000.
>>>>>>
>>>>>>Columns 4 & 5: "start" and "end"
>>>>>>
>>>>>>The start and end of the feature, in 1-based integer coordinates,
>>>>>>relative to the landmark given in column 1. Start is less than end.
>>>>>>
>>>>>>Column 6: "score"
>>>>>>
>>>>>>The score of the feature, a floating point number. As in earlier
>>>>>>versions of the format, the semantics of the score are ill-defined.
>>>>>>It is strongly recommended that E-values be used for sequence
>>>>>>similarity features, and that P-values be used for ab initio gene
>>>>>>prediction features.
>>>>>>
>>>>>>Column 7: "strand"
>>>>>>
>>>>>>The strand of the feature. + for positive strand (relative to the
>>>>>>landmark), - for minus strand, and . for features that are not
>>>>>>stranded. In addition, ? can be used for features whose strandedness
>>>>>>is relevant, but unknown.
>>>>>>
>>>>>>Column 8: "phase"
>>>>>>
>>>>>>The phase of the feature, for protein-encoding featues (primarily
>>>>>>CDSs). This is an integer-valued field with the values 0, 1, or 2.
>>>>>>The integer indicates the offset from the start of the feature to the
>>>>>>first base of the first codon in the reading frame. "." is used for
>>>>>>features that do not corresponding to a reading frame.
>>>>>>
>>>>>>Column 9: "group"
>>>>>>
>>>>>>A list of the immediate parents of the current feature. Multiple
>>>>>>parents are allowed (example: one exon shared by multiple
>>>>>>transcripts). Multiple parents are separated by a semicolon.
>>>>>>Parentless features have a dot in this field.
>>>>>>
>>>>>>Column 10: "attributes"
>>>>>>
>>>>>>A list of feature attributes in the format tag=value. Multiple
>>>>>>tag=value pairs are separated by semicolons. URL escaping rules are
>>>>>>used for tags or values containing whitespace, "=" characters and
>>>>>>semicolons.
>>>>>>
>>>>>>Two tags are special:
>>>>>>
>>>>>> ID Indicates the name of the feature. IDs must be unique
>>>>>>within the scope of the GFF file.
>>>>>>
>>>>>> Target Indicates the target of a nucleotide to nucleotide or
>>>>>> nucleotide to protein alignment. The format of the
>>>>>> value is "target_id:start..end" Start may be greater
>>>>>> than end to indicate a + strand alignment to the
>>>>>> reverse complement of a target nucleotide sequence.
>>>>>>
>>>>>>In the example GFF3 file given below, the first column contains line
>>>>>>numbers that I have added for the purposes of the narrative. Here
>>>>>
> are
>
>>>>>>some common scenarios that I have attempted to illustrate:
>>>>>>
>>>>>>A) a simple feature, no public ID
>>>>>>
>>>>>>Line 2 in the example is a feature of type "repeat". It has a start
>>>>>>and an end and no ID, but it does have an attribute named "Note."
>>>>>>
>>>>>>B) a simple feature with a public ID
>>>>>>
>>>>>>Line 3 is a feature of type clone. It has a start and an end. Its
>>>>>>parent is undefined (empty column 9), but it has an attribute of type
>>>>>>ID with value "cTel33B."
>>>>>>
>>>>>>C) a feature with multiple attributes
>>>>>>
>>>>>>Line 5 is a feature of type "gene." It has no parent, and has
>>>>>>attributes of type ID, Note, and GO_term.
>>>>>>
>>>>>>D) a hierarchical grouping of features
>>>>>>
>>>>>>Lines 5-13 demonstrate a hierarchical grouping. At the top level is
>>>>>>line 5, which defines the extent of a "gene" with ID Y74C9A.1. Below
>>>>>>this are two features of type mRNA (lines 6 and 7). Their group
>>>>>>fields contain the ID of Y74C9A.1, indicating that this feature is
>>>>>>their immediate parent. In the 10th column, the mRNA features have
>>>>>>their own IDs independent of the ID of the parent gene.
>>>>>>
>>>>>>This pattern is repeated for the exons listed on lines 8-11. Exons
>>>>>>e1, e2, and e4 belong to both of the transcripts. Therefore, both
>>>>>>transcript IDs are listed in the group column, separated by
>>>>>>semicolons.
>>>>>>
>>>>>>Exon e3 belongs only to one of the transcripts, and therefore only
>>>>>>that transcript's ID is listed in the group column.
>>>>>>
>>>>>>Lines 12 and 13 indicate coding_start and coding_end features. These
>>>>>>subfeatures are hierarchically grouped underneath their corresponding
>>>>>>exons, but they do not have independent public IDs.
>>>>>>
>>>>>>E) Disjunct coordinates
>>>>>>
>>>>>>Lines 14-16 illustrates a single feature -- the CDS corresponding to
>>>>>>mRNA Y74C9A.1a -- which occupies multiple disjunct regions. The
>>>>>
> group
>
>>>>>>column indicates that the CDS belongs to mRNA Y74C9A.1a. However,
>>>>>
> the
>
>>>>>>attribute column assigns each of lines 14-16 the same ID. Because
>>>>>
> the
>
>>>>>>ID is the same, this is to be interpreted as a single feature that
>>>>>>spans multiple locations.
>>>>>>
>>>>>>F) Alignments
>>>>>>
>>>>>>Lines 17-19 demonstrate a gapped alignment of two sequences using the
>>>>>>reserved Target attribute. Each non-gapped segment becomes a line in
>>>>>>the GFF3 file. The segments each share the same ID, thereby
>>>>>>indicating that the segments are disjunct regions of the same
>>>>>
> feature.
>
>>>>>>The Target attribute indicates the ID of the target sequence (which
>>>>>>does not have to be represented in the GFF3 file) and the start and
>>>>>>end coordinates of the aligned target.
>>>>>>
>>>>>>Unlike the GFF1 and GFF2 formats, the group field for gapped
>>>>>>alignments can be empty. However, a valid alternative representation
>>>>>>is to create a single "match" feature, and a series of "hsp" features
>>>>>>underneath it via the group field. Lines 20-22 show this alternative
>>>>>>representation.
>>>>>>
>>>>>>G) Relative coordinates
>>>>>>
>>>>>>Lines 23-26 illustrate using relative coordinate addressing in
>>>>>>feature/subfeature relationships. Line 23 defines an mRNA that is
>>>>>>positioned on sequence landmark "I" from positions 5000 to 6000. Its
>>>>>>ID field indicates that it is M7.3. Lines 24-26 are exon subfeatures
>>>>>>of M7.3 as indicated by their group field. However, the seqid field
>>>>>>specifies M7.3 as the parent coordinate system, thereby allowing the
>>>>>>exons to begin at position 1.
>>>>>>
>>>>>> 0 ##gff-version 3
>>>>>> 1 ##sequence-region I:1..14972282
>>>>>> 2 I wormbase repeat 5000 5100 . .
>>>>>
> .
>
>>>>>> . Note=ALU3 3 I wormbase clone 1 2679
>>>>>>. + . . ID=cTel33B 4 I wormbase
>>>>>>contig 1 14972282 . + . .
>>>>>>ID=CHROMOSOME_I 5 I wormbase gene 43733 44677 .
>>>>>>+ . .
>>>>>
> ID=Y74C9A.1;Note=unc-3;GO_term=GO:12345
>
>>>>>>6 I wormbase mRNA 43733 44677 . + .
>>>>>>Y74C9A.1 ID=Y74C9A.1a 7 I wormbase mRNA
>>>>>
> 43733
>
>>>>>>44677 . + . Y74C9A.1 ID=Y74C9A.1b 8 I
>>>>>>wormbase exon 43733 43961 . + .
>>>>>>Y74C9A.1a;Y74C9A.1b ID=e1 9 I wormbase exon
>>>>>
> 44030
>
>>>>>>44234 . + . Y74C9A.1a;T:Y74C9A.1b ID=e2 10 I
>>>>>>wormbase exon 44281 44328 . + .
>>>>>>Y74C9A.1b ID=e3 11 I wormbase exon 44521 44677
>>>>>
> .
>
>>>>>> + . Y74C9A.1a;T:Y74C9A.1b ID=e4 12 I wormbase
>>>>>>coding_start 43740 43740 . + . e1 13 I
>>>>>>wormbase coding_end 44677 44677 . + . e4
>>>>>
> 14
>
>>>>>> I wormbase cds 43740 43961 . + 0
>>>>>>Y74C9A.1a 15 I wormbase cds 44030 44234 . +
>>>>>>1 Y74C9A.1a 16 I wormbase cds 44521 44677
>>>>>
> .
>
>>>>>> + 1 Y74C9A.1a 17 I wormbase
>>>>>>match 1 100 100 . . .
>>>>>>ID=12345.s;Target=cb123:1001..1100 18 I wormbase match
>>>>>>101 500 20 . . .
>>>>>>ID=12345.s;Target=cb123:1101..1500 19 I wormbase match
>>>>>>501 1000 80 . . .
>>>>>>ID=12345.s;Target=cb123:1501..2000 20 I wormbase match
>>>>>>5001 6000 100 . . .
>>>>>
> ID=abc;Target=M1:1..1000
>
>>>>>>21 I wormbase hsp 5001 5500 . . .
>>>>>> abc Target=M1:1..500 22 I wormbase hsp 5501
>>>>>>6000 . . . abc Target=M1:501..100 23 I
>>>>>>wormbase mRNA 5000 6000 + . . .
>>>>>>ID=M7.3 24 M7.3 wormbase exon 1 300 + .
>>>>>> . M7.3 ID=M7.3.1 25 M7.3 wormbase exon 301
>>>>>>400 + . . M7.3 ID=M7.3.2 26 M7.3
>>>>>
> wormbase
>
>>>>>> exon 401 1000 + . . M7.3 ID=M7.3.3
>>>>>>
>>>>>>=================================================================
>>>>>>
>>>>>>I have extended (in an experimental way), the Bio::Tools::GFF module
>>>>>>to accomodate this new format. Here is a test script and its output
>>>>>>when run on the above file.
>>>>>>
>>>>>> 0 #!/usr/bin/perl -w
>>>>>> 1 use strict;
>>>>>> 2 use lib '.';
>>>>>>
>>>>>> 3 use Bio::Tools::GFF;
>>>>>> 4 my $gffio = Bio::Tools::GFF->new(-fh=>\*STDIN,-gff_version=>3);
>>>>>> 5 my @f = $gffio->features;
>>>>>> 6 format_features(\@f);
>>>>>>
>>>>>> 7 sub format_features {
>>>>>> 8 my $features = shift;
>>>>>> 9 my $tabs = shift || 0;
>>>>>>10 for my $f (@$features) {
>>>>>>11 my $type = $f->primary_tag;
>>>>>>12 my $id = $f->unique_id;
>>>>>>13 $id ||= '(no id)';
>>>>>>14 my ($start,$end) = ($f->start,$f->end);
>>>>>>15 my $alt = ($f->alternative_locations)[0];
>>>>>>16 my ($target,$tstart,$tend) =
>>>>>>($alt->seq_id,$alt->start,$alt->end) if $alt;
>>>>>>
>>>>>>17 print
>>>>>
>>>>"\t"x$tabs,join("\t",$id,$type,$f->location->to_FTstring,eval{$alt->loca
>>>>
>>>>>>t ion->seq_id,$alt->location->to_FTstring}),"\n"; 18
>>>>>>format_features([$f->sub_SeqFeature],$tabs+1);
>>>>>>19 }
>>>>>>20 }
>>>>>>
>>>>>>21 1;
>>>>>>
>>>>>>OUTPUT:
>>>>>>
>>>>>>cTel33B clone 1..2679
>>>>>>CHROMOSOME_I contig 1..14972282
>>>>>>12345.s match join(101..500,1..100,501..1000)
>>>>>>M7.3 mRNA 5000..6000
>>>>>>M7.3.1 exon 5000..5299
>>>>>>M7.3.2 exon 5300..5399
>>>>>>M7.3.3 exon 5400..5999
>>>>>>abc match 5001..6000
>>>>>>(no id) hsp 5001..5500
>>>>>>(no id) hsp 5501..6000
>>>>>>(no id) repeat 5000..5100
>>>>>>Y74C9A.1 gene 43733..44677
>>>>>>Y74C9A.1a mRNA 43733..44677
>>>>>>e1 exon 43733..43961
>>>>>>(no id) coding_start 43740
>>>>>>e2 exon 44030..44234
>>>>>>e4 exon 44521..44677
>>>>>>(no id) coding_end 44677
>>>>>>(no id) cds 43740..43961
>>>>>>(no id) cds 44030..44234
>>>>>>(no id) cds 44521..44677
>>>>>>Y74C9A.1b mRNA 43733..44677
>>>>>>e1 exon 43733..43961
>>>>>>(no id) coding_start 43740
>>>>>>e3 exon 44281..44328
>>>>>
>>--
>>Lincoln Stein
>>lstein at cshl.org
>>Cold Spring Harbor Laboratory
>>1 Bungtown Road
>>Cold Spring Harbor, NY 11724
>>(516) 367-8380 (voice)
>>(516) 367-8389 (fax)
>>
>
>
More information about the Bioperl-l
mailing list