[Bioperl-l] Re: Proposed GFF version 3
Richard Durbin
rd at sanger.ac.uk
Tue Feb 11 17:27:40 EST 2003
OK. Let's fix on that:
Column 9 is "attributes" and column 10 is "parents".
Richard
Lincoln Stein wrote:
> The important thing to me is to be able to preserve some backward
> compatibility with GFF2. I don't think it will make much of a difference
> which order the two columns fall in because some people used column 9 for
> grouping and others for attributes. How about calling column 10 "parents"?
>
> I went to URL format mostly because Perl parsing will be a lot faster (Perl
> likes regular expressions, but those don't play well with shell-style quote
> and backslashing rules). The official URL standard uses the semicolon. The
> very earliest CGI specification used ampersands, but this was abandoned about
> five years ago when people realized that this violated the HTML spec
> (ampersands must be escaped, so the correct way to write ampersanded
> parameter lists is:
>
> <a href="/cgi-bin/foo?first=a&second=b&third=c">
>
> I'm surprised to hear that Ensembl uses ampersands in its URLs. I bet their
> pages don't validate against the XHTML validators.
>
> Lincoln
>
>
> On Tuesday 11 February 2003 07:54 am, Richard Durbin wrote:
>
>>Swap them entirely. i.e. put the attributes in column 9 and call that
>>"attributes" and put the new hierarchical group term in column 10 and
>>call that "group". Or perhaps it would be better to call it something
>>else to minimise confusion, because in gff version 1 column 9 was called
>>group. What about calling column 10 "cluster"?
>>
>>I see you have switched to URL type format for the attributes, away from
>>acedb. That's fine - URL format is much more universal. But is ';' a
>>standard separator in URLS? I just looked and see that Ensembl uses '&'
>>and WormBase uses ';' and I think I have seen '+' somewhere, so maybe
>>there is no standard.
>>
>>Richard
>>
>>Lincoln Stein wrote:
>>
>>>Hi Richard,
>>>
>>>Do you mean that we should swap columns 9 and 10 entirely, or just swap
>>>their names? I think you mean the former, but I want to be sure.
>>>
>>>Lincoln
>>>
>>>On Monday 10 February 2003 11:12 am, Richard Durbin wrote:
>>>
>>>>Hello all,
>>>>
>>>>This looks very nice to me. Not surprising perhaps because I had an
>>>>earlier involvement as Lincoln says.
>>>>
>>>>I have added gff-list at sanger.ac.uk to the mailing Cc: list because it is
>>>>the "official" GFF mailing list, although it is very little used.
>>>>
>>>>I have one major comment, that columns 9 (group) and 10 (attributes)
>>>>should be switched. Although GFF version 1 column 9 was called "group"
>>>>in version 2, which is what has been current for over two years, this
>>>>was renamed "attribute" and contains the attribute information. For
>>>>consistency we should keep column 9 for the attributes. Also, in many
>>>>cases there will be attributes but no group.
>>>>
>>>>I like ID and Target. I see the idea with hsp's for gapped alignments,
>>>>though perhaps they could be called "match_block". But there is a case
>>>>I think to also encode gapped alignments on one line, perhaps using the
>>>>CIGAR encoding used by ENSEMBL (and BioPerl?), e.g. as
>>>>
>>>> Target=M1:1..1000;Align=xxxxxxx
>>>>
>>>>(sorry I don't know cigar format well enough to write a legal string.
>>>>
>>>>Richard
>>>>
>>>>Lincoln Stein wrote:
>>>>
>>>>>This letter is to discuss a proposed extension to GFF. It arises from
>>>>>conversations with Richard Durbin during last fall's Hinxton genome
>>>>>informatics meeting.
>>>>>
>>>>>Although there are many richer ways of representing genomic features
>>>>>via XML, the stubborn persistence of a variety of ad-hoc tab-delimited
>>>>>flat file formats declares the bioinformatics community's need for a
>>>>>simple format that can be modified with a text editor and processed
>>>>>with shell tools like grep. The GFF format, although widely used, has
>>>>>fragmented into multiple incompatible dialects. When asked why they
>>>>>have modified the published Sanger specification, bioinformaticists
>>>>>frequently answer that the format was insufficient for their needs,
>>>>>and they needed to extend it. The proposed GFF3 format addresses the
>>>>>most common extensions to GFF, while preserving backward compatibility
>>>>>with previous formats. The new format:
>>>>>
>>>>> 1) adds a mechanism for representing more than one level
>>>>> of hierarchical grouping of features and subfeatures.
>>>>> 2) separates the ideas of group membership and feature name/id
>>>>> 3) constrains the feature type field to be taken from a controlled
>>>>> vocabulary.
>>>>> 4) allows a single feature, such as an exon, to belong to more than
>>>>> one group at a time.
>>>>> 5) one level of relative addressing for subfeatures (e.g. exons
>>>>> can be expressed in transcript coordinates)
>>>>> 6) an explicit convention for pairwise alignments
>>>>> 7) an explicit convention for features that occupy disjunct regions
>>>>>
>>>>>The format consists of 10 columns, separated by spaces. The following
>>>>>unescaped characters are allowed within fields:
>>>>>[a-zA-Z0-9.:;=%^*$@!+_?-]. All other characters must must be escaped
>>>>>using the URL escaping conventions. Unescaped quotation marks,
>>>>>backslashes and other ad-hoc escaping conventions that have been added
>>>>>to the GFF format are explicitly forbidden. The =, ; and % characters
>>>>>have reserved meanings as described below.
>>>>>
>>>>>Undefined fields are replaced with the "." character, as described in
>>>>>the original GFF spec.
>>>>>
>>>>>Column 1: "seqid"
>>>>>
>>>>>The ID of the landmark used to establish the coordinate system for the
>>>>>current feature. IDs must contain alphanumeric characters.
>>>>>Whitespace, if present, must be escaped using URL escaping rules
>>>>>(e.g. space="%20").
>>>>>
>>>>>Column 2: "source"
>>>>>
>>>>>The source of the feature. This is unchanged from the older GFF specs
>>>>>and is not part of a controlled vocabulary.
>>>>>
>>>>>Column 3: "type"
>>>>>
>>>>>The type of the feature (previously called the "method"). This is
>>>>>constrained to be either: (a) a term from SOFA; or (b) a SOFA
>>>>>accession number. The latter alternative is distinguished using the
>>>>>syntax SOFA:000000.
>>>>>
>>>>>Columns 4 & 5: "start" and "end"
>>>>>
>>>>>The start and end of the feature, in 1-based integer coordinates,
>>>>>relative to the landmark given in column 1. Start is less than end.
>>>>>
>>>>>Column 6: "score"
>>>>>
>>>>>The score of the feature, a floating point number. As in earlier
>>>>>versions of the format, the semantics of the score are ill-defined.
>>>>>It is strongly recommended that E-values be used for sequence
>>>>>similarity features, and that P-values be used for ab initio gene
>>>>>prediction features.
>>>>>
>>>>>Column 7: "strand"
>>>>>
>>>>>The strand of the feature. + for positive strand (relative to the
>>>>>landmark), - for minus strand, and . for features that are not
>>>>>stranded. In addition, ? can be used for features whose strandedness
>>>>>is relevant, but unknown.
>>>>>
>>>>>Column 8: "phase"
>>>>>
>>>>>The phase of the feature, for protein-encoding featues (primarily
>>>>>CDSs). This is an integer-valued field with the values 0, 1, or 2.
>>>>>The integer indicates the offset from the start of the feature to the
>>>>>first base of the first codon in the reading frame. "." is used for
>>>>>features that do not corresponding to a reading frame.
>>>>>
>>>>>Column 9: "group"
>>>>>
>>>>>A list of the immediate parents of the current feature. Multiple
>>>>>parents are allowed (example: one exon shared by multiple
>>>>>transcripts). Multiple parents are separated by a semicolon.
>>>>>Parentless features have a dot in this field.
>>>>>
>>>>>Column 10: "attributes"
>>>>>
>>>>>A list of feature attributes in the format tag=value. Multiple
>>>>>tag=value pairs are separated by semicolons. URL escaping rules are
>>>>>used for tags or values containing whitespace, "=" characters and
>>>>>semicolons.
>>>>>
>>>>>Two tags are special:
>>>>>
>>>>> ID Indicates the name of the feature. IDs must be unique
>>>>> within the scope of the GFF file.
>>>>>
>>>>> Target Indicates the target of a nucleotide to nucleotide or
>>>>> nucleotide to protein alignment. The format of the
>>>>> value is "target_id:start..end" Start may be greater
>>>>> than end to indicate a + strand alignment to the
>>>>> reverse complement of a target nucleotide sequence.
>>>>>
>>>>>In the example GFF3 file given below, the first column contains line
>>>>>numbers that I have added for the purposes of the narrative. Here are
>>>>>some common scenarios that I have attempted to illustrate:
>>>>>
>>>>>A) a simple feature, no public ID
>>>>>
>>>>>Line 2 in the example is a feature of type "repeat". It has a start
>>>>>and an end and no ID, but it does have an attribute named "Note."
>>>>>
>>>>>B) a simple feature with a public ID
>>>>>
>>>>>Line 3 is a feature of type clone. It has a start and an end. Its
>>>>>parent is undefined (empty column 9), but it has an attribute of type
>>>>>ID with value "cTel33B."
>>>>>
>>>>>C) a feature with multiple attributes
>>>>>
>>>>>Line 5 is a feature of type "gene." It has no parent, and has
>>>>>attributes of type ID, Note, and GO_term.
>>>>>
>>>>>D) a hierarchical grouping of features
>>>>>
>>>>>Lines 5-13 demonstrate a hierarchical grouping. At the top level is
>>>>>line 5, which defines the extent of a "gene" with ID Y74C9A.1. Below
>>>>>this are two features of type mRNA (lines 6 and 7). Their group
>>>>>fields contain the ID of Y74C9A.1, indicating that this feature is
>>>>>their immediate parent. In the 10th column, the mRNA features have
>>>>>their own IDs independent of the ID of the parent gene.
>>>>>
>>>>>This pattern is repeated for the exons listed on lines 8-11. Exons
>>>>>e1, e2, and e4 belong to both of the transcripts. Therefore, both
>>>>>transcript IDs are listed in the group column, separated by
>>>>>semicolons.
>>>>>
>>>>>Exon e3 belongs only to one of the transcripts, and therefore only
>>>>>that transcript's ID is listed in the group column.
>>>>>
>>>>>Lines 12 and 13 indicate coding_start and coding_end features. These
>>>>>subfeatures are hierarchically grouped underneath their corresponding
>>>>>exons, but they do not have independent public IDs.
>>>>>
>>>>>E) Disjunct coordinates
>>>>>
>>>>>Lines 14-16 illustrates a single feature -- the CDS corresponding to
>>>>>mRNA Y74C9A.1a -- which occupies multiple disjunct regions. The group
>>>>>column indicates that the CDS belongs to mRNA Y74C9A.1a. However, the
>>>>>attribute column assigns each of lines 14-16 the same ID. Because the
>>>>>ID is the same, this is to be interpreted as a single feature that
>>>>>spans multiple locations.
>>>>>
>>>>>F) Alignments
>>>>>
>>>>>Lines 17-19 demonstrate a gapped alignment of two sequences using the
>>>>>reserved Target attribute. Each non-gapped segment becomes a line in
>>>>>the GFF3 file. The segments each share the same ID, thereby
>>>>>indicating that the segments are disjunct regions of the same feature.
>>>>>The Target attribute indicates the ID of the target sequence (which
>>>>>does not have to be represented in the GFF3 file) and the start and
>>>>>end coordinates of the aligned target.
>>>>>
>>>>>Unlike the GFF1 and GFF2 formats, the group field for gapped
>>>>>alignments can be empty. However, a valid alternative representation
>>>>>is to create a single "match" feature, and a series of "hsp" features
>>>>>underneath it via the group field. Lines 20-22 show this alternative
>>>>>representation.
>>>>>
>>>>>G) Relative coordinates
>>>>>
>>>>>Lines 23-26 illustrate using relative coordinate addressing in
>>>>>feature/subfeature relationships. Line 23 defines an mRNA that is
>>>>>positioned on sequence landmark "I" from positions 5000 to 6000. Its
>>>>>ID field indicates that it is M7.3. Lines 24-26 are exon subfeatures
>>>>>of M7.3 as indicated by their group field. However, the seqid field
>>>>>specifies M7.3 as the parent coordinate system, thereby allowing the
>>>>>exons to begin at position 1.
>>>>>
>>>>> 0 ##gff-version 3
>>>>> 1 ##sequence-region I:1..14972282
>>>>> 2 I wormbase repeat 5000 5100 . . .
>>>>> . Note=ALU3 3 I wormbase clone 1 2679
>>>>>. + . . ID=cTel33B 4 I wormbase
>>>>>contig 1 14972282 . + . .
>>>>>ID=CHROMOSOME_I 5 I wormbase gene 43733 44677 .
>>>>>+ . . ID=Y74C9A.1;Note=unc-3;GO_term=GO:12345
>>>>>6 I wormbase mRNA 43733 44677 . + .
>>>>>Y74C9A.1 ID=Y74C9A.1a 7 I wormbase mRNA 43733
>>>>>44677 . + . Y74C9A.1 ID=Y74C9A.1b 8 I
>>>>>wormbase exon 43733 43961 . + .
>>>>>Y74C9A.1a;Y74C9A.1b ID=e1 9 I wormbase exon 44030
>>>>>44234 . + . Y74C9A.1a;T:Y74C9A.1b ID=e2 10 I
>>>>>wormbase exon 44281 44328 . + .
>>>>>Y74C9A.1b ID=e3 11 I wormbase exon 44521 44677 .
>>>>> + . Y74C9A.1a;T:Y74C9A.1b ID=e4 12 I wormbase
>>>>>coding_start 43740 43740 . + . e1 13 I
>>>>>wormbase coding_end 44677 44677 . + . e4 14
>>>>> I wormbase cds 43740 43961 . + 0
>>>>>Y74C9A.1a 15 I wormbase cds 44030 44234 . +
>>>>>1 Y74C9A.1a 16 I wormbase cds 44521 44677 .
>>>>> + 1 Y74C9A.1a 17 I wormbase
>>>>>match 1 100 100 . . .
>>>>>ID=12345.s;Target=cb123:1001..1100 18 I wormbase match
>>>>>101 500 20 . . .
>>>>>ID=12345.s;Target=cb123:1101..1500 19 I wormbase match
>>>>>501 1000 80 . . .
>>>>>ID=12345.s;Target=cb123:1501..2000 20 I wormbase match
>>>>>5001 6000 100 . . . ID=abc;Target=M1:1..1000
>>>>>21 I wormbase hsp 5001 5500 . . .
>>>>> abc Target=M1:1..500 22 I wormbase hsp 5501
>>>>>6000 . . . abc Target=M1:501..100 23 I
>>>>>wormbase mRNA 5000 6000 + . . .
>>>>>ID=M7.3 24 M7.3 wormbase exon 1 300 + .
>>>>> . M7.3 ID=M7.3.1 25 M7.3 wormbase exon 301
>>>>>400 + . . M7.3 ID=M7.3.2 26 M7.3 wormbase
>>>>> exon 401 1000 + . . M7.3 ID=M7.3.3
>>>>>
>>>>>=================================================================
>>>>>
>>>>>I have extended (in an experimental way), the Bio::Tools::GFF module
>>>>>to accomodate this new format. Here is a test script and its output
>>>>>when run on the above file.
>>>>>
>>>>> 0 #!/usr/bin/perl -w
>>>>> 1 use strict;
>>>>> 2 use lib '.';
>>>>>
>>>>> 3 use Bio::Tools::GFF;
>>>>> 4 my $gffio = Bio::Tools::GFF->new(-fh=>\*STDIN,-gff_version=>3);
>>>>> 5 my @f = $gffio->features;
>>>>> 6 format_features(\@f);
>>>>>
>>>>> 7 sub format_features {
>>>>> 8 my $features = shift;
>>>>> 9 my $tabs = shift || 0;
>>>>>10 for my $f (@$features) {
>>>>>11 my $type = $f->primary_tag;
>>>>>12 my $id = $f->unique_id;
>>>>>13 $id ||= '(no id)';
>>>>>14 my ($start,$end) = ($f->start,$f->end);
>>>>>15 my $alt = ($f->alternative_locations)[0];
>>>>>16 my ($target,$tstart,$tend) =
>>>>>($alt->seq_id,$alt->start,$alt->end) if $alt;
>>>>>
>>>>>17 print
>>>>>"\t"x$tabs,join("\t",$id,$type,$f->location->to_FTstring,eval{$alt->loca
>>>>>t ion->seq_id,$alt->location->to_FTstring}),"\n"; 18
>>>>>format_features([$f->sub_SeqFeature],$tabs+1);
>>>>>19 }
>>>>>20 }
>>>>>
>>>>>21 1;
>>>>>
>>>>>OUTPUT:
>>>>>
>>>>>cTel33B clone 1..2679
>>>>>CHROMOSOME_I contig 1..14972282
>>>>>12345.s match join(101..500,1..100,501..1000)
>>>>>M7.3 mRNA 5000..6000
>>>>> M7.3.1 exon 5000..5299
>>>>> M7.3.2 exon 5300..5399
>>>>> M7.3.3 exon 5400..5999
>>>>>abc match 5001..6000
>>>>> (no id) hsp 5001..5500
>>>>> (no id) hsp 5501..6000
>>>>>(no id) repeat 5000..5100
>>>>>Y74C9A.1 gene 43733..44677
>>>>> Y74C9A.1a mRNA 43733..44677
>>>>> e1 exon 43733..43961
>>>>> (no id) coding_start 43740
>>>>> e2 exon 44030..44234
>>>>> e4 exon 44521..44677
>>>>> (no id) coding_end 44677
>>>>> (no id) cds 43740..43961
>>>>> (no id) cds 44030..44234
>>>>> (no id) cds 44521..44677
>>>>> Y74C9A.1b mRNA 43733..44677
>>>>> e1 exon 43733..43961
>>>>> (no id) coding_start 43740
>>>>> e3 exon 44281..44328
>>>>
>
More information about the Bioperl-l
mailing list