[Bioperl-l] Re: Proposed GFF version 3
Suzanna Lewis
suzi at fruitfly.org
Fri Feb 7 22:04:45 EST 2003
Hi Lincoln,
I just skimmed this, but it looks good.
-S
On Friday, February 7, 2003, at 04:54 PM, Lincoln Stein wrote:
> This letter is to discuss a proposed extension to GFF. It arises from
> conversations with Richard Durbin during last fall's Hinxton genome
> informatics meeting.
>
> Although there are many richer ways of representing genomic features
> via XML, the stubborn persistence of a variety of ad-hoc tab-delimited
> flat file formats declares the bioinformatics community's need for a
> simple format that can be modified with a text editor and processed
> with shell tools like grep. The GFF format, although widely used, has
> fragmented into multiple incompatible dialects. When asked why they
> have modified the published Sanger specification, bioinformaticists
> frequently answer that the format was insufficient for their needs,
> and they needed to extend it. The proposed GFF3 format addresses the
> most common extensions to GFF, while preserving backward compatibility
> with previous formats. The new format:
>
> 1) adds a mechanism for representing more than one level
> of hierarchical grouping of features and subfeatures.
> 2) separates the ideas of group membership and feature name/id
> 3) constrains the feature type field to be taken from a controlled
> vocabulary.
> 4) allows a single feature, such as an exon, to belong to more than
> one group at a time.
> 5) one level of relative addressing for subfeatures (e.g. exons
> can be expressed in transcript coordinates)
> 6) an explicit convention for pairwise alignments
> 7) an explicit convention for features that occupy disjunct regions
>
> The format consists of 10 columns, separated by spaces. The following
> unescaped characters are allowed within fields:
> [a-zA-Z0-9.:;=%^*$@!+_?-]. All other characters must must be escaped
> using the URL escaping conventions. Unescaped quotation marks,
> backslashes and other ad-hoc escaping conventions that have been added
> to the GFF format are explicitly forbidden. The =, ; and % characters
> have reserved meanings as described below.
>
> Undefined fields are replaced with the "." character, as described in
> the original GFF spec.
>
> Column 1: "seqid"
>
> The ID of the landmark used to establish the coordinate system for the
> current feature. IDs must contain alphanumeric characters.
> Whitespace, if present, must be escaped using URL escaping rules
> (e.g. space="%20").
>
> Column 2: "source"
>
> The source of the feature. This is unchanged from the older GFF specs
> and is not part of a controlled vocabulary.
>
> Column 3: "type"
>
> The type of the feature (previously called the "method"). This is
> constrained to be either: (a) a term from SOFA; or (b) a SOFA
> accession number. The latter alternative is distinguished using the
> syntax SOFA:000000.
>
> Columns 4 & 5: "start" and "end"
>
> The start and end of the feature, in 1-based integer coordinates,
> relative to the landmark given in column 1. Start is less than end.
>
> Column 6: "score"
>
> The score of the feature, a floating point number. As in earlier
> versions of the format, the semantics of the score are ill-defined.
> It is strongly recommended that E-values be used for sequence
> similarity features, and that P-values be used for ab initio gene
> prediction features.
>
> Column 7: "strand"
>
> The strand of the feature. + for positive strand (relative to the
> landmark), - for minus strand, and . for features that are not
> stranded. In addition, ? can be used for features whose strandedness
> is relevant, but unknown.
>
> Column 8: "phase"
>
> The phase of the feature, for protein-encoding featues (primarily
> CDSs). This is an integer-valued field with the values 0, 1, or 2.
> The integer indicates the offset from the start of the feature to the
> first base of the first codon in the reading frame. "." is used for
> features that do not corresponding to a reading frame.
>
> Column 9: "group"
>
> A list of the immediate parents of the current feature. Multiple
> parents are allowed (example: one exon shared by multiple
> transcripts). Multiple parents are separated by a semicolon.
> Parentless features have a dot in this field.
>
> Column 10: "attributes"
>
> A list of feature attributes in the format tag=value. Multiple
> tag=value pairs are separated by semicolons. URL escaping rules are
> used for tags or values containing whitespace, "=" characters and
> semicolons.
>
> Two tags are special:
>
> ID Indicates the name of the feature. IDs must be unique
> within the scope of the GFF file.
>
> Target Indicates the target of a nucleotide to nucleotide or
> nucleotide to protein alignment. The format of the
> value is "target_id:start..end" Start may be greater
> than end to indicate a + strand alignment to the
> reverse complement of a target nucleotide sequence.
>
> In the example GFF3 file given below, the first column contains line
> numbers that I have added for the purposes of the narrative. Here are
> some common scenarios that I have attempted to illustrate:
>
> A) a simple feature, no public ID
>
> Line 2 in the example is a feature of type "repeat". It has a start
> and an end and no ID, but it does have an attribute named "Note."
>
> B) a simple feature with a public ID
>
> Line 3 is a feature of type clone. It has a start and an end. Its
> parent is undefined (empty column 9), but it has an attribute of type
> ID with value "cTel33B."
>
> C) a feature with multiple attributes
>
> Line 5 is a feature of type "gene." It has no parent, and has
> attributes of type ID, Note, and GO_term.
>
> D) a hierarchical grouping of features
>
> Lines 5-13 demonstrate a hierarchical grouping. At the top level is
> line 5, which defines the extent of a "gene" with ID Y74C9A.1. Below
> this are two features of type mRNA (lines 6 and 7). Their group
> fields contain the ID of Y74C9A.1, indicating that this feature is
> their immediate parent. In the 10th column, the mRNA features have
> their own IDs independent of the ID of the parent gene.
>
> This pattern is repeated for the exons listed on lines 8-11. Exons
> e1, e2, and e4 belong to both of the transcripts. Therefore, both
> transcript IDs are listed in the group column, separated by
> semicolons.
>
> Exon e3 belongs only to one of the transcripts, and therefore only
> that transcript's ID is listed in the group column.
>
> Lines 12 and 13 indicate coding_start and coding_end features. These
> subfeatures are hierarchically grouped underneath their corresponding
> exons, but they do not have independent public IDs.
>
> E) Disjunct coordinates
>
> Lines 14-16 illustrates a single feature -- the CDS corresponding to
> mRNA Y74C9A.1a -- which occupies multiple disjunct regions. The group
> column indicates that the CDS belongs to mRNA Y74C9A.1a. However, the
> attribute column assigns each of lines 14-16 the same ID. Because the
> ID is the same, this is to be interpreted as a single feature that
> spans multiple locations.
>
> F) Alignments
>
> Lines 17-19 demonstrate a gapped alignment of two sequences using the
> reserved Target attribute. Each non-gapped segment becomes a line in
> the GFF3 file. The segments each share the same ID, thereby
> indicating that the segments are disjunct regions of the same feature.
> The Target attribute indicates the ID of the target sequence (which
> does not have to be represented in the GFF3 file) and the start and
> end coordinates of the aligned target.
>
> Unlike the GFF1 and GFF2 formats, the group field for gapped
> alignments can be empty. However, a valid alternative representation
> is to create a single "match" feature, and a series of "hsp" features
> underneath it via the group field. Lines 20-22 show this alternative
> representation.
>
> G) Relative coordinates
>
> Lines 23-26 illustrate using relative coordinate addressing in
> feature/subfeature relationships. Line 23 defines an mRNA that is
> positioned on sequence landmark "I" from positions 5000 to 6000. Its
> ID field indicates that it is M7.3. Lines 24-26 are exon subfeatures
> of M7.3 as indicated by their group field. However, the seqid field
> specifies M7.3 as the parent coordinate system, thereby allowing the
> exons to begin at position 1.
>
> 0 ##gff-version 3
> 1 ##sequence-region I:1..14972282
> 2 I wormbase repeat 5000 5100 . . .
> . Note=ALU3
> 3 I wormbase clone 1 2679 . + .
> . ID=cTel33B
> 4 I wormbase contig 1 14972282 . +
> . . ID=CHROMOSOME_I
> 5 I wormbase gene 43733 44677 . + .
> . ID=Y74C9A.1;Note=unc-3;GO_term=GO:12345
> 6 I wormbase mRNA 43733 44677 . + .
> Y74C9A.1 ID=Y74C9A.1a
> 7 I wormbase mRNA 43733 44677 . + .
> Y74C9A.1 ID=Y74C9A.1b
> 8 I wormbase exon 43733 43961 . + .
> Y74C9A.1a;Y74C9A.1b ID=e1
> 9 I wormbase exon 44030 44234 . + .
> Y74C9A.1a;T:Y74C9A.1b ID=e2
> 10 I wormbase exon 44281 44328 . + .
> Y74C9A.1b ID=e3
> 11 I wormbase exon 44521 44677 . + .
> Y74C9A.1a;T:Y74C9A.1b ID=e4
> 12 I wormbase coding_start 43740 43740 . +
> . e1
> 13 I wormbase coding_end 44677 44677 . +
> . e4
> 14 I wormbase cds 43740 43961 . + 0
> Y74C9A.1a
> 15 I wormbase cds 44030 44234 . + 1
> Y74C9A.1a
> 16 I wormbase cds 44521 44677 . + 1
> Y74C9A.1a
> 17 I wormbase match 1 100 100 . .
> . ID=12345.s;Target=cb123:1001..1100
> 18 I wormbase match 101 500 20 . .
> . ID=12345.s;Target=cb123:1101..1500
> 19 I wormbase match 501 1000 80 . .
> . ID=12345.s;Target=cb123:1501..2000
> 20 I wormbase match 5001 6000 100 . .
> . ID=abc;Target=M1:1..1000
> 21 I wormbase hsp 5001 5500 . . .
> abc Target=M1:1..500
> 22 I wormbase hsp 5501 6000 . . .
> abc Target=M1:501..100
> 23 I wormbase mRNA 5000 6000 + . .
> . ID=M7.3
> 24 M7.3 wormbase exon 1 300 + . .
> M7.3 ID=M7.3.1
> 25 M7.3 wormbase exon 301 400 + . .
> M7.3 ID=M7.3.2
> 26 M7.3 wormbase exon 401 1000 + . .
> M7.3 ID=M7.3.3
>
> =================================================================
>
> I have extended (in an experimental way), the Bio::Tools::GFF module
> to accomodate this new format. Here is a test script and its output
> when run on the above file.
>
> 0 #!/usr/bin/perl -w
> 1 use strict;
> 2 use lib '.';
>
> 3 use Bio::Tools::GFF;
> 4 my $gffio = Bio::Tools::GFF->new(-fh=>\*STDIN,-gff_version=>3);
> 5 my @f = $gffio->features;
> 6 format_features(\@f);
>
> 7 sub format_features {
> 8 my $features = shift;
> 9 my $tabs = shift || 0;
> 10 for my $f (@$features) {
> 11 my $type = $f->primary_tag;
> 12 my $id = $f->unique_id;
> 13 $id ||= '(no id)';
> 14 my ($start,$end) = ($f->start,$f->end);
> 15 my $alt = ($f->alternative_locations)[0];
> 16 my ($target,$tstart,$tend) =
> ($alt->seq_id,$alt->start,$alt->end) if $alt;
>
> 17 print
> "\t"x$tabs,join("\t",$id,$type,$f->location->to_FTstring,eval{$alt-
> >location->seq_id,$alt->location->to_FTstring}),"\n";
> 18 format_features([$f->sub_SeqFeature],$tabs+1);
> 19 }
> 20 }
>
> 21 1;
>
> OUTPUT:
>
> cTel33B clone 1..2679
> CHROMOSOME_I contig 1..14972282
> 12345.s match join(101..500,1..100,501..1000)
> M7.3 mRNA 5000..6000
> M7.3.1 exon 5000..5299
> M7.3.2 exon 5300..5399
> M7.3.3 exon 5400..5999
> abc match 5001..6000
> (no id) hsp 5001..5500
> (no id) hsp 5501..6000
> (no id) repeat 5000..5100
> Y74C9A.1 gene 43733..44677
> Y74C9A.1a mRNA 43733..44677
> e1 exon 43733..43961
> (no id) coding_start 43740
> e2 exon 44030..44234
> e4 exon 44521..44677
> (no id) coding_end 44677
> (no id) cds 43740..43961
> (no id) cds 44030..44234
> (no id) cds 44521..44677
> Y74C9A.1b mRNA 43733..44677
> e1 exon 43733..43961
> (no id) coding_start 43740
> e3 exon 44281..44328
>
>
> --
> =======================================================================
> =
> Lincoln D. Stein Cold Spring Harbor
> Laboratory
> lstein at cshl.org Cold Spring Harbor, NY
> 1 Bungtown Road, Cold Spring Harbor, NY 11724
> =======================================================================
> =
More information about the Bioperl-l
mailing list