[Bioperl-l] RE: Proposed GFF version 3
Tony Cox
avc at sanger.ac.uk
Tue Feb 11 19:00:42 EST 2003
Largely my experience too!
Tony
+>
+> Everywhere outside of WormBase and DAS I've personally seen
+> uses '&'. We had to implement ';' to cope with DAS.
+>
+> ----- Original Message -----
+> From: "Richard Durbin" <rd at sanger.ac.uk>
+> To: <lstein at cshl.org>
+> Cc: <bioperl-l at bioperl.org>; <suzi at fruitfly.org>;
+> <gff-list at sanger.ac.uk>
+> Sent: Tuesday, February 11, 2003 4:54 AM
+> Subject: Re: Proposed GFF version 3
+>
+>
+> > Swap them entirely. i.e. put the attributes in column 9
+> and call that
+> > "attributes" and put the new hierarchical group term in
+> column 10 and
+> > call that "group". Or perhaps it would be better to call
+> it something
+> > else to minimise confusion, because in gff version 1 column 9 was
+> > called group. What about calling column 10 "cluster"?
+> >
+> > I see you have switched to URL type format for the
+> attributes, away
+> > from acedb. That's fine - URL format is much more
+> universal. But is
+> > ';' a standard separator in URLS? I just looked and see
+> that Ensembl
+> > uses '&' and WormBase uses ';' and I think I have seen '+'
+> somewhere,
+> > so maybe there is no standard.
+> >
+> > Richard
+> >
+> > Lincoln Stein wrote:
+> > > Hi Richard,
+> > >
+> > > Do you mean that we should swap columns 9 and 10
+> entirely, or just
+> > > swap
+> their
+> > > names? I think you mean the former, but I want to be sure.
+> > >
+> > > Lincoln
+> > >
+> > > On Monday 10 February 2003 11:12 am, Richard Durbin wrote:
+> > >
+> > >>Hello all,
+> > >>
+> > >>This looks very nice to me. Not surprising perhaps
+> because I had an
+> > >>earlier involvement as Lincoln says.
+> > >>
+> > >>I have added gff-list at sanger.ac.uk to the mailing Cc:
+> list because
+> > >>it is the "official" GFF mailing list, although it is
+> very little
+> > >>used.
+> > >>
+> > >>I have one major comment, that columns 9 (group) and 10
+> (attributes)
+> > >>should be switched. Although GFF version 1 column 9 was called
+> > >>"group" in version 2, which is what has been current for
+> over two
+> > >>years, this was renamed "attribute" and contains the attribute
+> > >>information. For consistency we should keep column 9 for the
+> > >>attributes. Also, in many cases there will be attributes but no
+> > >>group.
+> > >>
+> > >>I like ID and Target. I see the idea with hsp's for gapped
+> > >>alignments, though perhaps they could be called
+> "match_block". But
+> > >>there is a case I think to also encode gapped alignments on one
+> > >>line, perhaps using the CIGAR encoding used by ENSEMBL (and
+> > >>BioPerl?), e.g. as
+> > >>
+> > >> Target=M1:1..1000;Align=xxxxxxx
+> > >>
+> > >>(sorry I don't know cigar format well enough to write a legal
+> > >>string.
+> > >>
+> > >>Richard
+> > >>
+> > >>Lincoln Stein wrote:
+> > >>
+> > >>>This letter is to discuss a proposed extension to GFF.
+> It arises
+> > >>>from conversations with Richard Durbin during last
+> fall's Hinxton
+> > >>>genome informatics meeting.
+> > >>>
+> > >>>Although there are many richer ways of representing genomic
+> > >>>features via XML, the stubborn persistence of a variety
+> of ad-hoc
+> > >>>tab-delimited flat file formats declares the bioinformatics
+> > >>>community's need for a simple format that can be
+> modified with a
+> > >>>text editor and processed with shell tools like grep. The GFF
+> > >>>format, although widely used, has fragmented into multiple
+> > >>>incompatible dialects. When asked why they have modified the
+> > >>>published Sanger specification, bioinformaticists
+> frequently answer
+> > >>>that the format was insufficient for their needs, and
+> they needed
+> > >>>to extend it. The proposed GFF3 format addresses the
+> most common
+> > >>>extensions to GFF, while preserving backward compatibility with
+> > >>>previous formats. The new format:
+> > >>>
+> > >>> 1) adds a mechanism for representing more than one level
+> > >>> of hierarchical grouping of features and subfeatures.
+> > >>> 2) separates the ideas of group membership and
+> feature name/id
+> > >>> 3) constrains the feature type field to be taken
+> from a controlled
+> > >>> vocabulary.
+> > >>> 4) allows a single feature, such as an exon, to
+> belong to more than
+> > >>> one group at a time.
+> > >>> 5) one level of relative addressing for subfeatures
+> (e.g. exons
+> > >>> can be expressed in transcript coordinates)
+> > >>> 6) an explicit convention for pairwise alignments
+> > >>> 7) an explicit convention for features that occupy disjunct
+> > >>> regions
+> > >>>
+> > >>>The format consists of 10 columns, separated by spaces. The
+> > >>>following unescaped characters are allowed within fields:
+> > >>>[a-zA-Z0-9.:;=%^*$@!+_?-]. All other characters must must be
+> > >>>escaped using the URL escaping conventions. Unescaped
+> quotation
+> > >>>marks, backslashes and other ad-hoc escaping
+> conventions that have
+> > >>>been added to the GFF format are explicitly forbidden.
+> The =, ;
+> > >>>and % characters have reserved meanings as described below.
+> > >>>
+> > >>>Undefined fields are replaced with the "." character,
+> as described
+> > >>>in the original GFF spec.
+> > >>>
+> > >>>Column 1: "seqid"
+> > >>>
+> > >>>The ID of the landmark used to establish the coordinate
+> system for
+> > >>>the current feature. IDs must contain alphanumeric characters.
+> > >>>Whitespace, if present, must be escaped using URL
+> escaping rules
+> > >>>(e.g. space="%20").
+> > >>>
+> > >>>Column 2: "source"
+> > >>>
+> > >>>The source of the feature. This is unchanged from the
+> older GFF
+> > >>>specs and is not part of a controlled vocabulary.
+> > >>>
+> > >>>Column 3: "type"
+> > >>>
+> > >>>The type of the feature (previously called the
+> "method"). This is
+> > >>>constrained to be either: (a) a term from SOFA; or (b) a SOFA
+> > >>>accession number. The latter alternative is
+> distinguished using
+> > >>>the syntax SOFA:000000.
+> > >>>
+> > >>>Columns 4 & 5: "start" and "end"
+> > >>>
+> > >>>The start and end of the feature, in 1-based integer
+> coordinates,
+> > >>>relative to the landmark given in column 1. Start is less than
+> > >>>end.
+> > >>>
+> > >>>Column 6: "score"
+> > >>>
+> > >>>The score of the feature, a floating point number. As
+> in earlier
+> > >>>versions of the format, the semantics of the score are
+> ill-defined.
+> > >>>It is strongly recommended that E-values be used for sequence
+> > >>>similarity features, and that P-values be used for ab
+> initio gene
+> > >>>prediction features.
+> > >>>
+> > >>>Column 7: "strand"
+> > >>>
+> > >>>The strand of the feature. + for positive strand
+> (relative to the
+> > >>>landmark), - for minus strand, and . for features that are not
+> > >>>stranded. In addition, ? can be used for features whose
+> > >>>strandedness is relevant, but unknown.
+> > >>>
+> > >>>Column 8: "phase"
+> > >>>
+> > >>>The phase of the feature, for protein-encoding featues
+> (primarily
+> > >>>CDSs). This is an integer-valued field with the values
+> 0, 1, or 2.
+> > >>>The integer indicates the offset from the start of the
+> feature to
+> > >>>the first base of the first codon in the reading frame. "." is
+> > >>>used for features that do not corresponding to a reading frame.
+> > >>>
+> > >>>Column 9: "group"
+> > >>>
+> > >>>A list of the immediate parents of the current feature.
+> Multiple
+> > >>>parents are allowed (example: one exon shared by multiple
+> > >>>transcripts). Multiple parents are separated by a semicolon.
+> > >>>Parentless features have a dot in this field.
+> > >>>
+> > >>>Column 10: "attributes"
+> > >>>
+> > >>>A list of feature attributes in the format tag=value. Multiple
+> > >>>tag=value pairs are separated by semicolons. URL
+> escaping rules
+> > >>>are used for tags or values containing whitespace, "="
+> characters
+> > >>>and semicolons.
+> > >>>
+> > >>>Two tags are special:
+> > >>>
+> > >>> ID Indicates the name of the feature. IDs must be unique
+> > >>> within the scope of the GFF file.
+> > >>>
+> > >>> Target Indicates the target of a nucleotide to nucleotide or
+> > >>> nucleotide to protein alignment. The format of the
+> > >>> value is "target_id:start..end" Start may be greater
+> > >>> than end to indicate a + strand alignment to the
+> > >>> reverse complement of a target nucleotide sequence.
+> > >>>
+> > >>>In the example GFF3 file given below, the first column contains
+> > >>>line numbers that I have added for the purposes of the
+> narrative.
+> > >>>Here are some common scenarios that I have attempted to
+> illustrate:
+> > >>>
+> > >>>A) a simple feature, no public ID
+> > >>>
+> > >>>Line 2 in the example is a feature of type "repeat". It
+> has a start
+> > >>>and an end and no ID, but it does have an attribute
+> named "Note."
+> > >>>
+> > >>>B) a simple feature with a public ID
+> > >>>
+> > >>>Line 3 is a feature of type clone. It has a start and
+> an end. Its
+> > >>>parent is undefined (empty column 9), but it has an
+> attribute of
+> > >>>type ID with value "cTel33B."
+> > >>>
+> > >>>C) a feature with multiple attributes
+> > >>>
+> > >>>Line 5 is a feature of type "gene." It has no parent, and has
+> > >>>attributes of type ID, Note, and GO_term.
+> > >>>
+> > >>>D) a hierarchical grouping of features
+> > >>>
+> > >>>Lines 5-13 demonstrate a hierarchical grouping. At the
+> top level
+> > >>>is line 5, which defines the extent of a "gene" with ID
+> Y74C9A.1.
+> > >>>Below this are two features of type mRNA (lines 6 and
+> 7). Their
+> > >>>group fields contain the ID of Y74C9A.1, indicating that this
+> > >>>feature is their immediate parent. In the 10th column,
+> the mRNA
+> > >>>features have their own IDs independent of the ID of the parent
+> > >>>gene.
+> > >>>
+> > >>>This pattern is repeated for the exons listed on lines
+> 8-11. Exons
+> > >>>e1, e2, and e4 belong to both of the transcripts.
+> Therefore, both
+> > >>>transcript IDs are listed in the group column, separated by
+> > >>>semicolons.
+> > >>>
+> > >>>Exon e3 belongs only to one of the transcripts, and
+> therefore only
+> > >>>that transcript's ID is listed in the group column.
+> > >>>
+> > >>>Lines 12 and 13 indicate coding_start and coding_end features.
+> > >>>These subfeatures are hierarchically grouped underneath their
+> > >>>corresponding exons, but they do not have independent
+> public IDs.
+> > >>>
+> > >>>E) Disjunct coordinates
+> > >>>
+> > >>>Lines 14-16 illustrates a single feature -- the CDS
+> corresponding
+> > >>>to mRNA Y74C9A.1a -- which occupies multiple disjunct
+> regions. The
+> > >>>group column indicates that the CDS belongs to mRNA Y74C9A.1a.
+> > >>>However, the attribute column assigns each of lines
+> 14-16 the same
+> > >>>ID. Because the ID is the same, this is to be interpreted as a
+> > >>>single feature that spans multiple locations.
+> > >>>
+> > >>>F) Alignments
+> > >>>
+> > >>>Lines 17-19 demonstrate a gapped alignment of two
+> sequences using
+> > >>>the reserved Target attribute. Each non-gapped segment
+> becomes a
+> > >>>line in the GFF3 file. The segments each share the same ID,
+> > >>>thereby indicating that the segments are disjunct
+> regions of the
+> > >>>same feature. The Target attribute indicates the ID of
+> the target
+> > >>>sequence (which does not have to be represented in the
+> GFF3 file)
+> > >>>and the start and end coordinates of the aligned target.
+> > >>>
+> > >>>Unlike the GFF1 and GFF2 formats, the group field for gapped
+> > >>>alignments can be empty. However, a valid alternative
+> > >>>representation is to create a single "match" feature,
+> and a series
+> > >>>of "hsp" features underneath it via the group field.
+> Lines 20-22
+> > >>>show this alternative representation.
+> > >>>
+> > >>>G) Relative coordinates
+> > >>>
+> > >>>Lines 23-26 illustrate using relative coordinate addressing in
+> > >>>feature/subfeature relationships. Line 23 defines an
+> mRNA that is
+> > >>>positioned on sequence landmark "I" from positions 5000
+> to 6000.
+> > >>>Its ID field indicates that it is M7.3. Lines 24-26 are exon
+> > >>>subfeatures of M7.3 as indicated by their group field.
+> However,
+> > >>>the seqid field specifies M7.3 as the parent coordinate system,
+> > >>>thereby allowing the exons to begin at position 1.
+> > >>>
+> > >>> 0 ##gff-version 3
+> > >>> 1 ##sequence-region I:1..14972282
+> > >>> 2 I wormbase repeat 5000 5100 .
+> . .
+> > >>> . Note=ALU3 3 I wormbase clone
+> 1 2679
+> .
+> > >>> + . . ID=cTel33B 4 I wormbase
+> > >>>contig 1 14972282 . + . .
+> > >>>ID=CHROMOSOME_I 5 I wormbase gene
+> 43733 44677 .
+> > >>> + . .
+> ID=Y74C9A.1;Note=unc-3;GO_term=GO:12345
+> > >>>6 I wormbase mRNA 43733 44677 .
+> + .
+> > >>> Y74C9A.1 ID=Y74C9A.1a 7 I wormbase
+> mRNA 43733
+> > >>>44677 . + . Y74C9A.1
+> ID=Y74C9A.1b 8 I
+> > >>>wormbase exon 43733 43961 . + .
+> > >>>Y74C9A.1a;Y74C9A.1b ID=e1 9 I wormbase
+> exon 44030
+> > >>>44234 . + . Y74C9A.1a;T:Y74C9A.1b
+> ID=e2 10 I
+> > >>>wormbase exon 44281 44328 . + .
+> Y74C9A.1b
+> > >>> ID=e3 11 I wormbase exon 44521
+> 44677 .
+> +
+> > >>> . Y74C9A.1a;T:Y74C9A.1b ID=e4 12 I wormbase
+> > >>>coding_start 43740 43740 . + . e1 13 I
+> > >>>wormbase coding_end 44677 44677 .
+> + .
+> > >>>e4 14 I wormbase cds 43740 43961 . +
+> 0
+> > >>> Y74C9A.1a 15 I wormbase cds
+> 44030 44234 .
+> > >>> + 1 Y74C9A.1a 16 I wormbase
+> cds 44521
+> > >>>44677 . + 1 Y74C9A.1a 17 I wormbase
+> > >>>match 1 100 100 . . .
+> > >>>ID=12345.s;Target=cb123:1001..1100 18 I wormbase
+> match
+> > >>>101 500 20 . . .
+> > >>>ID=12345.s;Target=cb123:1101..1500 19 I wormbase
+> match
+> > >>>501 1000 80 . . .
+> > >>>ID=12345.s;Target=cb123:1501..2000 20 I wormbase
+> match
+> > >>>5001 6000 100 . . .
+> ID=abc;Target=M1:1..1000
+> > >>>21 I wormbase hsp 5001 5500 .
+> . .
+> > >>> abc Target=M1:1..500 22 I wormbase
+> hsp 5501
+> > >>>6000 . . . abc Target=M1:501..100 23 I
+> > >>>wormbase mRNA 5000 6000 + .
+> . .
+> > >>>ID=M7.3 24 M7.3 wormbase exon 1 300
+> + .
+> > >>> . M7.3 ID=M7.3.1 25 M7.3 wormbase
+> exon 301
+> > >>>400 + . . M7.3 ID=M7.3.2 26
+> M7.3 wormbase
+> > >>> exon 401 1000 + . .
+> M7.3 ID=M7.3.3
+> > >>>
+> >
+> >>>=================================================================
+> > >>>
+> > >>>I have extended (in an experimental way), the Bio::Tools::GFF
+> > >>>module to accomodate this new format. Here is a test
+> script and
+> > >>>its output when run on the above file.
+> > >>>
+> > >>> 0 #!/usr/bin/perl -w
+> > >>> 1 use strict;
+> > >>> 2 use lib '.';
+> > >>>
+> > >>> 3 use Bio::Tools::GFF;
+> > >>> 4 my $gffio =
+> > >>> Bio::Tools::GFF->new(-fh=>\*STDIN,-gff_version=>3);
+> > >>> 5 my @f = $gffio->features;
+> > >>> 6 format_features(\@f);
+> > >>>
+> > >>> 7 sub format_features {
+> > >>> 8 my $features = shift;
+> > >>> 9 my $tabs = shift || 0;
+> > >>> 10 for my $f (@$features) {
+> > >>> 11 my $type = $f->primary_tag;
+> > >>> 12 my $id = $f->unique_id;
+> > >>> 13 $id ||= '(no id)';
+> > >>> 14 my ($start,$end) = ($f->start,$f->end);
+> > >>> 15 my $alt = ($f->alternative_locations)[0];
+> > >>> 16 my ($target,$tstart,$tend) =
+> > >>>($alt->seq_id,$alt->start,$alt->end) if $alt;
+> > >>>
+> > >>> 17 print
+> >
+> >>>"\t"x$tabs,join("\t",$id,$type,$f->location->to_FTstring,e
+> val{$alt->l
+> >>>ocat
+> > >>>ion->seq_id,$alt->location->to_FTstring}),"\n"; 18
+> > >>>format_features([$f->sub_SeqFeature],$tabs+1);
+> > >>> 19 }
+> > >>> 20 }
+> > >>>
+> > >>> 21 1;
+> > >>>
+> > >>>OUTPUT:
+> > >>>
+> > >>>cTel33B clone 1..2679
+> > >>>CHROMOSOME_I contig 1..14972282
+> > >>>12345.s match join(101..500,1..100,501..1000)
+> > >>>M7.3 mRNA 5000..6000
+> > >>> M7.3.1 exon 5000..5299
+> > >>> M7.3.2 exon 5300..5399
+> > >>> M7.3.3 exon 5400..5999
+> > >>>abc match 5001..6000
+> > >>> (no id) hsp 5001..5500
+> > >>> (no id) hsp 5501..6000
+> > >>>(no id) repeat 5000..5100
+> > >>>Y74C9A.1 gene 43733..44677
+> > >>> Y74C9A.1a mRNA 43733..44677
+> > >>> e1 exon 43733..43961
+> > >>> (no id) coding_start 43740
+> > >>> e2 exon 44030..44234
+> > >>> e4 exon 44521..44677
+> > >>> (no id) coding_end 44677
+> > >>> (no id) cds 43740..43961
+> > >>> (no id) cds 44030..44234
+> > >>> (no id) cds 44521..44677
+> > >>> Y74C9A.1b mRNA 43733..44677
+> > >>> e1 exon 43733..43961
+> > >>> (no id) coding_start 43740
+> > >>> e3 exon 44281..44328
+> > >>
+> > >
+> >
+> >
+>
More information about the Bioperl-l
mailing list