[Bioperl-l] Re: Proposed GFF version 3
Jim Kent
jim_kent at pacbell.net
Tue Feb 11 10:40:09 EST 2003
Everywhere outside of WormBase and DAS I've personally seen
uses '&'. We had to implement ';' to cope with DAS.
----- Original Message -----
From: "Richard Durbin" <rd at sanger.ac.uk>
To: <lstein at cshl.org>
Cc: <bioperl-l at bioperl.org>; <suzi at fruitfly.org>; <gff-list at sanger.ac.uk>
Sent: Tuesday, February 11, 2003 4:54 AM
Subject: Re: Proposed GFF version 3
> Swap them entirely. i.e. put the attributes in column 9 and call that
> "attributes" and put the new hierarchical group term in column 10 and
> call that "group". Or perhaps it would be better to call it something
> else to minimise confusion, because in gff version 1 column 9 was called
> group. What about calling column 10 "cluster"?
>
> I see you have switched to URL type format for the attributes, away from
> acedb. That's fine - URL format is much more universal. But is ';' a
> standard separator in URLS? I just looked and see that Ensembl uses '&'
> and WormBase uses ';' and I think I have seen '+' somewhere, so maybe
> there is no standard.
>
> Richard
>
> Lincoln Stein wrote:
> > Hi Richard,
> >
> > Do you mean that we should swap columns 9 and 10 entirely, or just swap
their
> > names? I think you mean the former, but I want to be sure.
> >
> > Lincoln
> >
> > On Monday 10 February 2003 11:12 am, Richard Durbin wrote:
> >
> >>Hello all,
> >>
> >>This looks very nice to me. Not surprising perhaps because I had an
> >>earlier involvement as Lincoln says.
> >>
> >>I have added gff-list at sanger.ac.uk to the mailing Cc: list because it is
> >>the "official" GFF mailing list, although it is very little used.
> >>
> >>I have one major comment, that columns 9 (group) and 10 (attributes)
> >>should be switched. Although GFF version 1 column 9 was called "group"
> >>in version 2, which is what has been current for over two years, this
> >>was renamed "attribute" and contains the attribute information. For
> >>consistency we should keep column 9 for the attributes. Also, in many
> >>cases there will be attributes but no group.
> >>
> >>I like ID and Target. I see the idea with hsp's for gapped alignments,
> >>though perhaps they could be called "match_block". But there is a case
> >>I think to also encode gapped alignments on one line, perhaps using the
> >>CIGAR encoding used by ENSEMBL (and BioPerl?), e.g. as
> >>
> >> Target=M1:1..1000;Align=xxxxxxx
> >>
> >>(sorry I don't know cigar format well enough to write a legal string.
> >>
> >>Richard
> >>
> >>Lincoln Stein wrote:
> >>
> >>>This letter is to discuss a proposed extension to GFF. It arises from
> >>>conversations with Richard Durbin during last fall's Hinxton genome
> >>>informatics meeting.
> >>>
> >>>Although there are many richer ways of representing genomic features
> >>>via XML, the stubborn persistence of a variety of ad-hoc tab-delimited
> >>>flat file formats declares the bioinformatics community's need for a
> >>>simple format that can be modified with a text editor and processed
> >>>with shell tools like grep. The GFF format, although widely used, has
> >>>fragmented into multiple incompatible dialects. When asked why they
> >>>have modified the published Sanger specification, bioinformaticists
> >>>frequently answer that the format was insufficient for their needs,
> >>>and they needed to extend it. The proposed GFF3 format addresses the
> >>>most common extensions to GFF, while preserving backward compatibility
> >>>with previous formats. The new format:
> >>>
> >>> 1) adds a mechanism for representing more than one level
> >>> of hierarchical grouping of features and subfeatures.
> >>> 2) separates the ideas of group membership and feature name/id
> >>> 3) constrains the feature type field to be taken from a controlled
> >>> vocabulary.
> >>> 4) allows a single feature, such as an exon, to belong to more than
> >>> one group at a time.
> >>> 5) one level of relative addressing for subfeatures (e.g. exons
> >>> can be expressed in transcript coordinates)
> >>> 6) an explicit convention for pairwise alignments
> >>> 7) an explicit convention for features that occupy disjunct regions
> >>>
> >>>The format consists of 10 columns, separated by spaces. The following
> >>>unescaped characters are allowed within fields:
> >>>[a-zA-Z0-9.:;=%^*$@!+_?-]. All other characters must must be escaped
> >>>using the URL escaping conventions. Unescaped quotation marks,
> >>>backslashes and other ad-hoc escaping conventions that have been added
> >>>to the GFF format are explicitly forbidden. The =, ; and % characters
> >>>have reserved meanings as described below.
> >>>
> >>>Undefined fields are replaced with the "." character, as described in
> >>>the original GFF spec.
> >>>
> >>>Column 1: "seqid"
> >>>
> >>>The ID of the landmark used to establish the coordinate system for the
> >>>current feature. IDs must contain alphanumeric characters.
> >>>Whitespace, if present, must be escaped using URL escaping rules
> >>>(e.g. space="%20").
> >>>
> >>>Column 2: "source"
> >>>
> >>>The source of the feature. This is unchanged from the older GFF specs
> >>>and is not part of a controlled vocabulary.
> >>>
> >>>Column 3: "type"
> >>>
> >>>The type of the feature (previously called the "method"). This is
> >>>constrained to be either: (a) a term from SOFA; or (b) a SOFA
> >>>accession number. The latter alternative is distinguished using the
> >>>syntax SOFA:000000.
> >>>
> >>>Columns 4 & 5: "start" and "end"
> >>>
> >>>The start and end of the feature, in 1-based integer coordinates,
> >>>relative to the landmark given in column 1. Start is less than end.
> >>>
> >>>Column 6: "score"
> >>>
> >>>The score of the feature, a floating point number. As in earlier
> >>>versions of the format, the semantics of the score are ill-defined.
> >>>It is strongly recommended that E-values be used for sequence
> >>>similarity features, and that P-values be used for ab initio gene
> >>>prediction features.
> >>>
> >>>Column 7: "strand"
> >>>
> >>>The strand of the feature. + for positive strand (relative to the
> >>>landmark), - for minus strand, and . for features that are not
> >>>stranded. In addition, ? can be used for features whose strandedness
> >>>is relevant, but unknown.
> >>>
> >>>Column 8: "phase"
> >>>
> >>>The phase of the feature, for protein-encoding featues (primarily
> >>>CDSs). This is an integer-valued field with the values 0, 1, or 2.
> >>>The integer indicates the offset from the start of the feature to the
> >>>first base of the first codon in the reading frame. "." is used for
> >>>features that do not corresponding to a reading frame.
> >>>
> >>>Column 9: "group"
> >>>
> >>>A list of the immediate parents of the current feature. Multiple
> >>>parents are allowed (example: one exon shared by multiple
> >>>transcripts). Multiple parents are separated by a semicolon.
> >>>Parentless features have a dot in this field.
> >>>
> >>>Column 10: "attributes"
> >>>
> >>>A list of feature attributes in the format tag=value. Multiple
> >>>tag=value pairs are separated by semicolons. URL escaping rules are
> >>>used for tags or values containing whitespace, "=" characters and
> >>>semicolons.
> >>>
> >>>Two tags are special:
> >>>
> >>> ID Indicates the name of the feature. IDs must be unique
> >>> within the scope of the GFF file.
> >>>
> >>> Target Indicates the target of a nucleotide to nucleotide or
> >>> nucleotide to protein alignment. The format of the
> >>> value is "target_id:start..end" Start may be greater
> >>> than end to indicate a + strand alignment to the
> >>> reverse complement of a target nucleotide sequence.
> >>>
> >>>In the example GFF3 file given below, the first column contains line
> >>>numbers that I have added for the purposes of the narrative. Here are
> >>>some common scenarios that I have attempted to illustrate:
> >>>
> >>>A) a simple feature, no public ID
> >>>
> >>>Line 2 in the example is a feature of type "repeat". It has a start
> >>>and an end and no ID, but it does have an attribute named "Note."
> >>>
> >>>B) a simple feature with a public ID
> >>>
> >>>Line 3 is a feature of type clone. It has a start and an end. Its
> >>>parent is undefined (empty column 9), but it has an attribute of type
> >>>ID with value "cTel33B."
> >>>
> >>>C) a feature with multiple attributes
> >>>
> >>>Line 5 is a feature of type "gene." It has no parent, and has
> >>>attributes of type ID, Note, and GO_term.
> >>>
> >>>D) a hierarchical grouping of features
> >>>
> >>>Lines 5-13 demonstrate a hierarchical grouping. At the top level is
> >>>line 5, which defines the extent of a "gene" with ID Y74C9A.1. Below
> >>>this are two features of type mRNA (lines 6 and 7). Their group
> >>>fields contain the ID of Y74C9A.1, indicating that this feature is
> >>>their immediate parent. In the 10th column, the mRNA features have
> >>>their own IDs independent of the ID of the parent gene.
> >>>
> >>>This pattern is repeated for the exons listed on lines 8-11. Exons
> >>>e1, e2, and e4 belong to both of the transcripts. Therefore, both
> >>>transcript IDs are listed in the group column, separated by
> >>>semicolons.
> >>>
> >>>Exon e3 belongs only to one of the transcripts, and therefore only
> >>>that transcript's ID is listed in the group column.
> >>>
> >>>Lines 12 and 13 indicate coding_start and coding_end features. These
> >>>subfeatures are hierarchically grouped underneath their corresponding
> >>>exons, but they do not have independent public IDs.
> >>>
> >>>E) Disjunct coordinates
> >>>
> >>>Lines 14-16 illustrates a single feature -- the CDS corresponding to
> >>>mRNA Y74C9A.1a -- which occupies multiple disjunct regions. The group
> >>>column indicates that the CDS belongs to mRNA Y74C9A.1a. However, the
> >>>attribute column assigns each of lines 14-16 the same ID. Because the
> >>>ID is the same, this is to be interpreted as a single feature that
> >>>spans multiple locations.
> >>>
> >>>F) Alignments
> >>>
> >>>Lines 17-19 demonstrate a gapped alignment of two sequences using the
> >>>reserved Target attribute. Each non-gapped segment becomes a line in
> >>>the GFF3 file. The segments each share the same ID, thereby
> >>>indicating that the segments are disjunct regions of the same feature.
> >>>The Target attribute indicates the ID of the target sequence (which
> >>>does not have to be represented in the GFF3 file) and the start and
> >>>end coordinates of the aligned target.
> >>>
> >>>Unlike the GFF1 and GFF2 formats, the group field for gapped
> >>>alignments can be empty. However, a valid alternative representation
> >>>is to create a single "match" feature, and a series of "hsp" features
> >>>underneath it via the group field. Lines 20-22 show this alternative
> >>>representation.
> >>>
> >>>G) Relative coordinates
> >>>
> >>>Lines 23-26 illustrate using relative coordinate addressing in
> >>>feature/subfeature relationships. Line 23 defines an mRNA that is
> >>>positioned on sequence landmark "I" from positions 5000 to 6000. Its
> >>>ID field indicates that it is M7.3. Lines 24-26 are exon subfeatures
> >>>of M7.3 as indicated by their group field. However, the seqid field
> >>>specifies M7.3 as the parent coordinate system, thereby allowing the
> >>>exons to begin at position 1.
> >>>
> >>> 0 ##gff-version 3
> >>> 1 ##sequence-region I:1..14972282
> >>> 2 I wormbase repeat 5000 5100 . . .
> >>> . Note=ALU3 3 I wormbase clone 1 2679
.
> >>> + . . ID=cTel33B 4 I wormbase
> >>>contig 1 14972282 . + . .
> >>>ID=CHROMOSOME_I 5 I wormbase gene 43733 44677 .
> >>> + . .
ID=Y74C9A.1;Note=unc-3;GO_term=GO:12345
> >>>6 I wormbase mRNA 43733 44677 . + .
> >>> Y74C9A.1 ID=Y74C9A.1a 7 I wormbase mRNA 43733
> >>>44677 . + . Y74C9A.1 ID=Y74C9A.1b 8 I
> >>>wormbase exon 43733 43961 . + .
> >>>Y74C9A.1a;Y74C9A.1b ID=e1 9 I wormbase exon 44030
> >>>44234 . + . Y74C9A.1a;T:Y74C9A.1b ID=e2 10 I
> >>>wormbase exon 44281 44328 . + .
Y74C9A.1b
> >>> ID=e3 11 I wormbase exon 44521 44677 .
+
> >>> . Y74C9A.1a;T:Y74C9A.1b ID=e4 12 I wormbase
> >>>coding_start 43740 43740 . + . e1 13 I
> >>>wormbase coding_end 44677 44677 . + .
> >>>e4 14 I wormbase cds 43740 43961 . +
0
> >>> Y74C9A.1a 15 I wormbase cds 44030 44234 .
> >>> + 1 Y74C9A.1a 16 I wormbase cds 44521
> >>>44677 . + 1 Y74C9A.1a 17 I wormbase
> >>>match 1 100 100 . . .
> >>>ID=12345.s;Target=cb123:1001..1100 18 I wormbase match
> >>>101 500 20 . . .
> >>>ID=12345.s;Target=cb123:1101..1500 19 I wormbase match
> >>>501 1000 80 . . .
> >>>ID=12345.s;Target=cb123:1501..2000 20 I wormbase match
> >>>5001 6000 100 . . .
ID=abc;Target=M1:1..1000
> >>>21 I wormbase hsp 5001 5500 . . .
> >>> abc Target=M1:1..500 22 I wormbase hsp 5501
> >>>6000 . . . abc Target=M1:501..100 23 I
> >>>wormbase mRNA 5000 6000 + . . .
> >>>ID=M7.3 24 M7.3 wormbase exon 1 300 + .
> >>> . M7.3 ID=M7.3.1 25 M7.3 wormbase exon 301
> >>>400 + . . M7.3 ID=M7.3.2 26 M7.3 wormbase
> >>> exon 401 1000 + . . M7.3 ID=M7.3.3
> >>>
> >>>=================================================================
> >>>
> >>>I have extended (in an experimental way), the Bio::Tools::GFF module
> >>>to accomodate this new format. Here is a test script and its output
> >>>when run on the above file.
> >>>
> >>> 0 #!/usr/bin/perl -w
> >>> 1 use strict;
> >>> 2 use lib '.';
> >>>
> >>> 3 use Bio::Tools::GFF;
> >>> 4 my $gffio = Bio::Tools::GFF->new(-fh=>\*STDIN,-gff_version=>3);
> >>> 5 my @f = $gffio->features;
> >>> 6 format_features(\@f);
> >>>
> >>> 7 sub format_features {
> >>> 8 my $features = shift;
> >>> 9 my $tabs = shift || 0;
> >>> 10 for my $f (@$features) {
> >>> 11 my $type = $f->primary_tag;
> >>> 12 my $id = $f->unique_id;
> >>> 13 $id ||= '(no id)';
> >>> 14 my ($start,$end) = ($f->start,$f->end);
> >>> 15 my $alt = ($f->alternative_locations)[0];
> >>> 16 my ($target,$tstart,$tend) =
> >>>($alt->seq_id,$alt->start,$alt->end) if $alt;
> >>>
> >>> 17 print
>
>>>"\t"x$tabs,join("\t",$id,$type,$f->location->to_FTstring,eval{$alt->locat
> >>>ion->seq_id,$alt->location->to_FTstring}),"\n"; 18
> >>>format_features([$f->sub_SeqFeature],$tabs+1);
> >>> 19 }
> >>> 20 }
> >>>
> >>> 21 1;
> >>>
> >>>OUTPUT:
> >>>
> >>>cTel33B clone 1..2679
> >>>CHROMOSOME_I contig 1..14972282
> >>>12345.s match join(101..500,1..100,501..1000)
> >>>M7.3 mRNA 5000..6000
> >>> M7.3.1 exon 5000..5299
> >>> M7.3.2 exon 5300..5399
> >>> M7.3.3 exon 5400..5999
> >>>abc match 5001..6000
> >>> (no id) hsp 5001..5500
> >>> (no id) hsp 5501..6000
> >>>(no id) repeat 5000..5100
> >>>Y74C9A.1 gene 43733..44677
> >>> Y74C9A.1a mRNA 43733..44677
> >>> e1 exon 43733..43961
> >>> (no id) coding_start 43740
> >>> e2 exon 44030..44234
> >>> e4 exon 44521..44677
> >>> (no id) coding_end 44677
> >>> (no id) cds 43740..43961
> >>> (no id) cds 44030..44234
> >>> (no id) cds 44521..44677
> >>> Y74C9A.1b mRNA 43733..44677
> >>> e1 exon 43733..43961
> >>> (no id) coding_start 43740
> >>> e3 exon 44281..44328
> >>
> >
>
>
More information about the Bioperl-l
mailing list