[Bioperl-l] validate_gff_via_ontology.pl
Chris Mungall
cjm at fruitfly.org
Fri Feb 21 01:39:48 EST 2003
I have just committed this to branch-1-2-collection
It is an ontology validator for GFF3 - it checks both feature types and
validates the feature containment (subfeatures hierarchy) as this must
correspond to the SOFA standard (in progress). The code could be easily
modified for any format that produces seqfeature containment hierarchies
(eg chado)
Currently it requires the go-dev perl modules to work - I am working on
moving the required functionality into the bioperl ontology classes
I have also made SeqFeature::Generic ontology-aware
type() is a getter/setter for a Bio::Ontology::TermI
primary_tag() has exactly the same behaviour but the underlying
implementation delegates to a Bio::Ontology::TermI object
(using the type() attribute)
type_string() is a synonym for primary_tag()
See also
geneontology.sf.net (for the go-dev perl api)
song.sf.net (for the SOFA ontology)
someone remind me to move this to the main branch when the time is right!
future plans: the next step is a transmogrifier for mapping legacy formats
into a standard ontology
Here are the pod docs:
=head1 NAME
validate_gff_via_ontology.pl
=head1 SYNOPSIS
validate_gff_via_ontology.pl -trace -m cds=coding_sequence
myfeatures.gff sofa.ontology
=head1 DESCRIPTION
This script will validate a GFF (version3 is best) file against an
ontology of sequence features. Currently the ontology must be in
GO/GOBO format, support for other formats (eg DAML+OIL, OWL) are
forthcoming.
The validation is in two parts:
first of all the types in the GFF file are checked vs the terms in the
ontology. An identifier (which will be a SO accession if the SO/SOFA
ontology is used) is assigned if it exists (case insensitive matching
is used). If the type used in the GFF file does not exist in the
ontology file the script will notify you with a message like:
BAD: locus transcrpt geme
The second part of the validation checks to see if the 'subfeature'
relationships specified in the GFF (using Parent and ID attributes in
GFF3) are valid. For instance, making 'transcript' a subfeature of
'repeat' is obviously silly, and it is banned because there is no
'part-of' restriction between transcript and repeat in the canonical
SOFA ontology. The rules are a bit more subtle than this, as we also
have to traverse the subsumption hierarchy.
If a feature/subfeature containment is not allowed, you will get a
message like:
BAD PARTOF: [coding_start PARTOF exon]
To get a full explanation of why the partof link is bad, run with the
trace switch [-trace]
=head2 REQUIREMENTS
(1) An ontology file
You can obtain the canonical SOFA ontology here:
http://sofa.sf.net/
(2) The go-dev perl objects
http://geneontology.sf.net/
Plus some files in GFF format!
=head2 RULES
a subfeature can only be part of a superfeature
if there is a direct part-of link for the relevant feature types
OR such a link can be obtained by traversing either of the
two graphs upwards.
the part-of links must be direct; we do not use the closure of the
part-of relationship. This means that you can *not* make an exon a
subfeature of gene, you need the intermediate object (of some kind of
transcripty type, eg mRNA).
=head2 EXAMPLES
(these depend on an imaginary version of a SO-style ontology):
=head3 Example 1
can I make a 'mRNA' feature a subfeature of noncoding gene ('nc_gene')?
YES! 'mRNA' is a subclass of 'processed_transcript' is a subclass of
'transcript'
'nc_gene' is a subclass of 'gene'
'mRNA' is a part of 'gene' ** RESOLVED **
=head3 Example 2
can I make 'exon' a subfeature of 'mRNA'?
NO! 'exon' is a subclass of only the root term
'mRNA' is a subclass of 'processed_transcript' is a subclass of
'transcript'
'exon' is ONLY a part of 'primary_transcript',
which is NOT in the 'mRNA' subsumption hierarchy
therefore this is invalid
=head2 SPECIFICATION
notes: R* is the reflexive transitive closure of R
notes: R+ is the transitive closure of R
(reflexive includes the relationship x R x)
WHERE ChildFeat is an object of type SeqFeature
ParentFeat is an object of type SeqFeature
IF
ChildFeat part-of ParentFeat
THEN
THIS MUST BE SATISFIED FOR SOME POSSIBLE BINDING OF THE VARIABLES BELOW
(if it starts with a capital it is an unbound variable):
ChildFeat has-feature-type ChildType
ParentFeat has-feature-type ParentType
ChildType is-a* ChildTypeAllPossible
ParentType is-a* ParentTypeAllPossible
# don't allow nonreflexive transitive closure on part-of
# we *could* allow ChildTypeAllPossible part-of+ ParentTypeAllPossible
# which would mean we could attach exons directly to genes
ChildTypeAllPossible part-of ParentTypeAllPossible
=head2 FORMAL SPECIFICATION
These rules can be implemented easily in prolog - I have resisted the
temptation to do so. The rules below are implemented imperatively in
the perl script. They are included below as a formal specification of
the rules.
;;;; RULES IN PROLOG:
;;;;
;;;; we assume an ontology (eg SOFA) that produces 'stmt' predicates of
;;;; of arity-3 [CHILD REL-TYPE PARENT]
;;;;
;;;; the bioperl feature containment hierarchy can be verified
;;;; by the predicate [SUBFEATURE 'part-of' SUPERFEATURE]
;;;; the type field of the feature is assumed to be
;;;; stmt predicates [FEATURE 'instance-of' FEATURE-TYPE]
;; ========================================
;; general rules: closure / graph traversal
;; ========================================
;; closure of R is R+
stmt(X, 'is-a+', Y):-
stmt(X, 'is-a', Y).
stmt(X, 'is-a+', Y):-
stmt(X, 'is-a', Z),
stmt(Z, 'is-a+', Y).
;; reflexive closure of R is R*
stmt(X, 'is-a*', X).
stmt(X, 'is-a*', Y):-
stmt(X, 'is-a+', Y).
;; ========================================
;; core rule - checks if a feat/subfeat
;; link is alllowed
;; ========================================
;; a subfeature can only be part of a superfeature
;; if there is a direct part-of link for the relevant feature types
;; OR such a link can be obtained by traversing either of the
;; two graphs upwards
stmt(ChildFeat, 'part-of', ParentFeat):-
stmt(ChildFeat, 'instance-of', ChildType),
stmt(ParentFeat, 'instance-of', ParentType),
stmt(ChildType, 'is-a*', ChildTypeAllPossible),
stmt(ParentType, 'is-a*', ParentTypeAllPossible),
stmt(ChildTypeAllPossible, 'is-a*', ParentTypeAllPossible).
=head1 AUTHOR - Chris Mungall
Email cjm at fruitfly.org
=cut
More information about the Bioperl-l
mailing list