[emboss-dev] Mapping feature types to Sequence Ontology (SO)
Peter Cock
p.j.a.cock at googlemail.com
Tue Aug 16 15:36:24 UTC 2011
On Tue, Aug 16, 2011 at 4:26 PM, Peter Rice <pmr at ebi.ac.uk> wrote:
> Yes, We needed an internal identifier for feature types, and picked SO
> for nucleotides - and then were able to add the protein terms when they
> became available.
> ...
Thanks!
>
> Let me know if you spot anything in need of updating.
>
I have found three protein features which have been
renamed, and one which appears to be wrong... see below.
I recently noticed that the UniProt provide GFF3 files,
e.g. http://www.uniprot.org/uniprot/P99999.gff
========================================
##gff-version 3
##sequence-region P99999 1 105
P99999 UniProtKB Initiator methionine 1 1 . . . Note=Removed
P99999 UniProtKB Chain 2 105 . . . ID=PRO_0000108218;Note=Cytochrome c
P99999 UniProtKB Metal binding 19 19 . . . Note=Iron (heme axial ligand)
P99999 UniProtKB Metal binding 81 81 . . . Note=Iron (heme axial ligand)
P99999 UniProtKB Binding site 15 15 . . . Note=Heme (covalent)
P99999 UniProtKB Binding site 18 18 . . . Note=Heme (covalent)
P99999 UniProtKB Modified residue 2 2 . . . Note=N-acetylglycine
P99999 UniProtKB Modified
residue 49 49 . . . Note=Phosphotyrosine;Status=By similarity
P99999 UniProtKB Modified
residue 98 98 . . . Note=Phosphotyrosine;Status=By similarity
P99999 UniProtKB Natural variant 42 42 . . . ID=VAR_044450;Note=In
THC4%3B increases the pro-apoptotic function by triggering caspase
activation more efficiently than wild-type%3B does not affect the
redox function.
P99999 UniProtKB Natural variant 56 56 . . . ID=VAR_048850
P99999 UniProtKB Natural variant 66 66 . . . ID=VAR_002204;Note=In
10%25 of the molecules.
P99999 UniProtKB Sequence conflict 18 18 . . . .
P99999 UniProtKB Sequence conflict 41 41 . . . .
P99999 UniProtKB Helix 4 14 . . . .
P99999 UniProtKB Turn 16 18 . . . .
P99999 UniProtKB Beta strand 23 25 . . . .
P99999 UniProtKB Beta strand 28 30 . . . .
P99999 UniProtKB Turn 36 38 . . . .
P99999 UniProtKB Helix 51 56 . . . .
P99999 UniProtKB Helix 62 70 . . . .
P99999 UniProtKB Helix 72 75 . . . .
P99999 UniProtKB Helix 89 102 . . . .
========================================
However, they are not using Sequence Ontology terms
in column three and so fail the online GFF3 validator
http://modencode.oicr.on.ca/cgi-bin/validate_gff3_online
listed in http://www.sequenceontology.org/gff3.shtml
(GFF3 specification currently at v1.20). Additionally
that UniProt GFF3 uses an upper case reserved tag,
"Status" rather than perhaps "status", in the modified
residue features.
I will report this to UniProt later. However, first I thought
I would try converting one of the other files provided into
GFF3 using EMBOSS seqret for an alternative, e.g. the
plain text "swiss" format: http://www.uniprot.org/uniprot/P99999.txt
I can convert this using seqret as follows:
========================================
$ seqret -feature -osformat=gff3 -sformat=swiss -sequence P99999.txt
-stdout -auto
##gff-version 3
##sequence-region CYC_HUMAN 1 105
#!Date 2011-08-16
#!Type Protein
#!Source-version EMBOSS 6.4.0.0
CYC_HUMAN SWISSPROT cleaved_initiator_methionine 1 1 . + . ID=CYC_HUMAN.1;note=Removed
CYC_HUMAN SWISSPROT mature_protein_region 2 105 . + . ID=CYC_HUMAN.2;note=Cytochrome
c;ftid=PRO_0000108218
CYC_HUMAN SWISSPROT metal_binding 19 19 . + . ID=CYC_HUMAN.3;note=Iron;comment=heme
axial ligand
CYC_HUMAN SWISSPROT metal_binding 81 81 . + . ID=CYC_HUMAN.4;note=Iron;comment=heme
axial ligand
CYC_HUMAN SWISSPROT binding_site 15 15 . + . ID=CYC_HUMAN.5;note=Heme;comment=covalent
CYC_HUMAN SWISSPROT binding_site 18 18 . + . ID=CYC_HUMAN.6;note=Heme;comment=covalent
CYC_HUMAN SWISSPROT protein_modification_categorized_by_chemical_process 2 2 . + . ID=CYC_HUMAN.7;note=N-acetylglycine
CYC_HUMAN SWISSPROT protein_modification_categorized_by_chemical_process 49 49 . + . ID=CYC_HUMAN.8;note=Phosphotyrosine;comment=By
similarity
CYC_HUMAN SWISSPROT protein_modification_categorized_by_chemical_process 98 98 . + . ID=CYC_HUMAN.9;note=Phosphotyrosine;comment=By
similarity
CYC_HUMAN SWISSPROT natural_variant 42 42 . + . ID=CYC_HUMAN.10;note=G
-> S;comment=in THC4%3B increases the pro- apoptotic function by
triggering caspase activation more efficiently than wild- type%3B does
not affect the redox function;ftid=VAR_044450
CYC_HUMAN SWISSPROT natural_variant 56 56 . + . ID=CYC_HUMAN.11;note=K
-> R;comment=in dbSNP:rs11548795;ftid=VAR_048850
CYC_HUMAN SWISSPROT natural_variant 66 66 . + . ID=CYC_HUMAN.12;note=M
-> L;comment=in 10%25 of the molecules;ftid=VAR_002204
CYC_HUMAN SWISSPROT sequence_conflict 18 18 . + . ID=CYC_HUMAN.13;note=C
-> Y;comment=in Ref. 8%3B AAH15130
CYC_HUMAN SWISSPROT sequence_conflict 41 41 . + . ID=CYC_HUMAN.14;note=T
-> I;comment=in Ref. 8%3B AAH68464
CYC_HUMAN SWISSPROT alpha_helix 4 14 . + . ID=CYC_HUMAN.15
CYC_HUMAN SWISSPROT turn 16 18 . + . ID=CYC_HUMAN.16
CYC_HUMAN SWISSPROT beta_strand 23 25 . + . ID=CYC_HUMAN.17
CYC_HUMAN SWISSPROT beta_strand 28 30 . + . ID=CYC_HUMAN.18
CYC_HUMAN SWISSPROT turn 36 38 . + . ID=CYC_HUMAN.19
CYC_HUMAN SWISSPROT alpha_helix 51 56 . + . ID=CYC_HUMAN.20
CYC_HUMAN SWISSPROT alpha_helix 62 70 . + . ID=CYC_HUMAN.21
CYC_HUMAN SWISSPROT alpha_helix 72 75 . + . ID=CYC_HUMAN.22
CYC_HUMAN SWISSPROT alpha_helix 89 102 . + . ID=CYC_HUMAN.23
##FASTA
>CYC_HUMAN P99999 Cytochrome c
MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIW
GEDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE
========================================
Interestingly EMBOSS includes the sequence at the bottom
(using the FASTA directive) and has generated unique ID tags
for each feature. It has also added more note tags.
Unfortunately this also failed the GFF3 validation. The EMBOSS
output does a lot better (e.g. "cleaved_initiator_methionine" is
valid while "Initiator methionine" in the UniProt file was not)
However, some of the terms in column 3 are apparently out of
date - but http://www.sequenceontology.org does list them as
synonyms:
* metal_binding -> polypeptide_metal_contact
* natural_variant -> natural_variant_site
* turn -> polypeptide_turn_motif
It looks like the EMBOSS sequence ontology table may need
updating for at least these three cases.
Finally protein_modification_categorized_by_chemical_process
does not seem to be valid (I failed to find it in the ontology).
Additionally the validator complained about some of the note
in Line 15, probably due to the %3B escaped semi-colon,
but that may be a bug in the validator.
Peter C.
More information about the emboss-dev
mailing list