[emboss-dev] Mapping feature types to Sequence Ontology (SO)

Peter Cock p.j.a.cock at googlemail.com
Tue Aug 16 15:36:24 UTC 2011


On Tue, Aug 16, 2011 at 4:26 PM, Peter Rice <pmr at ebi.ac.uk> wrote:
> Yes, We needed an internal identifier for feature types, and picked SO
> for nucleotides - and then were able to add the protein terms when they
> became available.
> ...

Thanks!

>
> Let me know if you spot anything in need of updating.
>

I have found three protein features which have been
renamed, and one which appears to be wrong... see below.

I recently noticed that the UniProt provide GFF3 files,
e.g. http://www.uniprot.org/uniprot/P99999.gff

========================================
##gff-version 3
##sequence-region P99999 1 105
P99999	UniProtKB	Initiator methionine	1	1	.	.	.	Note=Removed	
P99999	UniProtKB	Chain	2	105	.	.	.	ID=PRO_0000108218;Note=Cytochrome c	
P99999	UniProtKB	Metal binding	19	19	.	.	.	Note=Iron (heme axial ligand)	
P99999	UniProtKB	Metal binding	81	81	.	.	.	Note=Iron (heme axial ligand)	
P99999	UniProtKB	Binding site	15	15	.	.	.	Note=Heme (covalent)	
P99999	UniProtKB	Binding site	18	18	.	.	.	Note=Heme (covalent)	
P99999	UniProtKB	Modified residue	2	2	.	.	.	Note=N-acetylglycine	
P99999	UniProtKB	Modified
residue	49	49	.	.	.	Note=Phosphotyrosine;Status=By similarity
P99999	UniProtKB	Modified
residue	98	98	.	.	.	Note=Phosphotyrosine;Status=By similarity
P99999	UniProtKB	Natural variant	42	42	.	.	.	ID=VAR_044450;Note=In
THC4%3B increases the pro-apoptotic function by triggering caspase
activation more efficiently than wild-type%3B does not affect the
redox function.
P99999	UniProtKB	Natural variant	56	56	.	.	.	ID=VAR_048850	
P99999	UniProtKB	Natural variant	66	66	.	.	.	ID=VAR_002204;Note=In
10%25 of the molecules.
P99999	UniProtKB	Sequence conflict	18	18	.	.	.	.	
P99999	UniProtKB	Sequence conflict	41	41	.	.	.	.	
P99999	UniProtKB	Helix	4	14	.	.	.	.	
P99999	UniProtKB	Turn	16	18	.	.	.	.	
P99999	UniProtKB	Beta strand	23	25	.	.	.	.	
P99999	UniProtKB	Beta strand	28	30	.	.	.	.	
P99999	UniProtKB	Turn	36	38	.	.	.	.	
P99999	UniProtKB	Helix	51	56	.	.	.	.	
P99999	UniProtKB	Helix	62	70	.	.	.	.	
P99999	UniProtKB	Helix	72	75	.	.	.	.	
P99999	UniProtKB	Helix	89	102	.	.	.	.	
========================================

However, they are not using Sequence Ontology terms
in column three and so fail the online GFF3 validator
http://modencode.oicr.on.ca/cgi-bin/validate_gff3_online
listed in http://www.sequenceontology.org/gff3.shtml
(GFF3 specification currently at v1.20). Additionally
that UniProt GFF3 uses an upper case reserved tag,
"Status" rather than perhaps "status", in the modified
residue features.

I will report this to UniProt later. However, first I thought
I would try converting one of the other files provided into
GFF3 using EMBOSS seqret for an alternative, e.g. the
plain text "swiss" format: http://www.uniprot.org/uniprot/P99999.txt

I can convert this using seqret as follows:

========================================
$ seqret -feature -osformat=gff3 -sformat=swiss -sequence P99999.txt
-stdout -auto
##gff-version 3
##sequence-region CYC_HUMAN 1 105
#!Date 2011-08-16
#!Type Protein
#!Source-version EMBOSS 6.4.0.0
CYC_HUMAN	SWISSPROT	cleaved_initiator_methionine	1	1	.	+	.	ID=CYC_HUMAN.1;note=Removed
CYC_HUMAN	SWISSPROT	mature_protein_region	2	105	.	+	.	ID=CYC_HUMAN.2;note=Cytochrome
c;ftid=PRO_0000108218
CYC_HUMAN	SWISSPROT	metal_binding	19	19	.	+	.	ID=CYC_HUMAN.3;note=Iron;comment=heme
axial ligand
CYC_HUMAN	SWISSPROT	metal_binding	81	81	.	+	.	ID=CYC_HUMAN.4;note=Iron;comment=heme
axial ligand
CYC_HUMAN	SWISSPROT	binding_site	15	15	.	+	.	ID=CYC_HUMAN.5;note=Heme;comment=covalent
CYC_HUMAN	SWISSPROT	binding_site	18	18	.	+	.	ID=CYC_HUMAN.6;note=Heme;comment=covalent
CYC_HUMAN	SWISSPROT	protein_modification_categorized_by_chemical_process	2	2	.	+	.	ID=CYC_HUMAN.7;note=N-acetylglycine
CYC_HUMAN	SWISSPROT	protein_modification_categorized_by_chemical_process	49	49	.	+	.	ID=CYC_HUMAN.8;note=Phosphotyrosine;comment=By
similarity
CYC_HUMAN	SWISSPROT	protein_modification_categorized_by_chemical_process	98	98	.	+	.	ID=CYC_HUMAN.9;note=Phosphotyrosine;comment=By
similarity
CYC_HUMAN	SWISSPROT	natural_variant	42	42	.	+	.	ID=CYC_HUMAN.10;note=G
-> S;comment=in THC4%3B increases the pro- apoptotic function by
triggering caspase activation more efficiently than wild- type%3B does
not affect the redox function;ftid=VAR_044450
CYC_HUMAN	SWISSPROT	natural_variant	56	56	.	+	.	ID=CYC_HUMAN.11;note=K
-> R;comment=in dbSNP:rs11548795;ftid=VAR_048850
CYC_HUMAN	SWISSPROT	natural_variant	66	66	.	+	.	ID=CYC_HUMAN.12;note=M
-> L;comment=in 10%25 of the molecules;ftid=VAR_002204
CYC_HUMAN	SWISSPROT	sequence_conflict	18	18	.	+	.	ID=CYC_HUMAN.13;note=C
-> Y;comment=in Ref. 8%3B AAH15130
CYC_HUMAN	SWISSPROT	sequence_conflict	41	41	.	+	.	ID=CYC_HUMAN.14;note=T
-> I;comment=in Ref. 8%3B AAH68464
CYC_HUMAN	SWISSPROT	alpha_helix	4	14	.	+	.	ID=CYC_HUMAN.15
CYC_HUMAN	SWISSPROT	turn	16	18	.	+	.	ID=CYC_HUMAN.16
CYC_HUMAN	SWISSPROT	beta_strand	23	25	.	+	.	ID=CYC_HUMAN.17
CYC_HUMAN	SWISSPROT	beta_strand	28	30	.	+	.	ID=CYC_HUMAN.18
CYC_HUMAN	SWISSPROT	turn	36	38	.	+	.	ID=CYC_HUMAN.19
CYC_HUMAN	SWISSPROT	alpha_helix	51	56	.	+	.	ID=CYC_HUMAN.20
CYC_HUMAN	SWISSPROT	alpha_helix	62	70	.	+	.	ID=CYC_HUMAN.21
CYC_HUMAN	SWISSPROT	alpha_helix	72	75	.	+	.	ID=CYC_HUMAN.22
CYC_HUMAN	SWISSPROT	alpha_helix	89	102	.	+	.	ID=CYC_HUMAN.23
##FASTA
>CYC_HUMAN P99999 Cytochrome c
MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIW
GEDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE
========================================

Interestingly EMBOSS includes the sequence at the bottom
(using the FASTA directive) and has generated unique ID tags
for each feature. It has also added more note tags.

Unfortunately this also failed the GFF3 validation. The EMBOSS
output does a lot better (e.g. "cleaved_initiator_methionine" is
valid while "Initiator methionine" in the UniProt file was not)

However, some of the terms in column 3 are apparently out of
date - but http://www.sequenceontology.org does list them as
synonyms:

* metal_binding -> polypeptide_metal_contact
* natural_variant -> natural_variant_site
* turn -> polypeptide_turn_motif

It looks like the EMBOSS sequence ontology table may need
updating for at least these three cases.

Finally protein_modification_categorized_by_chemical_process
does not seem to be valid (I failed to find it in the ontology).

Additionally the validator complained about some of the note
in Line 15, probably due to the %3B escaped semi-colon,
but that may be a bug in the validator.

Peter C.



More information about the emboss-dev mailing list