[emboss-dev] Mapping feature types to Sequence Ontology (SO)

Peter Rice pmr at ebi.ac.uk
Tue Aug 16 15:26:51 UTC 2011


On 08/16/2011 04:03 PM, Peter Cock wrote:
> Dear Peter R. (et al.),
> 
> I recall from one of our chats in person that EMBOSS has some
> mapping tables to convert the various different data file format's
> feature names into a common standard (the Sequence Ontology?),
> for the purpose of inter-converting files. e.g. Converting a UniProt/
> SwissProt plain text protein file into a GenPept protein file or GFF3
> 
> Is that a fair summary?

Yes, We needed an internal identifier for feature types, and picked SO
for nucleotides - and then were able to add the protein terms when they
became available.

There are a few made up internal names, with _text after the SO term,
that were needed in the early days of the BioSapiens Ontology and some
dodgy mapping between SO and EMBL/GenBank for immunoglobulin gene
regions, but I believe are no longer used.

The first term in the file is defined as the default if nothing is
recognized (region or misc_feature)

> Can you point me at these mapping tables in the EMBOSS
> source code please?

emboss/data/Efeatures.embl
emboss/data/Efeatures.swiss

> I'm particularly interested in the SwissProt to SO mapping
> right now.

That was originally done by the BioSapiens "Network of excellence" for
annotating ENCODE data. They developed the protein features which were
then added to the sequence ontology.

You can look at SO terms in EMBOSS with:

ontoget so:0001094

or

ontoget -filter -oformat excel so:0001094

(Hmmm, should do something better for a missing namespace - it was
defined as a format for EDAM)


Let me know if you spot anything in need of updating.

We also have (especially for EMBL) equivalent Etags files listing the
available feature qualifiers.

regards,

Peter Rice
EMBOSS Team



More information about the emboss-dev mailing list