[Biocorba-l] Identifiers was Re: SeqFeature -> get_Primary_Seq (fwd)

Wed, 14 Feb 2001 22:37:12 +0000 (GMT Standard Time)

Hello 

On Thu, 8 Feb 2001, Matthew Pocock wrote:

> Just to give my ID crusade - they are of no meaning without knowing 
> which database to look them up in - the tuple {ID, SeqDB} is what is 
> realy required if you use ID at all. Also, there is the issue of wether 
> resolving the ID gives you back a CORBA object that is equivalent or 
> identical.

Matthew, I'm really glad you brought up the subject of identifiers! 

This situation was faced in the OMG 'Biomolecular Sequence Analysis' spec.

Their solution is to define some simple rules (which I'll quote below) for
what a stringified identifier should look like for a sequence. 

For example, a sequence with accession X12345 (sequence version 2) in EMBL
release 37, would have the identifier:

	Identifier = "EMBL.37/X12345.2";

Without versioning information on DB or sequence (which is assumed to
imply latest versions):

	Identifier = "EMBL/X12345";

You may also imply a local or default database for the sequence:

	Identifier = "./X12345";

You may also specify just the database (with or without version):

	Identifier = "EMBL.37";

In terms of the IDL, the following simple 'typedef'would be added:

  // Outline of rules for naming ID's uniquely.
  typedef string Identifier;

Then wherever a unique identifier is required for a sequence, 'Identifier'
would be specified in the IDL:

  interface SeqFeature : Annotation
  {
    string primary_seq_id();
    // ...
  };

would become:

  interface SeqFeature : Annotation
  {
    Identifier primary_seq_id();
    // ...
  };

I'm aware that this will have knock-on effects for the BioEnv,
PrimarySeqDB and SeqDB interfaces (The debate over version is a 'string'
vs. 'long' may even go away!! :)

How does this answer your needs?

--
============================================================
Alan J. Robinson, D.Phil.             Tel:+44-(0)1223 494444
European Bioinformatics Institute     Fax:+44-(0)1223 494468
EMBL Outstation - Hinxton             Email:  alan@ebi.ac.uk
Wellcome Trust Genome Campus
Hinxton, Cambridge
CB10 1SD, UK                http://industry.ebi.ac.uk/~alan/
============================================================

You may get the full BSA spec from
http://www.omg.org/cgi-bin/doc?dtc/00-11-01

For Identifiers, it speaks about CosNaming::Name and NameComponents - but
these are pretty simple and you do NOT need CosNaming to implement the
Identifiers (only the general rules have been borrowed, not the IDL).

BSA assumes that an Identifier is composed of at least one component (a
component has an id and a kind = id.kind). We map this to name.version for
both database and sequence. Components are concatenated using '/' as a
deliminator:

  db_name.db_version/seq_name.seq_version

>From pp. 2-19

2.1.8.1 Identifier Description

[...]

The rules are as follows:

- Names can refer to collections of entities (such as databases), or to
entities within such collections. Names referring to collections consist
of exactly one component; names referring to entities within collections
consist of at least two components.

- The first component represents the data source. Data sources can be
anything: transient collections, local databases, public repositories. It
is up to the implementation to document the accepted names for the data
source.

 - The empty name (".") is valid for the first component, and represents
the 'local' or 'default' collection. It is up to the implementation to
document what the semantics of 'local' or 'default' is.

 - Names that refer to entities within collections consist of two or more
components. The second component of such names represents an identifier
that is unique in the context of the data source. No empty id-fields are
allowed in this or any further components.

 - If two components are not enough to uniquely identify an entity, an
Identifier can contain more than two components, but no more than
necessary to make the identification unique. That is, an Identifier may
not be used to freely attach textual information.

 - The only characters valid in a component are "a" through "z", "0"
through "9", and "-" (hyphen), "_" (underscore), "$" and "." (period). Use
of the latter is discouraged since it has a special meaning in the
stringifying convention, and has therefore to be escaped.

 - To comply with existing practice in the field of public data
repositories, it is strongly advised that implementations do string
comparisons in a case-insensitive manner. The Naming Service standard
fails to mention whether type-case is, for identification purposes,
significant or not. Implementations that use a third-party implementation
of the Naming Service may therefore wish to restrict Identifiers to only
use one type-case. It is up to an implementation to state whether mixed
type-case is allowed, and whether type-case is significant in comparisons.

The id and kind parts of the string components of Identifier are used as
follows:

 - The id-field of a component contains the principal value that makes it
unique in the scope provided by the preceding component. It may only be
empty in the case of the first component of an Identifier.

 - The kind-field of a component is used to represent information
indicating the release (for a data source) or version (for an entry) of an
entity, and can be empty. If kind is empty and entities with non-empty
kind-fields exist, an empty kind field becomes synonymous with the latest
release or version. It is up to the implementation to document the syntax
and semantics of the version information.

The adoption of this convention has the following advantages:
 - it is simple and lightweight,
 - it has a well-defined and re-used syntax,
 - it is compatible with existing practice,
 - it is sufficiently flexible to allow for sub-ids if necessary.