[Biocorba-l] Identifiers

Thu, 15 Feb 2001 13:19:14 +0000 (GMT Standard Time)

Matthew Pocock wrote:

> Anyway, step 1 - do we all think that a formal ID containing enough info 
> to re-fetch the resource is a *good thing*, or is it potentialy a cause 
> of great hastle and lots of work?

IMHO, when you're living in a 'distributed' world - It is a good thing -
But it is also necessarily *some* extra work.

In a quiet voice, I'll point out that since Identifier is actually just a
string, people could choose to ignore the conventions and return a 'naked'
accession number. But they wouldn't be playing nice and there stuff may
not work properly outside of their own environment.

> > For example, a sequence with accession X12345 (sequence version 2) in EMBL
> > release 37, would have the identifier:
> > 
> > 	Identifier = "EMBL.37/X12345.2";
>
> Looks like an improvement to me - esp if we tacked some extra text on it 
> like:
> 
> urn://seqdb/EMBL.37/X12345.2

The BSA spec includes exactly these type of issues: You may have more than
two 'components' to an Identifier. In fact you can have as many as it
requires to identify an entity uniquely. From the BSA spec:

  "If two components are not enough to uniquely identify an entity, an 
  Identifier can contain more than two components, but no more than
  necessary to make the identification unique. That is, an Identifier may
  not be used to freely attach textual information."

So your example would become: 

  EMBL.37/X12345.2/seqdb   [It could even include a seqdb version!!]

The first component is always the data source, and the second is the name
of an entity (preferably unique to the database).

However, I'm not sure if your example of a 'seqdb' is an appropriate one:

> Then, I think it becomes relatively painless to write resolvers for 
> these things - perhaps as a part of BioEnv? You can palm each layer in 
> the urn off to a different resolver - seqdb is resolved by the master 
> registry of resolvers, EMBL by the seqdb resolver and X12345 by the EMBL 
> resolver. Also, a sequence doesn't have to carry a reference to its DB 
> arround.

This sounds like you're heading straight into CORBA NamingService
territory. You want to find out about available servers of biocorba seqdb
and seq objects, specifically one with a SeqDB called EMBL.

However, an Identifier resolver would be a good thing to have as part of
any biocorba implementation and reduce the burden on client programmers:

  interface XXX // This could be in either BioEnv, or Interface, or its
                // own interface. (I'd prefer BioEnv, I think).
  {

    struct IdentiferComponents
    {
      // The data source name.
      string source_name;

      // The data source version - string or long???. If string, may be
      // 'null' or '' for latest version???
      string source_version;

      // The entity name, e.g. accession number.
      string entity_name;

      // The version of the entity - string or long??? If string, may be
      // 'null' or '' for latest version???
      string entity_version;

      //Any extra components - Or all components???
      CosNaming::Name components;
    };

    // Return a struct containing the components of the Identifier.
    IdentifierComponents resolve(in Identifier id);
  }; 

// The bits of the CosNaming module that are used are trivial - A struct
// and a sequence:

module CosNaming 
{
  typedef string Istring;

  struct NameComponent 
  {
    // The 'id' maps to our biocorba name.
    Istring id;

    // The 'kind' maps to our biocorba version.
    Istring kind;
  };

  // Define a list of NameComponents.
  typedef sequence <NameComponent> Name;

  // ... other stuff we don't need ...
};