[DAS] Our identifier doc and proposal
Lincoln Stein
lstein@cshl.org
Tue, 27 Nov 2001 15:49:13 -0500
Hi Brian,
I'm pleased to see that the I3C identifier is nearly identical to my
(biological class,namespace,id) triple suggestion. The difference is
the version number, which I agree with completely. So I accept it
wholeheartedly.
The part that I don't feel entirely comfortable with is that the
namespace seems to be completely under the control of the authority:
urn:lsid:informatics.mpi.com:plate/glycerol/freeze:12345
I think the top level namespace, e.g. "plate" should be hard-and-fast
data types. Is this envisioned by the I3C?
Lincoln
Brian Gilman writes:
> Lincoln,
>
> Please find attached an updated identifier proposal that we have
> been working
> on to identifiy objects in the web services architecture. I like it over
> the feature_class mechanism becuase we can uniquely identify an object in
> the "cloud".
>
> Best,
>
> -Brian
>
> -----------------------
> Brian Gilman <gilmanb@genome.wi.mit.edu>
> Sr. Software Engineer MIT/Whitehead Inst. Center for Genome Research
> One Kendall Square, Bldg. 300 / Cambridge, MA 02139-1561 USA
> phone +1 617 252 1069 / fax +1 617 252 1902
>
>
> <!doctype html public "-//w3c//dtd html 4.0 transitional//en">
> <html>
> <head>
> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
> <meta name="Author" content="Ted liefeld">
> <meta name="GENERATOR" content="Mozilla/4.73 [en]C-CCK-MCD BA45DSL (WinNT; U) [Netscape]">
> <title>identifiers</title>
> </head>
> <body>
>
> <h2>
> I3C Identifier Specification</h2>
>
> <h3>
> <a NAME="Abstract"></a>Abstract:</h3>
> This document describes the motivation for and specification of string
> identifiers to be used to identify objects within the life sciences domain
> by the I3C architecture. A string format for the identifiers is defined
> as <tt> urn:lsid:<authority>:<namespace>:<value>:<version>.</tt>
> <br>
> <h2>
> Index:</h2>
> <a href="#Abstract">Abstract</a>
> <br><a href="#Introduction:">Introduction</a>
> <br><a href="#Background: Existing">Background: Existing Identifiers</a>
> <blockquote><a href="#MPI">MPI Id</a>
> <br><a href="#AGAVE">AGAVE db_id</a></blockquote>
> <a href="#I3C String">I3C String Identifiers</a>
> <blockquote><a href="#Requirements">Requirements for the I3C String Identifier</a>
> <blockquote><a href="#Syntactic">Syntactic Requirements</a>
> <br><a href="#Semantic">Semantic Requirements</a></blockquote>
> </blockquote>
> <a href="#Specification">Specification of the I3C String Identifier</a>
> <blockquote><a href="#Web Centric Id: URI,">Web Centric Id: URI,
> URN</a>
> <br><a href="#I3C String Identifier">I3C String Identifier Definition</a>
> <br><a href="#Examples">Examples</a></blockquote>
> <a href="#Appendix A, URN">Appendix A, URN Reference</a>
> <br><a href="#Appendix B, Some example">Appendix B, Some example identifiers</a>
> <br><a href="#Appendix C, Additional">Appendix C, Additional Work</a>
> <h2>
> <a NAME="Introduction:"></a>Introduction:</h2>
> One of the goals of the I3c is the definition of a common architecture
> and standards to simplify interoperability between applications from different
> companies. For interoperability to occur, we need a common
> format for unique identifiers for any objects we reference that would function
> in the context of I3C services. The remainder of this document defines
> a web-centric ID definition that will allow us to create federated systems
> utilizing many databases and services in a common way.
> <p>The purpose of this identifier definition is to uniquely identify biologically
> significant objects, e.g. a sequence, a clone, a gene, a contig etc.
> It is not meant to identify artifacts of implementation, e,g, a database,
> a server. Identifying objects such as these should be handled via
> other mechanisms such as JDBC URLs and WSDL.
> <p>In addition, using http URLs as identifiers (e.g. http://srs.ebi.ac.uk/srs6bin/cgi-bin/wgetz?-id+7clen1HOYrs+[taxonomy-ID:10090]+-e)
> is not adequate for organizations who need to limit or control access
> to external databases for intellectual property or security reasons.
> The identifiers defined here are deliberately location independent and
> are intended to uniquely identify a biological artifact, but not the location
> of that artifact.
> <br>
> <h3>
> <a NAME="Background: Existing"></a><b>Background: Existing Identifiers</b></h3>
> There are currently many existing forms of identifiers for biological artifacts
> in use within the life sciences community. These include proprietary formats
> as well as public domain formats. Some of these are discussed below.
> <br>
> <h4>
> <a NAME="MPI"></a>MPI Id</h4>
> Within Millennium Pharmaceuticals Inc., there is a suite of CORBA
> services that have been running in production for over two years.
> One of the first tasks they addressed was the identity management of objects
> that appear in more than one database. To deal with this need, a
> CORBA IDL structure called MPI Id was declared. It has since
> been reused in many subsequent CORBA services.
> <p>The MPI ID corba type is defined as a triple
> <br><tt> struct Id {</tt>
> <br><tt> string value;</tt>
> <br><tt> string domain;</tt>
> <br><tt> string type;</tt>
> <br><tt> };</tt>
> <p>for example, MP PL 001 represents identifier domain 'MP', type 'Plate',
> identifier value '001'. The value uniquely identifies an object in
> the domain of a given type. Note that an object may have more than
> one ID, so that the plate known as MP PL 001 may also be known as SE PL
> 435a in the SE domain.
> <p>Some of the services that use these identifiers found this too limiting.
> For example, retrieving clones from GenBank you may want to use the
> accession number or the GI number. However either of these would
> have been encoded as
> <br> GB CL gid
> <br> GB CL accession
> <br>This raised the problem of differentiating whether an identifier is
> a GI number or an accession number. If the namespaces of the accession
> number and gi numbers overlap, then there is no way for a server or client
> to identify which form was intended.
> <p>Another weakness is that object type can be overloaded in the same manner;
> for example a sequence and a contig are both sequences. Similarly,
> inclusion of a version number for an identifier would require overloading
> the value field of the MPI Id.
> <p>Therefore it was found that limiting the unique identifier at three
> elements was too few. There must be provision for extension.
> <br>
> <h4>
> <a NAME="AGAVE"></a>AGAVE db_id</h4>
> Doubltwist Inc. has found some of the same issues in their AGAVE product.
> AGAVE defines an identifier called db_id in the AGAVE DTD file, an XML
> format.
> <p>The AGAVE db_id is defined as follows;
> <blockquote><tt><!--
> --></tt>
> <br><tt><!-- db_id is an identifier for an object in its source database.
> --></tt>
> <br><tt><!--
> --></tt>
> <br><tt><!-- Attributes:
> --></tt>
> <br><tt><!--
> --></tt>
> <br><tt><!-- id: a data identifier
> such as GenBank accession or PID.
> --></tt>
> <br><tt><!-- db_code: a code for the data source, e.g. GenBank
> is "gb". --></tt>
> <br><tt><!-- version: version of the associated data.
> --></tt>
> <br><tt><!--
> --></tt>
> <br><tt><!ELEMENT db_id EMPTY></tt>
> <br><tt><!ATTLIST db_id id
> CDATA #REQUIRED</tt>
> <br><tt>
> version CDATA #IMPLIED</tt>
> <br><tt>
> db_code CDATA #REQUIRED ></tt></blockquote>
> In this format, the version is explicitly specified, but the weaknesses
> remain of having insufficient scope to specify object types or variations
> in the type of ID being specified (accession vs gi).
> <br>
> <h2>
> <a NAME="I3C String"></a>I3C String Identifiers</h2>
> For use within the I3C architecture, the existing identifier definiitons
> was found to be inadequate to handle the breadth and scope of the possible
> identifiers that would be required. The following sections detail
> the requirements and spaecification of a new identifier format for use
> within the I3C architecture.
> <h3>
> <a NAME="Requirements"></a>Requirements for the I3C String Identifier</h3>
> The I3C architecture has the following syntactical and semantic requirements
> for its identifiers;
> <h4>
> <a NAME="Syntactic"></a>Syntactic Requirements</h4>
>
> <ol>
> <li>
> The identifier must be encodable in a string format</li>
>
> <li>
> The identifier must be extensible</li>
>
> <li>
> The identifier must uniquely identify one object</li>
>
> <li>
> The identifier must not require additional contextual information for evaluation</li>
> </ol>
> These requirements result from the need to transmit the identifier in an
> XML format to and from web-services. By requiring that it can be
> encoded as a string, it becomes possible to transmit identifiers via other
> mechanisms as well. Also, as noted in the examples given above, the
> identifier must be extensible to allow use with biological objects that
> have not yet been defined.
> <h4>
> <a NAME="Semantic"></a>Semantic Requirements</h4>
> For an Id to uniquely specify a biological object in a system, it needs
> to include the following pieces of information;
> <br>
> <ol>
> <li>
> Authority : The name of the organization that has defined an
> entity.</li>
>
> <li>
> Id Value : an alpha-numeric sequence that uniquely identifies an
> object to its authority</li>
>
> <li>
> Namespace : one or more statements constraining the scope in which
> an Id is evaluated</li>
>
> <li>
> Version : (optional) version number for an Id</li>
> </ol>
> As an example, the following uniquely identifies a sequence in Genbank,
> <p> GenBank, Sequence, Accession J01636, version
> 1
> <p>With all these pieces of information we can uniquely identify a sequence.
> Leaving off the version number we can get pretty close. Leaving out
> any of the other bits of information makes it impossible to find the object
> without a priori knowledge of the context.
> <br>
> <h2>
> <a NAME="Specification"></a>Specification of the I3C String Identifier</h2>
> To take advantage of existing work on unique identifiers, the I3C
> technical Architecture working group has selected the World Wide Web Consortium's
> (W3C) definition of a universal resource name (URN) as the basis for the
> I3C String Identifier. For additional background on URNs, please
> see Appendix A, "URN Reference", for the definiiton of a URN and reference
> links.
> <h4>
> <a NAME="Web Centric Id: URI,"></a><b>Web Centric Id: URI, URN</b></h4>
> To summarize the IETF and W3C documents, a URI can be written as
> having the following parts;
> <p> scheme:namespace identifier://authority/path/.../pathN/value?queryterm#fragment
> <p> where
> <br>
> scheme and namespace identifier define the semantics of everything that
> follows
> <br>
> authority defines the organization responsible for defining and managing
> the namespace
> <br>
> path/.../pathN/ defines a subset of an authority's namespace
> <br>
> value is the last element in the path
> <br>
> queryterm indicates a post-processing directive
> <br>
> fragment defines a preprocessing directive or fragment within the scope
> of the Id
> <p>The adoption of the URN format should simplify integration with other
> existing standards such as MAGE-ML which permit the use of URN identifiers.
> <br>
> <h3>
> <a NAME="I3C String Identifier"></a>I3C String Identifier Definition</h3>
> Given the definition of a URN above, we have defined the following syntax
> for an I3C String identifier;
> <p><tt> urn:lsid:<authority>:<namespace>:<value>:<version></tt>
> <p>The different parts of the identifier are delimited by colons ":".
> <p>The elements of the identifier are as follows;
> <ul>
> <li>
> scheme = urn</li>
>
> <ul>
> <li>
> This specifies that the identifier is in URN format</li>
> </ul>
>
> <li>
> namespace identifier = lsid</li>
>
> <ul>
> <li>
> The I3C string identifier namespace identifier is defined as "Life Science
> Identifier", or "lsid".</li>
> </ul>
>
> <li>
> authority = <authority></li>
>
> <ul>
> <li>
> This portion uniquely identifies the organization and optionally the organizational
> unit that has defined the namespace for the remaining porions of the identifier</li>
> </ul>
>
> <li>
> namespace = <namespace></li>
>
> <ul>
> <li>
> a hierarchical namepace to scope the identifier value. The form and
> content of this section is defined and managed by the authority</li>
> </ul>
>
> <li>
> value = <value></li>
>
> <ul>
> <li>
> the unique identifier for an object within the namespace defined by an
> authority</li>
> </ul>
>
> <li>
> version = <version></li>
>
> <ul>
> <li>
> optional version information associated with the identifier value</li>
> </ul>
> </ul>
>
> <h4>
> <a NAME="Examples"></a>Examples</h4>
> So for example, for the plate, identified by millennium as ID 12345 with
> MPI ID "MP PL 12345"
> <p> urn:lsid:informatics.mpi.com:plate:12345
> <p>Since the authority is free to define any path that it wishes (provided
> of course that it manages them), we may want to define the path section
> for plates more fully to something like this
> <p> urn:lsid:informatics.mpi.com:plate/glycerol/freeze:12345
> <p>We can now use expanded path information to deal with cases that required
> type overloading in the MPI ID. For example
> <br> (Accession) GB CL j01636 version
> 1
> <br> (GI)
> GB CL 146575
> <br>refer to the same object. These can now be encoded as
> <p> urn:lsid:genbank.ncbi.nlm.nih.gov:sequence/accession:J01636:1
> <br> urn:lsid:genbank.ncbi.nlm.nih.gov:sequence/gi:146575
> <br>
> <h3>
> <a NAME="Appendix A, URN"></a>Appendix A, URN Reference</h3>
>
> <p><br>Ref: http://www.w3.org/Addressing/,http://www.ietf.org/rfc/rfc2141.txt,
> http://www.ietf.org/rfc/rfc2396.txt
> <p>In the context of the web, there is already a definition for global
> identifiers, the Uniform Resource Name. From
> <br>http://www.ietf.org/rfc/rfc2141.txt
> <blockquote>Uniform Resource Names (URNs) are intended to serve as persistent,
> <br>location-independent, resource identifiers and are designed to make
> <br>it easy to map other namespaces (which share the properties of URNs)
> <br>into URN-space. Therefore, the URN syntax provides a means to encode
> <br>character data in a form that can be sent in existing protocols,
> <br>transcribed on most keyboards, etc.</blockquote>
> URIs are the superset of URNs and URLs. URL's are familiar due to
> their use on the web. They differ from URNs in that they are scoped to
> a particular protocol (e.g. http:*, ftp:* etc). URN's are scoped
> simply as identifiers urn:*.
> <p>URNs are divided into two parts,
> <br> <scheme> : <scheme specific part >
> <br>e.g. http://www.mpi.com/index.html, <b>http</b> is the scheme,
> and <b>www.mpi.com/index.html </b>is the scheme specific part that is interpreted
> in the context of that scheme.
> <br>
> <blockquote>The URI syntax does not require that the scheme-specific-part
> have any general structure or set of semantics which is common among
> all URI. However, a subset of URI do share a common syntax for
> representing hierarchical relationships within the namespace. This
> "generic URI" syntax consists of a sequence of four main components:
> <p> <scheme>://<authority><path>?<query>#fragment
> <p>each of which, except <scheme>, may be absent from a particular URI.
> For example, some URI schemes do not allow an <authority> component,
> and others do not use a <query> component.
> <p> absoluteURI = scheme ":" ( hier_part
> | opaque_part )
> <p> URI that are hierarchical in nature use the slash "/" character
> for separating hierarchical components. For some file systems,
> a "/" character (used to denote the hierarchical structure of a URI)
> is the delimiter used to construct a file name hierarchy, and thus
> the URI path will look similar to a file pathname. This does
> NOT imply that the resource is a file or that the URI maps to an actual
> filesystem pathname.
> <p>[snip]
> <p>The path component contains data, specific to the authority (or the
> scheme if there is no authority component), identifying the resource within
> the scope of that scheme and authority.
> <p>[snip]
> <p>When a URI reference is used to perform a retrieval action on the identified
> resource, the optional fragment identifier, separated from the URI by a
> crosshatch ("#") character, consists of additional reference information
> to be interpreted by the user agent after the retrieval action has been
> successfully completed. As such, it is not part of a URI, but
> is often used in conjunction with a URI.
> <p>(http://www.ietf.org/rfc/rfc2396.txt)</blockquote>
> So to sum up the IETF stuff, a URI can be written as having all of
> the following parts;
> <p> scheme://authority/path/path2?queryterm=something#fragment
> <br>
> <br>
> <h3>
> <a NAME="Appendix B, Some example"></a>Appendix B, Some example identifiers</h3>
> Here are some examples of identifiers written in this format;
> <p>GenBank: the sequence fo J01636 could be identified as follows;
> <p> urn:lsid:genbank.ncbi.nlm.nih.gov:nucleotide/accession:J01636
> <br> urn:lsid:genbank.ncbi.nlm.nih.gov:nucleotide/accession:J01636:1
> <br> urn:lsid:genbank.ncbi.nlm.nih.gov:nucleotide/accession:K01483
> <br> urn:lsid:genbank.ncbi.nlm.nih.gov:nucleotide/gi:146575
> <p>The associated protein could be referred to as follows;
> <p> urn:lsid:genbank.ncbi.nlm.nih.gov/protein/locus/AAA24054
> <br> urn:lsid:genpept.ncbi.nlm.nih.gov/protein/accession/AAA24054.1
> <br> urn:lsid:genpept.ncbi.nlm.nih.gov/protein/pid/g146578
> <br>
> <p>Another example is the following nucleotide from EMBL
> <p> urn:lsid:embl.ebi.ac.uk:nucleotide:AB056092
> <br> urn:lsid:embl.ebi.ac.uk:nucleotide:AB056092:1
> <p>This includes a reference to a taxonomy term
> <p> urn:lsid:taxonomy.ebi.ac.uk::10090
> <br>
> <br>
> <h3>
> <a NAME="Appendix C, Additional"></a>Appendix C, Additional Work</h3>
> 1. More clearly define what is authority and what is path. E.g. should
> GenBank be part of the authority string or is it a part of a path beneath
> ncbi.nlm.nih.gov.
> <p>2. Since path terms are owned by the authority, get common definitions
> for authorities/databases such as GenBank, EMBL etc. This could be
> defined by us and presented to the organization in question for ratification.
> Entities that do not make IDs publicly available are responsible for themselves
> and their customers only but would benefit from a set of guidelines and
> examples.
> <p>3. Examine use cases in proteomics and other branches of informatics.
> <p>4. Create libraries (java, perl) for manipulating IDs in this form.
> <br>
> <br>
> <br>
> <br>
> </body>
> </html>
--
========================================================================
Lincoln D. Stein Cold Spring Harbor Laboratory
lstein@cshl.org Cold Spring Harbor, NY
NOW HIRING BIOINFORMATICS POSTDOCTORAL FELLOWS AND PROGRAMMERS.
PLEASE WRITE FOR DETAILS.
========================================================================