[DAS] Re: Our identifier doc and proposal
Brian Gilman
gilmanb@genome.wi.mit.edu
Tue, 27 Nov 2001 16:20:19 -0500 (EST)
Yes,
Absolutely, the question is: do we build the ontology and hope
that it suits 80% of people's needs or do we adopt another group's? I
don't think anyone has formed a genomics ontology group? So I'd be up for
building our own with the help of Thomas/Mathew and Ewan. I think we can
learn from bioperl, biojava, and Ensembl in the way that they build there
feature hierarchies.
-B
-----------------------
Brian Gilman <gilmanb@genome.wi.mit.edu>
Sr. Software Engineer MIT/Whitehead Inst. Center for Genome Research
One Kendall Square, Bldg. 300 / Cambridge, MA 02139-1561 USA
phone +1 617 252 1069 / fax +1 617 252 1902
On Tue, 27 Nov 2001, Lincoln Stein wrote:
> Hi Brian,
>
> I'm quite sure we'll need an ontology for feature types (at least the
> top few tiers, which people can add to), so we'll be doing some
> ontology building one way or another. Would you agree?
>
> Lincoln
>
> Brian Gilman writes:
> > I think so and I have also asked about this in the group. it becomes very
> > hard to "control" the namespace without an ontology. This is why we allow
> > the individuals to control the top level.
> >
> > -B
> >
> > -----------------------
> > Brian Gilman <gilmanb@genome.wi.mit.edu>
> > Sr. Software Engineer MIT/Whitehead Inst. Center for Genome Research
> > One Kendall Square, Bldg. 300 / Cambridge, MA 02139-1561 USA
> > phone +1 617 252 1069 / fax +1 617 252 1902
> >
> >
> > On Tue, 27 Nov 2001, Lincoln Stein wrote:
> >
> > > Hi Brian,
> > >
> > > I'm pleased to see that the I3C identifier is nearly identical to my
> > > (biological class,namespace,id) triple suggestion. The difference is
> > > the version number, which I agree with completely. So I accept it
> > > wholeheartedly.
> > >
> > > The part that I don't feel entirely comfortable with is that the
> > > namespace seems to be completely under the control of the authority:
> > >
> > > urn:lsid:informatics.mpi.com:plate/glycerol/freeze:12345
> > >
> > > I think the top level namespace, e.g. "plate" should be hard-and-fast
> > > data types. Is this envisioned by the I3C?
> > >
> > > Lincoln
> > >
> > >
> > > Brian Gilman writes:
> > > > Lincoln,
> > > >
> > > > Please find attached an updated identifier proposal that we have
> > > > been working
> > > > on to identifiy objects in the web services architecture. I like it over
> > > > the feature_class mechanism becuase we can uniquely identify an object in
> > > > the "cloud".
> > > >
> > > > Best,
> > > >
> > > > -Brian
> > > >
> > > > -----------------------
> > > > Brian Gilman <gilmanb@genome.wi.mit.edu>
> > > > Sr. Software Engineer MIT/Whitehead Inst. Center for Genome Research
> > > > One Kendall Square, Bldg. 300 / Cambridge, MA 02139-1561 USA
> > > > phone +1 617 252 1069 / fax +1 617 252 1902
> > > >
> > > >
> > > > <!doctype html public "-//w3c//dtd html 4.0 transitional//en">
> > > > <html>
> > > > <head>
> > > > <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
> > > > <meta name="Author" content="Ted liefeld">
> > > > <meta name="GENERATOR" content="Mozilla/4.73 [en]C-CCK-MCD BA45DSL (WinNT; U) [Netscape]">
> > > > <title>identifiers</title>
> > > > </head>
> > > > <body>
> > > >
> > > > <h2>
> > > > I3C Identifier Specification</h2>
> > > >
> > > > <h3>
> > > > <a NAME="Abstract"></a>Abstract:</h3>
> > > > This document describes the motivation for and specification of string
> > > > identifiers to be used to identify objects within the life sciences domain
> > > > by the I3C architecture. A string format for the identifiers is defined
> > > > as <tt> urn:lsid:<authority>:<namespace>:<value>:<version>.</tt>
> > > > <br>
> > > > <h2>
> > > > Index:</h2>
> > > > <a href="#Abstract">Abstract</a>
> > > > <br><a href="#Introduction:">Introduction</a>
> > > > <br><a href="#Background: Existing">Background: Existing Identifiers</a>
> > > > <blockquote><a href="#MPI">MPI Id</a>
> > > > <br><a href="#AGAVE">AGAVE db_id</a></blockquote>
> > > > <a href="#I3C String">I3C String Identifiers</a>
> > > > <blockquote><a href="#Requirements">Requirements for the I3C String Identifier</a>
> > > > <blockquote><a href="#Syntactic">Syntactic Requirements</a>
> > > > <br><a href="#Semantic">Semantic Requirements</a></blockquote>
> > > > </blockquote>
> > > > <a href="#Specification">Specification of the I3C String Identifier</a>
> > > > <blockquote><a href="#Web Centric Id: URI,">Web Centric Id: URI,
> > > > URN</a>
> > > > <br><a href="#I3C String Identifier">I3C String Identifier Definition</a>
> > > > <br><a href="#Examples">Examples</a></blockquote>
> > > > <a href="#Appendix A, URN">Appendix A, URN Reference</a>
> > > > <br><a href="#Appendix B, Some example">Appendix B, Some example identifiers</a>
> > > > <br><a href="#Appendix C, Additional">Appendix C, Additional Work</a>
> > > > <h2>
> > > > <a NAME="Introduction:"></a>Introduction:</h2>
> > > > One of the goals of the I3c is the definition of a common architecture
> > > > and standards to simplify interoperability between applications from different
> > > > companies. For interoperability to occur, we need a common
> > > > format for unique identifiers for any objects we reference that would function
> > > > in the context of I3C services. The remainder of this document defines
> > > > a web-centric ID definition that will allow us to create federated systems
> > > > utilizing many databases and services in a common way.
> > > > <p>The purpose of this identifier definition is to uniquely identify biologically
> > > > significant objects, e.g. a sequence, a clone, a gene, a contig etc.
> > > > It is not meant to identify artifacts of implementation, e,g, a database,
> > > > a server. Identifying objects such as these should be handled via
> > > > other mechanisms such as JDBC URLs and WSDL.
> > > > <p>In addition, using http URLs as identifiers (e.g. http://srs.ebi.ac.uk/srs6bin/cgi-bin/wgetz?-id+7clen1HOYrs+[taxonomy-ID:10090]+-e)
> > > > is not adequate for organizations who need to limit or control access
> > > > to external databases for intellectual property or security reasons.
> > > > The identifiers defined here are deliberately location independent and
> > > > are intended to uniquely identify a biological artifact, but not the location
> > > > of that artifact.
> > > > <br>
> > > > <h3>
> > > > <a NAME="Background: Existing"></a><b>Background: Existing Identifiers</b></h3>
> > > > There are currently many existing forms of identifiers for biological artifacts
> > > > in use within the life sciences community. These include proprietary formats
> > > > as well as public domain formats. Some of these are discussed below.
> > > > <br>
> > > > <h4>
> > > > <a NAME="MPI"></a>MPI Id</h4>
> > > > Within Millennium Pharmaceuticals Inc., there is a suite of CORBA
> > > > services that have been running in production for over two years.
> > > > One of the first tasks they addressed was the identity management of objects
> > > > that appear in more than one database. To deal with this need, a
> > > > CORBA IDL structure called MPI Id was declared. It has since
> > > > been reused in many subsequent CORBA services.
> > > > <p>The MPI ID corba type is defined as a triple
> > > > <br><tt> struct Id {</tt>
> > > > <br><tt> string value;</tt>
> > > > <br><tt> string domain;</tt>
> > > > <br><tt> string type;</tt>
> > > > <br><tt> };</tt>
> > > > <p>for example, MP PL 001 represents identifier domain 'MP', type 'Plate',
> > > > identifier value '001'. The value uniquely identifies an object in
> > > > the domain of a given type. Note that an object may have more than
> > > > one ID, so that the plate known as MP PL 001 may also be known as SE PL
> > > > 435a in the SE domain.
> > > > <p>Some of the services that use these identifiers found this too limiting.
> > > > For example, retrieving clones from GenBank you may want to use the
> > > > accession number or the GI number. However either of these would
> > > > have been encoded as
> > > > <br> GB CL gid
> > > > <br> GB CL accession
> > > > <br>This raised the problem of differentiating whether an identifier is
> > > > a GI number or an accession number. If the namespaces of the accession
> > > > number and gi numbers overlap, then there is no way for a server or client
> > > > to identify which form was intended.
> > > > <p>Another weakness is that object type can be overloaded in the same manner;
> > > > for example a sequence and a contig are both sequences. Similarly,
> > > > inclusion of a version number for an identifier would require overloading
> > > > the value field of the MPI Id.
> > > > <p>Therefore it was found that limiting the unique identifier at three
> > > > elements was too few. There must be provision for extension.
> > > > <br>
> > > > <h4>
> > > > <a NAME="AGAVE"></a>AGAVE db_id</h4>
> > > > Doubltwist Inc. has found some of the same issues in their AGAVE product.
> > > > AGAVE defines an identifier called db_id in the AGAVE DTD file, an XML
> > > > format.
> > > > <p>The AGAVE db_id is defined as follows;
> > > > <blockquote><tt><!--
> > > > --></tt>
> > > > <br><tt><!-- db_id is an identifier for an object in its source database.
> > > > --></tt>
> > > > <br><tt><!--
> > > > --></tt>
> > > > <br><tt><!-- Attributes:
> > > > --></tt>
> > > > <br><tt><!--
> > > > --></tt>
> > > > <br><tt><!-- id: a data identifier
> > > > such as GenBank accession or PID.
> > > > --></tt>
> > > > <br><tt><!-- db_code: a code for the data source, e.g. GenBank
> > > > is "gb". --></tt>
> > > > <br><tt><!-- version: version of the associated data.
> > > > --></tt>
> > > > <br><tt><!--
> > > > --></tt>
> > > > <br><tt><!ELEMENT db_id EMPTY></tt>
> > > > <br><tt><!ATTLIST db_id id
> > > > CDATA #REQUIRED</tt>
> > > > <br><tt>
> > > > version CDATA #IMPLIED</tt>
> > > > <br><tt>
> > > > db_code CDATA #REQUIRED ></tt></blockquote>
> > > > In this format, the version is explicitly specified, but the weaknesses
> > > > remain of having insufficient scope to specify object types or variations
> > > > in the type of ID being specified (accession vs gi).
> > > > <br>
> > > > <h2>
> > > > <a NAME="I3C String"></a>I3C String Identifiers</h2>
> > > > For use within the I3C architecture, the existing identifier definiitons
> > > > was found to be inadequate to handle the breadth and scope of the possible
> > > > identifiers that would be required. The following sections detail
> > > > the requirements and spaecification of a new identifier format for use
> > > > within the I3C architecture.
> > > > <h3>
> > > > <a NAME="Requirements"></a>Requirements for the I3C String Identifier</h3>
> > > > The I3C architecture has the following syntactical and semantic requirements
> > > > for its identifiers;
> > > > <h4>
> > > > <a NAME="Syntactic"></a>Syntactic Requirements</h4>
> > > >
> > > > <ol>
> > > > <li>
> > > > The identifier must be encodable in a string format</li>
> > > >
> > > > <li>
> > > > The identifier must be extensible</li>
> > > >
> > > > <li>
> > > > The identifier must uniquely identify one object</li>
> > > >
> > > > <li>
> > > > The identifier must not require additional contextual information for evaluation</li>
> > > > </ol>
> > > > These requirements result from the need to transmit the identifier in an
> > > > XML format to and from web-services. By requiring that it can be
> > > > encoded as a string, it becomes possible to transmit identifiers via other
> > > > mechanisms as well. Also, as noted in the examples given above, the
> > > > identifier must be extensible to allow use with biological objects that
> > > > have not yet been defined.
> > > > <h4>
> > > > <a NAME="Semantic"></a>Semantic Requirements</h4>
> > > > For an Id to uniquely specify a biological object in a system, it needs
> > > > to include the following pieces of information;
> > > > <br>
> > > > <ol>
> > > > <li>
> > > > Authority : The name of the organization that has defined an
> > > > entity.</li>
> > > >
> > > > <li>
> > > > Id Value : an alpha-numeric sequence that uniquely identifies an
> > > > object to its authority</li>
> > > >
> > > > <li>
> > > > Namespace : one or more statements constraining the scope in which
> > > > an Id is evaluated</li>
> > > >
> > > > <li>
> > > > Version : (optional) version number for an Id</li>
> > > > </ol>
> > > > As an example, the following uniquely identifies a sequence in Genbank,
> > > > <p> GenBank, Sequence, Accession J01636, version
> > > > 1
> > > > <p>With all these pieces of information we can uniquely identify a sequence.
> > > > Leaving off the version number we can get pretty close. Leaving out
> > > > any of the other bits of information makes it impossible to find the object
> > > > without a priori knowledge of the context.
> > > > <br>
> > > > <h2>
> > > > <a NAME="Specification"></a>Specification of the I3C String Identifier</h2>
> > > > To take advantage of existing work on unique identifiers, the I3C
> > > > technical Architecture working group has selected the World Wide Web Consortium's
> > > > (W3C) definition of a universal resource name (URN) as the basis for the
> > > > I3C String Identifier. For additional background on URNs, please
> > > > see Appendix A, "URN Reference", for the definiiton of a URN and reference
> > > > links.
> > > > <h4>
> > > > <a NAME="Web Centric Id: URI,"></a><b>Web Centric Id: URI, URN</b></h4>
> > > > To summarize the IETF and W3C documents, a URI can be written as
> > > > having the following parts;
> > > > <p> scheme:namespace identifier://authority/path/.../pathN/value?queryterm#fragment
> > > > <p> where
> > > > <br>
> > > > scheme and namespace identifier define the semantics of everything that
> > > > follows
> > > > <br>
> > > > authority defines the organization responsible for defining and managing
> > > > the namespace
> > > > <br>
> > > > path/.../pathN/ defines a subset of an authority's namespace
> > > > <br>
> > > > value is the last element in the path
> > > > <br>
> > > > queryterm indicates a post-processing directive
> > > > <br>
> > > > fragment defines a preprocessing directive or fragment within the scope
> > > > of the Id
> > > > <p>The adoption of the URN format should simplify integration with other
> > > > existing standards such as MAGE-ML which permit the use of URN identifiers.
> > > > <br>
> > > > <h3>
> > > > <a NAME="I3C String Identifier"></a>I3C String Identifier Definition</h3>
> > > > Given the definition of a URN above, we have defined the following syntax
> > > > for an I3C String identifier;
> > > > <p><tt> urn:lsid:<authority>:<namespace>:<value>:<version></tt>
> > > > <p>The different parts of the identifier are delimited by colons ":".
> > > > <p>The elements of the identifier are as follows;
> > > > <ul>
> > > > <li>
> > > > scheme = urn</li>
> > > >
> > > > <ul>
> > > > <li>
> > > > This specifies that the identifier is in URN format</li>
> > > > </ul>
> > > >
> > > > <li>
> > > > namespace identifier = lsid</li>
> > > >
> > > > <ul>
> > > > <li>
> > > > The I3C string identifier namespace identifier is defined as "Life Science
> > > > Identifier", or "lsid".</li>
> > > > </ul>
> > > >
> > > > <li>
> > > > authority = <authority></li>
> > > >
> > > > <ul>
> > > > <li>
> > > > This portion uniquely identifies the organization and optionally the organizational
> > > > unit that has defined the namespace for the remaining porions of the identifier</li>
> > > > </ul>
> > > >
> > > > <li>
> > > > namespace = <namespace></li>
> > > >
> > > > <ul>
> > > > <li>
> > > > a hierarchical namepace to scope the identifier value. The form and
> > > > content of this section is defined and managed by the authority</li>
> > > > </ul>
> > > >
> > > > <li>
> > > > value = <value></li>
> > > >
> > > > <ul>
> > > > <li>
> > > > the unique identifier for an object within the namespace defined by an
> > > > authority</li>
> > > > </ul>
> > > >
> > > > <li>
> > > > version = <version></li>
> > > >
> > > > <ul>
> > > > <li>
> > > > optional version information associated with the identifier value</li>
> > > > </ul>
> > > > </ul>
> > > >
> > > > <h4>
> > > > <a NAME="Examples"></a>Examples</h4>
> > > > So for example, for the plate, identified by millennium as ID 12345 with
> > > > MPI ID "MP PL 12345"
> > > > <p> urn:lsid:informatics.mpi.com:plate:12345
> > > > <p>Since the authority is free to define any path that it wishes (provided
> > > > of course that it manages them), we may want to define the path section
> > > > for plates more fully to something like this
> > > > <p> urn:lsid:informatics.mpi.com:plate/glycerol/freeze:12345
> > > > <p>We can now use expanded path information to deal with cases that required
> > > > type overloading in the MPI ID. For example
> > > > <br> (Accession) GB CL j01636 version
> > > > 1
> > > > <br> (GI)
> > > > GB CL 146575
> > > > <br>refer to the same object. These can now be encoded as
> > > > <p> urn:lsid:genbank.ncbi.nlm.nih.gov:sequence/accession:J01636:1
> > > > <br> urn:lsid:genbank.ncbi.nlm.nih.gov:sequence/gi:146575
> > > > <br>
> > > > <h3>
> > > > <a NAME="Appendix A, URN"></a>Appendix A, URN Reference</h3>
> > > >
> > > > <p><br>Ref: http://www.w3.org/Addressing/,http://www.ietf.org/rfc/rfc2141.txt,
> > > > http://www.ietf.org/rfc/rfc2396.txt
> > > > <p>In the context of the web, there is already a definition for global
> > > > identifiers, the Uniform Resource Name. From
> > > > <br>http://www.ietf.org/rfc/rfc2141.txt
> > > > <blockquote>Uniform Resource Names (URNs) are intended to serve as persistent,
> > > > <br>location-independent, resource identifiers and are designed to make
> > > > <br>it easy to map other namespaces (which share the properties of URNs)
> > > > <br>into URN-space. Therefore, the URN syntax provides a means to encode
> > > > <br>character data in a form that can be sent in existing protocols,
> > > > <br>transcribed on most keyboards, etc.</blockquote>
> > > > URIs are the superset of URNs and URLs. URL's are familiar due to
> > > > their use on the web. They differ from URNs in that they are scoped to
> > > > a particular protocol (e.g. http:*, ftp:* etc). URN's are scoped
> > > > simply as identifiers urn:*.
> > > > <p>URNs are divided into two parts,
> > > > <br> <scheme> : <scheme specific part >
> > > > <br>e.g. http://www.mpi.com/index.html, <b>http</b> is the scheme,
> > > > and <b>www.mpi.com/index.html </b>is the scheme specific part that is interpreted
> > > > in the context of that scheme.
> > > > <br>
> > > > <blockquote>The URI syntax does not require that the scheme-specific-part
> > > > have any general structure or set of semantics which is common among
> > > > all URI. However, a subset of URI do share a common syntax for
> > > > representing hierarchical relationships within the namespace. This
> > > > "generic URI" syntax consists of a sequence of four main components:
> > > > <p> <scheme>://<authority><path>?<query>#fragment
> > > > <p>each of which, except <scheme>, may be absent from a particular URI.
> > > > For example, some URI schemes do not allow an <authority> component,
> > > > and others do not use a <query> component.
> > > > <p> absoluteURI = scheme ":" ( hier_part
> > > > | opaque_part )
> > > > <p> URI that are hierarchical in nature use the slash "/" character
> > > > for separating hierarchical components. For some file systems,
> > > > a "/" character (used to denote the hierarchical structure of a URI)
> > > > is the delimiter used to construct a file name hierarchy, and thus
> > > > the URI path will look similar to a file pathname. This does
> > > > NOT imply that the resource is a file or that the URI maps to an actual
> > > > filesystem pathname.
> > > > <p>[snip]
> > > > <p>The path component contains data, specific to the authority (or the
> > > > scheme if there is no authority component), identifying the resource within
> > > > the scope of that scheme and authority.
> > > > <p>[snip]
> > > > <p>When a URI reference is used to perform a retrieval action on the identified
> > > > resource, the optional fragment identifier, separated from the URI by a
> > > > crosshatch ("#") character, consists of additional reference information
> > > > to be interpreted by the user agent after the retrieval action has been
> > > > successfully completed. As such, it is not part of a URI, but
> > > > is often used in conjunction with a URI.
> > > > <p>(http://www.ietf.org/rfc/rfc2396.txt)</blockquote>
> > > > So to sum up the IETF stuff, a URI can be written as having all of
> > > > the following parts;
> > > > <p> scheme://authority/path/path2?queryterm=something#fragment
> > > > <br>
> > > > <br>
> > > > <h3>
> > > > <a NAME="Appendix B, Some example"></a>Appendix B, Some example identifiers</h3>
> > > > Here are some examples of identifiers written in this format;
> > > > <p>GenBank: the sequence fo J01636 could be identified as follows;
> > > > <p> urn:lsid:genbank.ncbi.nlm.nih.gov:nucleotide/accession:J01636
> > > > <br> urn:lsid:genbank.ncbi.nlm.nih.gov:nucleotide/accession:J01636:1
> > > > <br> urn:lsid:genbank.ncbi.nlm.nih.gov:nucleotide/accession:K01483
> > > > <br> urn:lsid:genbank.ncbi.nlm.nih.gov:nucleotide/gi:146575
> > > > <p>The associated protein could be referred to as follows;
> > > > <p> urn:lsid:genbank.ncbi.nlm.nih.gov/protein/locus/AAA24054
> > > > <br> urn:lsid:genpept.ncbi.nlm.nih.gov/protein/accession/AAA24054.1
> > > > <br> urn:lsid:genpept.ncbi.nlm.nih.gov/protein/pid/g146578
> > > > <br>
> > > > <p>Another example is the following nucleotide from EMBL
> > > > <p> urn:lsid:embl.ebi.ac.uk:nucleotide:AB056092
> > > > <br> urn:lsid:embl.ebi.ac.uk:nucleotide:AB056092:1
> > > > <p>This includes a reference to a taxonomy term
> > > > <p> urn:lsid:taxonomy.ebi.ac.uk::10090
> > > > <br>
> > > > <br>
> > > > <h3>
> > > > <a NAME="Appendix C, Additional"></a>Appendix C, Additional Work</h3>
> > > > 1. More clearly define what is authority and what is path. E.g. should
> > > > GenBank be part of the authority string or is it a part of a path beneath
> > > > ncbi.nlm.nih.gov.
> > > > <p>2. Since path terms are owned by the authority, get common definitions
> > > > for authorities/databases such as GenBank, EMBL etc. This could be
> > > > defined by us and presented to the organization in question for ratification.
> > > > Entities that do not make IDs publicly available are responsible for themselves
> > > > and their customers only but would benefit from a set of guidelines and
> > > > examples.
> > > > <p>3. Examine use cases in proteomics and other branches of informatics.
> > > > <p>4. Create libraries (java, perl) for manipulating IDs in this form.
> > > > <br>
> > > > <br>
> > > > <br>
> > > > <br>
> > > > </body>
> > > > </html>
> > >
> > > --
> > > ========================================================================
> > > Lincoln D. Stein Cold Spring Harbor Laboratory
> > > lstein@cshl.org Cold Spring Harbor, NY
> > >
> > > NOW HIRING BIOINFORMATICS POSTDOCTORAL FELLOWS AND PROGRAMMERS.
> > > PLEASE WRITE FOR DETAILS.
> > > ========================================================================
> > >
>
> --
> ========================================================================
> Lincoln D. Stein Cold Spring Harbor Laboratory
> lstein@cshl.org Cold Spring Harbor, NY
>
> NOW HIRING BIOINFORMATICS POSTDOCTORAL FELLOWS AND PROGRAMMERS.
> PLEASE WRITE FOR DETAILS.
> ========================================================================
>