[DAS] Re: Our identifier doc and proposal

Wed, 28 Nov 2001 11:39:36 -0500

I think we're going to find that the features form a DAG and not a
hierarchy.  Otherwise you're going to have problems classifying things
like "genes".  In the context of genetics, a gene is a type of
complementation group.  In the context of genomics, a gene is a
subclass of transcription features, translation features, and
regulatory features.

Or what do we say about transposons?  You can think of them in various
contexts as: repeats, insertions, and pseudogenes.

Lincoln

Ewan Birney writes:
 > On Tue, 27 Nov 2001, Brian Gilman wrote:
 > 
 > > Yes,
 > > 
 > > 	Absolutely, the question is: do we build the ontology and hope
 > > that it suits 80% of people's needs or do we adopt another group's? I
 > > don't think anyone has formed a genomics ontology group? So I'd be up for
 > > building our own with the help of Thomas/Mathew and Ewan. I think we can
 > > learn from bioperl, biojava, and Ensembl in the way that they build there
 > > feature hierarchies. 
 > 
 > <giggle>
 > 
 > We have an ontology? Inside Ensembl?
 > 
 > </giggle>
 > 
 > 
 > But - point taken - we actually now have quite an understanding of the
 > different feature types you would want to display - we'd be happy to
 > contribute to this. 
 > 
 > It is not really an ontology - it is a heirarchy. I think will piss off
 > the proffessional ontologists if we called it an ontology (mind
 > you... maybe that would be fun...)
 > 
 > 
 > 
 > > 
 > > 			-B
 > > 
 > > -----------------------
 > > Brian Gilman <gilmanb@genome.wi.mit.edu>
 > > Sr. Software Engineer MIT/Whitehead Inst. Center for Genome Research
 > > One Kendall Square, Bldg. 300 / Cambridge, MA 02139-1561 USA
 > > phone +1 617  252 1069 / fax +1 617 252 1902
 > > 
 > > 
 > > On Tue, 27 Nov 2001, Lincoln Stein wrote:
 > > 
 > > > Hi Brian,
 > > > 
 > > > I'm quite sure we'll need an ontology for feature types (at least the
 > > > top few tiers, which people can add to), so we'll be doing some
 > > > ontology building one way or another.  Would you agree?
 > > > 
 > > > Lincoln
 > > > 
 > > > Brian Gilman writes:
 > > >  > I think so and I have also asked about this in the group. it becomes very
 > > >  > hard to "control" the namespace without an ontology. This is why we allow
 > > >  > the individuals to control the top level. 
 > > >  > 
 > > >  > 			-B
 > > >  > 
 > > >  > -----------------------
 > > >  > Brian Gilman <gilmanb@genome.wi.mit.edu>
 > > >  > Sr. Software Engineer MIT/Whitehead Inst. Center for Genome Research
 > > >  > One Kendall Square, Bldg. 300 / Cambridge, MA 02139-1561 USA
 > > >  > phone +1 617  252 1069 / fax +1 617 252 1902
 > > >  > 
 > > >  > 
 > > >  > On Tue, 27 Nov 2001, Lincoln Stein wrote:
 > > >  > 
 > > >  > > Hi Brian,
 > > >  > > 
 > > >  > > I'm pleased to see that the I3C identifier is nearly identical to my
 > > >  > > (biological class,namespace,id) triple suggestion.  The difference is
 > > >  > > the version number, which I agree with completely.  So I accept it
 > > >  > > wholeheartedly.
 > > >  > > 
 > > >  > > The part that I don't feel entirely comfortable with is that the
 > > >  > > namespace seems to be completely under the control of the authority:
 > > >  > > 
 > > >  > >    urn:lsid:informatics.mpi.com:plate/glycerol/freeze:12345
 > > >  > > 
 > > >  > > I think the top level namespace, e.g. "plate" should be hard-and-fast
 > > >  > > data types.  Is this envisioned by the I3C?
 > > >  > > 
 > > >  > > Lincoln
 > > >  > > 
 > > >  > > 
 > > >  > > Brian Gilman writes:
 > > >  > >  > Lincoln,
 > > >  > >  > 
 > > >  > >  > 	Please find attached an updated identifier proposal that we have
 > > >  > >  > been working
 > > >  > >  > on to identifiy objects in the web services architecture. I like it over
 > > >  > >  > the feature_class mechanism becuase we can uniquely identify an object in
 > > >  > >  > the "cloud".
 > > >  > >  > 
 > > >  > >  > 		Best, 
 > > >  > >  > 
 > > >  > >  > 			-Brian
 > > >  > >  > 
 > > >  > >  > -----------------------
 > > >  > >  > Brian Gilman <gilmanb@genome.wi.mit.edu>
 > > >  > >  > Sr. Software Engineer MIT/Whitehead Inst. Center for Genome Research
 > > >  > >  > One Kendall Square, Bldg. 300 / Cambridge, MA 02139-1561 USA
 > > >  > >  > phone +1 617  252 1069 / fax +1 617 252 1902
 > > >  > >  > 
 > > >  > >  > 
 > > >  > >  > <!doctype html public "-//w3c//dtd html 4.0 transitional//en">
 > > >  > >  > <html>
 > > >  > >  > <head>
 > > >  > >  >    <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
 > > >  > >  >    <meta name="Author" content="Ted liefeld">
 > > >  > >  >    <meta name="GENERATOR" content="Mozilla/4.73 [en]C-CCK-MCD BA45DSL  (WinNT; U) [Netscape]">
 > > >  > >  >    <title>identifiers</title>
 > > >  > >  > </head>
 > > >  > >  > <body>
 > > >  > >  > 
 > > >  > >  > <h2>
 > > >  > >  > I3C Identifier Specification</h2>
 > > >  > >  > 
 > > >  > >  > <h3>
 > > >  > >  > <a NAME="Abstract"></a>Abstract:</h3>
 > > >  > >  > This document describes the motivation for and specification of string
 > > >  > >  > identifiers to be used to identify objects within the life sciences domain
 > > >  > >  > by the I3C architecture.&nbsp; A string format for the identifiers is defined
 > > >  > >  > as&nbsp;<tt>&nbsp; urn:lsid:&lt;authority>:&lt;namespace>:&lt;value>:&lt;version>.</tt>
 > > >  > >  > <br>&nbsp;
 > > >  > >  > <h2>
 > > >  > >  > Index:</h2>
 > > >  > >  > <a href="#Abstract">Abstract</a>
 > > >  > >  > <br><a href="#Introduction:">Introduction</a>
 > > >  > >  > <br><a href="#Background: Existing">Background: Existing Identifiers</a>
 > > >  > >  > <blockquote><a href="#MPI">MPI Id</a>
 > > >  > >  > <br><a href="#AGAVE">AGAVE db_id</a></blockquote>
 > > >  > >  > <a href="#I3C String">I3C String Identifiers</a>
 > > >  > >  > <blockquote><a href="#Requirements">Requirements for the I3C String Identifier</a>
 > > >  > >  > <blockquote><a href="#Syntactic">Syntactic Requirements</a>
 > > >  > >  > <br><a href="#Semantic">Semantic Requirements</a></blockquote>
 > > >  > >  > </blockquote>
 > > >  > >  > <a href="#Specification">Specification of the I3C String Identifier</a>
 > > >  > >  > <blockquote><a href="#Web Centric Id:  URI,">Web Centric Id:&nbsp; URI,
 > > >  > >  > URN</a>
 > > >  > >  > <br><a href="#I3C String Identifier">I3C String Identifier Definition</a>
 > > >  > >  > <br><a href="#Examples">Examples</a></blockquote>
 > > >  > >  > <a href="#Appendix A, URN">Appendix A, URN Reference</a>
 > > >  > >  > <br><a href="#Appendix B, Some example">Appendix B, Some example identifiers</a>
 > > >  > >  > <br><a href="#Appendix C, Additional">Appendix C, Additional Work</a>
 > > >  > >  > <h2>
 > > >  > >  > <a NAME="Introduction:"></a>Introduction:</h2>
 > > >  > >  > One of the goals of the I3c is the definition of a common architecture
 > > >  > >  > and standards to simplify interoperability between applications from different
 > > >  > >  > companies.&nbsp; For interoperability to occur,&nbsp; we need a common
 > > >  > >  > format for unique identifiers for any objects we reference that would function
 > > >  > >  > in the context of I3C services.&nbsp; The remainder of this document defines
 > > >  > >  > a web-centric ID definition that will allow us to create federated systems
 > > >  > >  > utilizing many databases and services in a common way.
 > > >  > >  > <p>The purpose of this identifier definition is to uniquely identify biologically
 > > >  > >  > significant objects,&nbsp; e.g. a sequence, a clone, a gene, a contig etc.
 > > >  > >  > It is not meant to identify artifacts of implementation, e,g, a database,
 > > >  > >  > a server.&nbsp; Identifying objects such as these should be handled via
 > > >  > >  > other mechanisms such as JDBC URLs and WSDL.
 > > >  > >  > <p>In addition, using http URLs as identifiers (e.g. http://srs.ebi.ac.uk/srs6bin/cgi-bin/wgetz?-id+7clen1HOYrs+[taxonomy-ID:10090]+-e)
 > > >  > >  > is not adequate for&nbsp; organizations who need to limit or control access
 > > >  > >  > to external databases for intellectual property or security reasons.&nbsp;
 > > >  > >  > The identifiers defined here are deliberately location independent and
 > > >  > >  > are intended to uniquely identify a biological artifact, but not the location
 > > >  > >  > of that artifact.
 > > >  > >  > <br>&nbsp;
 > > >  > >  > <h3>
 > > >  > >  > <a NAME="Background: Existing"></a><b>Background: Existing Identifiers</b></h3>
 > > >  > >  > There are currently many existing forms of identifiers for biological artifacts
 > > >  > >  > in use within the life sciences community. These include proprietary formats
 > > >  > >  > as well as public domain formats.&nbsp; Some of these are discussed below.
 > > >  > >  > <br>&nbsp;
 > > >  > >  > <h4>
 > > >  > >  > <a NAME="MPI"></a>MPI Id</h4>
 > > >  > >  > Within Millennium Pharmaceuticals Inc.,&nbsp; there is a suite of CORBA
 > > >  > >  > services that have been running in production for over two years.&nbsp;
 > > >  > >  > One of the first tasks they addressed was the identity management of objects
 > > >  > >  > that appear in more than one database.&nbsp; To deal with this need, a
 > > >  > >  > CORBA IDL structure called&nbsp; MPI Id was declared.&nbsp; It has since
 > > >  > >  > been reused in many subsequent CORBA services.
 > > >  > >  > <p>The MPI ID corba type is defined as a triple
 > > >  > >  > <br><tt>&nbsp;&nbsp;&nbsp; struct Id {</tt>
 > > >  > >  > <br><tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; string value;</tt>
 > > >  > >  > <br><tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; string domain;</tt>
 > > >  > >  > <br><tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; string type;</tt>
 > > >  > >  > <br><tt>&nbsp;&nbsp;&nbsp; };</tt>
 > > >  > >  > <p>for example, MP PL 001 represents identifier domain 'MP', type 'Plate',
 > > >  > >  > identifier value '001'.&nbsp; The value uniquely identifies an object in
 > > >  > >  > the domain of a given type.&nbsp; Note that an object may have more than
 > > >  > >  > one ID, so that the plate known as MP PL 001 may also be known as SE PL
 > > >  > >  > 435a in the SE domain.
 > > >  > >  > <p>Some of the services that use these identifiers found this too limiting.&nbsp;
 > > >  > >  > For example,&nbsp; retrieving clones from GenBank you may want to use the
 > > >  > >  > accession number or the GI number.&nbsp; However either of these would
 > > >  > >  > have been encoded as
 > > >  > >  > <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; GB CL gid
 > > >  > >  > <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; GB CL accession
 > > >  > >  > <br>This raised the problem of differentiating whether an identifier is
 > > >  > >  > a GI number or an accession number.&nbsp; If the namespaces of the accession
 > > >  > >  > number and gi numbers overlap, then there is no way for a server or client
 > > >  > >  > to identify which form was intended.
 > > >  > >  > <p>Another weakness is that object type can be overloaded in the same manner;&nbsp;
 > > >  > >  > for example a sequence and a contig are both sequences.&nbsp; Similarly,
 > > >  > >  > inclusion of a version number for an identifier would require overloading
 > > >  > >  > the value field of the MPI Id.
 > > >  > >  > <p>Therefore it was found that limiting the unique identifier at three
 > > >  > >  > elements was too few.&nbsp; There must be provision for extension.
 > > >  > >  > <br>&nbsp;
 > > >  > >  > <h4>
 > > >  > >  > <a NAME="AGAVE"></a>AGAVE db_id</h4>
 > > >  > >  > Doubltwist Inc. has found some of the same issues in their AGAVE product.&nbsp;
 > > >  > >  > AGAVE defines an identifier called db_id in the AGAVE DTD file, an XML
 > > >  > >  > format.
 > > >  > >  > <p>The AGAVE db_id is defined as follows;
 > > >  > >  > <blockquote><tt>&lt;!--&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
 > > >  > >  > --></tt>
 > > >  > >  > <br><tt>&lt;!-- db_id is an identifier for an object in its source database.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
 > > >  > >  > --></tt>
 > > >  > >  > <br><tt>&lt;!--&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
 > > >  > >  > --></tt>
 > > >  > >  > <br><tt>&lt;!-- Attributes:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
 > > >  > >  > --></tt>
 > > >  > >  > <br><tt>&lt;!--&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
 > > >  > >  > --></tt>
 > > >  > >  > <br><tt>&lt;!-- id:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; a data identifier
 > > >  > >  > such as GenBank accession or PID.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
 > > >  > >  > --></tt>
 > > >  > >  > <br><tt>&lt;!-- db_code:&nbsp; a code for the data source, e.g. GenBank
 > > >  > >  > is "gb".&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; --></tt>
 > > >  > >  > <br><tt>&lt;!-- version:&nbsp; version of the associated data.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
 > > >  > >  > --></tt>
 > > >  > >  > <br><tt>&lt;!--&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
 > > >  > >  > --></tt>
 > > >  > >  > <br><tt>&lt;!ELEMENT db_id EMPTY></tt>
 > > >  > >  > <br><tt>&lt;!ATTLIST db_id&nbsp; id&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
 > > >  > >  > CDATA&nbsp; #REQUIRED</tt>
 > > >  > >  > <br><tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
 > > >  > >  > version&nbsp; CDATA&nbsp; #IMPLIED</tt>
 > > >  > >  > <br><tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
 > > >  > >  > db_code&nbsp; CDATA&nbsp; #REQUIRED ></tt></blockquote>
 > > >  > >  > In this format, the version is explicitly specified, but the weaknesses
 > > >  > >  > remain of having insufficient scope to specify object types or variations
 > > >  > >  > in the type of ID being specified (accession vs gi).
 > > >  > >  > <br>&nbsp;
 > > >  > >  > <h2>
 > > >  > >  > <a NAME="I3C String"></a>I3C String Identifiers</h2>
 > > >  > >  > For use within the I3C architecture, the existing identifier definiitons
 > > >  > >  > was found to be inadequate to handle the breadth and scope of the possible
 > > >  > >  > identifiers that would be required.&nbsp; The following sections detail
 > > >  > >  > the requirements and spaecification of a new identifier format for use
 > > >  > >  > within the I3C architecture.
 > > >  > >  > <h3>
 > > >  > >  > <a NAME="Requirements"></a>Requirements for the I3C String Identifier</h3>
 > > >  > >  > The I3C architecture has the following syntactical and semantic requirements
 > > >  > >  > for its identifiers;
 > > >  > >  > <h4>
 > > >  > >  > <a NAME="Syntactic"></a>Syntactic Requirements</h4>
 > > >  > >  > 
 > > >  > >  > <ol>
 > > >  > >  > <li>
 > > >  > >  > The identifier must be encodable in a string format</li>
 > > >  > >  > 
 > > >  > >  > <li>
 > > >  > >  > The identifier must be extensible</li>
 > > >  > >  > 
 > > >  > >  > <li>
 > > >  > >  > The identifier must uniquely identify one object</li>
 > > >  > >  > 
 > > >  > >  > <li>
 > > >  > >  > The identifier must not require additional contextual information for evaluation</li>
 > > >  > >  > </ol>
 > > >  > >  > These requirements result from the need to transmit the identifier in an
 > > >  > >  > XML format to and from web-services.&nbsp; By requiring that it can be
 > > >  > >  > encoded as a string, it becomes possible to transmit identifiers via other
 > > >  > >  > mechanisms as well.&nbsp; Also, as noted in the examples given above, the
 > > >  > >  > identifier must be extensible to allow use with biological objects that
 > > >  > >  > have not yet been defined.
 > > >  > >  > <h4>
 > > >  > >  > <a NAME="Semantic"></a>Semantic Requirements</h4>
 > > >  > >  > For an Id to uniquely specify a biological object in a system, it needs
 > > >  > >  > to include the following pieces of information;
 > > >  > >  > <br>&nbsp;
 > > >  > >  > <ol>
 > > >  > >  > <li>
 > > >  > >  > &nbsp;Authority :&nbsp; The name of the organization that has defined an
 > > >  > >  > entity.</li>
 > > >  > >  > 
 > > >  > >  > <li>
 > > >  > >  > &nbsp;Id Value : an alpha-numeric sequence that uniquely identifies an
 > > >  > >  > object to its authority</li>
 > > >  > >  > 
 > > >  > >  > <li>
 > > >  > >  > &nbsp;Namespace : one or more statements constraining the scope in which
 > > >  > >  > an Id is evaluated</li>
 > > >  > >  > 
 > > >  > >  > <li>
 > > >  > >  > &nbsp;Version&nbsp; : (optional) version number for an Id</li>
 > > >  > >  > </ol>
 > > >  > >  > As an example, the following uniquely identifies a sequence in Genbank,
 > > >  > >  > <p>&nbsp;&nbsp;&nbsp; GenBank, Sequence, Accession J01636,&nbsp; version
 > > >  > >  > 1
 > > >  > >  > <p>With all these pieces of information we can uniquely identify a sequence.&nbsp;
 > > >  > >  > Leaving off the version number we can get pretty close.&nbsp; Leaving out
 > > >  > >  > any of the other bits of information makes it impossible to find the object
 > > >  > >  > without a priori knowledge of the context.
 > > >  > >  > <br>&nbsp;
 > > >  > >  > <h2>
 > > >  > >  > <a NAME="Specification"></a>Specification of the I3C String Identifier</h2>
 > > >  > >  > To take advantage of existing work on unique identifiers,&nbsp; the I3C
 > > >  > >  > technical Architecture working group has selected the World Wide Web Consortium's
 > > >  > >  > (W3C) definition of a universal resource name (URN) as the basis for the
 > > >  > >  > I3C String Identifier.&nbsp; For additional background on URNs, please
 > > >  > >  > see Appendix A, "URN Reference", for the definiiton of a URN and reference
 > > >  > >  > links.
 > > >  > >  > <h4>
 > > >  > >  > <a NAME="Web Centric Id:  URI,"></a><b>Web Centric Id:&nbsp; URI, URN</b></h4>
 > > >  > >  > To summarize the IETF and W3C documents,&nbsp; a URI can be written as
 > > >  > >  > having the following parts;
 > > >  > >  > <p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; scheme:namespace identifier://authority/path/.../pathN/value?queryterm#fragment
 > > >  > >  > <p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; where
 > > >  > >  > <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
 > > >  > >  > scheme and namespace identifier define the semantics of everything that
 > > >  > >  > follows
 > > >  > >  > <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
 > > >  > >  > authority defines the organization responsible for defining and managing
 > > >  > >  > the namespace
 > > >  > >  > <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
 > > >  > >  > path/.../pathN/ defines a subset of an authority's namespace
 > > >  > >  > <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
 > > >  > >  > value is the last element in the path
 > > >  > >  > <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
 > > >  > >  > queryterm indicates a post-processing directive
 > > >  > >  > <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
 > > >  > >  > fragment defines a preprocessing directive or fragment within the scope
 > > >  > >  > of the Id
 > > >  > >  > <p>The adoption of the URN format should simplify integration with other
 > > >  > >  > existing standards such as MAGE-ML which permit the use of URN identifiers.
 > > >  > >  > <br>&nbsp;
 > > >  > >  > <h3>
 > > >  > >  > <a NAME="I3C String Identifier"></a>I3C String Identifier Definition</h3>
 > > >  > >  > Given the definition of a URN above, we have defined the following syntax
 > > >  > >  > for an I3C String identifier;
 > > >  > >  > <p><tt>&nbsp;&nbsp;&nbsp; urn:lsid:&lt;authority>:&lt;namespace>:&lt;value>:&lt;version></tt>
 > > >  > >  > <p>The different parts of the identifier are delimited by colons ":".
 > > >  > >  > <p>The elements of the identifier are as follows;
 > > >  > >  > <ul>
 > > >  > >  > <li>
 > > >  > >  > scheme = urn</li>
 > > >  > >  > 
 > > >  > >  > <ul>
 > > >  > >  > <li>
 > > >  > >  > This specifies that the identifier is in URN format</li>
 > > >  > >  > </ul>
 > > >  > >  > 
 > > >  > >  > <li>
 > > >  > >  > namespace identifier = lsid</li>
 > > >  > >  > 
 > > >  > >  > <ul>
 > > >  > >  > <li>
 > > >  > >  > The I3C string identifier namespace identifier is defined as "Life Science
 > > >  > >  > Identifier", or "lsid".</li>
 > > >  > >  > </ul>
 > > >  > >  > 
 > > >  > >  > <li>
 > > >  > >  > authority = &lt;authority></li>
 > > >  > >  > 
 > > >  > >  > <ul>
 > > >  > >  > <li>
 > > >  > >  > This portion uniquely identifies the organization and optionally the organizational
 > > >  > >  > unit that has defined the namespace for the remaining porions of the identifier</li>
 > > >  > >  > </ul>
 > > >  > >  > 
 > > >  > >  > <li>
 > > >  > >  > namespace = &lt;namespace></li>
 > > >  > >  > 
 > > >  > >  > <ul>
 > > >  > >  > <li>
 > > >  > >  > a hierarchical namepace to scope the identifier value.&nbsp; The form and
 > > >  > >  > content of this section is defined and managed by the authority</li>
 > > >  > >  > </ul>
 > > >  > >  > 
 > > >  > >  > <li>
 > > >  > >  > value = &lt;value></li>
 > > >  > >  > 
 > > >  > >  > <ul>
 > > >  > >  > <li>
 > > >  > >  > the unique identifier for an object within the namespace defined by an
 > > >  > >  > authority</li>
 > > >  > >  > </ul>
 > > >  > >  > 
 > > >  > >  > <li>
 > > >  > >  > version = &lt;version></li>
 > > >  > >  > 
 > > >  > >  > <ul>
 > > >  > >  > <li>
 > > >  > >  > optional version information associated with the identifier value</li>
 > > >  > >  > </ul>
 > > >  > >  > </ul>
 > > >  > >  > 
 > > >  > >  > <h4>
 > > >  > >  > <a NAME="Examples"></a>Examples</h4>
 > > >  > >  > So for example, for the plate, identified by millennium as ID 12345 with
 > > >  > >  > MPI ID&nbsp;&nbsp;&nbsp; "MP PL 12345"
 > > >  > >  > <p>&nbsp;&nbsp;&nbsp; urn:lsid:informatics.mpi.com:plate:12345
 > > >  > >  > <p>Since the authority is free to define any path that it wishes (provided
 > > >  > >  > of course that it manages them),&nbsp; we may want to define the path section
 > > >  > >  > for plates more fully to something like this
 > > >  > >  > <p>&nbsp;&nbsp;&nbsp; urn:lsid:informatics.mpi.com:plate/glycerol/freeze:12345
 > > >  > >  > <p>We can now use expanded path information to deal with cases that required
 > > >  > >  > type overloading in the MPI ID.&nbsp; For example
 > > >  > >  > <br>&nbsp;&nbsp;&nbsp; (Accession)&nbsp;&nbsp;&nbsp; GB CL j01636 version
 > > >  > >  > 1
 > > >  > >  > <br>&nbsp;&nbsp;&nbsp; (GI)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
 > > >  > >  > GB CL 146575
 > > >  > >  > <br>refer to the same object.&nbsp; These can now be encoded as
 > > >  > >  > <p>&nbsp;&nbsp;&nbsp; urn:lsid:genbank.ncbi.nlm.nih.gov:sequence/accession:J01636:1
 > > >  > >  > <br>&nbsp;&nbsp;&nbsp; urn:lsid:genbank.ncbi.nlm.nih.gov:sequence/gi:146575
 > > >  > >  > <br>&nbsp;
 > > >  > >  > <h3>
 > > >  > >  > <a NAME="Appendix A, URN"></a>Appendix A, URN Reference</h3>
 > > >  > >  > 
 > > >  > >  > <p><br>Ref: http://www.w3.org/Addressing/,http://www.ietf.org/rfc/rfc2141.txt,
 > > >  > >  > http://www.ietf.org/rfc/rfc2396.txt
 > > >  > >  > <p>In the context of the web, there is already a definition for global
 > > >  > >  > identifiers,&nbsp; the Uniform Resource Name.&nbsp; From
 > > >  > >  > <br>http://www.ietf.org/rfc/rfc2141.txt
 > > >  > >  > <blockquote>Uniform Resource Names (URNs) are intended to serve as persistent,
 > > >  > >  > <br>location-independent, resource identifiers and are designed to make
 > > >  > >  > <br>it easy to map other namespaces (which share the properties of URNs)
 > > >  > >  > <br>into URN-space. Therefore, the URN syntax provides a means to encode
 > > >  > >  > <br>character data in a form that can be sent in existing protocols,
 > > >  > >  > <br>transcribed on most keyboards, etc.</blockquote>
 > > >  > >  > URIs are the superset of URNs and URLs.&nbsp; URL's are familiar due to
 > > >  > >  > their use on the web. They differ from URNs in that they are scoped to
 > > >  > >  > a particular protocol (e.g. http:*, ftp:* etc).&nbsp; URN's are scoped
 > > >  > >  > simply as identifiers urn:*.
 > > >  > >  > <p>URNs are divided into two parts,
 > > >  > >  > <br>&nbsp;&nbsp;&nbsp; &lt;scheme> : &lt;scheme specific part >
 > > >  > >  > <br>e.g. http://www.mpi.com/index.html,&nbsp; <b>http</b> is the scheme,&nbsp;
 > > >  > >  > and <b>www.mpi.com/index.html </b>is the scheme specific part that is interpreted
 > > >  > >  > in the context of that scheme.
 > > >  > >  > <br>&nbsp;
 > > >  > >  > <blockquote>The URI syntax does not require that the scheme-specific-part
 > > >  > >  > have&nbsp; any general structure or set of semantics which is common among
 > > >  > >  > all URI.&nbsp; However, a subset of URI do share a common syntax for&nbsp;
 > > >  > >  > representing hierarchical relationships within the namespace.&nbsp; This
 > > >  > >  > "generic URI" syntax consists of a sequence of four main components:
 > > >  > >  > <p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;scheme>://&lt;authority>&lt;path>?&lt;query>#fragment
 > > >  > >  > <p>each of which, except &lt;scheme>, may be absent from a particular URI.&nbsp;&nbsp;
 > > >  > >  > For example, some URI schemes do not allow an &lt;authority> component,&nbsp;
 > > >  > >  > and others do not use a &lt;query> component.
 > > >  > >  > <p>&nbsp;&nbsp;&nbsp;&nbsp; absoluteURI&nbsp;&nbsp; = scheme ":" ( hier_part
 > > >  > >  > | opaque_part )
 > > >  > >  > <p>&nbsp; URI that are hierarchical in nature use the slash "/" character
 > > >  > >  > for&nbsp; separating hierarchical components.&nbsp; For some file systems,
 > > >  > >  > a "/"&nbsp; character (used to denote the hierarchical structure of a URI)
 > > >  > >  > is the&nbsp; delimiter used to construct a file name hierarchy, and thus
 > > >  > >  > the URI&nbsp; path will look similar to a file pathname.&nbsp; This does
 > > >  > >  > NOT imply that the resource is a file or that the URI maps to an actual
 > > >  > >  > filesystem pathname.
 > > >  > >  > <p>[snip]
 > > >  > >  > <p>The path component contains data, specific to the authority (or the
 > > >  > >  > scheme if there is no authority component), identifying the resource within
 > > >  > >  > the scope of that scheme and authority.
 > > >  > >  > <p>[snip]
 > > >  > >  > <p>When a URI reference is used to perform a retrieval action on the identified
 > > >  > >  > resource, the optional fragment identifier, separated from the URI by a
 > > >  > >  > crosshatch ("#") character, consists of additional reference information
 > > >  > >  > to be interpreted by the user agent after the retrieval action has been
 > > >  > >  > successfully completed.&nbsp; As such, it is not&nbsp; part of a URI, but
 > > >  > >  > is often used in conjunction with a URI.
 > > >  > >  > <p>(http://www.ietf.org/rfc/rfc2396.txt)</blockquote>
 > > >  > >  > So to sum up the IETF stuff,&nbsp; a URI can be written as having all of
 > > >  > >  > the following parts;
 > > >  > >  > <p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; scheme://authority/path/path2?queryterm=something#fragment
 > > >  > >  > <br>&nbsp;
 > > >  > >  > <br>&nbsp;
 > > >  > >  > <h3>
 > > >  > >  > <a NAME="Appendix B, Some example"></a>Appendix B, Some example identifiers</h3>
 > > >  > >  > Here are some examples of identifiers written in this format;
 > > >  > >  > <p>GenBank:&nbsp; the sequence fo J01636 could be identified as follows;
 > > >  > >  > <p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; urn:lsid:genbank.ncbi.nlm.nih.gov:nucleotide/accession:J01636
 > > >  > >  > <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; urn:lsid:genbank.ncbi.nlm.nih.gov:nucleotide/accession:J01636:1
 > > >  > >  > <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; urn:lsid:genbank.ncbi.nlm.nih.gov:nucleotide/accession:K01483
 > > >  > >  > <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; urn:lsid:genbank.ncbi.nlm.nih.gov:nucleotide/gi:146575
 > > >  > >  > <p>The associated protein could be referred to as follows;
 > > >  > >  > <p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; urn:lsid:genbank.ncbi.nlm.nih.gov/protein/locus/AAA24054
 > > >  > >  > <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; urn:lsid:genpept.ncbi.nlm.nih.gov/protein/accession/AAA24054.1
 > > >  > >  > <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; urn:lsid:genpept.ncbi.nlm.nih.gov/protein/pid/g146578
 > > >  > >  > <br>&nbsp;
 > > >  > >  > <p>Another example is the following nucleotide from EMBL
 > > >  > >  > <p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; urn:lsid:embl.ebi.ac.uk:nucleotide:AB056092
 > > >  > >  > <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; urn:lsid:embl.ebi.ac.uk:nucleotide:AB056092:1
 > > >  > >  > <p>This includes a reference to a taxonomy term
 > > >  > >  > <p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; urn:lsid:taxonomy.ebi.ac.uk::10090
 > > >  > >  > <br>&nbsp;
 > > >  > >  > <br>&nbsp;
 > > >  > >  > <h3>
 > > >  > >  > <a NAME="Appendix C, Additional"></a>Appendix C, Additional Work</h3>
 > > >  > >  > 1. More clearly define what is authority and what is path.&nbsp; E.g. should
 > > >  > >  > GenBank be part of the authority string or is it a part of a path beneath
 > > >  > >  > ncbi.nlm.nih.gov.
 > > >  > >  > <p>2. Since path terms are owned by the authority, get common definitions
 > > >  > >  > for authorities/databases such as GenBank, EMBL etc.&nbsp; This could be
 > > >  > >  > defined by us and presented to the organization in question for ratification.&nbsp;
 > > >  > >  > Entities that do not make IDs publicly available are responsible for themselves
 > > >  > >  > and their customers only but would benefit from a set of guidelines and
 > > >  > >  > examples.
 > > >  > >  > <p>3. Examine use cases in proteomics and other branches of informatics.
 > > >  > >  > <p>4. Create libraries (java, perl) for manipulating IDs in this form.
 > > >  > >  > <br>&nbsp;
 > > >  > >  > <br>&nbsp;
 > > >  > >  > <br>&nbsp;
 > > >  > >  > <br>&nbsp;
 > > >  > >  > </body>
 > > >  > >  > </html>
 > > >  > > 
 > > >  > > -- 
 > > >  > > ========================================================================
 > > >  > > Lincoln D. Stein                           Cold Spring Harbor Laboratory
 > > >  > > lstein@cshl.org			                  Cold Spring Harbor, NY
 > > >  > > 
 > > >  > > NOW HIRING BIOINFORMATICS POSTDOCTORAL FELLOWS AND PROGRAMMERS. 
 > > >  > > PLEASE WRITE FOR DETAILS.
 > > >  > > ========================================================================
 > > >  > > 
 > > > 
 > > > -- 
 > > > ========================================================================
 > > > Lincoln D. Stein                           Cold Spring Harbor Laboratory
 > > > lstein@cshl.org			                  Cold Spring Harbor, NY
 > > > 
 > > > NOW HIRING BIOINFORMATICS POSTDOCTORAL FELLOWS AND PROGRAMMERS. 
 > > > PLEASE WRITE FOR DETAILS.
 > > > ========================================================================
 > > > 
 > > 
 > > _______________________________________________
 > > DAS mailing list
 > > DAS@biodas.org
 > > http://biodas.org/mailman/listinfo/das
 > > 
 > 
 > -----------------------------------------------------------------
 > Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420
 > <birney@ebi.ac.uk>. 
 > -----------------------------------------------------------------

-- 
========================================================================
Lincoln D. Stein                           Cold Spring Harbor Laboratory
lstein@cshl.org			                  Cold Spring Harbor, NY

NOW HIRING BIOINFORMATICS POSTDOCTORAL FELLOWS AND PROGRAMMERS. 
PLEASE WRITE FOR DETAILS.
========================================================================