[MOBY-l] Hmmm... ID objects or SpecificID objects??

Alan Robinson alan at ebi.ac.uk
Fri May 31 09:51:04 UTC 2002


Btw, here's my 0.02$ on the identifiers issue... Unfortunately, it
may not help answer Mark's question.

Is this a "metadata" issue? E.g. A service method takes an identifier and
a numeric value, but the "metadata" for the method specifices that these
input parameters must be ["NCBI_gi" or "EMBL"] and [0 < int < 10]. This
"metadata" is held on MOBY-Central using, for example, RDFS.

Alternatively, does the MOBY-Central simply hold the fact that a service
uses an identifier object (or its preferred identifier), & it's then up to
the client & service to negotiate if they have the right type of
identifier? i.e. metadata is on MOBY-Server.

If MOBY needs a vocabulary for databases names & identifiers, I suggest
that it adopts the same scheme as GO:

 http://www.geneontology.org/doc/GO.xrf_abbs

	abbreviation: NCBI_NM
	database: NCBI RefSeq.
	object: mRNA identifier.
	example: NCBI_NM:123456

	abbreviation: NCBI_gi
	database: NCBI databases.
	object: Identifier.
	example: NCBI_gi:10727410

	abbreviation: EMBL
	database: EMBL-EBI International Nucleotide Sequence Data
			Library/DDBJ/GenBank.
	object: Sequence accession number.
	example: EMBL:AA816246

All database entries should have a primary key. They may also have other
keys (c.f. ID, SV &  AC lines in EMBL - I'll get back to this).

Any database handed an identifier should assume that it's for the primary
key. I'll see if I get shot down on this, but having two (or more) sets of
identifiers that use the same names for different entries sounds like a
design flaw (though it can happen... especially with numeric values).

I believe there is only one case in the 15+ million EMBL records where the
ID of one entry is the same as the AC in another entry. Though for good
reason, we assign AC as the highest priority identifier.


You may ask, should I use the ID, AC or SV field in an EMBL record?  
Or the locus, accession, version or gi fields of GenBank?

For the record, the first SV/VERSION [syntax 'accession.version'] is *THE*
primary identifier that should be used for referencing DDBJ/EMBL/GenBank.
The accession is unique to an entry and stable across all database
versions (unlike ID/locus); the version increments by one each time the
sequence entry is changed. (If others are present, they are secondary
identifiers).

ID/locus is not guaranteed to be stable over database versions -so beware!
It's there primarily because people want human-understandable identifiers
that are loaded with semantics (if semantics change => name changes!)

IMO, 'gi' numbers are evil - Unlike locus/ID, accessions and versions,
they are not part of the nucleotide collaborative data exchange agreement
between NCBI, EBI and DDBJ.


--
============================================================
Alan J. Robinson, D.Phil.             Tel:+44-(0)1223 494444
European Bioinformatics Institute     Fax:+44-(0)1223 494468
EMBL Outstation - Hinxton             Email:  alan at ebi.ac.uk
Wellcome Trust Genome Campus
Hinxton, Cambridge
CB10 1SD, UK                http://industry.ebi.ac.uk/~alan/
============================================================

On Thu, 30 May 2002, Mark Wilkinson wrote:

> Many services (eg. sequence retrieval services) will simply take an ID
> number as input.  The problem is that ID numbers may be of many types...
> GenbankGI, GenbankAcc, EMBLID, TIGR_Gene_ID, and so on and so on and so
> on.  In principle, we could define an object like this:
> 
>         <ID  namespace="GenbankGI" id="1223647"/>
> 
> 
> But since services register only the type of *object* that they deal
> with, not the namespace that they accept, most services that claim to
> accept ID numbers will not necessarily handle *all* types of ID's.
> 
> The alternative is to have separate objects for each type of ID:
> 
>         <GenbankGI  namespace="GenbankGI"  id="1223647"/>
> 
> But this seems like a nightmare scenario...or?
> 
> Hmmmmm....  do we change the registry, or do we make more objects?... or
> do we just let the server decide if it is competent to handle that type
> of ID number?
> 
> Hmmmmmmm.....
> 
> 
> Mark
> 
> 
> 
> --
> --------------------------------
> "Speed is subsittute fo accurancy."
> ________________________________
> 
> Dr. Mark Wilkinson
> Bioinformatics Group
> National Research Council of Canada
> Plant Biotechnology Institute
> 110 Gymnasium Place
> Saskatoon, SK
> Canada
> 
> 
> 
> _______________________________________________
> moby-l mailing list
> moby-l at biomoby.org
> http://biomoby.org/mailman/listinfo/moby-l
> 





More information about the moby-l mailing list