[MOBY-dev] Discussion of the recent integration of LSID's into MOBY

Thu May 15 13:15:09 UTC 2003

Hi MOBY-dev'ers, 

Just a quick note - there are a few of you who have recently joined this
-dev list.  It has been traditional that newcomers introduce themselves
to the rest of the list, as the -dev list, although open, is often used
to invite people to teleconferences and such, and is primarily directed
to hard-code developers... so we like to know who the audience is when
we post these telephone numbers :-)

Okay, on to LSID and what was decided at the hackathon last week:

The background to this story is that there appeared to be a great deal
of similarity between the MOBY Duple of [namespace, id] and the LSID,
which also has namespace and id components.  In fact, the similarity
seemed to go farther, in that the "namespace" component of the MOBY
Duple was hierarchically structured, with an "authority" component, and
a "namespace" component; thus we appeared to have all of the meaningful
components of an LSID, just in a different format.  Format being
completely irrelevant to us, we attempted to drop our Duple in favour of
an LSID.

On closer examination, it became apparent that we (MOBY) were using
namespace to mean something more than is meant by the namespace
component of the LSID - that in MOBY it was supposed to restrict the
identifier to a certain data *type* (e.g. a Genbank record, or an EMBL
record), while the namespace component of the LSID is completely
opaque... as is the LSID itself.

On closer examination still, it became apparent that we (MOBY) were
misguided in our use of namespace in this way, and that it may well have
broken down in the future.  Namespace, as we were using it, does not
actually represent a data type, though in practice this is usually the
case.  Although data providers generally publicize their data such that
similar types of data have similar identifiers, they are not *forced* to
do this, and in fact Genbank could start using 'gi' numbers to refer to
microarray data, taxon ID's, or anything else.  We had based our concept
of namespace on a typical behaviour of service providers, rather than a
concrete definition, and this was a perfect example of a Ewan Birney
"Bad Thing".

The concrete example that I noticed just before the hackathon was that
Genbank records could represent either nucleotide sequences or amino
acid sequences... yet they lie in the same MOBY namespace of Genbank/gi.

Anyway <<clip a days worth of headaches and discussion out here>> we
decided that the correct way to solve this problem, and integrate LSID's
into MOBY, was to make our Duple a pair of LSIDs - an LSID representing
a data type, and an LSID representing the actual data entity.

For the forseeable future, BioMOBY will host the ontology of data type
LSID's, though this does not restrict people from using other data
typing systems/ontologies in MOBY if they wish.  Simply use a different
LSID!  Similarly, service providers register the namespace that they can
deal with by registering the LSID of the ontology-node that most
generally (not most specifically) describes the datatypes that they can
handle.  e.g. a service provider may know something about any type of
genbank record, so they would register the LSID representing genbank
sequence records.  Alternately, they may only know things about protein
sequences, so they register the LSID representing genbank protein
sequence records.

We are also going to use an LSID to represent the MOBY Data Class, thus
the Triple [Class, Namespace, ID] will now be [Class-LSID,
Namespace-LSID, ID-LSID].

This actually gives us an *enormous* amount of flexibility!   e.g. MOBY
Central will resolve and be able to decompose MOBY-compliant objects
based on the ontology, however service providers may register themselves
as working with data Classes from non-MOBY ontologies (e.g. caBIO)...
and that is now just fine.  There is no longer a need to wrap the
foreign data object in a MOBY Triple (though there may still be
advantages to doing so...) nor are we dictatorial in any way as to what
a service provider has to do in order to have their service discovered. 
We could even have translation services, consuming objects from one
ontology and spitting out object types from another ontology (MOBY
Sequence to caBIO Sequence)

I think this was a big win!  

...of course, this is easy to say, since we haven't actually tried it in
practice :-)

I promise to spend every spare moment in the next week or two finishing
up the coding of the new MOBY Central so that we can start coding
against it.  It *was* finished, until we made the last minute change of
describing the object/service relationship ontology as an LSID also...
and that broke everything again :-)

Anyway, I'll post a general announcement when it is ready to test.  I
think it will be a good idea to set up two MOBY Centrals this time - one
for people to test against, and the other as a "production" registry.

Comments welcome!

Mark

-- 
=======================================================================
                                    |--==\
Mark Wilkinson, Ph.D.                \==-|       
Bioinformatics Consultant             \=/        0010010010100101110010
Illuminae Media                       /-\        
727 6th Ave. N.                      /-==|       0010100100111101010010
Saskatoon, SK, Canada               |==-/        
S7K 2S8                              \=/         0100100100010010010101
+1 (306) 373 3841                     /\         
markw at illuminae.com                  /=-\        1101001010100101010101
                                    |--==\
=======================================================================