[Open-bio-l] LSIDs

Fri Mar 28 10:47:05 EST 2003

Hi,

I've got mailing list fatigue. Which one should I be
posting to about LSIDs for file formats?

Anyway, I was about to add a whole load of common
formats to biojava and hit a snag. For your
convenience, I've pasted in part of the spec below.

I know I'm risking being a radical pedant by even
bringing this up, but presumably these ids are meant
to be used by more than one individual.

The spec says that we should use things like:

URN:LSID:open-bio.org:<format>/<alphabet>

This is bad for several reasons. The first one is that
file formats and sequence databases can become
trivialy confused. Does URN:LSID:open-bio.org:embl
refer to the embl database, or to the embl format with
default alphabet?

Seccondly, what do we do with non-sequence formats?
For example, Unigene and Enzyme don't fit into this
world very well.

Thirdly, (and bless them for doing this) there are
some ambiguities about format names unless scoped
propperly. An example is the Enzyme db's enzyme.dat
file which is similar to embl in structure, and the
ligand enzyme file which is shaped like genebank. They
both tell you things about ec numbers, but are
defintiely not the same format.

I propose that we carve up the fourth field more
sanely. We can firstly prefix the format name with the
constant string "format/", leaving room in the future
for namespaces like "database" or "application".
Secondly, the format name should (optionaly) be
compound. Thirdly, variables (like alphabet) should be
encoded using an agreed upon URL query scheme.

URN:LSID:open-bio.org:format/enzyme

URN:LSID:open-bio.org:format/ligand/enzyme
URN:LSID:open-bio.org:format/ligand/compound
URN:LSID:open-bio.org:format/ligand/ligand

URN:LSID:open-bio.org:format/embl?alphabet=DNA
URN:LSID:open-bio.org:format/genbank
URN:LSID:open-bio.org:format/genbank?alphabet=PROTEIN

This leaves us room for URNs for other things that
perhaps people haven't named yet:

URN:LSID:open-bio.org:database/embl
URN:LSID:open-bio.org:database/swissprot
URN:LSID:open-bio.org:database/ligand

URN:LSID:open-bio.org:application/blast/n:2.2.5

Can I ammend the OBDA documentation to match this, or
is this the wrong way to go?

Matthew

(from
http://cvs.open-bio.org/cgi-bin/viewcvs/viewcvs.cgi/obda-specs/registry/lsid_for_dbformats.txt?rev=1.1&cvsroot=obf-common&content-type=text/vnd.viewcvs-markup)

All flat file formats are identified using this
format:

URN:LSID:open-bio.org:<format>/<alphabet>

where <format> is one of:
      embl
      genbank
      fasta
      swiss
      pdb

and <alphabet> is one of:
    dna
    rna
    protein

__________________________________________________
Do You Yahoo!?
Everything you'll ever need on one web page
from News and Sport to Email and Music Charts
http://uk.my.yahoo.com