[Open-bio-l] LSIDs

Fri Mar 28 09:43:03 EST 2003

Hi Matthew,
On Friday, March 28, 2003, at 05:47 AM, Matthew Pocock wrote:

> Hi,
>
> I've got mailing list fatigue. Which one should I be
> posting to about LSIDs for file formats?
>

The one with the most traffic at the moment about LSID's is the  
i3c-techarch committee mailing list.  I have am forwarding your mail to  
them in my response to you.

An excellent commentary about the current thinking in implementing  
LSID's can be found in a document from Joshua Phillips, from the  
National Cancer Institute, see:

ftp://ftp1.nci.nih.gov/pub/cacore/caBIO/lsid/lsid_memo.doc

Currently there is an open issue about using a DNS name as part of the  
LSID, re the American Hipaa regulation on patient privacy.
See:

http://answers.hhs.gov/cgi-bin/hhs.cfg/php/enduser/std_alp.php

My understanding is a medical record used in research must have removed  
protected health information (PHI), and part of that removal is any DNS  
entry or IP address.  An LSID containing a DNS name or IP address could  
be stripped from the record as a result of this rule.  Research done  
for eventual release to the FDA requires that the complete PHI be  
available to the FDA, and LSID in this case would be kept.  This leads  
to research not for consumption by the FDA, using American medical  
records, potentially having the LSID removed as part of the standard  
removal of PHI information, or a statistician must give a justification  
that the LSID could not be used to identify a patient for each  
Independent Review Board (IRB) using the LSID indexed information as  
part of their medical records.  If the DNS entry is removed, then the  
implementation of LSID resolvers cannot use standard DNS directly and  
complicates the resolution, and might require the current spec be  
modified:

http://www.i3c.org/workgroups/technical_architecture/resources/lsid/ 
docs/LSIDSyntax9-20-02.htm

I am sending this along for further comment.

Warmest Regards,

Jim Freeman

> Anyway, I was about to add a whole load of common

> formats to biojava and hit a snag. For your
> convenience, I've pasted in part of the spec below.
>
> I know I'm risking being a radical pedant by even
> bringing this up, but presumably these ids are meant
> to be used by more than one individual.
>
> The spec says that we should use things like:
>
> URN:LSID:open-bio.org:<format>/<alphabet>
>
> This is bad for several reasons. The first one is that
> file formats and sequence databases can become
> trivialy confused. Does URN:LSID:open-bio.org:embl
> refer to the embl database, or to the embl format with
> default alphabet?
>
> Seccondly, what do we do with non-sequence formats?
> For example, Unigene and Enzyme don't fit into this
> world very well.
>
> Thirdly, (and bless them for doing this) there are
> some ambiguities about format names unless scoped
> propperly. An example is the Enzyme db's enzyme.dat
> file which is similar to embl in structure, and the
> ligand enzyme file which is shaped like genebank. They
> both tell you things about ec numbers, but are
> defintiely not the same format.
>
> I propose that we carve up the fourth field more
> sanely. We can firstly prefix the format name with the
> constant string "format/", leaving room in the future
> for namespaces like "database" or "application".
> Secondly, the format name should (optionaly) be
> compound. Thirdly, variables (like alphabet) should be
> encoded using an agreed upon URL query scheme.
>
> URN:LSID:open-bio.org:format/enzyme
>
> URN:LSID:open-bio.org:format/ligand/enzyme
> URN:LSID:open-bio.org:format/ligand/compound
> URN:LSID:open-bio.org:format/ligand/ligand
>
> URN:LSID:open-bio.org:format/embl?alphabet=DNA
> URN:LSID:open-bio.org:format/genbank
> URN:LSID:open-bio.org:format/genbank?alphabet=PROTEIN
>
> This leaves us room for URNs for other things that
> perhaps people haven't named yet:
>
> URN:LSID:open-bio.org:database/embl
> URN:LSID:open-bio.org:database/swissprot
> URN:LSID:open-bio.org:database/ligand
>
> URN:LSID:open-bio.org:application/blast/n:2.2.5
>
> Can I ammend the OBDA documentation to match this, or
> is this the wrong way to go?
>
> Matthew
>
> (from
> http://cvs.open-bio.org/cgi-bin/viewcvs/viewcvs.cgi/obda-specs/ 
> registry/lsid_for_dbformats.txt?rev=1.1&cvsroot=obf-common&content- 
> type=text/vnd.viewcvs-markup)
>
> All flat file formats are identified using this
> format:
>
> URN:LSID:open-bio.org:<format>/<alphabet>
>
> where <format> is one of:
>       embl
>       genbank
>       fasta
>       swiss
>       pdb
>
> and <alphabet> is one of:
>     dna
>     rna
>     protein
>
>
> __________________________________________________
> Do You Yahoo!?
> Everything you'll ever need on one web page
> from News and Sport to Email and Music Charts
> http://uk.my.yahoo.com
> _______________________________________________
> Open-Bio-l mailing list
> Open-Bio-l at open-bio.org
> http://open-bio.org/mailman/listinfo/open-bio-l
>