[MOBY-l] discussion of GO abbs's and Re: [MOBY] Constructing MOBY objects

Mark Wilkinson markw at illuminae.com
Tue Jul 8 15:42:34 UTC 2003


On Thu, 2003-07-03 at 05:37, Beatrice Schildknecht wrote:

> As each object has a unique stock number and may not have an EMBL Acc, I 
> would like to register the namespace, Stock No. (or something 
> similar...).

I believe you are going ahead with this already, and have already
registered this namespace with GO... right?


>  At NASC, codes always start with N followed by the number. 

This is something that has come up as an issue in my mind over the past
few days.  We need to quickly set a convention for ID's or things will
become chaotic.  We have decided to use the GO abbreviations as our
namespaces, but the more I look at that document the more concern I
have.  For example, in the Gene Ontology abbreviations document they
discuss "namespaces" (in the MOBY meaning of the word), and imply that
these namespaces are prefixes for an ID number.  So an NCBI taxon id is
written:

	taxon:123

**BUT**, from my interpretation of the document, it isn't consistent
from one identifier to the next (GO people, please comment on this if I
am misunderstanding).  For example, an E.coli genetic stock center gene
name, abbreviated "ECOGENE_G" is designated as "ECOGENE_G:deoC", but a
Compugen GO Gene Accession, abbreviated "CGEN" does not use the prefix,
and is writen "PrID131022" (moreover, in the GO database itself, even
the PrID part of the identifier is apparently stripped off, and you get
just the integer portion of the id).  Am I misreading the GO_xref_abbs
document, or is it accidentally inconsistent, or is it purposely
inconsistent?  Midori?

So anyway... what should we do in MOBY?  

?	<Object namespace='taxon' id='taxon:123'> 
?	<Object namespace='taxon' id='123'> 

...or do we make it flexible, where the client/server must check for
themselves if the id portion is prefixed?  Up to now, we have not used
prefixes because they would be redundant, but it might make us more
compatible with other systems if we do.  Comments anyone?  I quite
favour the removal of the prefix, but it makes no functional difference,
so... I'm easy :-)


> This is open for discussion, (whether a separate NASC and ABRC code 
> namespace, or integrate them somehow?).

since separate namespaces is the reality (regardless of the underlying data 
being identical) we 
should make separate namespaces.  A similar phenomenon happens between
Genbank and EMBL - identical records with different identifiers in each
namespace.  But since a service may only recognize one or the other namespace
we must keep them separate and "join" them as cross-references.

> Mutants have a:
> 
> --  Stock No.-- (Each stock has a unique code)
> 
> --  EMBL Acc -- (does not exist if donor has not submitted sequence or 
> stock is an ecotype for example)
> --  Locus /AGI-code--
> --  Phenotype -- (Information on plant's phenotype
> -- Allele symbol --
> -- Donor last name -- (The name of the person who donated the line)
> -- Donor number -- (Unique number given by donor for each stock donated)
> -- Background -- (Name of line used to generate mutant)

Okay, here's my first stab at interpreting this object, keeping in mind
that we are trying to generate objects that are *excruciatingly*
ligthweight and generic:

<PhenotypeDescription namespace='StockNumber' id="blahblah">
	<CrossReference>
		<Object namespace="EMBL" id="7234676'/>
		<Object namespace="AGI_Locus" id="AP3"/>
		<Object namespace="AGI_Author" id="Wilkinson, MD"/>
		<Object namespace="AGI_Donor" id="98765"/>
		<Object namespace="AGI_Ecotype" id="Ler"/>
	</CrossReference>
	<String namespace='' id='' articleName='Phenotype'>
		phenotypic description fits in here
	</String>
</PhenotypeDescription>


Since most of your information is simply a reference to another ID
number, they can all fit into the CrossReference block.  The only part
of the object that actually carries any structured information is the
phenotype (which, as it turns out is just a string), so that is the only
part of the object that needs to be explicitly defined.  Moreover, this
allows us to re-use the object for **every** phenotypic description,
since cross-references are *not* defined as part of the object
definition; they are optional and arbitrary.

So, your object definition is:

PhenotypeDescription:
	ISA Object
	HASA String areticleName=Phenotype

...and that's all!

This is just a suggestion off the top of my head - hopefully this will
initiate some discussion of object construction.  Do you think this is
sufficiently descriptive for your needs?  The beauty of this object is
that it can (without decomposition) trigger service that act on
StockNumbers, EMBL Accessions, AGI_Loci, Author names, donor ID's and
Ecotypes, since these are all in the cross-reference block.


> In addition each ecotype has a:
> -- Habitat --
> --  Location --
> -- Altitude --
> -- Daily temp --
> -- Long/latitude --

This one get's a but more difficult.  I think most of those fields are
'real' data, rather than pointers.  Perhaps we should toss this out to
the various plant genome db's to see if there are additional fields that
should be included in an 'ecotype' object?

Mark

-- 
Mark Wilkinson <markw at illuminae.com>
Illuminae




More information about the moby-l mailing list