[MOBY-l] ideas from the CRIB

Thu Feb 27 19:25:54 UTC 2003

Dear friends,
I have been regretfully silent on this list, not through lack of 
interest, but through massive personal overload by my teaching duties 
and all the other things I tend to consider as my business. But since 
Mark is now trying to remind me to do things in posts to lists where I 
am not even subscribed, maybe it's time to resurface.

For those who don't know me:
Heiko Schoof, assistant prof. at the Technische Universitaet Munich, 
tasted the freshness of Emma Lake waters at the original MOBY-DIC, thus 
MOBY-aware from the first hour. Teaching: Bioinformatics. Research: 
whole-genome correlative analyses in plants. Duty: Keep MAtDB (MIPS 
Arabidopsis thaliana DataBase, mips.gsf.de/proj/thal/db) administered, 
up, uptodate, useful while supervising the setup of rice and maize dbs, 
and a EU project to network european plant databases towards a federated 
genome db (www.eu-plant-genome.net). OK, I'm busy.

My contributions to MOBY have been rather indirect, but I've been 
spreading the idea and I guess I did my part in getting MOBY talked 
about. I've been promising to adopt MOBY for MAtDB, and I still have 
that plan, only now it looks like something is actually happening:

- For the PlaNet project, we have a position here which we found hard to 
fill, but in December Rebecca Ernst joined us so finally I have someone 
who can do the work and I can continue just making promises ;-)
- We convinced our partners within PlaNet that MOBY may be the thing to 
look at instead of our own CORBA-based implementation, which was the 
original idea back in 2001.
- Thus, we (read "Rebecca") have the mission to implement some MOBY 
services to test usefulness for PlaNet, other PlaNet partners will follow
- At MIPS, a group led by Volker Stuempflen is revolutionarizing our 
idea of infrastructure by implementing things like model-view-controller 
or business delegate design, XML databases etc., and Volker supports the 
idea of webservices as opposed to CORBA and is willing to help us give 
MOBY/SOAP a shot. At the same time, we're using XML for internal data 
transport, and need all the schemas that we'll also need for MOBY.

So far for the general update. Now to the details Mark wanted me to 
comment on. Well, I have been milling MOBY around my mind and through 
explaining it again and again to others I'm starting to get some ideas. 
I do hope I don't repeat too much you have already discussed, I must 
admit I have not been following moby-l (and I'm not on moby-dev, seeing 
I'm not contributing codewise, as much as I'd love to: Please rely on 
Rebecca to keep communication fluid).

Key point for me has always been bringing together distributed data. 
This boils down to knowing what is identical, equal, or related. Hence 
the id discussion, and MOBY triplets (excuse me, but to my knowledge of 
English the noun is triplet... and triple just doesn't have the vibes 
for me: i3c? MOBY to the power of 3? No, no, please...). Class, 
Namespace, Id still makes most sense to me.
One thing that has been worrying me is the CRIB, or cross-reference 
block. What exactly do we mean with a cross-reference? I've seen 
different ideas used.

One idea that was already discussed at Emma Lake was that the CRIB 
should be the jumping platform to get from one MOBY object to others. 
E.g. a NASequence:EMBL:AC000123 object could contain in its CRIB the 
triplets for literature references, alternative accessions in e.g. 
GenBank, or the protein translation. This makes it similar to database 
reference blocks in e.g. EMBL or SWISSPROT entries.
In this context I remember discussions that e.g. a gene object retrieved 
from a genome database could contain in the CRIB a lot of the annotation 
data. This would include database references, literature, sequences, GO 
terms and so on. All extremely useful to navigate on from that gene.
But at that point I really think we need to tackle the problem of 
identity, equality, or relatedness.

Identity: The triplet in the CRIB is just another descriptor for the 
same thing, that is a synonym, fully exchangeable. E.g.: 
NASequence:EMBLacc:AC000123 is by definition of the int. nuc. seq. db. 
coll. identical to NASequence:GBacc:AC000123. Or something like 
PrimitiveSequence:EMBLacc:AC000123 and EnhancedSequence:EMBLacc:AC000123.

Equality: Two sequences may be the same, though they are not the same 
thing: One is from human, the other from gorilla... Well, I find this 
concept harder to put into words, because I don't really see 
applications within a CRIB. Basically what I mean is indistinguishable, 
but not exchangeable. But I don't think this would ever make sense in a 
CRIB.

Relatedness: The classical cross-reference is a literature reference to 
a sequence. Certainly they are not identical, as there may be a 
many-to-many relationship. Still, it's generally accepted that it makes 
a lot of sense to reference this kind of thing. It's also a quite fixed 
relationship. But how about GO terms? Their assignment can be anything 
from reliable to esoteric. And it is certainly subject to change. And 
when I find a GO term in a CRIB, I will want to know how reliable it is. 
The GO approach is evidence codes.
But there is another issue. Working on flexible data models for our data 
munging here, I've come to realize how essential it is do distinguish 
the relationship types. isa? partof? inheritsfrom? translatedfrom? 
assignedto? referencedby? And I came to wondering about the CRIB.

I think that it is very important for MOBY. It is certainly important 
for our work here. Most of the stuff we're looking at and for is 
relations, interactions, connections between biological data. Most of my 
new-generation genome database is about storing relations. I'm using 
what is called a relational database model (the pun drives me crazy 
sometimes), and I'm finding that I spend hardly any time on working out 
how to handle primary data (sequences, names, coordinates and such) but 
spending lots of brain time on getting the relationships worked out.

I believe that beside MOBY triplets, that are an essential component in 
that they allow us to identify stuff, it may be worthwhile to think 
about getting all this relatedness business sorted out by introducing a 
new, special class: Relations. There'd have to be an ontology of 
relationship types, in which we can again identify each by a MOBY 
triplet. Relationships would then be represented as triplets of triplets 
(that's so poetic it has to be good): MOBYobject-Relation-MOBYobject.

That would sort out the CRIB. In the CRIB, it wouldn't have to be a 
triplet, as MOBYobject one would always be self, i.e. the object that 
contains the CRIB. But then I could get very happy and excited by 
writing stuff like:

SynonymousTo:EMBLSequenceAccession
ReferencedBy:PubMedCitation
TranslatesInto:ProteinSequence
InheritsFrom:BasicCodingSequence
AssignedTo:GOTerm
Contains:SequenceMotif

It admittedly gets a little complicated if relation-triplets are allowed 
to have attributes, e.g. reliability or evidence code. In XML 
representation, that doesn't really frighten me, though.

Anyway, I do think this is important. I'm sorry if I'm bringing a whole 
new dimension into MOBY.

For me, MOBY has always had three vital components:
1 The technology to access distributed data, i.e. brokerage provided by 
MOBY Central and SOAP transport of XML
2 The definition of the payload, that is XML schemas that allow me to 
recognize a sequence and it's name and description when I get one
3 The interrelation of data, reflected by the CRIB

Of course, 1 and 2 are already valuable and beyond what is available so 
far, so I'm willing to postpone. But maybe it needs to be considered 
from the start. Navigation or transformation from one object to another 
can be done by services, even without a CRIB. I can set up a 
ReferencedBy service that for EMBL accessions returns Citation objects. 
But as biology in my view is about relationships, that's the data that I 
want to share and access. And I think it will be immensely useful to 
package and transport that data within MOBY.

Well, while I intended to write a quick note, I've been at it for almost 
half an hour now and it's almost got the length required to be called a 
Document. I hope it doesn't take you quite as long to figure out what I 
mean. Mark, if I've completely confused MOBY or myself or you or both of 
us put it down to my state of mind and send me a note, then we'll make a 
date for a phone call. Easier to sort things out there. I do tend to 
make things complicated. Sorry I didn't get on the telconference, I read 
the emails way too late...

What's up with the Brisbane/ISMB plan? I may be there, will I meet you 
guys?

Best regards to all, Heiko

------------------------------------
Dr. Heiko Schoof
Technische Universitaet Muenchen
-Genome Oriented Bioinformatics-
Wissenschaftszentrum Weihenstephan
85350 Freising
Germany

Tel. +49 8161 71 5632
Fax +49 8161 71 5629
h.schoof at wzw.tum.de
http://binfo.bio.wzw.tum.de