[MOBY-l] ideas from the CRIB
Heiko Schoof
h.schoof at gsf.de
Thu Feb 27 19:25:54 UTC 2003
Dear friends,
I have been regretfully silent on this list, not through lack of
interest, but through massive personal overload by my teaching duties
and all the other things I tend to consider as my business. But since
Mark is now trying to remind me to do things in posts to lists where I
am not even subscribed, maybe it's time to resurface.
For those who don't know me:
Heiko Schoof, assistant prof. at the Technische Universitaet Munich,
tasted the freshness of Emma Lake waters at the original MOBY-DIC, thus
MOBY-aware from the first hour. Teaching: Bioinformatics. Research:
whole-genome correlative analyses in plants. Duty: Keep MAtDB (MIPS
Arabidopsis thaliana DataBase, mips.gsf.de/proj/thal/db) administered,
up, uptodate, useful while supervising the setup of rice and maize dbs,
and a EU project to network european plant databases towards a federated
genome db (www.eu-plant-genome.net). OK, I'm busy.
My contributions to MOBY have been rather indirect, but I've been
spreading the idea and I guess I did my part in getting MOBY talked
about. I've been promising to adopt MOBY for MAtDB, and I still have
that plan, only now it looks like something is actually happening:
- For the PlaNet project, we have a position here which we found hard to
fill, but in December Rebecca Ernst joined us so finally I have someone
who can do the work and I can continue just making promises ;-)
- We convinced our partners within PlaNet that MOBY may be the thing to
look at instead of our own CORBA-based implementation, which was the
original idea back in 2001.
- Thus, we (read "Rebecca") have the mission to implement some MOBY
services to test usefulness for PlaNet, other PlaNet partners will follow
- At MIPS, a group led by Volker Stuempflen is revolutionarizing our
idea of infrastructure by implementing things like model-view-controller
or business delegate design, XML databases etc., and Volker supports the
idea of webservices as opposed to CORBA and is willing to help us give
MOBY/SOAP a shot. At the same time, we're using XML for internal data
transport, and need all the schemas that we'll also need for MOBY.
So far for the general update. Now to the details Mark wanted me to
comment on. Well, I have been milling MOBY around my mind and through
explaining it again and again to others I'm starting to get some ideas.
I do hope I don't repeat too much you have already discussed, I must
admit I have not been following moby-l (and I'm not on moby-dev, seeing
I'm not contributing codewise, as much as I'd love to: Please rely on
Rebecca to keep communication fluid).
Key point for me has always been bringing together distributed data.
This boils down to knowing what is identical, equal, or related. Hence
the id discussion, and MOBY triplets (excuse me, but to my knowledge of
English the noun is triplet... and triple just doesn't have the vibes
for me: i3c? MOBY to the power of 3? No, no, please...). Class,
Namespace, Id still makes most sense to me.
One thing that has been worrying me is the CRIB, or cross-reference
block. What exactly do we mean with a cross-reference? I've seen
different ideas used.
One idea that was already discussed at Emma Lake was that the CRIB
should be the jumping platform to get from one MOBY object to others.
E.g. a NASequence:EMBL:AC000123 object could contain in its CRIB the
triplets for literature references, alternative accessions in e.g.
GenBank, or the protein translation. This makes it similar to database
reference blocks in e.g. EMBL or SWISSPROT entries.
In this context I remember discussions that e.g. a gene object retrieved
from a genome database could contain in the CRIB a lot of the annotation
data. This would include database references, literature, sequences, GO
terms and so on. All extremely useful to navigate on from that gene.
But at that point I really think we need to tackle the problem of
identity, equality, or relatedness.
Identity: The triplet in the CRIB is just another descriptor for the
same thing, that is a synonym, fully exchangeable. E.g.:
NASequence:EMBLacc:AC000123 is by definition of the int. nuc. seq. db.
coll. identical to NASequence:GBacc:AC000123. Or something like
PrimitiveSequence:EMBLacc:AC000123 and EnhancedSequence:EMBLacc:AC000123.
Equality: Two sequences may be the same, though they are not the same
thing: One is from human, the other from gorilla... Well, I find this
concept harder to put into words, because I don't really see
applications within a CRIB. Basically what I mean is indistinguishable,
but not exchangeable. But I don't think this would ever make sense in a
CRIB.
Relatedness: The classical cross-reference is a literature reference to
a sequence. Certainly they are not identical, as there may be a
many-to-many relationship. Still, it's generally accepted that it makes
a lot of sense to reference this kind of thing. It's also a quite fixed
relationship. But how about GO terms? Their assignment can be anything
from reliable to esoteric. And it is certainly subject to change. And
when I find a GO term in a CRIB, I will want to know how reliable it is.
The GO approach is evidence codes.
But there is another issue. Working on flexible data models for our data
munging here, I've come to realize how essential it is do distinguish
the relationship types. isa? partof? inheritsfrom? translatedfrom?
assignedto? referencedby? And I came to wondering about the CRIB.
I think that it is very important for MOBY. It is certainly important
for our work here. Most of the stuff we're looking at and for is
relations, interactions, connections between biological data. Most of my
new-generation genome database is about storing relations. I'm using
what is called a relational database model (the pun drives me crazy
sometimes), and I'm finding that I spend hardly any time on working out
how to handle primary data (sequences, names, coordinates and such) but
spending lots of brain time on getting the relationships worked out.
I believe that beside MOBY triplets, that are an essential component in
that they allow us to identify stuff, it may be worthwhile to think
about getting all this relatedness business sorted out by introducing a
new, special class: Relations. There'd have to be an ontology of
relationship types, in which we can again identify each by a MOBY
triplet. Relationships would then be represented as triplets of triplets
(that's so poetic it has to be good): MOBYobject-Relation-MOBYobject.
That would sort out the CRIB. In the CRIB, it wouldn't have to be a
triplet, as MOBYobject one would always be self, i.e. the object that
contains the CRIB. But then I could get very happy and excited by
writing stuff like:
SynonymousTo:EMBLSequenceAccession
ReferencedBy:PubMedCitation
TranslatesInto:ProteinSequence
InheritsFrom:BasicCodingSequence
AssignedTo:GOTerm
Contains:SequenceMotif
It admittedly gets a little complicated if relation-triplets are allowed
to have attributes, e.g. reliability or evidence code. In XML
representation, that doesn't really frighten me, though.
Anyway, I do think this is important. I'm sorry if I'm bringing a whole
new dimension into MOBY.
For me, MOBY has always had three vital components:
1 The technology to access distributed data, i.e. brokerage provided by
MOBY Central and SOAP transport of XML
2 The definition of the payload, that is XML schemas that allow me to
recognize a sequence and it's name and description when I get one
3 The interrelation of data, reflected by the CRIB
Of course, 1 and 2 are already valuable and beyond what is available so
far, so I'm willing to postpone. But maybe it needs to be considered
from the start. Navigation or transformation from one object to another
can be done by services, even without a CRIB. I can set up a
ReferencedBy service that for EMBL accessions returns Citation objects.
But as biology in my view is about relationships, that's the data that I
want to share and access. And I think it will be immensely useful to
package and transport that data within MOBY.
Well, while I intended to write a quick note, I've been at it for almost
half an hour now and it's almost got the length required to be called a
Document. I hope it doesn't take you quite as long to figure out what I
mean. Mark, if I've completely confused MOBY or myself or you or both of
us put it down to my state of mind and send me a note, then we'll make a
date for a phone call. Easier to sort things out there. I do tend to
make things complicated. Sorry I didn't get on the telconference, I read
the emails way too late...
What's up with the Brisbane/ISMB plan? I may be there, will I meet you
guys?
Best regards to all, Heiko
------------------------------------
Dr. Heiko Schoof
Technische Universitaet Muenchen
-Genome Oriented Bioinformatics-
Wissenschaftszentrum Weihenstephan
85350 Freising
Germany
Tel. +49 8161 71 5632
Fax +49 8161 71 5629
h.schoof at wzw.tum.de
http://binfo.bio.wzw.tum.de
More information about the moby-l
mailing list