[moby] [MOBY-dev] Re: Problems with Biomoby servicesin Taverna 1.2
Heiko Schoof
schoof at mpiz-koeln.mpg.de
Thu Jul 21 13:42:32 UTC 2005
Hi all again, I'll respond to Martin's query here:
<snip>
My understanding is that (talking about one mobyData object):
a) Any Moby service can have more outputs. If so, all of them must be
registered. The number of such outputs must be fixed.
b) Any of these outputs can be of type either Simple or Collection.
c) If it is a Collection, this output can have one or more Simples in
that Collection. Such Simples (and their number) are not individually
registered.
</snip>
I think the issue is point a, the number of such outputs must be fixed.
Or, as the API says
"each of these articles will appear EXACTLY ONCE in the output from the
service". I request to change this to "each of these articles will
appear AT LEAST ONCE in the output from the service".
Why is this necessary? Currently, services that return potentially many
results are registered as outputting a Collection. But within
workflows, the iteration strategy or how often the next service is
called should distinguish between an output made up of many independent
items versus one or many groups of connected items. In the first case,
the following service should be called once for each item, in the
second case, only once for each group. See below for full example.
What I am suggesting is to separate the cardinality from the
Simple/Collection issue; meaning, that a service that performs e.g. a
database lookup and returns 1 or many outputs (or none, but in the Moby
world this means it returns 1 empty output) will be registered as
returning Simple, not Collection, if the outputs are otherwise
semantically unrelated (aside from the fact that they arose from the
same query). And reserve the Collection article for grouping of outputs
that need to be seen as a single entity. E.g. a service that outputs
ortholog pairs given as input a pair of organisms: Each ortholog pair
could be represented as a pair of GenericSequence objects in a
Collection, with the service outputing 1 or many of these Collections.
The same service, given as input three organisms, could still output
many Collections, then containing three sequences each. This prevents
ugly explosion of specialized BioMoby objects like "MultipleSequences",
"HAS GenericSequence(s)"... that would otherwise be needed to wrap
this.
For service discovery, this should not make a difference. Services
would still be required to return every Class of object that they
register, as you state in a: all output object *Classes* must be
registered, and the number of *Classes* fixed. I.e., a service
registered as returning GenericSequence and AnnotatedJpeg objects must
always return at least one GenericSequence plus at least one
AnnotatedJpeg. I can't recall a service that actually does that... I
can only think of this being meaningful if the AnnotatedJpeg is
semantically connected to a specific GenericSequence, and in that case
both should be connected through putting them in a Collection imho.
For inputs, I'm not so worried; if multiple inputs are intended to go
into a single service call, they will probably be connected and could
go into a Collection. Example above: Input is a Collection of
Organisms. Basically, I see that as the only way to register services
that require AT LEAST two equal inputs.
No problem with b, but with c: To my understanding, when registering a
Collection, also the classes of objects in Simples that it contains
must be registered. Otherwise, no discovery. However, see API, " A
collection may contain zero or more Objects of each of the Classes",
not all these classes must actually be included.
So far, I do not see the need to distinguish between services that
return EXACTLY ONE output and those that return one or more. Taverna
seems to make that distinction, and bases iteration strategies on that,
but I would want to do that dynamically, and it may be that that's what
Taverna does. I'd by default assume that there will be multiple outputs
and iterate over them, but if the workflow designer so wishes make it
possible to (using e.g. a local processor) combine all the outputs into
a Collection that can be used as the single input to a following
service.
This distinction is necessary: Use case examples would be a service
returning a number of sequences, that in one scenario (iterate) should
each be run through a BLAST service individually and in a different
scenario (bag or Collection) should be all together input into a single
call of a multiple alignment service.
The current problem arises because Taverna now, in what is for me the
semantically correct interpretation, if it receives a Collection as
output from a service, it inputs that into a single execution of the
following service if that service consumes Collection. In version 1.1
and before, Collections were decomposed by Taverna and iterated over.
For the workflows being used, that was the wanted behavior, as e.g.
keyword queries returning a set of sequences should be linked to
services that act on each individual sequence. The mistake is imho on
the BioMoby side, where we use Collection to wrap multiple outputs even
if these are individual results that should be processed individually,
and then, in order to be able to pipeline, register services that
actually act on single inputs as consuming Collections. Consider Tom
Oinn's comment 080705:
<snip>
1) Consumer declares it consumes singles, Producer emits a collection.
In this context Taverna iteratively calls the Consumer with each item
from the collection. This is probably what you'd expect to happen, the
result is that the Consumer effectively emits a collection of whatever
it would emit normally.
2) Consumer declares it consumes a collection, Producer emits a
collection. In this case Taverna will indeed split the output
collection (because we always do) but it will be magically reassembled
before being given to the Consumer.
3) Consumer declares it consumes a collection, Producer emits a single
item. Taverna wraps the single item in a single element collection and
gives it to the Consumer.
</snip>
This is the same logic that we'd need to implement into BioMoby to
allow meaningful links between Collection producing services and Simple
consuming services! And NOT register services as consuming or producing
Collections if all they do is mimic this behaviour internally by
iterating over the Collection items.
Many words and I'm not sure this is making anything any clearer. But I
try ;-)
Best, Heiko
On 18. Jul 2005, at 14:39 Uhr, Martin Senger wrote:
Hi all,
Catching up my email piles I wonder if someone can summarize if the
discussion about collections in this thread brought any (planned)
changes
in the API (I am not talking now about how it is, or should be,
implemented in Taverna, that's, imho, an another story).
My understanding is that (talking about one mobyData object):
a) Any Moby service can have more outputs. If so, all of them must be
registered. The number of such outputs must be fixed.
b) Any of these outputs can be of type either Simple or Collection.
c) If it is a Collection, this output can have one or more Simples in
that Collection. Such Simples (and their number) are not individually
registered.
Has this vision been changed?
Thanks,
Martin
--
Martin Senger
EMBL Outstation - Hinxton Senger at EBI.ac.uk
European Bioinformatics Institute Phone: (+44) 1223 494636
Wellcome Trust Genome Campus (Switchboard: 494444)
Hinxton Fax : (+44) 1223 494468
Cambridge CB10 1SD
United Kingdom
http://industry.ebi.ac.uk/~senger
_______________________________________________
MOBY-dev mailing list
MOBY-dev at biomoby.org
http://www.biomoby.org/mailman/listinfo/moby-dev
More information about the MOBY-dev
mailing list