[moby] [MOBY-dev] Re: Problems with Biomoby servicesin Taverna 1.2

Heiko Schoof schoof at mpiz-koeln.mpg.de
Thu Jul 21 13:42:32 UTC 2005


Hi all again, I'll respond to Martin's query here:

<snip>
  My understanding is that (talking about one mobyData object):
    a) Any Moby service can have more outputs. If so, all of them must be
registered. The number of such outputs must be fixed.
    b) Any of these outputs can be of type either Simple or Collection.
    c) If it is a Collection, this output can have one or more Simples in
that Collection. Such Simples (and their number) are not individually
registered.
</snip>

I think the issue is point a, the number of such outputs must be fixed. 
Or, as the API says
"each of these articles will appear EXACTLY ONCE in the output from the 
service". I request to change this to "each of these articles will 
appear AT LEAST ONCE in the output from the service".

Why is this necessary? Currently, services that return potentially many 
results are registered as outputting a Collection. But within 
workflows, the iteration strategy or how often the next service is 
called should distinguish between an output made up of many independent 
items versus one or many groups of connected items. In the first case, 
the following service should be called once for each item, in the 
second case, only once for each group. See below for full example.

What I am suggesting is to separate the cardinality from the 
Simple/Collection issue; meaning, that a service that performs e.g. a 
database lookup and returns 1 or many outputs (or none, but in the Moby 
world this means it returns 1 empty output) will be registered as 
returning Simple, not Collection, if the outputs are otherwise 
semantically unrelated (aside from the fact that they arose from the 
same query). And reserve the Collection article for grouping of outputs 
that need to be seen as a single entity. E.g. a service that outputs 
ortholog pairs given as input a pair of organisms: Each ortholog pair 
could be represented as a pair of GenericSequence objects in a 
Collection, with the service outputing 1 or many of these Collections. 
The same service, given as input three organisms, could still output 
many Collections, then containing three sequences each. This prevents 
ugly explosion of specialized BioMoby objects like "MultipleSequences", 
"HAS GenericSequence(s)"... that would otherwise be needed to wrap 
this.

For service discovery, this should not make a difference. Services 
would still be required to return every Class of object that they 
register, as you state in a: all output object *Classes* must be 
registered, and the number of *Classes* fixed. I.e., a service 
registered as returning GenericSequence and AnnotatedJpeg objects must 
always return at least one GenericSequence plus at least one 
AnnotatedJpeg. I can't recall a service that actually does that... I 
can only think of this being meaningful if the AnnotatedJpeg is 
semantically connected to a specific GenericSequence, and in that case 
both should be connected through putting them in a Collection imho.

For inputs, I'm not so worried; if multiple inputs are intended to go 
into a single service call, they will probably be connected and could 
go into a Collection. Example above: Input is a Collection of 
Organisms. Basically, I see that as the only way to register services 
that require AT LEAST two equal inputs.

No problem with b, but with c: To my understanding, when registering a 
Collection, also the classes of objects in Simples that it contains 
must be registered. Otherwise, no discovery. However, see API, " A 
collection may contain zero or more Objects of each of the Classes", 
not all these classes must actually be included.

So far, I do not see the need to distinguish between services that 
return EXACTLY ONE output and those that return one or more. Taverna 
seems to make that distinction, and bases iteration strategies on that, 
but I would want to do that dynamically, and it may be that that's what 
Taverna does. I'd by default assume that there will be multiple outputs 
and iterate over them, but if the workflow designer so wishes make it 
possible to (using e.g. a local processor) combine all the outputs into 
a Collection that can be used as the single input to a following 
service.

This distinction is necessary: Use case examples would be a service 
returning a number of sequences, that in one scenario (iterate) should 
each be run through a BLAST service individually and in a different 
scenario (bag or Collection) should be all together input into a single 
call of a multiple alignment service.

The current problem arises because Taverna now, in what is for me the 
semantically correct interpretation, if it receives a Collection as 
output from a service, it inputs that into a single execution of the 
following service if that service consumes Collection. In version 1.1 
and before, Collections were decomposed by Taverna and iterated over. 
For the workflows being used, that was the wanted behavior, as e.g. 
keyword queries returning a set of sequences should be linked to 
services that act on each individual sequence. The mistake is imho on 
the BioMoby side, where we use Collection to wrap multiple outputs even 
if these are individual results that should be processed individually, 
and then, in order to be able to pipeline, register services that 
actually act on single inputs as consuming Collections. Consider Tom 
Oinn's comment 080705:
<snip>
1) Consumer declares it consumes singles, Producer emits a collection. 
In this context Taverna iteratively calls the Consumer with each item 
from the collection. This is probably what you'd expect to happen, the 
result is that the Consumer effectively emits a collection of whatever 
it would emit normally.

2) Consumer declares it consumes a collection, Producer emits a 
collection. In this case Taverna will indeed split the output 
collection (because we always do) but it will be magically reassembled 
before being given to the Consumer.

3) Consumer declares it consumes a collection, Producer emits a single 
item. Taverna wraps the single item in a single element collection and 
gives it to the Consumer.
</snip>

This is the same logic that we'd need to implement into BioMoby to 
allow meaningful links between Collection producing services and Simple 
consuming services! And NOT register services as consuming or producing 
Collections if all they do is mimic this behaviour internally by 
iterating over the Collection items.

Many words and I'm not sure this is making anything any clearer. But I 
try ;-)

Best, Heiko

On 18. Jul 2005, at 14:39 Uhr, Martin Senger wrote:

Hi all,
    Catching up my email piles I wonder if someone can summarize if the
discussion about collections in this thread brought any (planned) 
changes
in the API (I am not talking now about how it is, or should be,
implemented in Taverna, that's, imho, an another story).
    My understanding is that (talking about one mobyData object):
    a) Any Moby service can have more outputs. If so, all of them must be
registered. The number of such outputs must be fixed.
    b) Any of these outputs can be of type either Simple or Collection.
    c) If it is a Collection, this output can have one or more Simples in
that Collection. Such Simples (and their number) are not individually
registered.

    Has this vision been changed?

    Thanks,
    Martin

-- 
Martin Senger

EMBL Outstation - Hinxton                Senger at EBI.ac.uk
European Bioinformatics Institute        Phone: (+44) 1223 494636
Wellcome Trust Genome Campus             (Switchboard:     494444)
Hinxton                                  Fax  : (+44) 1223 494468
Cambridge CB10 1SD
United Kingdom                           
http://industry.ebi.ac.uk/~senger

_______________________________________________
MOBY-dev mailing list
MOBY-dev at biomoby.org
http://www.biomoby.org/mailman/listinfo/moby-dev




More information about the MOBY-dev mailing list