[MOBY-dev] data by reference - a request for comments

Fri Jul 25 10:50:12 UTC 2008

Hi Martin at al.,

On 25•Jul•2008, at 2:06 AM, Martin Senger wrote:

> Thank you for all your comments. We need to move on, however. I am,  
> almost
> impatiently, waiting for a conclusion because I need to implement the
> "data-by-reference" rather soon. Are there other people preparing/ 
> building
> their comments - so we should wait for them, or can I summarize what  
> was
> said so far - and create an official "RFC" from it?
>
> My personal opinion is still on the side to have references only for
> primitives. I am not convinced by the Pieter's vision of a huge  
> collection
> that is too big even when it uses references. I am simply trying to  
> find a
> solution that will be good for the most cases - and then deal with the
> marginal cases separately (like by designing the service differently).

Well, especially if you want to have a solution that works in most  
cases, I think we should have pass-by-reference at *any* level in  
mobyData. How often will the data of a single primitive be too large  
for inclusion in the SOAP body? I can imagine wrapping legacy data  
formats like a complete EMBL/Genbank/DDBJ record in a String object or  
sending images as base 64 encoded pieces of text in a String object.  
Maybe even sending a bulky sequence like a complete chromosome inside  
a String object, but in most cases the raw data carried by a primitive  
will be smaller than your average URL! And even if you do send some  
bulky stuff inside a String object as long as you send only one of  
those at a time it'll still work in most cases. But once you start to  
send tens, hundreds or more of those you have a problem. If you still  
send around tens, hundreds or more URLs, you'd have to fetch each one  
individually causing massive overhead. The reason to implement pass-by- 
reference is to make BioMoby scale much better for big jobs and in  
such cases fetching data from a single reference is much more efficient.

Take for example position information of alignments on a genome:

               <moby:HitPosition moby:id=''  
moby:namespace='transcriptome' moby:articleName='hit_position'>
                 <moby:String moby:id='' moby:namespace=''  
moby:articleName='seq_id'> 13 </moby:String>
                 <moby:Integer moby:id='' moby:namespace=''  
moby:articleName='start'> 97419218 </moby:Integer>
                 <moby:Integer moby:id='' moby:namespace=''  
moby:articleName='stop'> 97419282 </moby:Integer>
                 <moby:Object moby:id='-' moby:namespace=''  
moby:articleName='strand'/>
                 <moby:String moby:id='' moby:namespace=''  
moby:articleName='cll'> 65M </moby:String>
               </moby:HitPosition>

With pass-by-reference only for primitives this would become something  
like this:

               <moby:HitPosition moby:id=''  
moby:namespace='transcriptome' moby:articleName='hit_position'>
                 <moby:String moby:id='' moby:namespace=''  
moby:articleName='seq_id'
                  xlink='http://www.mydomain.org/biomoby/tmp/job1239573/String_seq_id.xml'/ 
 >
                 <moby:Integer moby:id='' moby:namespace=''  
moby:articleName='start'
                  xlink='http://www.mydomain.org/biomoby/tmp/job1239573/Integer_start.xml'/ 
 >
                 <moby:Integer moby:id='' moby:namespace=''  
moby:articleName='stop'
                  xlink='http://www.mydomain.org/biomoby/tmp/job1239573/Integer_stop.xml'/ 
 >
                 <moby:Object moby:id='-' moby:namespace=''  
moby:articleName='strand'/>
                 <moby:String moby:id='' moby:namespace=''  
moby:articleName='cll'
                  xlink='http://www.mydomain.org/biomoby/tmp/job1239573/String_cll.xml'/ 
 >
               </moby:HitPosition>

That ain't much of an improvement!

I have on average 30.000 features on a micro array and for each oligo  
on average 3 hits on a reference assembly. That makes for an average  
total of 90.000 HitPosition objects and that is just a small part of  
the annotation for my oligos....

Just to stress that these are not hypothetical cases: The stuff above  
is just a single example of what I'm already using for more than two  
years. Off course I needed a pass by reference workaround, because the  
data is too big for the SOAP body. So I registered an URL object and  
send those around. These URL objects point to chunks of BioMoby XML  
which is a complete mobyData block in most cases. Although the URL  
obeject is fully compatible with the current BioMoby standard, it's an  
ugly solution because of two reasons:
1. It's not a standard to do pass-by-reference.
2. It defeats the entire purpose of having the BioMoby object ontology  
to improve automatic service discovery. You can only discover that I  
provide several services which consume or produce URL objects, but you  
can not discover automatically what those URLs point to, so most of  
the URL object producers and consumers will be incompatible! (Ok, I  
can use namespace restriction to limit the problem of incompatible  
services a bit, but still you would have no idea what the URL points  
to based on the data in BioMoby Central.)

So I would love to see standardised pass-by-reference as part of the  
BioMoby specs and I think it doesn't require rocket science to do this  
at any level in the structure of BioMoby objects. Why don't we simply  
do the following:

Current situation:

A BioMoby object is a BioMoby tripple with optional articleName  
attribute and optionally raw character data for primitives. The  
tripple is the XML element name, an id attribute and a namespace  
attribute.

New situation:
The above + The id and namespace attributes of a BioMoby tripple can  
be replaced with an xlink attribute resulting in a "BioMoby double".  
If the latter is the case the element containing the xlink attribute  
and all it's children are available from link specified by the xlink  
attribute.

Example old:

<ComplexObject id='accession_number123875' namespace=''  
articleName='MyFavoriteObject'>
	<String id='' namespace='' articleName='MyPrimitiveString'>
		ATTGCGCGCTAGAGTGCGGGTGTGCAAACCGGTGT
	</String>
	<Integer id='' namespace='' articleName='MyPrimitiveInt'>
		4569343
	</Integer>
</ComplexObject>

Example with pass-by-reference:

<ComplexObject xlink='http://www.mydomain.org/biomoby/tmp/job1239573/ComplexObject.xml' 
  articleName='MyFavoriteObject' />

which points to:

<ComplexObject id='accession_number123875' namespace=''  
articleName='MyFavoriteObject'>
	<String id='' namespace='' articleName='MyPrimitiveString'>
		ATTGCGCGCTAGAGTGCGGGTGTGCAAACCGGTGT
	</String>
	<Integer id='' namespace='' articleName='MyPrimitiveInt'>
		4569343
	</Integer>
</ComplexObject>

Hence note that the link points to a ComplexObject and not just to  
it's children. The latter would also be an option but than you would  
get:

<ComplexObject xlink='http://www.mydomain.org/biomoby/tmp/job1239573/ComplexObject_content.xml' 
  id='accession_number123875' namespace=''  
articleName='MyFavoriteObject' />

which points to:

	<String id='' namespace='' articleName='MyPrimitiveString'>
		ATTGCGCGCTAGAGTGCGGGTGTGCAAACCGGTGT
	</String>
	<Integer id='' namespace='' articleName='MyPrimitiveInt'>
		4569343
	</Integer>

If you have more than one child element like with the example above,  
some XML parsers might have problems with such a chunk. Although it is  
well balanced, it doesn't have a (pseudo-)root element. So for  
practical reasons I suggest to have the links point to the element  
containing the xlink attribute and it's children. This should be  
really easy to parse. You either have an id + namespace attribute or  
you have an xlink attribute. If both were not present the values were  
NULL. This doesn't require parsing dozens of WSRF tags in a header. It  
even doesn't require a service to tell the client using the  
serviceNotes or something similar that it did pass-by-reference of  
some kind nor does it require the client to specify that it can  
understand certain references. Off course it would be handy though to  
have some extension to BioMoby Central to prevent discovering services  
providing references which are incompatible with your client.  
Secondly, if a service can provide multiple types of references and  
your client doesn't understand them all, it would also be nice if a  
client can specify a preference for a certain type of reference. But  
both wouldn't be required for a first quick implementation of pass-by- 
reference.

> I would like to hear also Eddie's voice - because he knows how easy  
> or hard
> it would be to make the way we decide to do the "data-by-reference" in
> Taverna (and I am still talking about the T1 which I expect to be  
> supported
> for some time).

I agree we need Eddie's feedback on Taverna compatibility!  
Compatibility with Taverna 1 would be great, but it would be extremely  
lame if our BioMoby references turn out to be incompatible with the  
new pass-by-reference feature of Taverna 2.

I hope you see the potential for improving scalability of BioMoby  
services with pass-by-reference at any level of mobyData!

Cheers,

Pi

> Martin
>
> -- 
> Martin Senger
> email: martin.senger at gmail.com,m.senger at cgiar.org
> skype: martinsenger
> _______________________________________________
> MOBY-dev mailing list
> MOBY-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/moby-dev

-------------------------------------------------------------
Wageningen University and Research centre (WUR)
Laboratory of Bioinformatics
Transitorium (building 312) room 1034

Dreijenlaan 3
6703 HA Wageningen
The Netherlands

phone:  +31 (0)317-483 060
mobile: +31 (0)6-143 66 783
e-mail: pieter.neerincx at gmail.com
skype:  pieter.online
------------------------------------------------------------