[MOBY-dev] data by reference - a request for comments

Fri Jul 25 11:15:52 UTC 2008

Hi,

I'm not sure whether i get the complete point of the discussion here, 
nevertheless I like to jump in into that.

What i dont understand is why do we need references for only primitives 
? I dont understand what a reference for a primitive is good for ?
I would also like to have references for any type, if possible. 
Currently i work on sending image information around. some images are up 
to several MBs. It would be way more convinient to send references to 
the http url of the image instead of encoding/decoding it and 
transferring around some networks.

If any type of object could be a reference, then we could stay with the 
current way of registering and discovering services....

It might also just the case i dont get the point what martin said ;-)

Cheers
andreas

Pieter Neerincx wrote:
> Hi Martin at al.,
>
> On 25•Jul•2008, at 2:06 AM, Martin Senger wrote:
>
>> Thank you for all your comments. We need to move on, however. I am, 
>> almost
>> impatiently, waiting for a conclusion because I need to implement the
>> "data-by-reference" rather soon. Are there other people 
>> preparing/building
>> their comments - so we should wait for them, or can I summarize what was
>> said so far - and create an official "RFC" from it?
>>
>> My personal opinion is still on the side to have references only for
>> primitives. I am not convinced by the Pieter's vision of a huge 
>> collection
>> that is too big even when it uses references. I am simply trying to 
>> find a
>> solution that will be good for the most cases - and then deal with the
>> marginal cases separately (like by designing the service differently).
>
> Well, especially if you want to have a solution that works in most 
> cases, I think we should have pass-by-reference at *any* level in 
> mobyData. How often will the data of a single primitive be too large 
> for inclusion in the SOAP body? I can imagine wrapping legacy data 
> formats like a complete EMBL/Genbank/DDBJ record in a String object or 
> sending images as base 64 encoded pieces of text in a String object. 
> Maybe even sending a bulky sequence like a complete chromosome inside 
> a String object, but in most cases the raw data carried by a primitive 
> will be smaller than your average URL! And even if you do send some 
> bulky stuff inside a String object as long as you send only one of 
> those at a time it'll still work in most cases. But once you start to 
> send tens, hundreds or more of those you have a problem. If you still 
> send around tens, hundreds or more URLs, you'd have to fetch each one 
> individually causing massive overhead. The reason to implement 
> pass-by-reference is to make BioMoby scale much better for big jobs 
> and in such cases fetching data from a single reference is much more 
> efficient.
>
> Take for example position information of alignments on a genome:
>
>               <moby:HitPosition moby:id='' 
> moby:namespace='transcriptome' moby:articleName='hit_position'>
>                 <moby:String moby:id='' moby:namespace='' 
> moby:articleName='seq_id'> 13 </moby:String>
>                 <moby:Integer moby:id='' moby:namespace='' 
> moby:articleName='start'> 97419218 </moby:Integer>
>                 <moby:Integer moby:id='' moby:namespace='' 
> moby:articleName='stop'> 97419282 </moby:Integer>
>                 <moby:Object moby:id='-' moby:namespace='' 
> moby:articleName='strand'/>
>                 <moby:String moby:id='' moby:namespace='' 
> moby:articleName='cll'> 65M </moby:String>
>               </moby:HitPosition>
>
> With pass-by-reference only for primitives this would become something 
> like this:
>
>               <moby:HitPosition moby:id='' 
> moby:namespace='transcriptome' moby:articleName='hit_position'>
>                 <moby:String moby:id='' moby:namespace='' 
> moby:articleName='seq_id'
>                  
> xlink='http://www.mydomain.org/biomoby/tmp/job1239573/String_seq_id.xml'/> 
>
>                 <moby:Integer moby:id='' moby:namespace='' 
> moby:articleName='start'
>                  
> xlink='http://www.mydomain.org/biomoby/tmp/job1239573/Integer_start.xml'/> 
>
>                 <moby:Integer moby:id='' moby:namespace='' 
> moby:articleName='stop'
>                  
> xlink='http://www.mydomain.org/biomoby/tmp/job1239573/Integer_stop.xml'/>
>                 <moby:Object moby:id='-' moby:namespace='' 
> moby:articleName='strand'/>
>                 <moby:String moby:id='' moby:namespace='' 
> moby:articleName='cll'
>                  
> xlink='http://www.mydomain.org/biomoby/tmp/job1239573/String_cll.xml'/>
>               </moby:HitPosition>
>
> That ain't much of an improvement!
>
> I have on average 30.000 features on a micro array and for each oligo 
> on average 3 hits on a reference assembly. That makes for an average 
> total of 90.000 HitPosition objects and that is just a small part of 
> the annotation for my oligos....
>
> Just to stress that these are not hypothetical cases: The stuff above 
> is just a single example of what I'm already using for more than two 
> years. Off course I needed a pass by reference workaround, because the 
> data is too big for the SOAP body. So I registered an URL object and 
> send those around. These URL objects point to chunks of BioMoby XML 
> which is a complete mobyData block in most cases. Although the URL 
> obeject is fully compatible with the current BioMoby standard, it's an 
> ugly solution because of two reasons:
> 1. It's not a standard to do pass-by-reference.
> 2. It defeats the entire purpose of having the BioMoby object ontology 
> to improve automatic service discovery. You can only discover that I 
> provide several services which consume or produce URL objects, but you 
> can not discover automatically what those URLs point to, so most of 
> the URL object producers and consumers will be incompatible! (Ok, I 
> can use namespace restriction to limit the problem of incompatible 
> services a bit, but still you would have no idea what the URL points 
> to based on the data in BioMoby Central.)
>
> So I would love to see standardised pass-by-reference as part of the 
> BioMoby specs and I think it doesn't require rocket science to do this 
> at any level in the structure of BioMoby objects. Why don't we simply 
> do the following:
>
> Current situation:
>
> A BioMoby object is a BioMoby tripple with optional articleName 
> attribute and optionally raw character data for primitives. The 
> tripple is the XML element name, an id attribute and a namespace 
> attribute.
>
> New situation:
> The above + The id and namespace attributes of a BioMoby tripple can 
> be replaced with an xlink attribute resulting in a "BioMoby double". 
> If the latter is the case the element containing the xlink attribute 
> and all it's children are available from link specified by the xlink 
> attribute.
>
> Example old:
>
> <ComplexObject id='accession_number123875' namespace='' 
> articleName='MyFavoriteObject'>
>     <String id='' namespace='' articleName='MyPrimitiveString'>
>         ATTGCGCGCTAGAGTGCGGGTGTGCAAACCGGTGT
>     </String>
>     <Integer id='' namespace='' articleName='MyPrimitiveInt'>
>         4569343
>     </Integer>
> </ComplexObject>
>
> Example with pass-by-reference:
>
> <ComplexObject 
> xlink='http://www.mydomain.org/biomoby/tmp/job1239573/ComplexObject.xml' articleName='MyFavoriteObject' 
> />
>
> which points to:
>
> <ComplexObject id='accession_number123875' namespace='' 
> articleName='MyFavoriteObject'>
>     <String id='' namespace='' articleName='MyPrimitiveString'>
>         ATTGCGCGCTAGAGTGCGGGTGTGCAAACCGGTGT
>     </String>
>     <Integer id='' namespace='' articleName='MyPrimitiveInt'>
>         4569343
>     </Integer>
> </ComplexObject>
>
> Hence note that the link points to a ComplexObject and not just to 
> it's children. The latter would also be an option but than you would get:
>
> <ComplexObject 
> xlink='http://www.mydomain.org/biomoby/tmp/job1239573/ComplexObject_content.xml' id='accession_number123875' 
> namespace='' articleName='MyFavoriteObject' />
>
> which points to:
>
>     <String id='' namespace='' articleName='MyPrimitiveString'>
>         ATTGCGCGCTAGAGTGCGGGTGTGCAAACCGGTGT
>     </String>
>     <Integer id='' namespace='' articleName='MyPrimitiveInt'>
>         4569343
>     </Integer>
>
> If you have more than one child element like with the example above, 
> some XML parsers might have problems with such a chunk. Although it is 
> well balanced, it doesn't have a (pseudo-)root element. So for 
> practical reasons I suggest to have the links point to the element 
> containing the xlink attribute and it's children. This should be 
> really easy to parse. You either have an id + namespace attribute or 
> you have an xlink attribute. If both were not present the values were 
> NULL. This doesn't require parsing dozens of WSRF tags in a header. It 
> even doesn't require a service to tell the client using the 
> serviceNotes or something similar that it did pass-by-reference of 
> some kind nor does it require the client to specify that it can 
> understand certain references. Off course it would be handy though to 
> have some extension to BioMoby Central to prevent discovering services 
> providing references which are incompatible with your client. 
> Secondly, if a service can provide multiple types of references and 
> your client doesn't understand them all, it would also be nice if a 
> client can specify a preference for a certain type of reference. But 
> both wouldn't be required for a first quick implementation of 
> pass-by-reference.
>
>> I would like to hear also Eddie's voice - because he knows how easy 
>> or hard
>> it would be to make the way we decide to do the "data-by-reference" in
>> Taverna (and I am still talking about the T1 which I expect to be 
>> supported
>> for some time).
>
> I agree we need Eddie's feedback on Taverna compatibility! 
> Compatibility with Taverna 1 would be great, but it would be extremely 
> lame if our BioMoby references turn out to be incompatible with the 
> new pass-by-reference feature of Taverna 2.
>
> I hope you see the potential for improving scalability of BioMoby 
> services with pass-by-reference at any level of mobyData!
>
> Cheers,
>
> Pi
>
>> Martin
>>
>> -- 
>> Martin Senger
>> email: martin.senger at gmail.com,m.senger at cgiar.org
>> skype: martinsenger
>> _______________________________________________
>> MOBY-dev mailing list
>> MOBY-dev at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/moby-dev
>
> -------------------------------------------------------------
> Wageningen University and Research centre (WUR)
> Laboratory of Bioinformatics
> Transitorium (building 312) room 1034
>
> Dreijenlaan 3
> 6703 HA Wageningen
> The Netherlands
>
> phone:  +31 (0)317-483 060
> mobile: +31 (0)6-143 66 783
> e-mail: pieter.neerincx at gmail.com
> skype:  pieter.online
> ------------------------------------------------------------
>
>
> _______________________________________________
> MOBY-dev mailing list
> MOBY-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/moby-dev

-- 
=====================================================
  Dipl. Bioinf. Andreas Groscurth
  Bioinformatics Software Developer
  Plant Computational Biology group
  Max-Planck Institute for plant breeding research
  Carl-von-Linne Weg 10
  50829 Cologne
  Germany
  +49(0) 221 5062449
=====================================================