[MOBY-dev] data by reference - a request for comments
Pieter Neerincx
pieter.neerincx at gmail.com
Fri Jul 25 10:50:12 UTC 2008
Hi Martin at al.,
On 25•Jul•2008, at 2:06 AM, Martin Senger wrote:
> Thank you for all your comments. We need to move on, however. I am,
> almost
> impatiently, waiting for a conclusion because I need to implement the
> "data-by-reference" rather soon. Are there other people preparing/
> building
> their comments - so we should wait for them, or can I summarize what
> was
> said so far - and create an official "RFC" from it?
>
> My personal opinion is still on the side to have references only for
> primitives. I am not convinced by the Pieter's vision of a huge
> collection
> that is too big even when it uses references. I am simply trying to
> find a
> solution that will be good for the most cases - and then deal with the
> marginal cases separately (like by designing the service differently).
Well, especially if you want to have a solution that works in most
cases, I think we should have pass-by-reference at *any* level in
mobyData. How often will the data of a single primitive be too large
for inclusion in the SOAP body? I can imagine wrapping legacy data
formats like a complete EMBL/Genbank/DDBJ record in a String object or
sending images as base 64 encoded pieces of text in a String object.
Maybe even sending a bulky sequence like a complete chromosome inside
a String object, but in most cases the raw data carried by a primitive
will be smaller than your average URL! And even if you do send some
bulky stuff inside a String object as long as you send only one of
those at a time it'll still work in most cases. But once you start to
send tens, hundreds or more of those you have a problem. If you still
send around tens, hundreds or more URLs, you'd have to fetch each one
individually causing massive overhead. The reason to implement pass-by-
reference is to make BioMoby scale much better for big jobs and in
such cases fetching data from a single reference is much more efficient.
Take for example position information of alignments on a genome:
<moby:HitPosition moby:id=''
moby:namespace='transcriptome' moby:articleName='hit_position'>
<moby:String moby:id='' moby:namespace=''
moby:articleName='seq_id'> 13 </moby:String>
<moby:Integer moby:id='' moby:namespace=''
moby:articleName='start'> 97419218 </moby:Integer>
<moby:Integer moby:id='' moby:namespace=''
moby:articleName='stop'> 97419282 </moby:Integer>
<moby:Object moby:id='-' moby:namespace=''
moby:articleName='strand'/>
<moby:String moby:id='' moby:namespace=''
moby:articleName='cll'> 65M </moby:String>
</moby:HitPosition>
With pass-by-reference only for primitives this would become something
like this:
<moby:HitPosition moby:id=''
moby:namespace='transcriptome' moby:articleName='hit_position'>
<moby:String moby:id='' moby:namespace=''
moby:articleName='seq_id'
xlink='http://www.mydomain.org/biomoby/tmp/job1239573/String_seq_id.xml'/
>
<moby:Integer moby:id='' moby:namespace=''
moby:articleName='start'
xlink='http://www.mydomain.org/biomoby/tmp/job1239573/Integer_start.xml'/
>
<moby:Integer moby:id='' moby:namespace=''
moby:articleName='stop'
xlink='http://www.mydomain.org/biomoby/tmp/job1239573/Integer_stop.xml'/
>
<moby:Object moby:id='-' moby:namespace=''
moby:articleName='strand'/>
<moby:String moby:id='' moby:namespace=''
moby:articleName='cll'
xlink='http://www.mydomain.org/biomoby/tmp/job1239573/String_cll.xml'/
>
</moby:HitPosition>
That ain't much of an improvement!
I have on average 30.000 features on a micro array and for each oligo
on average 3 hits on a reference assembly. That makes for an average
total of 90.000 HitPosition objects and that is just a small part of
the annotation for my oligos....
Just to stress that these are not hypothetical cases: The stuff above
is just a single example of what I'm already using for more than two
years. Off course I needed a pass by reference workaround, because the
data is too big for the SOAP body. So I registered an URL object and
send those around. These URL objects point to chunks of BioMoby XML
which is a complete mobyData block in most cases. Although the URL
obeject is fully compatible with the current BioMoby standard, it's an
ugly solution because of two reasons:
1. It's not a standard to do pass-by-reference.
2. It defeats the entire purpose of having the BioMoby object ontology
to improve automatic service discovery. You can only discover that I
provide several services which consume or produce URL objects, but you
can not discover automatically what those URLs point to, so most of
the URL object producers and consumers will be incompatible! (Ok, I
can use namespace restriction to limit the problem of incompatible
services a bit, but still you would have no idea what the URL points
to based on the data in BioMoby Central.)
So I would love to see standardised pass-by-reference as part of the
BioMoby specs and I think it doesn't require rocket science to do this
at any level in the structure of BioMoby objects. Why don't we simply
do the following:
Current situation:
A BioMoby object is a BioMoby tripple with optional articleName
attribute and optionally raw character data for primitives. The
tripple is the XML element name, an id attribute and a namespace
attribute.
New situation:
The above + The id and namespace attributes of a BioMoby tripple can
be replaced with an xlink attribute resulting in a "BioMoby double".
If the latter is the case the element containing the xlink attribute
and all it's children are available from link specified by the xlink
attribute.
Example old:
<ComplexObject id='accession_number123875' namespace=''
articleName='MyFavoriteObject'>
<String id='' namespace='' articleName='MyPrimitiveString'>
ATTGCGCGCTAGAGTGCGGGTGTGCAAACCGGTGT
</String>
<Integer id='' namespace='' articleName='MyPrimitiveInt'>
4569343
</Integer>
</ComplexObject>
Example with pass-by-reference:
<ComplexObject xlink='http://www.mydomain.org/biomoby/tmp/job1239573/ComplexObject.xml'
articleName='MyFavoriteObject' />
which points to:
<ComplexObject id='accession_number123875' namespace=''
articleName='MyFavoriteObject'>
<String id='' namespace='' articleName='MyPrimitiveString'>
ATTGCGCGCTAGAGTGCGGGTGTGCAAACCGGTGT
</String>
<Integer id='' namespace='' articleName='MyPrimitiveInt'>
4569343
</Integer>
</ComplexObject>
Hence note that the link points to a ComplexObject and not just to
it's children. The latter would also be an option but than you would
get:
<ComplexObject xlink='http://www.mydomain.org/biomoby/tmp/job1239573/ComplexObject_content.xml'
id='accession_number123875' namespace=''
articleName='MyFavoriteObject' />
which points to:
<String id='' namespace='' articleName='MyPrimitiveString'>
ATTGCGCGCTAGAGTGCGGGTGTGCAAACCGGTGT
</String>
<Integer id='' namespace='' articleName='MyPrimitiveInt'>
4569343
</Integer>
If you have more than one child element like with the example above,
some XML parsers might have problems with such a chunk. Although it is
well balanced, it doesn't have a (pseudo-)root element. So for
practical reasons I suggest to have the links point to the element
containing the xlink attribute and it's children. This should be
really easy to parse. You either have an id + namespace attribute or
you have an xlink attribute. If both were not present the values were
NULL. This doesn't require parsing dozens of WSRF tags in a header. It
even doesn't require a service to tell the client using the
serviceNotes or something similar that it did pass-by-reference of
some kind nor does it require the client to specify that it can
understand certain references. Off course it would be handy though to
have some extension to BioMoby Central to prevent discovering services
providing references which are incompatible with your client.
Secondly, if a service can provide multiple types of references and
your client doesn't understand them all, it would also be nice if a
client can specify a preference for a certain type of reference. But
both wouldn't be required for a first quick implementation of pass-by-
reference.
> I would like to hear also Eddie's voice - because he knows how easy
> or hard
> it would be to make the way we decide to do the "data-by-reference" in
> Taverna (and I am still talking about the T1 which I expect to be
> supported
> for some time).
I agree we need Eddie's feedback on Taverna compatibility!
Compatibility with Taverna 1 would be great, but it would be extremely
lame if our BioMoby references turn out to be incompatible with the
new pass-by-reference feature of Taverna 2.
I hope you see the potential for improving scalability of BioMoby
services with pass-by-reference at any level of mobyData!
Cheers,
Pi
> Martin
>
> --
> Martin Senger
> email: martin.senger at gmail.com,m.senger at cgiar.org
> skype: martinsenger
> _______________________________________________
> MOBY-dev mailing list
> MOBY-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/moby-dev
-------------------------------------------------------------
Wageningen University and Research centre (WUR)
Laboratory of Bioinformatics
Transitorium (building 312) room 1034
Dreijenlaan 3
6703 HA Wageningen
The Netherlands
phone: +31 (0)317-483 060
mobile: +31 (0)6-143 66 783
e-mail: pieter.neerincx at gmail.com
skype: pieter.online
------------------------------------------------------------
More information about the MOBY-dev
mailing list