[MOBY-dev] RFC - Synchronization of Biomoby secondary repositories

Heiko Schoof schoof at mpiz-koeln.mpg.de
Thu Nov 30 09:39:46 UTC 2006


Hi Mark, Eddie,

we already use the RDF agent, from the RSS we intend to pull mainly  
the signature URLs, then we propose to use the RDF agent to get all  
data.
---quote---
The RSS contains the signature URL where the secondary picks up
the service RDF to retrieve all details required for the
registration using the existing RDF agent.
---/quote---

The advantage of the RSS versus the API call to retrieve ALL  
signature URLs is:
-scalability: If there are 1000s of signature URLs...with the RSS, we  
only retrieve changes
-filtering: ability to filter already based on data in the RSS with  
no need to actually retrieve the service RDF; should improve  
filtering performance as it's one request instead of potentially  
hundreds plus the need to parse all those RDF.

However, for the initialization/from scratch, this method indeed  
makes most sense, we'll modify the RFC accordingly. Is there a  
Biomoby WIKI where we can post that?

Do you intend to come "work" at the MPIZ next week? If yes, when? I'm  
free Thursday afternoon and most of Friday.

Best, Heiko

On 29. Nov 2006, at 18:31 Uhr, Mark Wilkinson wrote:

> Hi Andreas!
>
> Thanks for taking the time to put this document together.  Using  
> the RSS
> feed is an interesting idea.  My first instinct is that it might  
> not be
> "robust" enough, but I suppose if we spent more time thinking about  
> what
> information is passed on that RSS feed it might work quite well!
>
> Have you considered taking advantage of the recent move towards
> distributed service signatures?  The RDF Agent is capable of  
> consuming a
> list of URLs, recovering the RDF signatures from those URLs, and
> rebuilding the entire registry from those RDF documents. It is also a
> simple API call to MOBY Central that generates the list of URLs
> representing all of the service signatures.  As such, a full mirroring
> operation should require nothing more than a single call to the  
> primary
> MOBY Central, and passing the result of that call to the RDF agent of
> the mirror site and letting it run... Eddie, correct me if that isn't
> true...
>
> I'm going to be at your institute this time next week, so let's talk
> about it more in person :-)
>
> Best wishes!
>
> Mark
>
>
>
> On Wed, 2006-11-29 at 13:02 +0100, Andreas Groscurth wrote:
>> The following text describes the procedure of the synchronization of
>> Biomoby secondary repositories.
>>
>> Aim: Replicate BioMoby central
>> -to create mirrors
>> -to have redundancy in case of failure
>> -to create private sets of services, either filtered from the global
>> set (less services) or added to the global set (more services)
>>
>> Problems:
>> -synchronizing repositories
>> -cascading service/object registration requests
>> -populating a Moby central from scratch
>>
>> Solutions:
>> -The existing RSS feed is used to notify secondaries of changes
>> (register service/delete service/update service) to the master
>> -A complete RSS document is created by a new dump method for
>> initialization of Moby centrals from scratch
>> -Registrations are handled by the client and NOT cascaded
>>
>> 1. Synchronizing repositories
>> =============================
>>
>> We propose that secondaries check the Biomoby RSS feed to be
>> notified whether changes in the registration have been done.
>> Currently the RSS feed is updated once a day, for more rapid
>> synchronization this would have to be changed.
>> The changes include registration, modification or deletion of a
>> service/object. If changes were applied to the Biomoby Central
>> registry the changes are adopted to the secondary.
>> The RSS contains the signature URL where the secondary picks up
>> the service RDF to retrieve all details required for the
>> registration using the existing RDF agent.
>>
>> i) Problems/changes required:
>>
>> The main question here is if unregistered services are deleted
>> completely from the central database or are marked as inactive. The
>> problem about that is, that the feed would need to contain also the
>> information of a deleted service, so that the secondaries will
>> retrieve that information. So Moby central will have to keep a full
>> transaction log also of deletions.
>>
>> 2. Filtering
>> ============
>>
>> We propose that any secondary can apply filters to the RSS feed and
>> thus only include a subset of all services/objects. This can be
>> useful to make finding services from lists easier, to tune workflows
>> to performant services, only use local services or to exclude test
>> services. Information relevant to filtering is in the RSS, like
>> authority, description, but maybe more will be relevant, then
>> filtering may need to happen at the level of service RDF.
>>
>> 3. Private services
>> ===================
>>
>> We propose that any client can register services with a Moby central
>> secondary, these will then be available only to clients querying the
>> secondary. If the secondary is in a local network, this allows easy
>> access control to local services. Any secondary synchronizing to that
>> repository will of course inherit all those additional services,
>> allowing simple creation of local production Moby centrals and local
>> test Moby centrals.
>>
>> 4. Registration
>> ===============
>>
>> We propose to NOT cascade registration requests, i.e. pass them on
>> from secondary to master. That means that the client has control over
>> where a registration is done but also means the client has to make
>> that choice. Registration clients must thus add an implementation
>> that allows a user to choose the Moby central where a service/object
>> should be registered. Registration always happens at the topmost Moby
>> central node where the service should be visible, all secondaries of
>> this Moby central will pick that service up by synchronization.
>>
>> Why? Cascading registration is cumbersome, as only once a
>> registration request has reached the topmost node can name
>> duplications etc. be resolved, which must then be passed to the  
>> client.
>>
>> Name conflicts can still occur with locally registered services.
>> E.g., Adam registers a private service AnalyseThis on a private
>> secondary. Later, Beth registers AnalyseThis with same authority on
>> the Moby central master. The private secondary picks this up from the
>> RSS and runs into a name duplication. Proposed solution: Local
>> registrations MUST ALWAYS use a local authority. E.g., Adam registers
>> AnalyseThis with authority InternalIP, and Beth registers AnalyseThis
>> with authority paul_vitti.com. Then, we assume whoever registers a
>> service at a more global Moby central knows what we're doing and give
>> synchronization precedence over local registrations. E.g., a test
>> registry is a secondary of Moby central. Chris registers AnalyseThat
>> with authority paul_vitti.com in the test registry. Once he's happy
>> with testing, he registers AnalyseThat with authority paul_vitti.com
>> in Moby central. The test registry retrieves this from the RSS,
>> discards the local registration and overwrites it with the
>> registration picked up through the RSS.
>>
>> 5. Moby central failure
>> =======================
>>
>> If a master Moby central fails, the secondaries continue normal
>> operation with no effect on service discovery for all clients keyed
>> to a secondary. However, registration is no longer possible at the
>> master node. Once the master node comes back up, all secondaries must
>> resync.
>>
>> 6. Adaptations to the RSS
>> =========================
>>
>> For this procedure the current RSS feed has to be changed  
>> marginally, to
>> enable on the one hand the correct notification of the secondaries,
>> on the other hand to ensure that the normal RSS reader still work the
>> usual way. The current RSS feed mainly uses the Dublin Core Metadata
>> to provide the information, so to add additional information to the
>> feed it is only needed to add more Dublin Core Metadata.
>>
>> Primarily the feed has to contain the information whether the service
>> is new, modified or deleted. Additionally the service rdf has to be
>> linked in the feed to enable the local RDF agent to apply the changes
>> with the information of the service rdf to the local secondary.
>> If other additional information shall be added to the feed to provide
>> more possibilities to filter the services can be discussed.
>>
>> 7. Resync
>> =========
>>
>> Another main aspect is the problem if a repository is out of sync
>> (e.g. due to a temporary failure of master or secondary). The RSS
>> feed has a limited length, which means a limited number of
>> transactions are contained. Possibly, this will mean it does not
>> contain all transactions since the last sync of a secondary.
>>
>>
>> i) Solution
>> We propose that each repository will store a time stamp of
>> the last synchronization. In case that
>> in the next synchronization process the oldest changes in the feed
>> are older than the current sync time stamp of the repository,
>> we run the risk to not receive all information
>> about service changes. In this case the secondary should be able to
>> ask the primary to create a RSS feed with all changes which have
>> happened since the current time stamp of the secondary.
>>
>> 8. Initial load
>> ===============
>>
>> When populating a new secondary from scratch, all registered  
>> services/
>> objects need to be received from the master Moby central. We propose
>> a new method in Moby central to request all registered services/
>> objects as RSS. Then, the initialization proceeds exactly like a
>> synchronization.
>>
>>
>>
>> So to kick off the discussion here are some of our questions:
>>
>> 1.Is it reasonable to use the existing RSS feed for this procedure ?
>> It sounds very handy and avoids creating a similar complete new  
>> structure
>>
>> 2.Does any structure keep track of deleted services ?
>>
>> 3.Resync: Is it reasonable to timestamp all transactions in Moby
>> central? Or should we solve the resync issue by enforcing a full  
>> drop/
>> emptying of the secondary and reload all data as in initial load?
>>
>>
>> Thanks
>> Heiko & Andreas
>>
>> -- 
>> Andreas Groscurth
>> Diplom Bioinformatik - PhD Student
>> Max Planck Institute for Plant Breeding Research
>> Carl-von-Linné-Weg 10
>> 50829 Cologne
>> Germany
>> E-mail:    groscurt at mpiz-koeln.mpg.de
>> Phone:    +49(0)221-5062-447
>>
>> _______________________________________________
>> MOBY-dev mailing list
>> MOBY-dev at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/moby-dev
> -- 
> Mark Wilkinson
> Asst. Professor, Dept. of Medical Genetics
> University of British Columbia
> PI in Bioinformatics, iCAPTURE Centre
> St. Paul's Hospital, Rm. 166, 1081 Burrard St.
> Vancouver, BC, V6Z 1Y6
> tel: 604 682 2344 x62129
> fax: 604 806 9274
>
> "Scientists would rather share their toothbrush than their data"
>                                         - Carole Goble
>
>                          ==================
>
>
> ***CONFIDENTIALITY NOTICE***
> This electronic message is intended only for the use of the addressee
> and may contain information that is privileged and confidential.  Any
> dissemination, distribution or copying of this communication by
> unauthorized individuals is strictly prohibited. If you have received
> this communication in error, please notify the sender immediately by
> reply e-mail and delete the original and all copies from your system.
>
> _______________________________________________
> MOBY-dev mailing list
> MOBY-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/moby-dev





More information about the MOBY-dev mailing list