Replication Notes ----------------- See: http://mule1.dataone.org/ArchitectureDocs-current/design/UseCases/09_uc.html Replication of objects should be queued at the time the event occurs (CREATE, UPDATE, DELETE), and also on a scheduled basisin case event-based replication misses an event. However, this hinges on how frequently the CN becomes aware of an event. * Will MNs notify a CN of an event? * MN can inform a CN, but not implemented yet * Will CNs poll MNs for logged events only? If so, what frequency? * Runs in a crontab-like function using quartz, every 2 minutes currently * Frequency is configurable by a member node (see the element in the node class) Replication Authorization: NOTE: all references to "token" below should really be "session" B = receiving member node (where the content is to be copied to) A = providing member node (where the content is currently located) * How will replication be authorized? Node-level certificate/SSL handshake? * CN makes replication request * Receiving node needs to know where the it originates from * CN must request both ends of the replication request * Create a group with all of the Tier 4 member nodes * need to add MNReplication.allowReplicate() on the replication node, called by the source node * pass in pid for object, target Subject * * XXX0. CN calls A.allowReplicate(token, subject, pid) * * 1. CN calls B.replicate(CNtoken, A, pid) * 2. B calls A.getReplica(Btoken, pid) * 3. A calls CNReplication.isReplicationAuthorized(Atoken, subjectFromBtoken, pid, Permission.execute) * other possible names: isSubjectAuthorized(), isReplicationTarget(), isReplicationAuthorized()... * 4. if (yes) A returns data to B * 5. B calls CN.setReplicationStatus * add in 'REPLICATE' permission (actually just use Permission.execute) * Replication get() logging should be logged differently than a regular get() * Should evaluate the Subject from the certificate. To evaluate, ReplicationService needs to track Subjects for each ReplicationTask it queues. In IRC, we decided Types.Node needs to be modified to add a Node.Subject field, which is the value used to compare to. * We'll need two Hazelcast replication structures, and will also rely on a shared Synchronization/Replication Object Lock: * hzReplicationTaskQueue (this will be polled) * hzPendingReplicationTaskMap (this contains replication tasks that have been initiated, but not completed, and will be the source for finding ReplicationTasks that match a given Node.Subject) * hzObjectLock (a shared structure used to lock pids during SystemMetadata updates) * Possibly pass in a key (random string/subject string) that repl nodes verify Event-based replication for a given object: On create() or update(), check the SystemMetadata's ReplicationPolicy for: * numberOfReplicas < count(CN.listNodes()) * Nodes listed in preferred list exist Rank the target replication nodes based on: * SystemMetadata for the given object with it's: * ReplicationPolicy * blocked list * preferred list * required list (is this defined?) * number of replicas as a minimum * replication allowed * (also, how to handle replicas that are externally maintained; set replication allowed to false, and min replicas to 0) * * Decision: use only preferred and blocked lists; try to put a replica on all preferred nodes, plus any additional nodes needed to hit the min replicas * preferred > minimum : hit all preferred nodes, even if greater than minimum. * preferred < minimum : meet preferred and fill with more * preferred == minimum : stop at preferred & minimum * preferred - down < minimum : meet all available preferred and fill with extra * Replica list (location, status) * Node rejection of a replication event * Once rejected, always rejected? Depends on the exception * Initiate replication from the top of the list until the minimum number of replicas is created * Need an internal timeout after the replication is initiated for a given node * Create a ReplicationFailed exception * messages: * Replication denied * Replication timed out * ... Node-based polling: * If a node drops, use it's object list to evaluate whether or not each object's replication policy is fulfilled. If not, queue the object for replication * This requires a short list of objects on a given node. This may require a filter on the CN listObjects() method. We would need to pass in a NodeReference to get a list of objects on the given node. Question: Can we add a filter parameter to CNRead.listObjects() that filters by node id so we can produce that list efficiently? * Query solr index via mercury * Query metacat systemmetadata table Question about ReplicationPolicy implementation: * Should we replicate to numberOfReplicas only, or numberOfReplicas + 2 (or 3)? But not more? - Rationale: since the more replicas there are, if a node drops, there is significant overhead on the CNs to update all of the references to the replica node * initially stick with numberOfReplicas * How long is too long for a node being down to initiate replication of all of the objects that were stored on that node? One week of downtime? * We need a notification step (email to the node manager) * Downtime for maintenance, a few hours, a few days * resolve() need to be updated to consider the status of each node