/20110811-replication-notes

Replication Notes
-----------------
See: http://mule1.dataone.org/ArchitectureDocs-current/design/UseCases/09_uc.html

Replication of objects should be queued at the time the event occurs (CREATE, UPDATE, DELETE), and also on a scheduled basisin case event-based replication misses an event. However, this hinges on how frequently the CN becomes aware of an event.

Will MNs notify a CN of an event?
- MN can inform a CN, but not implemented yet
Will CNs poll MNs for logged events only? If so, what frequency?
- Runs in a crontab-like function using quartz, every 2 minutes currently
- Frequency is configurable by a member node (see the <synchronization> element in the node class)

Replication Authorization:
NOTE: all references to "token" below should really be "session"
B = receiving member node (where the content is to be copied to)
A = providing member node (where the content is currently located)

How will replication be authorized? Node-level certificate/SSL handshake?
- CN makes replication request
- Receiving node needs to know where the it originates from
- CN must request both ends of the replication request
- Create a group with all of the Tier 4 member nodes
- need to add MNReplication.allowReplicate() on the replication node, called by the source node
  - pass in pid for object, target Subject
  - ~~XXX0. CN calls A.allowReplicate(token, subject, pid)~~
  - 1. CN calls B.replicate(CNtoken, A, pid)
  - 2. B calls A.getReplica(Btoken, pid)
    - 3. A calls CNReplication.isReplicationAuthorized(Atoken, subjectFromBtoken, pid, Permission.execute)
      - other possible names: isSubjectAuthorized(), isReplicationTarget(), isReplicationAuthorized()...
    - 4. if (yes) A returns data to B
    - 5. B calls CN.setReplicationStatus
    - ~~add in 'REPLICATE' permission~~ (actually just use Permission.execute)
- Replication get() logging should be logged differently than a regular get()
- Should evaluate the Subject from the certificate. To evaluate, ReplicationService needs to track Subjects for each ReplicationTask it queues. In IRC, we decided Types.Node needs to be modified to add a Node.Subject field, which is the value used to compare to.
- We'll need two Hazelcast replication structures, and will also rely on a shared Synchronization/Replication Object Lock:
  - hzReplicationTaskQueue (this will be polled)
  - hzPendingReplicationTaskMap (this contains replication tasks that have been initiated, but not completed, and will be the source for finding ReplicationTasks that match a given Node.Subject)
  - hzObjectLock (a shared structure used to lock pids during SystemMetadata updates)
- ~~Possibly pass in a key (random string/subject string) that repl nodes verify~~

Event-based replication for a given object:

On create() or update(), check the SystemMetadata's ReplicationPolicy for:

numberOfReplicas < count(CN.listNodes())
Nodes listed in preferred list exist

Rank the target replication nodes based on:

SystemMetadata for the given object with it's:
- ReplicationPolicy
  - blocked list
  - preferred list
  - ~~required list (is this defined?)~~
  - number of replicas as a minimum
  - replication allowed
  - (also, how to handle replicas that are externally maintained; set replication allowed to false, and min replicas to 0)
  - Decision: use only preferred and blocked lists; try to put a replica on all preferred nodes, plus any additional nodes needed to hit the min replicas
    - preferred > minimum : hit all preferred nodes, even if greater than minimum.
    - preferred < minimum : meet preferred and fill with more
    - preferred == minimum : stop at preferred & minimum
    - preferred - down < minimum : meet all available preferred and fill with extra
- Replica list (location, status)
Node rejection of a replication event
- Once rejected, always rejected? Depends on the exception

Initiate replication from the top of the list until the minimum number of replicas is created
Need an internal timeout after the replication is initiated for a given node
Create a ReplicationFailed exception
- messages:
  - Replication denied
  - Replication timed out
  - ...

Node-based polling:

If a node drops, use it's object list to evaluate whether or not each object's replication policy is fulfilled. If not, queue the object for replication
This requires a short list of objects on a given node. This may require a filter on the CN listObjects() method. We would need to pass in a NodeReference to get a list of objects on the given node. Question: Can we add a filter parameter to CNRead.listObjects() that filters by node id so we can produce that list efficiently?
- Query solr index via mercury
- Query metacat systemmetadata table

Question about ReplicationPolicy implementation:

Should we replicate to numberOfReplicas only, or numberOfReplicas + 2 (or 3)? But not more? - Rationale: since the more replicas there are, if a node drops, there is significant overhead on the CNs to update all of the references to the replica node
- initially stick with numberOfReplicas
How long is too long for a node being down to initiate replication of all of the objects that were stored on that node? One week of downtime?
- We need a notification step (email to the node manager)
- Downtime for maintenance, a few hours, a few days
- resolve() need to be updated to consider the status of each node