.. meta::

   :keywords: sprint, standup, ccit, 
   

Sprint-2012-46-Block-6-3 Notes
==============================

Roger
-----
Chris B.
--------
Rob
---
Client Data Format Support:
Dave
----
Chris
-----

Ben
---
Skye
----
Robert
------
Bruce
-----
Discussion Items
----------------

  https://docs.google.com/spreadsheet/ccc?key=0Ai3ryhJR2IgZdEwwTDhnai01UXN1RlRoUWtkOFNyZVE#gid=0

20121128
    
    
  Types of communiction that may be split:
    - Hazelcast
    - LDAP
    - Metacat replication

  Also issues where one or more CNs have internal service failures - e.g. postgres failure, disk full, disk failure, hardware fault, etc

  Approaches to dealing with network partitioning:
  
  1- Don't use multi-master
  2- Detect and stop actions until split is resolved
  3- Never update, only create new records each with timestamp. After split merge content during time of split (may not be possible in all cases - e.g. permission conflict)
  4- Use WAN replication mode (3CNs = Active-Active-Active mode) in Hazelcast, requires implementation of conflict resolution mechanisms (merge-policy)

Note: since we have 3 HZ clusters, we need to determine what a 'split-brain' state is
    hzStorage
    hzProcess
    hzSession



Path forward:
1- Implement monitor of cluster membership. What gets notified when a member joins / leaves?
   http://www.hazelcast.com/docs/2.4/manual/single_html/#ClusterInterface

2- Implement mechanisms for responding to network partition events

Strategy 1 - revert to a read-only system
  - What is the level of disconnect? (one node, all nodes?) 
  - What time frame equates to a disconnect (seconds, minutes?)
  - what services need to be stopped? All write activity should stop.
    - what happens to outstanding replication requests for example?
  - What should administrators do? DNS update?, check and restart services?
  - How to override the default behavior?
  
on disconnect:
  - set status to read only
  - Notification mechanism should not rely on hzClient, but a local service
  - notify admins

Strategy 2 - Queue all writes

- writes are recorded as provisional
  provisional = interval between when split brain detected to when resolved

on disconnect:
  - DNS change - immediately switch to 2 nodes or one node
  - set status to queue changes (what does this mean? disconnect the backing store?)

Strategy 3 - Merge data stores
  - keep a Journal log, either merge all entries or last write wins
    (From discussion of Rollout procedures: We  may wish to consider that we keep a journalling system of posted  reservations (independent of LDAP) on a pubic facing CN during upgrades  that will create a replayable log of reserveIdentifier actions.)