.. meta::

   :keywords: sprint, standup, ccit, 
   

Sprint-2012-46-Block-6-3 Notes
==============================

Roger
-----
 * ONEDrive package management
   * trying to get to a stable space, not feature-complete
   * Looking at OAI-ORE serializations
     * will be using RDF/XML for now (formatID = http://www.openarchives.org/ore/terms )
   * Had to deal with handling solr index not in sync on the CNs
     * gracefull fallback if pids in the index are not resolvable


Chris B.
--------
 * Reworking the ansible playbooks to flatten the dependencies to restrict dependency tree to two level.
   * Done
 * Working with the Ansible developers on sub-playbooks
 * Playbook generator is complete
   * ncurses-based input is ok
 * Looking into the dataone packages to see what configuration files are produced and the required information that the ansible playbooks should ask.


Rob
---
Client Data Format Support:
 * Getting up to speed on R data frames and existing tools for data formats.
 * Looking at ncdf package (for Unidata's netCDF)
 * Working on delimited file parsing (CSV) in a general manner
 * Worked on an EML parser in the R client
   * needs a reasonable amount of eml-physical metadata


Dave
----
 * 20121130
   * preparations for EAB meeting
   * issue with nagios / check_mk on cn-stage-unm-1
   * reviewing replication process

 * 20121128
   * (automating) check through versions in svn tags, branch, dev and what is in poms
   * preparing for EAB meeting
   * Updated nagios configs on stage CNs for monitor.dataone.org

 * 20121126
   * Updating docs, dev schedule, component list (see actions below)
   * Added statsd datagram to websocket gateway
   * Network outage at UNM office Tuesday, 8-12MT (epad, redmine, hudson affected)


Chris
-----

 * 20121126
   * d1_replication testing with Skye, revealed a Metacat bug
   * Looking at switching Metacat indexing on/off to deal with this CN sync bug


Ben
---
 * Morpho API work (save, open, etc)
   * Using user-specified certificate location for authentication. Not exactly ready for end user, but useful for development and demonstrates the plumbing is in place for certificate-based auth.
   * [experimental] Working on the ability to assign ids if generated identifiers are not desirable
   * tracking bugzilla entry http://bugzilla.ecoinformatics.org/show_bug.cgi?id=5736
 * Working with Jing on bugs WRT archived objects
   * Metacat not looking at the archived flag
     * fixed in trunk -- will be in 2.0.5 release
   * Metacat.delete() was not marking SM.archived=true (https://redmine.dataone.org/issues/3406)
     * fixed in trunk -- 2.0.5 release
 * CLO/AKN member node -- what was the decision on that?
   * Chris will approve this in production
 * Jing will be working on Morpho again
 * Adding MN.generateIdentifier() in Metacat
 * Changed Metacat to use SQL-based sorting/limiting, not Java-based
   * max log entry return count is 7000

Skye
----
 * Replication testing
   * Final run on wednesday looks to have missed 20 docs
   * No errors in replication log...
   * lots of errors in catalina.out from indexing tasks?
 * DONE: Perform run with replication running on all three CN?
   * Result: replication running on 2 or 3 CN's worked well:
     * Increased performance/throughput of the replication process
     * Full replication runs with 2 and 3 CN for small test runs of 1K objects
 * TODO: Minor buildout error in cn-index buildout on purge/install
 * Added second CN into an already replicating cluster
   * saw a member drop outof the cluster, either network or client issue
   * this is an outstanding issue (split-brain)
   * need to create a strategy that handles inconsistencies between nodes
 * Most recent run stalled erly, looking into it - may be a Metacat upgrade issue?
 * will work on a standalone membership monitor
   * Hz LifeCycleEventListener
 * sync/repl run this morning, noticing repl pauses - need to figure out why it does this (locking?)

Robert
------
 * Submitted pull request to hazelcast
 * built hazelcast sucessfully on localhost machine
 * deleted and re-forked hazelcast/hazelcast repo into rwaltz/hazelcast
   * rebuilt on hudson
 * fixed some errors in synch NullPointer issue
 * investigated hazelcast executor exception. Not enough info.
 * want to investigate the hz client connection issue that causes EOF warnings on cluster members

Bruce
-----
 * Moving additional server into ORC cluster to add capacity
 * Resolving "sand in the gears" issues with UTK OIT
 * Documentation and testing of bibliographic tools (Mendeley, Zotero, Papers, EndNote)
 * Start framing for USA-NPN member node (likely MetaCat implementation, in January)

Discussion Items
----------------

 * Component status, responsibilities, versions spreadsheet:

  https://docs.google.com/spreadsheet/ccc?key=0Ai3ryhJR2IgZdEwwTDhnai01UXN1RlRoUWtkOFNyZVE#gid=0

 * Client data format support - addressing EML parsing in d1_client_r

 * Hazelcast non-snapshot release
   * A 1.1 release will require a full solr index rebuild, so timing of a 1.1.1 release may interfere
   * Robert will contact the Hazelcast group re: 2.4.1 tag
   * We'll test d1_replication against the 2.4.1 snapshot this week
   * Waiting on pull request re: logging in hazelcast
   * Should consider maintaining a D1 HZ repo if we use SSL (non-pay)

 * Metacat indexing DONE
   * Chris will test metacat with the indexing flag turned off
   * We'll look at keeping indexing running, but remove docids from the IndexerQueue on archive(), noting that there may be a race condition involved there

20121128
 * Scimeta parsing for clients (R)
   * science metadata is often less than perfect WRT physical file syntax
   * converting from a byte array to delimited text using read.csv

 * Network partitioning (split-brain) issues in repl and sync
   * Listeners can notify when members leave/join the cluster.
   * May be able to create ourown Merge Policy within Hazelcast
     * need to look at timestamps of object records in postgres
   * Look at WAN replication in Hazelcast
    
    
  Types of communiction that may be split:
    - Hazelcast
    - LDAP
    - Metacat replication

  Also issues where one or more CNs have internal service failures - e.g. postgres failure, disk full, disk failure, hardware fault, etc

  Approaches to dealing with network partitioning:
  
  1- Don't use multi-master
  2- Detect and stop actions until split is resolved
  3- Never update, only create new records each with timestamp. After split merge content during time of split (may not be possible in all cases - e.g. permission conflict)
  4- Use WAN replication mode (3CNs = Active-Active-Active mode) in Hazelcast, requires implementation of conflict resolution mechanisms (merge-policy)

Note: since we have 3 HZ clusters, we need to determine what a 'split-brain' state is
    hzStorage
    hzProcess
    hzSession


Path forward:
1- Implement monitor of cluster membership. What gets notified when a member joins / leaves?
   http://www.hazelcast.com/docs/2.4/manual/single_html/#ClusterInterface

2- Implement mechanisms for responding to network partition events

Strategy 1 - revert to a read-only system
  - What is the level of disconnect? (one node, all nodes?) 
  - What time frame equates to a disconnect (seconds, minutes?)
  - what services need to be stopped? All write activity should stop.
    - what happens to outstanding replication requests for example?
  - What should administrators do? DNS update?, check and restart services?
  - How to override the default behavior?
  
on disconnect:
  - set status to read only
  - Notification mechanism should not rely on hzClient, but a local service
  - notify admins

Strategy 2 - Queue all writes

- writes are recorded as provisional
  provisional = interval between when split brain detected to when resolved

on disconnect:
  - DNS change - immediately switch to 2 nodes or one node
  - set status to queue changes (what does this mean? disconnect the backing store?)

Strategy 3 - Merge data stores
  - keep a Journal log, either merge all entries or last write wins
    (From discussion of Rollout procedures: We  may wish to consider that we keep a journalling system of posted  reservations (independent of LDAP) on a pubic facing CN during upgrades  that will create a replayable log of reserveIdentifier actions.)