/Sprint-2012-46-Block-6-3

.. meta::

:keywords: sprint, standup, ccit,

Sprint-2012-46-Block-6-3 Notes
==============================

Roger
-----

ONEDrive package management
- trying to get to a stable space, not feature-complete
- Looking at OAI-ORE serializations
  - will be using RDF/XML for now (formatID = http://www.openarchives.org/ore/terms )
- Had to deal with handling solr index not in sync on the CNs
  - gracefull fallback if pids in the index are not resolvable

Chris B.
--------

Reworking the ansible playbooks to flatten the dependencies to restrict dependency tree to two level.
- Done
Working with the Ansible developers on sub-playbooks
Playbook generator is complete
- ncurses-based input is ok
Looking into the dataone packages to see what configuration files are produced and the required information that the ansible playbooks should ask.

Rob
---
Client Data Format Support:

Getting up to speed on R data frames and existing tools for data formats.
Looking at ncdf package (for Unidata's netCDF)
Working on delimited file parsing (CSV) in a general manner
Worked on an EML parser in the R client
- needs a reasonable amount of eml-physical metadata

Dave
----

20121130
- preparations for EAB meeting
- issue with nagios / check_mk on cn-stage-unm-1
- reviewing replication process
20121128
- (automating) check through versions in svn tags, branch, dev and what is in poms
- preparing for EAB meeting
- Updated nagios configs on stage CNs for monitor.dataone.org
20121126
- Updating docs, dev schedule, component list (see actions below)
- Added statsd datagram to websocket gateway
- Network outage at UNM office Tuesday, 8-12MT (epad, redmine, hudson affected)

Chris
-----

20121126
- d1_replication testing with Skye, revealed a Metacat bug
- Looking at switching Metacat indexing on/off to deal with this CN sync bug

Ben
---

Morpho API work (save, open, etc)
- Using user-specified certificate location for authentication. Not exactly ready for end user, but useful for development and demonstrates the plumbing is in place for certificate-based auth.
- [experimental] Working on the ability to assign ids if generated identifiers are not desirable
- tracking bugzilla entry http://bugzilla.ecoinformatics.org/show_bug.cgi?id=5736
Working with Jing on bugs WRT archived objects
- Metacat not looking at the archived flag
  - fixed in trunk -- will be in 2.0.5 release
- Metacat.delete() was not marking SM.archived=true (https://redmine.dataone.org/issues/3406)
  - fixed in trunk -- 2.0.5 release
CLO/AKN member node -- what was the decision on that?
- Chris will approve this in production
Jing will be working on Morpho again
Adding MN.generateIdentifier() in Metacat
Changed Metacat to use SQL-based sorting/limiting, not Java-based
- max log entry return count is 7000

Skye
----

Replication testing
- Final run on wednesday looks to have missed 20 docs
- No errors in replication log...
- lots of errors in catalina.out from indexing tasks?
DONE: Perform run with replication running on all three CN?
- Result: replication running on 2 or 3 CN's worked well:
  - Increased performance/throughput of the replication process
  - Full replication runs with 2 and 3 CN for small test runs of 1K objects
TODO: Minor buildout error in cn-index buildout on purge/install
Added second CN into an already replicating cluster
- saw a member drop outof the cluster, either network or client issue
- this is an outstanding issue (split-brain)
- need to create a strategy that handles inconsistencies between nodes
Most recent run stalled erly, looking into it - may be a Metacat upgrade issue?
will work on a standalone membership monitor
- Hz LifeCycleEventListener
sync/repl run this morning, noticing repl pauses - need to figure out why it does this (locking?)

Robert
------

Submitted pull request to hazelcast
built hazelcast sucessfully on localhost machine
deleted and re-forked hazelcast/hazelcast repo into rwaltz/hazelcast
- rebuilt on hudson
fixed some errors in synch NullPointer issue
investigated hazelcast executor exception. Not enough info.
want to investigate the hz client connection issue that causes EOF warnings on cluster members

Bruce
-----

Moving additional server into ORC cluster to add capacity
Resolving "sand in the gears" issues with UTK OIT
Documentation and testing of bibliographic tools (Mendeley, Zotero, Papers, EndNote)
Start framing for USA-NPN member node (likely MetaCat implementation, in January)

Discussion Items
----------------

Component status, responsibilities, versions spreadsheet:

https://docs.google.com/spreadsheet/ccc?key=0Ai3ryhJR2IgZdEwwTDhnai01UXN1RlRoUWtkOFNyZVE#gid=0

Client data format support - addressing EML parsing in d1_client_r
Hazelcast non-snapshot release
- A 1.1 release will require a full solr index rebuild, so timing of a 1.1.1 release may interfere
- Robert will contact the Hazelcast group re: 2.4.1 tag
- We'll test d1_replication against the 2.4.1 snapshot this week
- Waiting on pull request re: logging in hazelcast
- Should consider maintaining a D1 HZ repo if we use SSL (non-pay)
Metacat indexing DONE
- Chris will test metacat with the indexing flag turned off
- We'll look at keeping indexing running, but remove docids from the IndexerQueue on archive(), noting that there may be a race condition involved there

20121128

Scimeta parsing for clients (R)
- science metadata is often less than perfect WRT physical file syntax
- converting from a byte array to delimited text using read.csv
Network partitioning (split-brain) issues in repl and sync
- Listeners can notify when members leave/join the cluster.
- May be able to create ourown Merge Policy within Hazelcast
  - need to look at timestamps of object records in postgres
- Look at WAN replication in Hazelcast

Types of communiction that may be split:
    - Hazelcast
    - LDAP
    - Metacat replication

Also issues where one or more CNs have internal service failures - e.g. postgres failure, disk full, disk failure, hardware fault, etc

Approaches to dealing with network partitioning:

1- Don't use multi-master
2- Detect and stop actions until split is resolved
3- Never update, only create new records each with timestamp. After split merge content during time of split (may not be possible in all cases - e.g. permission conflict)
4- Use WAN replication mode (3CNs = Active-Active-Active mode) in Hazelcast, requires implementation of conflict resolution mechanisms (merge-policy)

Note: since we have 3 HZ clusters, we need to determine what a 'split-brain' state is
    hzStorage
    hzProcess
    hzSession

Path forward:
1- Implement monitor of cluster membership. What gets notified when a member joins / leaves?
   http://www.hazelcast.com/docs/2.4/manual/single_html/#ClusterInterface

2- Implement mechanisms for responding to network partition events

Strategy 1 - revert to a read-only system
- What is the level of disconnect? (one node, all nodes?)
- What time frame equates to a disconnect (seconds, minutes?)
- what services need to be stopped? All write activity should stop.
    - what happens to outstanding replication requests for example?
- What should administrators do? DNS update?, check and restart services?
- How to override the default behavior?

on disconnect:
- set status to read only
- Notification mechanism should not rely on hzClient, but a local service
- notify admins

Strategy 2 - Queue all writes

- writes are recorded as provisional
provisional = interval between when split brain detected to when resolved

on disconnect:
- DNS change - immediately switch to 2 nodes or one node
- set status to queue changes (what does this mean? disconnect the backing store?)

Strategy 3 - Merge data stores
- keep a Journal log, either merge all entries or last write wins
    (From discussion of Rollout procedures: We may wish to consider that we keep a journalling system of posted reservations (independent of LDAP) on a pubic facing CN during upgrades that will create a replayable log of reserveIdentifier actions.)