.. meta::
:keywords: sprint, standup, ccit,
Sprint-2012-46-Block-6-3 Notes
==============================
Roger
-----
- ONEDrive package management
- trying to get to a stable space, not feature-complete
- Looking at OAI-ORE serializations
- Had to deal with handling solr index not in sync on the CNs
- gracefull fallback if pids in the index are not resolvable
Chris B.
--------
- Reworking the ansible playbooks to flatten the dependencies to restrict dependency tree to two level.
- Working with the Ansible developers on sub-playbooks
- Playbook generator is complete
- ncurses-based input is ok
- Looking into the dataone packages to see what configuration files are produced and the required information that the ansible playbooks should ask.
Rob
---
Client Data Format Support:
- Getting up to speed on R data frames and existing tools for data formats.
- Looking at ncdf package (for Unidata's netCDF)
- Working on delimited file parsing (CSV) in a general manner
- Worked on an EML parser in the R client
- needs a reasonable amount of eml-physical metadata
Dave
----
- 20121130
- preparations for EAB meeting
- issue with nagios / check_mk on cn-stage-unm-1
- reviewing replication process
- 20121128
- (automating) check through versions in svn tags, branch, dev and what is in poms
- preparing for EAB meeting
- Updated nagios configs on stage CNs for monitor.dataone.org
- 20121126
- Updating docs, dev schedule, component list (see actions below)
- Added statsd datagram to websocket gateway
- Network outage at UNM office Tuesday, 8-12MT (epad, redmine, hudson affected)
Chris
-----
- 20121126
- d1_replication testing with Skye, revealed a Metacat bug
- Looking at switching Metacat indexing on/off to deal with this CN sync bug
Ben
---
- Morpho API work (save, open, etc)
- Using user-specified certificate location for authentication. Not exactly ready for end user, but useful for development and demonstrates the plumbing is in place for certificate-based auth.
- [experimental] Working on the ability to assign ids if generated identifiers are not desirable
- tracking bugzilla entry http://bugzilla.ecoinformatics.org/show_bug.cgi?id=5736
- Working with Jing on bugs WRT archived objects
- Metacat not looking at the archived flag
- fixed in trunk -- will be in 2.0.5 release
- Metacat.delete() was not marking SM.archived=true (https://redmine.dataone.org/issues/3406)
- fixed in trunk -- 2.0.5 release
- CLO/AKN member node -- what was the decision on that?
- Chris will approve this in production
- Jing will be working on Morpho again
- Adding MN.generateIdentifier() in Metacat
- Changed Metacat to use SQL-based sorting/limiting, not Java-based
- max log entry return count is 7000
Skye
----
- Replication testing
- Final run on wednesday looks to have missed 20 docs
- No errors in replication log...
- lots of errors in catalina.out from indexing tasks?
- DONE: Perform run with replication running on all three CN?
- Result: replication running on 2 or 3 CN's worked well:
- Increased performance/throughput of the replication process
- Full replication runs with 2 and 3 CN for small test runs of 1K objects
- TODO: Minor buildout error in cn-index buildout on purge/install
- Added second CN into an already replicating cluster
- saw a member drop outof the cluster, either network or client issue
- this is an outstanding issue (split-brain)
- need to create a strategy that handles inconsistencies between nodes
- Most recent run stalled erly, looking into it - may be a Metacat upgrade issue?
- will work on a standalone membership monitor
- Hz LifeCycleEventListener
- sync/repl run this morning, noticing repl pauses - need to figure out why it does this (locking?)
Robert
------
- Submitted pull request to hazelcast
- built hazelcast sucessfully on localhost machine
- deleted and re-forked hazelcast/hazelcast repo into rwaltz/hazelcast
- fixed some errors in synch NullPointer issue
- investigated hazelcast executor exception. Not enough info.
- want to investigate the hz client connection issue that causes EOF warnings on cluster members
Bruce
-----
- Moving additional server into ORC cluster to add capacity
- Resolving "sand in the gears" issues with UTK OIT
- Documentation and testing of bibliographic tools (Mendeley, Zotero, Papers, EndNote)
- Start framing for USA-NPN member node (likely MetaCat implementation, in January)
Discussion Items
----------------
- Component status, responsibilities, versions spreadsheet:
https://docs.google.com/spreadsheet/ccc?key=0Ai3ryhJR2IgZdEwwTDhnai01UXN1RlRoUWtkOFNyZVE#gid=0
- Client data format support - addressing EML parsing in d1_client_r
- Hazelcast non-snapshot release
- A 1.1 release will require a full solr index rebuild, so timing of a 1.1.1 release may interfere
- Robert will contact the Hazelcast group re: 2.4.1 tag
- We'll test d1_replication against the 2.4.1 snapshot this week
- Waiting on pull request re: logging in hazelcast
- Should consider maintaining a D1 HZ repo if we use SSL (non-pay)
- Metacat indexing DONE
- Chris will test metacat with the indexing flag turned off
- We'll look at keeping indexing running, but remove docids from the IndexerQueue on archive(), noting that there may be a race condition involved there
20121128
- Scimeta parsing for clients (R)
- science metadata is often less than perfect WRT physical file syntax
- converting from a byte array to delimited text using read.csv
- Network partitioning (split-brain) issues in repl and sync
- Listeners can notify when members leave/join the cluster.
- May be able to create ourown Merge Policy within Hazelcast
- need to look at timestamps of object records in postgres
- Look at WAN replication in Hazelcast
Types of communiction that may be split:
- Hazelcast
- LDAP
- Metacat replication
Also issues where one or more CNs have internal service failures - e.g. postgres failure, disk full, disk failure, hardware fault, etc
Approaches to dealing with network partitioning:
1- Don't use multi-master
2- Detect and stop actions until split is resolved
3- Never update, only create new records each with timestamp. After split merge content during time of split (may not be possible in all cases - e.g. permission conflict)
4- Use WAN replication mode (3CNs = Active-Active-Active mode) in Hazelcast, requires implementation of conflict resolution mechanisms (merge-policy)
Note: since we have 3 HZ clusters, we need to determine what a 'split-brain' state is
hzStorage
hzProcess
hzSession
Path forward:
1- Implement monitor of cluster membership. What gets notified when a member joins / leaves?
http://www.hazelcast.com/docs/2.4/manual/single_html/#ClusterInterface
2- Implement mechanisms for responding to network partition events
Strategy 1 - revert to a read-only system
- What is the level of disconnect? (one node, all nodes?)
- What time frame equates to a disconnect (seconds, minutes?)
- what services need to be stopped? All write activity should stop.
- what happens to outstanding replication requests for example?
- What should administrators do? DNS update?, check and restart services?
- How to override the default behavior?
on disconnect:
- set status to read only
- Notification mechanism should not rely on hzClient, but a local service
- notify admins
Strategy 2 - Queue all writes
- writes are recorded as provisional
provisional = interval between when split brain detected to when resolved
on disconnect:
- DNS change - immediately switch to 2 nodes or one node
- set status to queue changes (what does this mean? disconnect the backing store?)
Strategy 3 - Merge data stores
- keep a Journal log, either merge all entries or last write wins
(From discussion of Rollout procedures: We may wish to consider that we keep a journalling system of posted reservations (independent of LDAP) on a pubic facing CN during upgrades that will create a replayable log of reserveIdentifier actions.)