Notes for Development Block 2.3
===============================
Previous epad notes: http://epad.dataone.org/2014-12-Block-2-2
G+ URL: https://plus.google.com/hangouts/_/event/cqpnckqbr0s8o40kpk3r20ko8s0
Sprint Planning
~~~~~~~~~~~~~~~
Skye
------
- 20140331
- Dashboard UI refinements, review today
- SOLID Retro Friday
- 20140407
- Dashboard UI refinements
- Working on section below node list for display of:
- upcoming nodes
- replication nodes
- will review with amber
- Will start working on 1.2.6 release issues for indexing,mercury
- 20140409
- Dashboard UI refinements
- Reviewed with Amber, Rebecca
- Planning review in MN Wrangler meeting Friday
- Looking at index refresh performance
- Loading items into map, pid-by-pid seems to take just under 1 sec to create index jobs
- When items are in the map, can create 10-20 jobs per sec.
- 20140411
- SOLID RETRO TODAY AT 4ET
- Observed UCSB crash out of disk space
- UCSB down immediately caused throughput on hz system metadata to double
- Began work on a index job blaster with parrallelized job generation
- Concerned that reads trigger n-1 system metadata writes across system where n is number of CN.
- Many reads beget many more writes...
Roger
-----
- 20140404
- We had a productive meeting on the deployment tickets.
- Laura is doing major changes that we decided on at the meeting.
- We determined that the Member Node Deployment Checklist and the Deployment Tickets are two sides of the same coin. As a result of that, we're bringing notes from the existing tickets into the Checklist and then refactoring all the Deployment Tickets to have the same structure and contents as the Checklist.
- Ongoing work on ONEDrive.
- 20140402
- Ongoing work on ONEDrive.
- Must go back and look at dependencies in the Python stack. Fabio Trabucchi found another mismatch.
- 20140331
- Returning to ONEDrive dev.
Rob
----
- 20140411
- out sick for a couple of days
- 20140407
- Operating System Upgrades: jenkins: copied over all jobs from Hudson. Imported "trunk" jobs. A couple problems:
- d1_auditing_java failing, trying to compile with java v1.3?
- may be a problem with the maven installation on jenkins-1. It seems to be using an older maven-compiler-plugin 2.1.2
- CCI 1.2.6 Release
- DateTimeMarshaller timezone changes on hold in testing.
- reviewed the independent review of d1_common_java, assigned some tasks, fixed a couple bugs.
- 20140404
- set up apache-archiva on jenkins-1. It's a local maven repository with UI.
- d1_common_java tagging - blocked by testing of DataTimeMarshaller time-zone changes.
- 20140402
- Jenkins:
- troubleshooting / fixing maven-antrun issues
- have successful d1_common_java, and d1_libclient_java jobs
- still need to set up IRC notification, maven repo (?), email notification on jenkins-1
- d1_common_java - review of independent review, assigned a couple tickets
- CCI 1.2.6: somewhat blocked by date-time marshalling testing
- 20140331
- Finished working on some Tidy "dump" tests
- EDAC
David
-----
- 20140411
- RAID controller and NVSRAM update complete on ORC SAN2
- Migrations of non-prod and non-critical datastores in progress
- Splunk released a patched version to take care of Heartbleed vulnerability last night
- updated Splunk instances across all machines
- need to build new certs for Splunk and deploy
- 20140409
- ORC SAN issues
- After tweaking vDisk load on SAN2, SAN began reporting optimal conditions again
- "Well, I guess it's working now" isn't good enough, so going to do some more work on it, starting with updating RAID controller firmware
- Live update w/o shutting down servers connected to SAN2 vDisks risky, so migrating SAN2 load to SAN1 beforehand
- Splunk hazelcast logging/alerts
- built a crude alert to ping on explicit close messages, will talk to Bruce about how to make it more robust later today
- probably won't happen until after tidy/consistency fixes
- 20140407
- Splunk config cleanup finished
- need to build out datastore for archived data and configure that
- need to build out deployment server to steamline updates and config changes in the future
- SAN error at ORC persisting
- might need a firmware update, might be a symptom of upcoming RAID controller failure - OIT says contact Dell, gathering support data and contact info now
- want to make a backup of cn-orc-1 to the other SAN as soon as it's feasible
- Building a dummy CN VM at ORC to test ansible on 12.04
- 20140402
- Remaining Splunk forwarders in place
- Need to clean up some obsolete configs
- SAN error at ORC - had to reset some settings on our datastore manager
- 20140331
- Splunk forwarders built into MNs
- still have a couple that I need sudo access for - will discuss in standup
- Wrote up documentation for how Splunk now handles incoming data and how to build a new forwarder into Splunk
- Splunk indexer was complaining about low disk space last night, needed a fix
- Cloned VM to larger datastore, expanded the vDisk, then expanded the suffering LVM partition in Linux
- Talked to Bruce re: OS updates non non-prod boxes @ORC - he wants security patches in place soon after they go live; for non-security patches, get w/whoever manages that at UNM/UCSB and mimic what they're doing
- TO DO:
- Get sudo access for remaining forwarder builds, build out remaining forwarders
- Clean up obsolete Splunk configs
- Clean up Splunk docs
- Get to work building Hazelcast alerts
Matt
----
- 20140331
- Working on DataONE proposal to build metadata quality tools
Chris
-----
- 20140411
- New certs for: Cornell, Dryad, TFRI
- Check UCSB cert
- Revoke client certs
- Move UCSB into the RR, take ORC out
- Write ticket for Metacat write on every put()
- 20140409
- 20140407
- Continued MN support: GLEON, USANPN, DRYAD
- Metacat installation for view service work
- VM upgrades in production MNs
- 20140404
- Continued MN support for GLEON, USANPN
- MN Forum discussions
- MN deployment tickets meeting/planning with Laura, Roger, Rob
- 20140402
- MN support for USANPN, EDORA, GLEON
- Dashboard meetings
- Working on VM for Isis, Chris
- 20140331
- Troubleshooting GLEON resource maps
- VM spin up work
Peter
-----
- 20140402
- built CN local env on Ubuntu 10.04.4 and have cn-process-daemon.jar ready to deploy on cn-sandbox-ucsb-1
- drafted outline for log aggregation documentation
- 20140331
- building a CN local development environment in order to build jars with the update to log aggregation code - this will be installed and tested on cn-sandbox-ucsb-1
Dave
----
- 20140331
- Working through upgrade of DataONE documentation system to replace plone
- Working through code review of d1_common_java
- 20140402
- Administration work
- Continuing with alfresco document server config
- DSpace and OPeNDAP moving into Tier1 MN implementation work, will need to setup independent stage test environments
- 20140407
- Significant new workload related to grant renewal process.
Robert
------
- 20140331
- Completed Testing and started a Run of d1_tidy on cn-unm-1
Jing
------
- 20140407
- Set replication test between my local machine and mn-demo-11. But the test failed. I am shooting on it
- The registry has an issue - can't login. Working on it right now
- 20140402
- Made HTTPS connection work on apache of Ubuntu 12.04
- Made Morpho work with the Metacat on Ubuntu 12.04
- Most of JUnit test of Metacat worked on Ubuntu 12.04
- Upgrade mn-demo-11 to Ubuntu 12.04
- Connect the owners of the docids which have white spaces
Tim Robertson (GBIF):
trobertson@gbif.org
---------------------
Exploring a new Java implementation of a tier 4 Member Node that sits on Hadoop and HBase
- 20140411
- Early stages of exploration only so far
- Current progress:
- Skeleton project set up
- Maven, Dropwizard, Hadoop CDH4
- Core D1 java libraries reviewed
- Decision made not to use the core Java Interface
- Does not fit too nicely with Jersey JSR-311 annotations
- Keeps code much cleaner
- Decision made to use custom HBase persistence layer.
- Gora was used first, but dicsounted because it does not have Hive support
- Datanuclues is an option that could be explored, but requirements look very basic, so probably not needed
- SSL and Certificate based Auth "working" (e.g. Principle can be read from the certificate) - now trying to understand the actual D1 authentication
- Multipart Form data wired up - can do /object POST and /object/{pid} through API using CURL, where data is stored on the Hadoop DFS
- Next steps:
- Design HBase tables for the metadata
- Design Hadoop DFS for the file system storage
- Decision to be made: Should all files go to HDFS or small ones to HBase, larger to HDFS?
- Tidy code, get ready for a DataONE code review. Excite people by showing eBird SQL in sub 120seconds for generic queries, get a buddy to commit to it (Kyle at GBIF? + others?).
Topics
~~~~~~
- 20140411
- UNM Power outage
- Chris/Jing CN|MN Webtester discussion
- Jing/Skye discussion on indexing fields (fileId)
- Robert/Peter discussion on log aggregation issues
- Discussion on libclient comparators
- retro @4 ET: http://epad.dataone.org/DevRetroTopics
- 20140409
- Reindexing performance
- loop over hzIdentifiers
- create jobs in batches of 1000 (13-16 minutes) (1 per second)
- 10-20 per second if sysmeta is loaded into hzSysMeta
- Run a multi-threaded call to getSystemMetadata(pid) (Dave)
- Determine where the bottleneck is:
- check on Metacat's systemmetadata table (Chris)
- check on timing of SQL select from within Java
- check on timing of hzSystemMetadata.get(pid) call (which pulls from backing store)
- check timing of hazelcast caching of backup copies on each CN
- evaluate disk i/o issues with debug logging on/off (Robert)
- evaluate log files for SQL calls (INSERT, UPDATE, DELETE) (Robert)
- check to be sure Metacat indexing is turned of in metacat.properties (Chris)
- change Metacat logging to millisecond precision in DEV (Robert)
- LTER MN switch
- Geohash use in log aggregation
- will schedule a mtg today