Notes for Development Block 2.3 =============================== Previous epad notes: http://epad.dataone.org/2014-12-Block-2-2 G+ URL: https://plus.google.com/hangouts/_/event/cqpnckqbr0s8o40kpk3r20ko8s0 Sprint Planning ~~~~~~~~~~~~~~~ * CN Consistency wrap-up (running) * http://epad.dataone.org/cn-audit-systemmetadata-design * Finish Dashboard v1 * Plan dashboard release (tags) * Operating System Upgrades ( https://redmine.dataone.org/issues/4466 ) * Jenkins is running 12.04, OpenJDK 7 (http://jenkins-1.dataone.org:8080) * Upgrade development boxes * CCI 1.2.6 Release ( https://redmine.dataone.org/issues/4461 ) * d1_common_java release * https://redmine.dataone.org/issues/4474 * d1_libclient_java release * https://redmine.dataone.org/issues/4475 Skye ------ * 20140331 * Dashboard UI refinements, review today * https://redmine.dataone.org/issues/3875 * SOLID Retro Friday * 20140407 * Dashboard UI refinements * Working on section below node list for display of: * upcoming nodes * replication nodes * will review with amber * Will start working on 1.2.6 release issues for indexing,mercury * 20140409 * Dashboard UI refinements * Reviewed with Amber, Rebecca * Planning review in MN Wrangler meeting Friday * Looking at index refresh performance * Loading items into map, pid-by-pid seems to take just under 1 sec to create index jobs * When items are in the map, can create 10-20 jobs per sec. * 20140411 * SOLID RETRO TODAY AT 4ET * resources and links: http://epad.dataone.org/DevRetroTopics * Observed UCSB crash out of disk space * UCSB down immediately caused throughput on hz system metadata to double * Began work on a index job blaster with parrallelized job generation * Concerned that reads trigger n-1 system metadata writes across system where n is number of CN. * Many reads beget many more writes... Roger ----- * 20140404 * We had a productive meeting on the deployment tickets. * Laura is doing major changes that we decided on at the meeting. * We determined that the Member Node Deployment Checklist and the Deployment Tickets are two sides of the same coin. As a result of that, we're bringing notes from the existing tickets into the Checklist and then refactoring all the Deployment Tickets to have the same structure and contents as the Checklist. * Ongoing work on ONEDrive. * 20140402 * Ongoing work on ONEDrive. * Must go back and look at dependencies in the Python stack. Fabio Trabucchi found another mismatch. * 20140331 * Returning to ONEDrive dev. Rob ---- * 20140411 * out sick for a couple of days * * 20140407 * Operating System Upgrades: jenkins: copied over all jobs from Hudson. Imported "trunk" jobs. A couple problems: * d1_auditing_java failing, trying to compile with java v1.3? * may be a problem with the maven installation on jenkins-1. It seems to be using an older maven-compiler-plugin 2.1.2 * CCI 1.2.6 Release * DateTimeMarshaller timezone changes on hold in testing. * reviewed the independent review of d1_common_java, assigned some tasks, fixed a couple bugs. * 20140404 * set up apache-archiva on jenkins-1. It's a local maven repository with UI. * maven.dataone.org * d1_common_java tagging - blocked by testing of DataTimeMarshaller time-zone changes. * 20140402 * Jenkins: * troubleshooting / fixing maven-antrun issues * have successful d1_common_java, and d1_libclient_java jobs * still need to set up IRC notification, maven repo (?), email notification on jenkins-1 * d1_common_java - review of independent review, assigned a couple tickets * CCI 1.2.6: somewhat blocked by date-time marshalling testing * 20140331 * Finished working on some Tidy "dump" tests * EDAC David ----- * 20140411 * RAID controller and NVSRAM update complete on ORC SAN2 * Migrations of non-prod and non-critical datastores in progress * Splunk released a patched version to take care of Heartbleed vulnerability last night * updated Splunk instances across all machines * need to build new certs for Splunk and deploy * 20140409 * ORC SAN issues * After tweaking vDisk load on SAN2, SAN began reporting optimal conditions again * "Well, I guess it's working now" isn't good enough, so going to do some more work on it, starting with updating RAID controller firmware * Live update w/o shutting down servers connected to SAN2 vDisks risky, so migrating SAN2 load to SAN1 beforehand * Splunk hazelcast logging/alerts * built a crude alert to ping on explicit close messages, will talk to Bruce about how to make it more robust later today * probably won't happen until after tidy/consistency fixes * 20140407 * Splunk config cleanup finished * need to build out datastore for archived data and configure that * need to build out deployment server to steamline updates and config changes in the future * SAN error at ORC persisting * might need a firmware update, might be a symptom of upcoming RAID controller failure - OIT says contact Dell, gathering support data and contact info now * want to make a backup of cn-orc-1 to the other SAN as soon as it's feasible * Building a dummy CN VM at ORC to test ansible on 12.04 * 20140402 * Remaining Splunk forwarders in place * Need to clean up some obsolete configs * SAN error at ORC - had to reset some settings on our datastore manager * 20140331 * Splunk forwarders built into MNs * still have a couple that I need sudo access for - will discuss in standup * Wrote up documentation for how Splunk now handles incoming data and how to build a new forwarder into Splunk * Splunk indexer was complaining about low disk space last night, needed a fix * Cloned VM to larger datastore, expanded the vDisk, then expanded the suffering LVM partition in Linux * Talked to Bruce re: OS updates non non-prod boxes @ORC - he wants security patches in place soon after they go live; for non-security patches, get w/whoever manages that at UNM/UCSB and mimic what they're doing * TO DO: * Get sudo access for remaining forwarder builds, build out remaining forwarders * Clean up obsolete Splunk configs * Clean up Splunk docs * Get to work building Hazelcast alerts Matt ---- * 20140331 * Working on DataONE proposal to build metadata quality tools Chris ----- * 20140411 * New certs for: Cornell, Dryad, TFRI * Check UCSB cert * Revoke client certs * Move UCSB into the RR, take ORC out * Write ticket for Metacat write on every put() * 20140409 * Finished Metacat install * 20140407 * Continued MN support: GLEON, USANPN, DRYAD * Metacat installation for view service work * VM upgrades in production MNs * 20140404 * Continued MN support for GLEON, USANPN * MN Forum discussions * MN deployment tickets meeting/planning with Laura, Roger, Rob * 20140402 * MN support for USANPN, EDORA, GLEON * Dashboard meetings * Working on VM for Isis, Chris * 20140331 * Troubleshooting GLEON resource maps * VM spin up work Peter ----- * 20140402 * built CN local env on Ubuntu 10.04.4 and have cn-process-daemon.jar ready to deploy on cn-sandbox-ucsb-1 * drafted outline for log aggregation documentation * 20140331 * building a CN local development environment in order to build jars with the update to log aggregation code - this will be installed and tested on cn-sandbox-ucsb-1 Dave ---- * 20140331 * Working through upgrade of DataONE documentation system to replace plone * Working through code review of d1_common_java * 20140402 * Administration work * Continuing with alfresco document server config * DSpace and OPeNDAP moving into Tier1 MN implementation work, will need to setup independent stage test environments * 20140407 * Significant new workload related to grant renewal process. Robert ------ * 20140331 * Completed Testing and started a Run of d1_tidy on cn-unm-1 Jing ------ * 20140407 * Set replication test between my local machine and mn-demo-11. But the test failed. I am shooting on it * The registry has an issue - can't login. Working on it right now * 20140402 * Made HTTPS connection work on apache of Ubuntu 12.04 * Made Morpho work with the Metacat on Ubuntu 12.04 * Most of JUnit test of Metacat worked on Ubuntu 12.04 * Upgrade mn-demo-11 to Ubuntu 12.04 * Connect the owners of the docids which have white spaces Tim Robertson (GBIF): trobertson@gbif.org --------------------- Exploring a new Java implementation of a tier 4 Member Node that sits on Hadoop and HBase * 20140411 * Early stages of exploration only so far * Current progress: * Skeleton project set up * Maven, Dropwizard, Hadoop CDH4 * Core D1 java libraries reviewed * Decision made not to use the core Java Interface * Does not fit too nicely with Jersey JSR-311 annotations * Keeps code much cleaner * Decision made to use custom HBase persistence layer. * Gora was used first, but dicsounted because it does not have Hive support * Datanuclues is an option that could be explored, but requirements look very basic, so probably not needed * SSL and Certificate based Auth "working" (e.g. Principle can be read from the certificate) - now trying to understand the actual D1 authentication * Multipart Form data wired up - can do /object POST and /object/{pid} through API using CURL, where data is stored on the Hadoop DFS * Next steps: * Design HBase tables for the metadata * Design Hadoop DFS for the file system storage * Decision to be made: Should all files go to HDFS or small ones to HBase, larger to HDFS? * Tidy code, get ready for a DataONE code review. Excite people by showing eBird SQL in sub 120seconds for generic queries, get a buddy to commit to it (Kyle at GBIF? + others?). Topics ~~~~~~ * 20140411 * UNM Power outage * Chris/Jing CN|MN Webtester discussion * Jing/Skye discussion on indexing fields (fileId) * Robert/Peter discussion on log aggregation issues * Discussion on libclient comparators * retro @4 ET: http://epad.dataone.org/DevRetroTopics * 20140409 * Reindexing performance * loop over hzIdentifiers * get(pid) from hzSysMeta * create jobs in batches of 1000 (13-16 minutes) (1 per second) * 10-20 per second if sysmeta is loaded into hzSysMeta * Run a multi-threaded call to getSystemMetadata(pid) (Dave) * Determine where the bottleneck is: * check on Metacat's systemmetadata table (Chris) * check on timing of SQL select from within Java * check on timing of hzSystemMetadata.get(pid) call (which pulls from backing store) * check timing of hazelcast caching of backup copies on each CN * evaluate disk i/o issues with debug logging on/off (Robert) * evaluate log files for SQL calls (INSERT, UPDATE, DELETE) (Robert) * check to be sure Metacat indexing is turned of in metacat.properties (Chris) * change Metacat logging to millisecond precision in DEV (Robert) * indexed guid column * LTER MN switch * Geohash use in log aggregation * will schedule a mtg today