Monitoring discussion ===================== Participants: Chris, Dave, Matt, Nick, Robert, Skye, David, Bruce (late) Notes ----- 1) Nagios is running on monitor.dataone.org (kansas IP) * Status: running * check_mk interface: https://monitor.dataone.org/check_mk/ * all machines in test and production montored * low frequency checks (per minute) * role: good for service availabilty, ports * Nick: email alerts are super important * pulls at a default once per minute * setup is easy 2) Statsd * Running on statsd.dataone.org * https://statsd.dataone.org/ * Status: running * Listens on UDP port * currently streaming events for development CN hazelcast * role: realtime watch, troubleshooting * conclusion: it could be turned off for most normal monitoring, but is useful for development, high frequency testing 3) Splunk * realtime monitoring: tie an event to a log message * log forwarding running over TCP, sent to the splunk box at ORC * https://splunk.dataone.org:8000 * messages get to splunk within a second or two * alerts can be set up at any frequency, can be dialed back * if logging stops: write an alert to look at rate of log content drop * Accounts are separate from DataONE LDAP. See Bruce, David Doyle, Dave Vieglais to get an account set up on this. 4) Amazon AWS monitoring * can monitor both internal AWS services and external services * Alerting * Can configure HA and load-balancing to respond to these events DECISION: We can hold off on installing splunk on monitor.dataone.org Monitoring levels: * harware monitoring * service monitoring * In phase II of DataONE: * named responsibility for monitoring * passed baton Do we want a publicly visible monitoring synopsis? * e.g., https://status.github.com/graphs/past_month * e.g., http://www.google.com/appsstatus#hl=en&v=status * The DataONE dashboard should serve this function ACTION: Chris will write up tickets for hazelcast cluster monitoring ACTION: David will work with Bruce to create alerts for Hazelcast cluster membership BEW -- can someone clarify what kinds of log event we're looking for to generate the alert? Will be in the tickets ACTION: Dave will configure Nagios email alerts ACTION: Define a monitoring responsibility plan (Dave) ACTION: Robert, Bruce, and David D. will put together a cross training session on splunk monitoring ACTION: David Doyle to set up accounts in Splunk for Chris Jones, Matt Jones, Dave Vieglais, Skye Roseboom, Nick Brand, Robert Waltz + others on coredev (See below) Current coredev group is: * uniqueMember: uid=vieglais,ou=Account,dc=ecoinformatics,dc=org * uniqueMember: uid=jones,ou=Account,dc=ecoinformatics,dc=org * uniqueMember: uid=cjones,ou=Account,dc=ecoinformatics,dc=org * uniqueMember: uid=waltz,ou=Account,dc=ecoinformatics,dc=org * uniqueMember: uid=dahl,ou=Account,dc=ecoinformatics,dc=org * uniqueMember: uid=rnahf,ou=Account,dc=ecoinformatics,dc=org * uniqueMember: uid=leinfelder,ou=Account,dc=ecoinformatics,dc=org * uniqueMember: uid=sroseboo,ou=Account,dc=ecoinformatics,dc=org * uniqueMember: uid=palanisamy,ou=Account,dc=ecoinformatics,dc=org * uniqueMember: uid=tao,ou=Account,dc=ecoinformatics,dc=org * uniqueMember: uid=cbrumgard,ou=Account,dc=ecoinformatics,dc=org * uniqueMember: uid=brand,ou=Account,dc=ecoinformatics,dc=org * uniqueMember: uid=bwilson,ou=Account,dc=ecoinformatics,dc=org * uniqueMember: uid=ddoyle,ou=Account,dc=ecoinformatics,dc=org * uniqueMember: uid=slaughter,ou=Account,dc=ecoinformatics,dc=org