New Room: DataONE Upgrade Process in the EVO list

http://evo.caltech.edu/evoNext/koala.jnlp?meeting=vsvivIese9IlIuanaMItas

(take a 5 minutes break) 


High Level goals:
1. ONEMercury always up
    - read access w/ authorization:  'GET' calls
2. CNCore & CNRead always up


Goals

1) one goal was to always have two CNs up and running
   a) not certain the benefits with the way we have RR running right now.  It is not a fail over solution, but a load balancing solution.
   
   
2) Always have a single CN up and running (No down time)
   a)  Always have read + write services up (?).  When we say 'No down time', do we always need a fully writable system? For non interference with MemberNodes, at a minimal we should be able to call 'reserveIdentifiers' as a write function.  

3) Do not have different versions of products running on CNs communicating with one another
4) Do not allow a situation in which a user experiences data retrieval inconsistency
Issues
Proposed Upgrade Procedure starting 1.0.4

0. turn off d1-processing and indexing
1. Point DNS only to CN1
2. Turn off metacat replication on three 3 CNs ( toggle cn_rest read only switch - not available for 1.0.4! - reserveIdentifier must stay up)
3. Keep CN1 up
4. Shutdown CN2 and CN3
5. Upgrade CN2 and CN3
   5.1 Upgrade procedure on a CN
      a) back up database and metacat filesystem subdirectories /var/metacat/data /var/metacat/documents (copy with preserve permissions)
         1) turn off d1-processing/d1-indexing/tomcat/slapd/postgres daemons
         2) nohup  lvm-snapshot.sh > lvm-snapshot.out 2> lvm-snapshot.err < /dev/null &
         3) back up the /boot filesystem, do not believe it is under lvm
         1) create back up directories, and change directories into them
         2) nohup  tar -C /var/metacat  -zcpf metacat-bak.tgz . > /tmp/metacatArchive.out 2> /tmp/metacatArchive.err < /dev/null &
         3) root@cn-orc-1:/var# mkdir postgres-bak 
         4) root@cn-orc-1:/var# cd postgres-bak
         5) nohup su postgres -c "/usr/bin/pg_dump -Fc metacat" > metacatDB.dump 2> metacatDB.err < /dev/null &
         6) sftp metacatDB.metacatDB.dump and  metacat-bak.tgz  to remote FS
     b) move old catalina.out files
         1) mv /var/log/tomcat6/catalina.out  /var/log/tomcat6/catalina.yyyy-mm-dd.out 
     c) stop/restart ldap service?
     d) run apt-get update/apt-get upgrade
     e) remember to configure metacat if metacat has been upgraded
       1) set any web app log files to debug or other setting
       1) restart tomcat to activate changes to metacat
6. Start CN2 (maybe rename hz groups and change ldap password)
7. Point DNS to CN2 (only)
8. Shutdown CN1
9. Start CN3 (will take a long time, but no clients will be using it yet)
10. Upgrade CN1
11. Start CN1 (wait for long start up)
12. Add CN3 and CN1 to DNS RR when they are back online.
13. turn on d1-processing and indexing

This script will re-index a CN.  place it in a directory, name it reIndexCn.sh and call it with nohup reIndexCn.sh >reIndexCn.out 2> reIndexCn.err < /dev/null &
------------------------------------------------------
#!/bin/bash

/etc/init.d/d1-index-task-generator stop
/etc/init.d/d1-index-task-generator start
/etc/init.d/d1-index-task-processor stop
java -jar /usr/share/dataone-cn-index/d1_index_build_tool.jar -a
/etc/init.d/d1-index-task-processor start
------------------------------------------------------

To restart d1-processing
/etc/init.d/d1-processing start

1.0.2 process 
0. Turn off processing and indexing daemons (and log aggregation-- not running now)
1. Take out CN1 from DNS, and shutdown CN1
2. Upgrade CN1 (which turns on tomcat)
3. wait
4. Point DNS only to CN1
5. Turn off tomcat on CN2 and CN3
6. Upgrade CN2 and CN3 (which restarts them)
7. wait
8. Point DNS  to all CNs


   
Stories for 1.0.3
    1) toggle switch in node.properties that will set the CN Rest service to Read ONLY
    2) Pull out HZ client configuration into the /etc/dataone for d1_processing, cn-rest-service, indexing
    3) modify debian post inst script to change group names in HZ configuration files in /etc/dataone, maybe append the version # to a hazelcast group base string
    4) create manual process to communicate new LDAP passwords
    5) research methods to migrate from one datastructure to another in hazelcast and LDAP (switch a 1.0.x compliant systemMetadata structure to a 1.1.x structure without shutting down services) 


Stories for 1.1.0
    1.) Migrate the solr index.
        a. Install should detect a migration upgrade and echo to console that a migration is needed.
        b. Start index task generator daemon.
        c. Migrate solr search index schema: /usr/share/dataone-cn-index/scripts/migrate-search-index.sh.
            i.)  Requires sudo permission.
            ii.) Handles creating new solr core, running index tool, and starting the index processor when complete.