/20111215-knb-upgrade-process

#persist

Metacat upgrade strategy
===================

Items needed to be upgraded
------------------------------------------
1. Identifier upgrade
2. Generate access polidy
3. Generate replication policy
4. Generate system metadata
5. Generate ORE maps

Identifier upgrade
-------------------------
In general, identifier upgrades happen automatically, and are the joined docid plus rev strings for existing content. This guarantees uniqueness, and provides a direct correspondence to existing local identifiers. However, some content may also get DOIs. How and to whom do we assign DOIs?

Existing content
    a. Local content (home server = 0)
    b. Replicated content
        b1. LTER
        b2. SanParks
        b3. PISCO
        b4. Brazil
        b5.GBIF
        b6. Palmyra
        b7. ESA
    c. Non-replicating, external metacat servers
        c1. Taiwan
        c2. etc.

New content
    a. DataONE API
    b. Metacat API

ORE Generation
------------------------
We want to generate package descriptions for EML-described data so that the existing EML-sense of a package is maintained in DataONE. The strategy for this will differ along two axes -- whether the EML is local or replicated, and whether the data is stored in metacat or just referenced via URI. We also need a strategy for how to handle new content after the upgrade occurs. See discussion here:
   http://bugzilla.ecoinformatics.org/show_bug.cgi?id=5522

When to do this upgrade? 1) At upgrade to 2.0.0, or 2) When D1 MN status is turned on?
   -- Decision: (2) create flag in metacat config, only do ORE gen when metacat is a MN
   -- as a CN: never generate anything (need a flag for acting as a CN? existing node type field?) (need a CN-side check that new nodes don't register as new CNs)

Data stored in Metacat (ecogrid URI)
   -- yes
Data referenced via other URL
-- Yes, download & save object if 1) resolvable, 2) matches type
-- Otherwise, just include URL in ORE map

Existing content
    a. Local content (home server = 0)
        -- Decision: generate ORE (assuming conditions above)
    b. Replicated content
        b1. LTER (to be MN)
        b2. SanParks (to be MN)
        b3. PISCO
        b4. Brazil
        b5. GBIF
            metacatdev.gbif.org/knb/servlet/replication, vs. OAIPMH
        b6. Palmyra
        b7. ESA
        b8. iEcolab (Spain)
    c. Non-replicating, external metacat servers
        c1. Taiwan
        c2. etc.

New content
    a. DataONE API
        -- Decision: do nothing
    b. Metacat API
        -- Generate ORE, iff the metacat instance has D1 turned on

~~Sync seq~~
   ~~1. KNB home content~~
   ~~2. LTER home content~~
   ~~3. SanParks Home content~~
   ~~4. KNB replicated content~~
       ~~(any OREs that already exist don't get generated for existing MNs)~~
       ~~(for other KNB rep nordes, generate ORE maps with KNB as authoritative)~~
   ~~5. LTER replicated content~~

Converting a node from KNB to D1
--------------------------------------------------
0. Generate ORE (for all content that doesn't already have it) and sync all KNB content
1. Turn off LTER knb rep
2. Register LTER as MN, sync is off
    2a. (LTER avoid gen of ORE for any for which that graph exists)
3. On KNB, change Auth MN to LTER for all LTER replicas and ORE maps
4. Turn on LTER sync
    -- when CN discovers that an object is a replica, CN triggers sysmetaChanged event at LTER

Generating Replication policies
---------------------------------------------

Updating access policies
------------------------------------
Regarding RightsHolder and AuthoritativeMN, see:
http://bugzilla.ecoinformatics.org/show_bug.cgi?id=5523

Checking if data downloaded is legit
-----------------------------------------------------
                                switch (eml-type):
                                  case text/plain:
                                    if (isText() && !isHTML()) then archive()
                                  case text/csv: (or other delimited types)
                                    if (isText() and isValidCSV()) then archive()
                                  case text/html:
                                          if (isHTML()) then archive()
                                  case application/excel or application/msaccess:
                                        if (isBinary()) then archive()
                                  case image/*:
                                    if (isValidImageFormat()) then archive() (maybe just isBinary()?)
                                   ...
                                  case netCDF:
                                    if (isNetCDF()) then archive()
                                  case text/pdf:

                                  ...
                                  default:
                                    break;

         alternate
             if (text/html && isHTML()) then archive;
             if (!text/html && !isHTML() then archive;