..meta::
  :keywords: DataONE, CCIT, 20110202, VTC

DataONE Developer Call - 2011-02-02
===================================

:Attendees: Bob Sandusky, Chris Jones, Jeff Horsburgh, Rebecca Koskela, Paul Allen, Mark Servilla, Rob Nahf, John Kunze, Line Pouchard, Ryan Scherle, Matt Jones


Agenda and Notes
----------------

1. Summary of the semantics working group meeting held during Jan 24-26 at Stanford (Jeff / Line)

Major deliverables
- roadmap of technologies that may be of use to DataONE
- identify a scientific scenario to help describe use cases, defining semantics for DataONE
- Scenario: Representation of hydro data or hydro science in climate models
  - how to make obs and climate data work together
  - https://sonet.ecoinformatics.org/observational-data-use-cases
- Talked a lot about ontologies and onto repositories
- Needed more specific scenarios and use cases
- Perhaps co-locate next meeting with CCIT?


2. Selection of July CCIT meeting location.
 - one of Santa Barbara, New Mexico, NESCent, UTK
 - Dates: 19-21 July, travel on 18,22.
- Select between NM and Santa Barbara

3. Scheduling for next activities on authentication and authorization 
- Lots of institutional / project investment in existing infrastructures
- Work through design and prototype from existing feedback from workshop

4. Progress towards review preparedness
- Send out CI section for review by CCIT late Wednesday / Thursday morning.

5. Preservation Strategy draft for NSF review -- feedback?

From Jan 31 email:

Dear CCIT,

A draft report from the Preservation Workshop is attached.  It weighs in
at a little over 6 pages, and I understand that we still need to get it
down to 3-5 pages for the NSF review.

I know this is an especially busy time for everyone, but if you a few
minutes, I left in a number of comments to highlight some issues that the
group should be aware of to comment on as to plausibility/feasability.

I'll summarize the 6 salient points below.  Please skim them and holler
if you see something we shouldn't put in the NSF report.

   1.  I fair amount of security is assumed at the MNs, which the report
       tries to make more explicit by reference to widespread PCI and ITIL
       standards for physical security and electronic perimeter controls
       (firewalls) that we all take for granted in higher ed and federal
       agencies.  I'm asking the DUG if perhaps this should be more of a
       requirement of MNs than just a recommendation.

   2.  The workshop consensus was on making 2 replicas of each dataset
       (2 copies plus MN copy making 3 instances).  One issue brought up
       by the EAB is that a uniform policy like this could cause problems
       when registering especially large datasets.  Do we need an answer
       for that right now better than "don't register REALLY big data"?

   3.  The report claims that access control rules will be honored for
       replicated data, but doesn't say where those rules will be fetched
       from, what happens if they change since the replica was created,
       etc.  I'm not very comfortable with it, but can we live with this
       claim for now?

   4.  For fixity/integrity checking, the workshop proposed the "pop
       quiz" approach for random subsets of data holdings:  request the
       data and recompute the digest.  Exhaustive checking isn't feasible
       for the amount of data we'll have and re-computing digests is
       needed to prevent easy spoofing (eg, when asked for the digest,
       just return what you stored originally).

   5.  Two claims regarding formats:

       a. "DataONE will encourage use of data formats that are open,
          transparent, widely used, and non-encrypted".

       b. "Use of automated characterization tools, such as DROID and
          JHOVE2 will be strongly recommended of data providers."

       If we're ok with these, we'll have to document them somewhere, eg,
       MN guidlines.

   6.  Is it ok to make this claim regarding migration:  "The versioning
       of managed content that results from migration will be reflected
       in that content’s system metadata.  All migrated content will be
       subject to “before” and “after” characterization to ensure the
       semantic invariance of the transformation."