/20130128-RSV-Prep-notes

28 January 2013
Santa Fe, NM

AI:

Attendees:
Rebecca Koskela
Bill Michener
Bob Cook
Susie Allard
Bruce Wilson
Amber Budden
Dave Vieglais
John Cobb
Matt Jones
Mike Frame

Meeting Agenda at
<https://docs.dataone.org/member-area/documents/management/nsf-reviews/nsf-reverse-site-visit-february-2013/folder-for-rsv-planning-meeting-santa-fe-1-28-1-30/Draft%20AGENDA%20RSVPlanning%20_23Jan2013.pdf/at_download/file>
Schedule is loose.

Bill Kickoff:

Think out what to do for the next 18 months of the award

Do some brain storming about the next proposal - DataONE - Two

Chadduck offered to review alternative ideas about DataONE Two
a sustainability component is expected.
We want to do some new stuff
We can't just keep doing what we are doing right now.

Spend some time profiling member nodes
Talk about collaboration and DataONE strategy
Talk about what we are putting in place in terms of support
Talk about prioritization and selection
Talk about future MN goals
We hope to have EDAC and Dryad and Brazillian MN's id we can get

Talk about story-telling

Lessons from
Science Communicating the Message

Leave the audience with a Key message

HAve supporting points around that.
It ovule be several things
Here is a challenge
Here is the vision for implanting the solution
Here is the result

People remember a key message and up to 3 points.

Last site-visit: DataONE war supporting the data lifecycle

What are the key messages?

Enable new science about data science and life on earth and the environment that sustains it.

Exercise:

Matt: DataONE enables critical science via an open and sustainable data framework and the tools researchers need.
Matt: It is missing CE

Allard: DataONE provides leadership, relationships , research, education, and technology solutions to the scientific data community to enable new science.

Bob Cook: Preservation and delivery to data to enable science

John Cobb: DataONE uniquely helps science data users, archives, and producers by creating collaborative infrastructures across multiple collections. (Not said but continuing that allows easier and more comprehensive data access and methods. In the process it augments the use and re-use of science data and enables new previously impossible research from data aggregation and synthesis.

Amber Budden: The Data Lifecycle message .
Harnessing the value of community-wide resources through interoperable solutions for preservation and delivery of data

Bruce Wilson: DataONE is a unique blend of community engagement and cyberinfrastructure that enables new science and meets NSF needs for long-tailed earth science data.

Dave Vieglais: For a community wrestling with the high volume earth science data, DataONE enables efficient and reliable access to heterogeneous data resources through a combination not an open,extensible data management infrastructure, a suit of software tools, education, networking, and best-practices.

Combine candidate effort:
Enabling science through community building and sustainable data discovery (? services) and interoperability solutions.

Three points:
1. Engaging and Building Communities (Amber, Suzie, John)
2. Sustainable and Open CI fw. (Dave, Rebecca, Mike)
3. User focused tools and services (Matt, Bob, Bruce)

What is the evidence that we have that each of these are supporting our message.

Notes from #1 breakout
First what are the items that are in this area:
        Best Practice development
        Best Practices training and outreach
        Education events
                Data Manager's workshops
                Metadata workshops at USGS
                ESA activities/ conferences
        DUG
        MN recruitment
        Participation in other comment activities (Earthcube, RDA,
        Assessments
                Libraries
                Librarians
                Scientists
                Data managers
        Reciprocal research relationships
                ESA working group
                ESA
                Other Datanets
                        Co-authored posters
                        SC11 tutorial with SEAD
                        panel sessions
                                panel on data curation profiles
                                Purdue 2-3 day workshop and report out
                                AAG

                        Booth colocation
                Data requirements collaborations
        Co-location of DUG with ESIP
        NEON
                shared booth
                training events/sessions
        collaboration on grants and proposals
                DCRC - DataONE and Illionois, came out of Data Conservancy and DataONE collar
                DMPTOOL
                DMPTOOL2 (Sloan grant)
        Development collaborations
                DMPTool
                DataUP
        Social Meida

Break out more detailed discussion of DUG:

Also think about how to order the presentations to make this point.

Back to

Detailed description of DUG.

- what is a users group
- what is this users group
- how to participate (show the blooming GUD network analysis)
- What does the DUG do (Main Focus)(
        - requirements definition
        - engage relevant communities
        - messaging
        - prioritization process
                - MN process
                - tool evaluation
                - evaluation activitiees
                        DMPtool
                        usabiltiy
                        Promotional materials
                        examining the redefined website
- DUG changes looking forward
        Allard:
                the revolution is pulling people together
                the evolution is changing the relationship so those groups contribute to the whole
        DUG interests expressed in evaluations
                Open Access & Data sharing
                Data Management
                Request DataONE information packet

Allard: Note: remember the interoperability thrust.
Human interoperability is an interoperability.

MN breakout:

(2) Mention Metrics
        How many
        overview of nodes - summary
        Size of holdings
        growth over time

(2) Diversity
        Platforms
        Curation levels
        Data types
        Geography
        present as logo grouping

(3) Process definition (workflow communication)
        communication includes the office hours

(3) recruitment (including prioritization and outreach)

(4) projected growth ( stick with Project management plan goals)
        how do we get to 50.
        gap analysis

(1) MN operational structure
        CN/MN communication
        Replication
        Global replicated metadata catalogue
        recap the

(2) MN operations:
        support
        outage and recovery issues


The collection of member nodes enables science (through interoperability)

1 About MN's
2 Current MN status
3 Process
4 future - where we are heading

Lunch

Reconvene as a group.

Think about structuring along the big 5 question - how why what, ...

Key message triangle center surrounded by Three supporting points

Core of triangle: Enabling science by building community and providing sustainable data discovery and interoperability solutions

Three supporting points
* Building Community:
1. DataONE User group
2. Member Nodes
3. Education
4. Assessments (Bill thinks this is a top item)
5. Collaborations
6. Research (RDA,etc) (Bill thinks this is a top item)
    a. Personas
    b. WG activities
7. Communication
    a. internal (proj mgmnt section)
    b. external

* Providing sustainable data discovery and interoperability solutions:
1. Common set of open interfaces that support broad sharing and preservation of data
    a. Architectural documentation
    b. MN & CN
    c. Exposing DataONE services to community tools
    d. Identifiers
    e. Replication
    f. Access Control/identity
2. Leveraging community tools (R, Lepler, Excel, etc.) and infrastructure (MN stack)
3 Development of metadata cross-walks for discovery

Discussion: we might have some difficulty in showing evidence of interoperability.

Evidence is functioning api's and documentation.

Also talk about the MN and CN stacks

Also talk about the interoperability of client tools

* Enabling Science via user=focused tools and services
1. EVA Science stories drive priorities
    a. Personas, baseline and DUG assessments, lifecycle
    b. EVA-eBird and EVA-terrestial Biosphere Model
2. Tools for Whole Lifecycle
    a. DMP, DataUP and Morpho, R, ONEMercury, Zotero, R, VT, Kepler
3. Community Drivern Development (this is the model we need to nurture)
    a. DMP tool, DataUP, VT, Kepler, R-provenance
4. Future Work
    a. Streaming data, subsettingm integrationm semantics
    b. planned for evolution

Should we include the idea of a stroage Member node? Not at this time, but we could if we got it implemented

Things that came out of eBrid
    data staging
    proximity to HPC
    Data subsetting

iLamb lessons
    provenance aware workflows
    data integration with processing
    better tool support for NetCDF

For Powerpoint follow this same model. Have a central thought and then build three points around that.
i.e. challenges, solutions, future solutions

Example we built community
we listen - assessments, DUG, ..
we engage community -
We make and active effort to communicate with the communit and keep them innvoved

Tuesday:

Review of slide outlines.

Overview: Bill

Engaging and building community: Suzie, Amber, John

Slide 5
fix Earthcube typo
rephrase NSF requiremens change

Slide 16: Should we retain education summit bullet?

Slide 23:
We are talking about MN's before we have presented the overall structure. That needs to have happened.
One slide talking about

Perhaps regroup MN presentation elsewhere later.

define a real name for "Office hours"

Iisten engage communicate
but separate out
user researchers and then MN's

************************************
Wednesday, January 30, 2013

Agenda:

Kunze online in afternoon
Next 18 months

Discussion of next 18 months
======================
We need to be specific with respect to the next 18 months an

Need to justify change in metrics
   -- utilize change control

Current metrics
----------------------
Start with PMP metrics

Number of Member nodes: end of
Yr 3: 10
Current operational: (10) MN's: CDL/merrit, KNB, LTER, ORNL-DAAC, SanParks, USGS-CSC, ESA, PISCO, ONEshare, CLO/AKN/eBird
   https://redmine.dataone.org/rb/master_backlog/mns
    yr 4: 20
High Prob: (12)
    EDAC, (200,000 datasets and ~ 8TB) (Dave: I'm fairly confident)
    Dryad,
    SEAD,
    PELD,
    Taiwan,
    REPUNM,
    REPUCSB,
    REPORC,
    KU,
    AOOS?,
    USA-NPN, (probably June-July timeframe; ~10 datasets and 10's of MB)
    USGS Topo
Lower Prob: (2)
    DFC,
    iPlant,

    yr 5: 40
        -- requires self-building network;
        -- new software stacks; iRODS, DSpace
        -- Member Node in a box as a turnkey installation

   Post ost discussion see some re-ordering. Look at https://redmine.dataone.org/rb/master_backlog/mns


Other discussion: Chicago, UTK libraries

Other candidate member-nodes mentioned in Redmine but not included above: (25, counting DPsace as 1)
NODC
GMN production MN's at UNM and ORNL (2)
Replication MN's at CN locations (3)
USGS Climate Science Centers (8)
    NKN-Gollberg
    Berrien Moore
    6(?) others
Prarie Research Institute
Arctos
Globe.gov
DSpace: 200 implementations have datasets per the DSpace site - 25 in US. These include Woods Hole, NASA Langley. Universities include (but I don't know if holdings are environmentsl) Delaware, George Mason, UIUC, Maryland, Michigan (Deep Blue), Oregon State, Missouri etc.
GLEON
IABIN
CitSci.org
iEclolab
ALA Atlas of Living Australia
AZGS - Arizona Geological Survey
USGS Topo
iLTER - isreal

Mike Question: What is the atomic unit of a MN? IS USGS one MN or several MN?s
The same can be said for LTER, KNB,
Perhaps we need a new vocabulary in terms of member nodes: MN Connects (those that are actually interacting with DataONE directly) and MN Penetration (those that are reached through DataONE regardless of their direct connection to DataONE)
Member Node Countries
                                                                                 Y1         Y2      Y3        Y4       Y5
Data Volume (TB)                                                       0.4       1           5         40       60
    -- change to "capacity", focus on long-tail

Number of Metadata Records                                       5K    25K      100K     400K    1M
     -- includes packages without data, and prior versions
     - we also have metadata for previous versions
     -- EDAC will add 200K

Number of Data Sets at Member Nodes                         5K     22K     90K    180K    360K
     -- Add in USGS quads to CSAS clearinghouse

Number of Tools in the ITK                                             1        4        8          10        12
- python lib, java lib, (BASH client), (Developer Console), CLI, ONEMercury, DataUp, Morpho, R, ONEDrive, Kepler, VisTrails,
Number of Metadata Schemes Supported                        2        5         8           8          8
-- Have 5 now, easily can add 8

Number of Member Nodes                                             3          6      10        20       40

Total storage capacity at   Member Nodes (TB)              0.5       3      15       200      2K
    -- need to harmonize this with Data Volume stats, esp. in yr 5

Geographic coverage of   Member Nodes – countries        1       1        3          5        10
Now:
    US
    South Afrika
Future
    Tiawan
    Israel (iLTER)
    Brazil
    others from iLTER's
Number of Coordinating Nodes                                         3         3        3         4         5
    -- revise to 3 for end of project
    -- new model in phase 2 for international CNs

Total storage capacity at each Coordinating Node (TB)      0.5      2       6         8         10
   -- already hit 10+

Geographic coverage of Coordinating Nodes – countries     1       1       1         1           2
    -- will not do.
        latency issues require some attention and it may not be prioritized higher than other opportunities

DataONE Usage Statistics
Number of web interactions on DataONE portal(per/yr)          -        40K    200K    500K     1M

What are web interactions?
    web visits?
        Web visits of at least a minute?
    CN contacts?
    Mercury searches?
    -- Amber will shepherd the process
        -we can count up the numbers of searches by unique users and also if that user clicked on the "data download" link or the "documentation" link within the metadata record


Number of DataONE users                                                     -          -         5K        10K     20K

How to count users?
Depositors:
Rights holders?
We should also counted authnetication for get/create calls
LDAP entries?
Accessors:
as measured by weblogs
2800 unique users (from google analytics - using supercookies to identify computers)
public web-site has 8000 users across 2 months
Issues:
    IP NAT and multiple user site erroneous aggregation and multiplication
    Indirect use by members of our federation (e.g., LTER users hiding behind 1 account)
-- Amber will coordinate compilation

Number of Tool Downloads (per/yr)                                    -         -          1K         5K      10K
   -- Amber will work with tool devs to try to compile these stats
Example project metrics: Nanohub: <http://nanohub.org/usage/> and <http://www.youtube.com/watch?v=PK2GztIfJY4>

Number of Metadata Catalog Search Records(per/yr)        -         10K    80K    200K    400K
Do we have a measurement harness for this? need to check.
ONEmercury searches and API searches
        -we can count up the numbers of searches by unique users and also if that user clicked on the "data download" link or the "documentation" link within the metadata record. Data downloads from these Mercury searches would be counted from the MN logs.

Number of Data Set Downloads (per/yr)                                       -          1K     10K     20K     50K
? may be short
-- will get one of the coredev to produce the logs from MNs

Reliability & System Performance
Uptime (availability) of Coordinating Nodes (%)                   n/a        50     99     99.9     99.9
100%

Uptime (availability) of Member Nodes (% )                         n/a        50     85     98        99

-- better metric is resiliency of access to data, even in the face of MN outages (planned and unplanned)
-- maybe just availability of replicated data objects, which is what we can actually control

Server response time for average user interactions (sec)          n/a       8       3       1         1
-- meaningless bcse too broad
-- should be more specific, such as search response time

Response time of user interface for avg user interactions (sec)   n/a      8       5       1         1
-- meaningless bcse too broad
-- should be more specific, such as search UI time

Community Engagement
Number of Baseline assessment of scientists completed               -      1       -         -           -
-- done

Number of repeat assessments of scientists completed                 -      -        -        1           -
-- scheduled for March 2013

Number of Baseline assessment of other stakeholders completed   -      -        -        1          2
-- done
Number of repeat assessments of other stakeholders completed     -      -        -        1         2
-- educators will go out with this year's repeat scientist assessment

Education and Outreach
Number of edu modules developed and/or accessible through D1     0      4       6       8     10
-- met 5 yr goal

Number of times education modules are downloaded (per/yr)            0     50    100     200 400
-- 70 in last two months, so on track
-- rebecca will check if apache logs were archived

Number of BP guides developed and/or accessible through D1        15    40      75     100 110
-- in some ways this is counter-productive as a metric -- fewer may be better
-- we have ~100

Number of times best practices are viewed/downloaded (per/yr)       0    200    400    800 1200
-- 611 page views of BP in 2 months
-- 54 downloads of primer in last 2 months

Number of training sessions or workshops offered                           0        3        3      4       4
-- far exceeded (6 for ESA this year alone)

Number of workshop participants (per/yr)                                        0      75    100    100   140
-- not well counted, but are exceeded
-- Rebecca will collate #s

Number of people in DataONE Users Group                                  25      50    100    150   200
-- we have 100 now
-- we have many more 'involved' in DataONE, we should report that as well
    -- Rebecca will compile this
-- look at community@dataone.org mailing list
-- we really shouldn't be trying to get numbers for the sake of numbers
-- need to redefine what DUG membership means, how formal it is
-- Amber & Andrew will draft a better definition of membership
     -- reconsider 2 year membership terms,concept of formal charter
     -- could consider a 10-12 person advisory committee instead of 2 yr terms, everyone is a member

Socio-cultural
Number of proposals submitted to support DataONE research activities    -    1      2     3     4
-- hitting metrics, Rebecca can put in #s
Number of students involved in supporting DataONE research activities    -    3      5      5    5
-- hitting metrics, Rebecca can put in #s

Publications
Number of DataONE project publications                -       5   10     15    25
   -- 20 on web site, underreported
   -- DataONE grants included in acknowledgements
   -- exceeding these numbers

Number of publications citing DataONE                - there aren't any numbers listed in PMP -
-- need to post how to acknowledge DataONE, and how to cite DataONE

Sustainability

Amt of in-kind support generated annually to support D1 (FTE/yr) 2       3       4        5        5
-- compiled yesterday, Bill compiling these #s

Amount of funding generated annually to support DataONE          -       -      250K 500K 750K
-- compiled yesterday, Bill compiling these #s

Diversity of funding streams (including in-kind support)                2       2        3        3         4
-- compiled yesterday, Bill compiling these #s

Number of projects and partners collaborating with D1 or
leveraging DataONE infrastructure                                             -        2        4        6        8
-- check

-- For phase 2 proposal, redesign and consolidate all of these to a more meaningful set

List of MN
ESA MN will become part of Dryad

link for Redmine area on MN's: https://redmine.dataone.org/projects/mns/issues

iplant is enthusaistic if we will make irods available as a interface technologoy

EDAC

Mike: In next 18 months we need to ask who is/are the real focus of recruitment effort.

Matt: Getting to 20 at the end of year 4 is one thing. Getting to 40 at the end of year five

Dave: We need to target keystone Member node that help diffues within the community.

Mike: our biggest challenge is MN recruitment.
We need to reach a tipping point where the member nodes see value.

and then have enough tools that allow people to use data.

We have metacat as a stable MN interface
Bill has been making inroads with the DSpace community.
on the DSpace site 200 implementations specifically note that they have datasets in their collection

Remember the cost to implement a MN. The biggerst cost is implementation. WE have to minimize the number of SW stacks that we support.

The way to go is to target MN SW stacks and target communities

Install these are

Rebecca: we also have a commitment for a 4th CN in year 4.

Mike: We need a different strategy.

Matt: we need a self-building strategy.

John: We need a "MN in a box" tutorial and provisioning.

Dave: We have had contactd with the PRagma group that has experience in SW deployent strategies iwth things like ROCKS
The target will be a GMN. Get it so it is dead easy to install and get running.
This would service small scale laboratory-scale MN's

For successful networks, one must build resiliency into the network.(Bit botternet, and other file sharing networks)

lots of little nodes and then target larger institution-wide nodes.

Dave: Any NSF project that gets funded should just be able to install a GMN and meet NSF requirements.
Mike: incorporate this into the Data Management plan for NSF grants.

Matt: so we are talking abtou a lab-oriented plan.
Mike: Put this in a proposal for phase 2.

Matt: we need to focus not juston MN count buld also excitement through institutional interest and important data sets.

Users want a tool, not a project.
Dave: many people think of DataONE as a dropbox