/20110316-CCIT-VT

..meta::
:keywords: DataONE, CCIT, 20110316, VTC

DataONE Developer Call - 2011-03-16
===================================

:Attendees: Ryan Scherle, Nick Dexter, Bob Sandusky, Mark Servilla, Rebecca Koskela, Bruce Wilson, Chris Jones, Roger Dahl, Paul Allen, John Cobb, Matt Jones, Giri Palinisamy, Jeff Horsburgh

[3:11 PM] Bruce: in tier 2, is mn_base different from mn_core in Tier 1?
[3:11 PM] Bruce: replication restrictions could be more than format.
[3:12 PM] Matt: bruce: MN_base has been renamed MN_read -- but wasn't updated throughout
[3:15 PM] Paul: The NBII M will be a proxy for these other data centers?
[3:17 PM] Bruce: Interesting details to sort out, particularly since the ORNL DAAC is also an NBII node.
[3:17 PM] Matt: as is the KNB
[3:17 PM] Bruce: I think that what we talked about first for NBII was focussing on the data that was actually held by NBII, and then deal with how to handle the partner node data.
[3:18 PM] Bruce: @matt -- agreed. Duplication is an interesting problem to make sure we resolve. That's a good thing to sort out, since that's a problem in a lot of different places.
[3:19 PM] John: Right. I think we need a design that works where DataONE is a first order data integrator and where DataONE is a second or higher order integrator and where we may also have the same collection presented to DataONE through multiple aggregation paths.
[3:20 PM] Bruce: Identifiers are key, for the metadata and the data.
[3:20 PM] Bruce: Nothing wrong with a MN having copies of metadata that are relevant to their community of practice.
[3:22 PM] Bruce: 2 down, 78 to go.
[3:22 PM] Matt:
[3:23 PM] Bruce: Right, for the smaller nodes, NBII can be the member node to represent their data.
[3:23 PM] John: I had this same issue with TG. I proposed all interested TG RP sites to become D1-MN's and the TG Forum replied why not start with (or stay at) a single Dataone MN gateway for all of TG. Hmm.
[3:25 PM] Marco: that same pattern would hold for LTER too
[3:26 PM] John: We need to both solve the problem and to persuade potential MN's to have confidence that we have solved the tracking/audit issue
[3:39 PM] John: have a "private" parameter flag options that could be incorporated into the AA Api's
[3:39 PM] John: A provider may choose whether to allow or not allow anonymous access and a user may choose whether to be known or anonymous
[3:42 PM] Jeff Horsburgh joined.
[3:53 PM] Rebecca: @Matt: do you want it on this week's LT agenda?
[3:54 PM] Matt: yeah, that sounds good
[3:55 PM] Matt: I'll try to write it this afternoon and sendit off
[4:03 PM] John: An out would be to have an alternate "bulk update" function that could ameliorate large updates or new node ingest
[4:04 PM] Robert: yep
[4:05 PM] Robert: on the other hand, if we change core api features that require upgrade to systemMeta, we need to be able to scale

Agenda and Notes
----------------

1. Different categories of Member Node

~~Core: Functionality common to all nodes, including optional methods~~
~~Old API: MN_health, MN_crud~~
~~New API: MN_core~~
~~Methods: ping, getCapabilities, getStatus, [getObjectStatistics], [getOperationStatistics], [getLogRecords]~~

Tier 1: Public read, no Authn/Authz
Old APIs: MN_crud, MN_replication, MN_health
New APIs: MN_core, MN_read
Methods: MN_core: ping, getCapabilities, getStatus, [getObjectStatistics], [getOperationStatistics], [getLogRecords]
MN_read: get, getSystemMetadata, listObjects, describe, getChecksum, synchronizationFailed

Tier 2: Read/Resolve with Authn/Authz
Old API: MN_authorization, MN_authentication
New API: MN_auth
Methods: MN_read MN_core + login(*), logout(*), isAuthorized(*), setAccess(*)

Tier 3: Write (create, update, delete), possibly limited support for data types
Old API: MN_crud
New API: MN_storage
Methods: MN_auth + create, update, delete,

Tier 4: Limited Replication target (specified data types)
Old API: MN_replication
New API: MN_replication
Methods: MN_storage + replicate

Tier 5: Replication target, any data types
Old API:
New API:
Methods: MN_replication (no additional methods)

NBII may be a member node within member nodes - there are many different organizations contributing systems, many with different authn / authz so there will not necessarily be a clear demarcation of the functionality that may be available across the participant collections. Per John, TG is in a similar situation.

Would NBII be used for writing to?

Dealing with aggregators is a general problem that needs to be addressed - simplistic approach is to restrict access to particular collections that are already exposed through other MNs. Important to try and work with existing information and copies of data that are present in the system (efficiency)

Policy of data providers - dictates the tier of functioanlity that can be provided.

Start with simple case - the data sets that are already archived by the NBII.

KNB nodes that restrict access to data - seems to be primarily due to the issue of tracking of data use. So dealing with the data access logging issue may be effective at eliminating much of the (read) access control restriction in place by various repositories.

BEW: On replication -- there can be some restrictions based on data source, data type (format), or data domains. A MN may be unable to accept data for replication outside of their mission area.

Privacy concerns about logging information. Many libraries are dumping logs to avoid privacy concerns. If logging is required at tier1 then all nodes need to record informaiton identifying users - so could potentially exclude a lot of participants.

If privacy is an issue - then support for logging information needs to be optional at tiers. There are some requirements for truly anonymous access support.
- Logging is desired by data providers, but data users don't want to be tracked.
- DataONE is a research project - stf OK to record logging information for purposes of the project. But forcing requirement for logging and changing later would not be trivial.

- important to get some involvement from the CE side of the project.
- Some collections may require access logging - stf no anonymous access allowed. An interesting tension on this point from a Federal perspective, with OMB pushing for more identification of who is accessing data for what purpose and OSTP pushing for more open access to data. As has previously been discussed, there are also some international issues that may arise, with conflicting laws that require logging in some countries and prohibit it in others. User logging has interesting consequences for the EU recent laws (that got pushed back). It is also complicated by the issue that using InCommon for authentication, as a user doesn't specifically sign up for a DataONE account, and we have less option to present a conditions of use to make users aware of what data is logged about their access.
- logging as a public feature? LTER - Nope - log access restricted to principals that have ownership / admin to object. Dryad does provide summary stats - good feedback. Might be a popular feature to see summary statistics assocaited with particular objects through the Mercury interface (for example) or summaries of data downloads.

- Need a flag that indicates access must be logged or not, and similarly for the user - flag that indicates they can only access information that is not logged should be accessible to the user.

- Matt will write up some notes for distribution.

- Also need a clear statement of what data is tracked for the public release, so that it's up on the web site and readily available.

2. Schedule for 2011

The calendar year is divided into six blocks of development activity, each
approximately eight weeks long. Each block is comprised of six weeks of
feature development, followed by two weeks of bug squashing, and ending with
product releases.

Block 1
2011-01-02 - 2011-02-26
Preparations and completion of 18 month project review

Block 2
2011-02-27 - 2011-04-23
MN-MN replication, authorization, CN service restructure, stabilize MN API
definitions, outline development for new MNs, deploy MN testing service

Block 3
2011-04-24 - 2011-06-18
Progression on authentication, authorization. Progress on CN service
restructure. ITK component development outline and initiation for Morpho,
FUSE, Vistrails, Mercury. Development started on Merrit, AKN, NBII, CUAHSI,
(Saeon) MNs.
Also deploying MN SW stack on TG VM's and TG storage, contingent on positive allocation response from TeraGrid allocation committee (expected within days)

Block 4
2011-06-19 - 2011-08-20
Beta version of ITK components available for evaluation. Beta versions of MNs
deployed in testing network. Ongoing CN implementation and tuning. Deploy
additional hardware at CN locations.

Block 5
2011-08-21 - 2011-10-15
Release candidates for DataONE cyberinfrastructure including ITK, MN, and CN
components. Evaluation of release candidates.

Block 6
2011-10-16 - 2011-12-17
Public release of the DataONE infrastructure

3. Rough Estimate of CN Transaction Rates

nCN = # of coordinating nodes
nD = # of data objects
nM = # of science metadata objects
nY = # of system metadata objects
nr = # of replicas of each data object
n0 = total number of objects before synchronization or replication
n1 = total number of objects after synchronization
n2 = total number of objects after replication
D = difference in object count between start and steady state

nY = nM + nD

n0 = nY + nM + nD

n1 = nY*nCN + nM*nCN + n0

n2 = nY + nr * nD + n1

D = n2 - n0

So, if:

nD = nM = 1, n0 = 4, n1 = 13, n2 = 18, D = 14

If nD = 100,000 D = 1.4e6. The approximate (actually minimum) transaction rate (t) to reach steady state after d days for this number of new objects:

d = 1   t = 16.2
d = 7   t = 2.3
d = 30 t = 0.54
d = 365 t = 0.04

nD = 1,000,000

d = 1   t = 162
d = 7   t = 23
d = 30 t = 5.4
d = 365 t = 0.44

nD = 1e9

d = 1   t = 162000
d = 7   t = 23000
d = 30 t = 5400
d = 365 t = 443

4. Any other business
None