..meta:: :keywords: DataONE, CCIT, 20110316, VTC DataONE Developer Call - 2011-03-16 =================================== :Attendees: Ryan Scherle, Nick Dexter, Bob Sandusky, Mark Servilla, Rebecca Koskela, Bruce Wilson, Chris Jones, Roger Dahl, Paul Allen, John Cobb, Matt Jones, Giri Palinisamy, Jeff Horsburgh [3:11 PM] Bruce: in tier 2, is mn_base different from mn_core in Tier 1? [3:11 PM] Bruce: replication restrictions could be more than format. [3:12 PM] Matt: bruce: MN_base has been renamed MN_read -- but wasn't updated throughout [3:15 PM] Paul: The NBII M will be a proxy for these other data centers? [3:17 PM] Bruce: Interesting details to sort out, particularly since the ORNL DAAC is also an NBII node. [3:17 PM] Matt: as is the KNB [3:17 PM] Bruce: I think that what we talked about first for NBII was focussing on the data that was actually held by NBII, and then deal with how to handle the partner node data. [3:18 PM] Bruce: @matt -- agreed. Duplication is an interesting problem to make sure we resolve. That's a good thing to sort out, since that's a problem in a lot of different places. [3:19 PM] John: Right. I think we need a design that works where DataONE is a first order data integrator and where DataONE is a second or higher order integrator and where we may also have the same collection presented to DataONE through multiple aggregation paths. [3:20 PM] Bruce: Identifiers are key, for the metadata and the data. [3:20 PM] Bruce: Nothing wrong with a MN having copies of metadata that are relevant to their community of practice. [3:22 PM] Bruce: 2 down, 78 to go. [3:22 PM] Matt: [3:23 PM] Bruce: Right, for the smaller nodes, NBII can be the member node to represent their data. [3:23 PM] John: I had this same issue with TG. I proposed all interested TG RP sites to become D1-MN's and the TG Forum replied why not start with (or stay at) a single Dataone MN gateway for all of TG. Hmm. [3:25 PM] Marco: that same pattern would hold for LTER too [3:26 PM] John: We need to both solve the problem and to persuade potential MN's to have confidence that we have solved the tracking/audit issue [3:39 PM] John: have a "private" parameter flag options that could be incorporated into the AA Api's [3:39 PM] John: A provider may choose whether to allow or not allow anonymous access and a user may choose whether to be known or anonymous [3:42 PM] Jeff Horsburgh joined. [3:53 PM] Rebecca: @Matt: do you want it on this week's LT agenda? [3:54 PM] Matt: yeah, that sounds good [3:55 PM] Matt: I'll try to write it this afternoon and sendit off [4:03 PM] John: An out would be to have an alternate "bulk update" function that could ameliorate large updates or new node ingest [4:04 PM] Robert: yep [4:05 PM] Robert: on the other hand, if we change core api features that require upgrade to systemMeta, we need to be able to scale Agenda and Notes ---------------- 1. Different categories of Member Node Core: Functionality common to all nodes, including optional methods Old API: MN_health, MN_crud New API: MN_core Methods: ping, getCapabilities, getStatus, [getObjectStatistics], [getOperationStatistics], [getLogRecords] Tier 1: Public read, no Authn/Authz Old APIs: MN_crud, MN_replication, MN_health New APIs: MN_core, MN_read Methods: MN_core: ping, getCapabilities, getStatus, [getObjectStatistics], [getOperationStatistics], [getLogRecords] MN_read: get, getSystemMetadata, listObjects, describe, getChecksum, synchronizationFailed Tier 2: Read/Resolve with Authn/Authz Old API: MN_authorization, MN_authentication New API: MN_auth Methods: MN_read MN_core + login(*), logout(*), isAuthorized(*), setAccess(*) Tier 3: Write (create, update, delete), possibly limited support for data types Old API: MN_crud New API: MN_storage Methods: MN_auth + create, update, delete, Tier 4: Limited Replication target (specified data types) Old API: MN_replication New API: MN_replication Methods: MN_storage + replicate Tier 5: Replication target, any data types Old API: New API: Methods: MN_replication (no additional methods) NBII may be a member node within member nodes - there are many different organizations contributing systems, many with different authn / authz so there will not necessarily be a clear demarcation of the functionality that may be available across the participant collections. Per John, TG is in a similar situation. Would NBII be used for writing to? Dealing with aggregators is a general problem that needs to be addressed - simplistic approach is to restrict access to particular collections that are already exposed through other MNs. Important to try and work with existing information and copies of data that are present in the system (efficiency) Policy of data providers - dictates the tier of functioanlity that can be provided. Start with simple case - the data sets that are already archived by the NBII. KNB nodes that restrict access to data - seems to be primarily due to the issue of tracking of data use. So dealing with the data access logging issue may be effective at eliminating much of the (read) access control restriction in place by various repositories. BEW: On replication -- there can be some restrictions based on data source, data type (format), or data domains. A MN may be unable to accept data for replication outside of their mission area. Privacy concerns about logging information. Many libraries are dumping logs to avoid privacy concerns. If logging is required at tier1 then all nodes need to record informaiton identifying users - so could potentially exclude a lot of participants. If privacy is an issue - then support for logging information needs to be optional at tiers. There are some requirements for truly anonymous access support. - Logging is desired by data providers, but data users don't want to be tracked. - DataONE is a research project - stf OK to record logging information for purposes of the project. But forcing requirement for logging and changing later would not be trivial. - important to get some involvement from the CE side of the project. - Some collections may require access logging - stf no anonymous access allowed. An interesting tension on this point from a Federal perspective, with OMB pushing for more identification of who is accessing data for what purpose and OSTP pushing for more open access to data. As has previously been discussed, there are also some international issues that may arise, with conflicting laws that require logging in some countries and prohibit it in others. User logging has interesting consequences for the EU recent laws (that got pushed back). It is also complicated by the issue that using InCommon for authentication, as a user doesn't specifically sign up for a DataONE account, and we have less option to present a conditions of use to make users aware of what data is logged about their access. - logging as a public feature? LTER - Nope - log access restricted to principals that have ownership / admin to object. Dryad does provide summary stats - good feedback. Might be a popular feature to see summary statistics assocaited with particular objects through the Mercury interface (for example) or summaries of data downloads. - Need a flag that indicates access must be logged or not, and similarly for the user - flag that indicates they can only access information that is not logged should be accessible to the user. - Matt will write up some notes for distribution. - Also need a clear statement of what data is tracked for the public release, so that it's up on the web site and readily available. 2. Schedule for 2011 The calendar year is divided into six blocks of development activity, each approximately eight weeks long. Each block is comprised of six weeks of feature development, followed by two weeks of bug squashing, and ending with product releases. Block 1 2011-01-02 - 2011-02-26 Preparations and completion of 18 month project review Block 2 2011-02-27 - 2011-04-23 MN-MN replication, authorization, CN service restructure, stabilize MN API definitions, outline development for new MNs, deploy MN testing service Block 3 2011-04-24 - 2011-06-18 Progression on authentication, authorization. Progress on CN service restructure. ITK component development outline and initiation for Morpho, FUSE, Vistrails, Mercury. Development started on Merrit, AKN, NBII, CUAHSI, (Saeon) MNs. Also deploying MN SW stack on TG VM's and TG storage, contingent on positive allocation response from TeraGrid allocation committee (expected within days) Block 4 2011-06-19 - 2011-08-20 Beta version of ITK components available for evaluation. Beta versions of MNs deployed in testing network. Ongoing CN implementation and tuning. Deploy additional hardware at CN locations. Block 5 2011-08-21 - 2011-10-15 Release candidates for DataONE cyberinfrastructure including ITK, MN, and CN components. Evaluation of release candidates. Block 6 2011-10-16 - 2011-12-17 Public release of the DataONE infrastructure 3. Rough Estimate of CN Transaction Rates nCN = # of coordinating nodes nD = # of data objects nM = # of science metadata objects nY = # of system metadata objects nr = # of replicas of each data object n0 = total number of objects before synchronization or replication n1 = total number of objects after synchronization n2 = total number of objects after replication D = difference in object count between start and steady state nY = nM + nD n0 = nY + nM + nD n1 = nY*nCN + nM*nCN + n0 n2 = nY + nr * nD + n1 D = n2 - n0 So, if: nD = nM = 1, n0 = 4, n1 = 13, n2 = 18, D = 14 If nD = 100,000 D = 1.4e6. The approximate (actually minimum) transaction rate (t) to reach steady state after d days for this number of new objects: d = 1 t = 16.2 d = 7 t = 2.3 d = 30 t = 0.54 d = 365 t = 0.04 nD = 1,000,000 d = 1 t = 162 d = 7 t = 23 d = 30 t = 5.4 d = 365 t = 0.44 nD = 1e9 d = 1 t = 162000 d = 7 t = 23000 d = 30 t = 5400 d = 365 t = 443 4. Any other business None