2010-11-02
CCIT Breakout
Block 2: Tuesday, 10:15-12:00

MN Progress
Current:
- MetaCat
- Generic Member Node; an i/f to other repositories, used for Dryad and DAAC
Coming:
- Native Dryad
- Fedora
- Google App Engine

Authn & Authz big topic / target for this meeting

MN documentation for DUG
- draft docs on what it means to be a MN; some implementation details to give to potential MN managers / organizations

CN Status: 
to do: A&A (authn & authz) and MN replication

ITK Progress
- Java and Python libraries
- simple client tools in each lang
- tools for infrastructure testing
Web I/F by Mercury
- Tools: R, e.g.,
Future foci:
- Data packaging abstraction
- consistency in libraries
- documentation for DUG

Nov 2-4 AHM goals
- DUG prep
- NSF review prep
- MN implementations
- A&A
- Data packaging
- ITK design, components
- CN implementation (incl. technology selections)
- Review WG topics / structure
- Public release process (software release management)
- Security plan

Feb. 2011 DataNet review
5 page exec summary
15 page document
PMP, architecture, etc as appendices
Presentations with strong CI emphasis, demos

Timeline for Feb. 2001 DataNet review preparations
Nov. 4: document TOC
Jan 17: drafts for EAB review
Feb 7: final copies of docs to NSF
Feb 23-24: DataONE presentation and QA

Questions and comments on the NSF document outline:
C. Matt: tasks w/in CI need to be parsed out to other WGs
Q. Line: What does "work this into the dialog" mean? Activities and contribs of WGs will be woven into other sections, not as standalone WG sections
C. Randy: emphasize that DataONE is leveraging existing technologies and projects and not reinventing the wheel in terms of many CI functions
C. Dave: creating this document will serve our project well in addition to serving NSFs need, particularly in advance of the public release
Q. John C: how can the rest of us help? Dave, Matt and Bruce can coordinate volunteers and delegated tasks.

Discussion of the agenda:
Matt: rearrange parallel sessions; move A&A discussion sooner in the agenda so it doesn't get skipped. The revised agenda follows:

=============

Review Document Outline

* CI - Dave,Bruce,Matt (Note: As John notes, CI is an overloaded term that
  means many things to different people. Need to make sure that this section
  covers that entire waterfront so that reviewers will see their definition of
  CI)
  * Design, reference architecture diagrams (Bruce - diagrams, following ESDIS)
  * Usage scenarios that refer back to the architecture diagrams

(We have - system diagrams + software diagrams. 
Perhaps useful to have something in line with OAIS )

Link to OAIS reference architecture: http://public.ccsds.org/publications/archive/650x0b1.pdf
by the way, a point that may be worth emphasizing for the review is the extent that we adhere to the archive and curation micro-service idea in order t ouse small modular services in order to be able to scale further. This is in comparision to a large, all encompassing single system that risks being found to be rigid in architectural composition and/or scalability.

Also, just for completeness, the DataONE architeture noodled out bhy the CCIT can be found by drilling down in the repo starting at 
https://repository.dataone.org/documents/Projects/cicore/architecture/



  * HW, SW development and implementation timeline (Gantt chart, key deliverables, products)

Randy suggests that the document outline be organized along the functionality provided by the infrastructure, and then the crosscutting services provided by functionality. Talk about the problems we're working on, not the technology we're building.

Here's a restated document outline just before lunch:

3. Functions implemented by dataONE architecture
3.1 curation and preservation
    - (many of the desing choices fall under this category)
    - LOCKSS approach
    - Persistent identifier framework
3.2 Search and discovery
    - impoved access and discoverability of data / resources
3.4 Privacy and access control on a national/global scale
3.5 Integration with investigator tools
    - Supporting the full data life cycle

4. Collaborations
4.1 DataONE and Data Conservancy
4.2 Teragrid
4.3 Filtered Push

NSF: here's what the reviewers will look at during the review:
Vision
Milestones
Delivery and timeframes
- science / community engagement
- technical detail
- prototype demo

--> Dave V., Bruce W, and Matt J will develop the full TOC
--> CCIT Review the TOC, at a minimum and welcome to contribute more

  * Security
  * Technical sustainability
  * Deployment
  * Curation and Preservation
  * Policies, SLAs
  * Metrics, assessment/feedback
  * Collaborations
  * DataONE-Data Conservancy coordination
  * TeraGrid
  * Filtered Push
  * Investigator Toolkit
  * Working Group Activities - work this into the dialog

  * Functionality implemented by the architecture - table form, prehaps in appendix
  Capture the mapping between products, design goals, a
    - CNs
    - MNs
    - ITK
  




Use existing system diagrams and architecture documents. Should also ensure we reference and present DataONE in terms such as those used by OAIS.
--> High level reference architecture diagram
--> Bruce W will do diagrams, following ESDIS 


NSF comments: it's difficult to find knowledgeable reviews, so care in phrasing of our project is important to make it easier for less-knowledgeable reviewers provide a fair review. 
NSF Perspective - this document must speak to an audience that may not have the technical expertise that we have.  Half of our review panel may not have expertise, so we need to phrase this document so that it can be read and understood by reviewers who don't have the technical expertise (highly qualified but in different subject areas).

Must hook the reviewers with the 5 pages of the exective summary.


=============

Agenda for 2010-11-02 AHM (CCIT)
================================


:Last-Modified: 2010-11-02


Tuesday, 2010-11-02
-------------------

:Block 1:

* (Vieglais) Plenary session. Overview of CI status and high level plans (about 20 min)


:Block 2: (CCIT + Developers + WGs)  Tuesday 10:30 - noon

* CI Development plans and timeline for next 12+ months

* Main topics for the February review

* Review of agenda for meeting

* Goals: Assignments for the February review; shared understanding of the
  project trajectory and important aspects that need to be addressed in the
  short, medium and long term. To be revisited and adjusted at the end of the
  meeting.


:Block 3: (two streams) Tuesday 1:00 - 3:00

A) MN technical implementations (Fedora Commons, Dryad, Metacat, high
   capacity). 

   Goals: Documentation of issues, next steps for additional MN technologies
   to be integrated.
   Location: Large room
   Moderator: 
   Attendees: Cobb
  http://epad.dataone.org/2010-11-02-CCIT-T3-Member-Node 

B) Data modeling and packaging. Note - some overlap with ITK design

      Outcome: plan for mapping between or working with data packaging in the
      DataONE architecture.
   Location: Large room
         Moderator: Matt Jones
      Attendees: Bruce



:Block 4: (two streams) 45min Tuesday 3:15 - 4:00

A) MN Compute nodes, integration with Teragrid, ESG. 

   Goals: Develop plan for implementing "MN-like" functionality that can
   leverage computing resources such as available through TG, ESG and perhaps
   cloud computing environments.
   Moderator: 
    Attendees: Cobb, Wilson

B) Authentication
  Initial discussion of authentication methodologies
   Moderator: 
   Attendees: 


Tuesday, 4:00 - 5:00, 
:BOF: Member Nodes, including Sustainability and Governance WG participants
MN Policy, API documentation, participation documentation. Documentation
   required for public web.

   Goals: Outline of user / administrator / implementor oriented
   documentation for Member Nodes. The final document should include
   everything needed to build and deploy a Member Node


Wednesday, 2010-11-03
---------------------

:Block 6: Wednesday 8:00 - 10:00 

A) Investigator Toolkit. Applications to be targeted (esp. for February
   review). Web interface UI design.

   Goals: Define a couple of demonstrations that can be targeted for the Feb
   review; identify applications that should be targeted for the public
   release time frame
   
   Attendees

B) MN - MN replication. Trust, policy, identity of content. 

      Goals: Major issues to that need to be addressed for MN-MN replication.
      Design of interactions.
      
      Attendees: 

C) Provenance and Workflows Working Group 

* Update on Data-Tree-of-Life (DToL) Summer Project

  - Includes DToL use case: collaborate provenance (https://sites.google.com/site/datatolproject/)

* Overview  of provenance models/capabilities of workflow systems
  - Lightning talks: 10min each: Kepler, Pegasus, Taverna, Vistrails

Attendees: 

:Block 7: (three streams) Wednesday 10:30 - noon

A) Working with embedded identifiers
Attendees: 

B) Data modeling and packaging continued.
Attendees: 

C) Provenance and Workflows Working Group

* Discuss DataONE-related use cases, starting from DToL/collab. provenance
  (see above)

* How to provenance-enable the DataONE architecture, and why (towards an
  architecture for DataONE + Provenance Infrastructure)
  
  Attendees


:Block 8: (three streams) Wednesday 1:00 - 3:00 

A) Authentication technology implementation - CILogon, InCommon
Attendees: 

B) Project operations and administration. Web site, servers, doc management.
Attendees: 

C) Provenance and Workflows Working Group

* Detailed planning of next WG activities (in view of reporting to the plenary
  session on Thu)

Attendees: 

:Block 9: Wednesday 3:15 - 4:00 

A) Authorization and access control
  
* Technical implementation for scalability
* Access control rule compatibility across systems
* Consistency of role definition

Attendees: 

B) Aspects of Coordinating Node design and implementation

* Technology choices
* Review of services

Attendees: 

C) Provenance and Workflows Working Group

* Detailed planning of next WG activities (in view of reporting to the plenary
  session on Thu)
  
  Attendees: 


:BOF: Hardware budget planning  Wednesday 4:00 - 5:00 

* Storage for high capacity member nodes
* CN computational and storage capacity
* Supporting infrastructure (network, backup, power)


Thursday, 2010-11-04
--------------------

:Block 11: Thursday 8:00 - 10:00 

* Evaluate the existing working group structure, and reconfigure as necessary
  to align with major research topics / unknowns identified during previous
  two days. 

  Outcome: List of "mini" working groups, more tightly focussed than previous
  WG definitions.

(All CI)

:Block 12: Thursday 10:30 - noon

* Develop WG mini-charters from Block 1. Paragraph or so describing what needs
  to be done, deliverables, major milestones, participants. 

  Outcome: set of charters (at least late draft stage) for guiding WG
  activities and composition for next 12-36 months.

(All CI)

:Block 13: Thursday 1:00 - 3:15 

* Revisiting schedule and detailed dev plans, tie in WG charters

* Calendaring for meetings (CCIT, developers, technical WGs)

(All CI)

:Block 14: Thursday 3:30 - 5:00 

* Closing, plenary. Summaries from meeting.


---------- 

Notes past here


* Implementing authentication, authorization, and identity management (two
  blocks), focus on prototype impl.  Focus on authorization implementation aspects.

* Integrating output from Bertram's group (two blocks)

  - reports from each technical working groups
  - helping other working groups getting started
  - synopsis from joint wg on semantics (Matt)

* Data modeling and packaging - how to map this into the DataONE data model or adapt / change DataONE to work with more granular models
of data collections

* General update on related topics / projects (perhaps over dinner)

* Investigator Toolkit.  Need to invest time in developing some nicer tools for compelling demonstrations (esp for Feb target)
  - define roadmap for prioritizing apps / tools for integration

* User interface for data discovery, search and delivery
  - semantic search topics 

* (related to delivery above) Integration / working nicely with existing systems.  For example EML URLs point to some data set - how to take those URLS and make them
behave as identifiers in the DataONE context

[I moved a big block of text up to about line 100.... - Bob S.]