NOTE: These notes have been captured to subversion at: https://repository.dataone.org/documents/Committees/CCIT/20110719_CCIT_SantaBarbara/raw_notes_from_etherpad.pdf
Further edits can be made to this document but please notify Dave of any signifcant changes.

CCIT Face to Face Meeting Notes
NCEAS, Santa Barbara

July 19, 2011: Session 1 (8:30)

Attendees: Amber B., Roger D., Mark S., Dave V., Ryan S., John K., Jeff H., Bob S., Giri P., Chris J., Bruce W., Paul A., Robert W., Matt J., Ben L., Ryan K., Nick D.

See Dave's presentation (url: ????)

Goals for 2011
Afternoon Session
---------------------------

ReserveIdentifier:
- two separate methods
  - generateIdentifier(scheme, [fragment]) returns doi ark
    scheme = type of identifier "DOI" or "ARK"
  - reserveIdentifier(string) returns string or fail

July 20, 2011 Discussions
How can we ensure that the running member nodes can continue to run, allowing the staged movement of CNs and MNs to move forward to the new versions?  Assume that there is a staging enviornment, where the new version of the CN is moved out and the staging then gets flipped to be the production enviornment. Consider a case of significant changes to system metadata.  How would that process forward?  Likely scenario that we may have to be supporting 1.0 and 2.0 clients at the same time (months).  Might be able to take 1.0 out of production when we release 3.0.  

If new field is added in system metadata, would have to add the field in the database, then modify the parsers.  

Discussion:  should store only the latest version of a given type.  The version 1 of a call has to be able to read the new version of the object and translate to the old version (get operation from a client).  For a client put, the v1 method has to accept the old version and transform to the new version to write this to the database.  Need to ensure that the serialization of a given type includes the version of that type.  This is common for most XML serializations, in the schema definition version.  

??? Can the member node stacks be auto-updated, in the same way that Chrome autoupdates?  Could be done for client tools, as a maintenance method.  For services like member nodes, this could be a problem.  

??? Does this this change if we look at changes in a more incremental, continuous beta fashion, rather than sets of relatively rare changes.  

Issue: how does the client signal which version of a method that it's calling?  Is this in the URL or is this in the arguments that it sends to that REST URL.  One way is to change the REST signature when there is a different signature.  The other option is that it's in the header, as a required attribute, and the dispatcher has to figure out what version is in the header and call the appropriate method.  Could put it in the message request header, but that does make it difficult to use a web browser because it wouldn't put that into the message header.  Could also make it more difficult for the dispatcher, as the processor has to actually read part of the message to determine where the message should go.  Putting it in the URL also makes a simple grep to search for what's happening.  

Question: Do all versions of the methods move together?  Would we put the version number in all of the method signatures?  If we did a service pack change and created a new method, for example, would we then change the signature of all other method signature.  CLO does the versioning at the REST URL level.  Changes happen infrequently.  Can add new stuff to an existing level, but semantic changes are a major release and all method signatures update.  

If we add a method to a tier, then either that tier has to be versioned, or we have to add a new tier to the system.  If we add, for example, a getpackage method and it becomes allowed in Tier 1, is this now a new tier, or are we creating a version of that tier.  Similar issue with things like data subsetting operations. 

Paul: perspective from the experience of CLO is that making it easy for people to tell what version they're on and what needs to be done to move to the next version.  Helps with engagement in groups and simplifying the member node developer's jobs.  

Rob: Do XSD schemas support optional elements, such as <version>?  It may allow us to avoid schema validation problems and the need for the Java code base to support multiple schema versions and Jibx generated datatype classes.
see: "W3C XML Schema Design Patterns: Dealing With Change"  - http://msdn.microsoft.com/en-us/library/aa468563.aspx

MN version signalling:
A. URL encoding of version:
/mn/meta/{pid}

This is the one:  /mn/0.6.2/meta/{pid}

/mn/v1/meta/{pid} -- Similar to CLO method, all versioned together
/mn/meta/v1/{pid} -- could conflict with {pid}.  Makes it more confusing to read.
/mn/meta/{pid}?v=1

B. Node registry information

C. HTTP request headers - version information must appear in request / response
Problem - client must send version info info request

D. Message namespaces
- requires reading messages to figure out what it is

Versions in URLs indicate consistency in the interface definitions. Any change to interface requries an update to the version tag in the URL. Version tag in URL sould be simple (i.e. not tied to software version) like "v1".
Proposed granularity of the Interface version is only major numbers, v1, v2, v3, v4, etc.

from Dave, the dynamics:
change in API          ==== requires ====>   change in code  (client/cn/mn impl, integration tests)
change in schema    ==== requires ====>   change in code
change in schema    ==== requires ====>   change in API
YET, a change in API does not require change in datatypes

so, datatypes within d1_common_(java) correspond to schema version
For example:
==========
    starting at:
        api version: v3
        schema version: 0.6.1
        code release: x.0.0

1) release of D1_SCHEMA_0_6_2 results in new API version and code version, and new datatypes
    api version: v4 for example
    new code version: x.1.0
    and new package in d1_common_java:  org.dataone.service.types.0_6_2

2) Further code implementation to existing schemas and service API
    new code version x.1.1 (is x.2.0 possible in this case?)

3) Following that: release of D1_SCHEMA_0_6_3 triggers
    new api version: v5
    new code version x.2.0
    and new package in d1_common_java: org.dataone.service.types.0_6_3
    
4) Following that API update:
    new api version: v6
    new code version x.3.0
    
** because api versions will represent the "dataone" version, we should consider using "major.minor" format
or else we will be at a high version number very quickly.


Example d1_common structure for the v1 release of d1_common_1.0.0:
    org.dataone.service.cn.v1.CNAuthorization
    org.dataone.mn.tier1.v1.MNCore
    org.dataone.mn.tier1.v1.MNRead
    org.dataone.mn.tier2.v1.MNAuthorization
    org.dataone.service.types.v1.SystemMetadata

Example d1_common structure for the v2 release of d1_common_2.0.0
    org.dataone.service.cn.v1.CNAuthorization
    org.dataone.mn.tier1.v1.MNCore
    org.dataone.mn.tier1.v1.MNRead
    org.dataone.mn.tier2.v1.MNAuthorization
    org.dataone.service.types.v1.SystemMetadata
    org.dataone.mn.tier1.v2.MNCore
    org.dataone.mn.tier1.v2.MNRead
    org.dataone.mn.tier5.v2.MNSubset
    The problem with multiple api versions within the d1_common package is that
    the client will have to implement all of these methods again, and the client will
    be talking many versions.  Is there a need for a client to switch which version 
    of the API it's talking?  (CN-as-client, MN-as-client?) or can the user include 
    multiple libclient jars in the pom instead?

Actions:
0. Multiple versions MUST be supported

1. Change REST urls to include infrastructure version 
- versioning at the Tier level. All REST URLs for a tier have the same version tag

2. Base URL of CNs and MNs MUST return node registry information for the node
- cn.dataone.org -> Human interface
- cn.dataone.org/cn -> node registry doc

3. Define a process for interface versioning independent of the software tags

4. Acceptable deprecation period for member nodes = at least one year

5. Support multiple versions of types - need to name packages appropriately with version information

6. Review other package / apps  for versioning strategies. e.g. Oracle, TLS, Google Maps, p2p networks? Amazon web service
http://docs.amazonwebservices.com/AmazonSimpleDB/latest/DeveloperGuide/index.html?APIVersioning.html
http://stackoverflow.com/questions/389169/best-practices-for-api-versioning
    - good point.  REST implies URL persistence, so we definitely need to support an unversioned URL, especially for gets on unversioned objects 
http://blog.apigee.com/detail/api_restful_design_question_verisioning_number_in_the_url/



Collating contact information for MN administrators and other stakeholders.
- Need a list of contacts to notify for:
  - version changes / system updates
  - system outages
  - other infrastructure related notifications

Contains:
- all member nodes: technical contact, administrative contact
- all "official" ITK components

DataONE needs one person who is responsible for notifications
primary contact as well as backup contact,  user lead ("power user" perhaps)

Actions: 1. review MN registration doc (http://bit.ly/oDt9mz)
2. Setup mailing list with search capability
3. Identify someone responsible for sending notifications.

Public Release Design
Component Interface appearances
Use agreements and licenses

DataONE Terms & Conditions
- Review by CCIT - comments to Suzie within a couple of weeks
  - Bruce, Bob Sandusky
  In Google Docs https://docs.google.com/document/d/1TeVPzhsP53W-FwzFm3hl2J7EC5xjiNiSErd33Pyptno/edit?hl=en_US  Send a note to Bruce Wilson (refbruce@gmail.com, bruce.wilson@utk.edu, or wilsonbe@ornl.gov) to get access.  The intent is that Bruce and Bob will direct questions to CI and CE folks as appropriate.

- How to present use agreements?
  - No requirement to click through an agreement
  - Present links to use agreements where available (e.g. in mercury interface)
  - Goal is to provide information to users for "proper" use of the information - attribution, citation, redistribution, ... not as a legally binding agreement.

  - Member node agreements can contain a link to the Member Node data use agreement (presumably not conflicting with the DataONE operation)
  - Generally the best location for data use agreement / guidlines is within the science metadata for the dataset

Actions:
- generate a landing page that presents the use agreements
- Need a brief, general agreement for DataONE's perspective
    -- Should say seomthing like: "DataONE expects adherence to scientific principals on ethical data sharing, redistribution, and attribution.  In addition, DataONE is a federation of data provider organizations each with their own usage policies and procedures, and we expect user's of data and information gathered from DataONE to respect the usage, redistribution, and attribution policies of the individual Member Nodes and contributors.  It is incumbent upon data users to find the usage agreement information that are pertinent for all data downloaded. We provide the following links to the policies of individual Member Nodes to facilitate this process." 
- Need links to the various MN specific links
- Need pointers to indicate that science metadata may contain more specific restrictions on use policies
- Who can do this?

What will be the face of DataONE at public release?

- e.g. using the GBIF site as a guiding structure. But that site is confusing to folks that want the data right away.

Front page:
- Need data search
- Other ways to access DataONE resources 
- Links to how to participate
- Links to other aspects of the project
- Feedback - use a plugin feedback tool (commercial)

What are we calling the main page for the web interface to the DataONE CI?

Actions:
- Change the L&F of the docs.dataone.org site
- The header / menu bar (top section of page) should be designed such that it can be reused across multiple sites - drupal, Mercury, CILogon branded
- Design for the web site L&F elements needs to be at least in draft form for UI changes to Mercury
- Aim for consistency of presentation across all web interfaces for DataONE, perhaps with less detailed menus on search pages
- Giri will lead the web ui piece (search interface elements and design)
- Need a list of all Science Metadata formats 
- For each format we need a transform to HTML


Implementation note: would it be prudent to add a visibility field to ObjectList (or ObjectInfo) to support marking individual results returned from search as "private" for example?

====
Searchable Elements

SearchMetadata.author
(String)
- Desireable if we can get "sort friendly" results -> recommend LAST, FIRST name
- Mercury UI

SearchMetadata.keyWord
(String)
- Mercury enhances the kw list with e.g. gcmd keyword list
- Mercury UI


SearchMetadata.keyConcept
- Controlled lists of terms available for several metadata standards
- A topic being addressed by the semantics WG
- Unlikely to be available in 2011
- Giri has some keyword enhancement and mapping script (magic)
- Mercury UI

SearchMetadata.spatialFeature
(SpatialFeature)
- Bounding box with contains or overlaps 
- can be supported by SOLR and Metacat searches
- Mercury UI

SearchMetadata.namedLocation
(String)
- free text search field
- provide recommendations for representation (see Wieczorek's georeferencing document)
- Mercury UI

SearchMetadata.temporalCoverageStart
(DateTime)
- temporal extent
- "jurassic" vs "date collected for jurassic specimen"
- not publication date, not creation date
- Applicability date (as opposed to collection date or publication date -- though searching on collection date is a secondary search).  Multiple dates possible:  Collection/observation date, coverage date, analysis date, publication date, metadata modification date (e.g. peat bog samples relevant to 50,000 - 10,000 BCE, collected in July 1980, re-analyzed in July 2008, published in January 2010, and metadata revised in June 2010).  In this example, earliestDate applies to 50,000 BCE.  
- Mercury UI

SearchMetadata.temporalCoverageEnd
(DateTime)
- same notes as above apply
- Mercury UI

SearchMetadata.any
(String)
- Mercury UI

Desirable 

ElementsSearchMetadata.title
(string)
- Mercury UI

SearchMetadata.objectFormat(String)
- search the system metadata value, but need to parse system metadata of related objects to determine the objectFormat(s) of the data.
- Potential for mapping the values appearing the science metadata to the controlled terms available in the object format registry
- label should be "content type"

SearchMetadata.variableName
(String)
- not in Dryad, optional in most/all
- Mercury UI

SearchMetadata.dataDomain
(String)
SearchMetadata.scientificName
(String)
- lots of opportunities for cleanup services from ITIS, EOL, etc
- Mercury UI

SearchMetadata.publication(String)
- hard to define how to use this
- concept of the publication associated with generation of the data

Some Others

SearchMetadata.submitter
(String) 
- for either data or metadata
- Principal that added the content to the MN
- Needs to be translated from the subject to the individual's name

SearchMetadata.relatedObject
(String)
- given a PID find everything that refers to it (deprecated, describes, derivedFrom...)
- This should probably be specific to relation type - find all stuff derivedFrom PID

SearchMetadata.quality
(String, controlled vocabulary)
SearchMetadata.relatedOrganizations
(String)
- low priority

SearchMetadata.size
(Integer, long)
- needed for drive, display, not necessary for UI search

SearchMetadata.replicaCount
(Integer)

SearchMetadata.replicaLocation
(String)
SearchMetadata.dimensions
(Integer or perhaps float?)
SearchMetadata.measurementUnits
(String)
SearchMetadata.identifier
(Types.IdentifierType)
- Need to be able to find metadata records describing a dataset identified by PID
- Find any science metadata that is or describes PID
- Mercury UI

SearchMetadata.datePublished
- drawn from science metadata
- is this the same as publication date concept in datacite - Yes
- Mercury UI

SearchMetadata.dateAddedToRepository
- first submission to Member Node (drawn from science metadata if available, otherwise system metadata)

SearchMetadata.dateSysMetadataModified(DateTime)
SearchMetadata.readPrincipal
- element in permission index, used for shard query

SearchMetadata.writePrincipal
(Types.PrincipalType)
- element in permission index, used for shard query

July 21, 2011 joint meeting with Semantics Working Group and CCIT

Damian -- asked about Git.  Brief discussion of the distributed storage working group.  

Deborah -- rationalization of metadata?  Key area of connection with the semantics working group.  

Comment that the R demo is similar to examples with Maven.  

Some thoughts on areas where need for input from Semantics WG:
What is the definition of data?  Do we have this well defined in the architecture documentation? This would be a good element for what's in the reference architecture that still needs to be done.  

The semantics and interoperability working group is keeping notes at
https://docs.google.com/document/d/1Vv6ekKh91oXtBWlRbxEjev2UurJuNnh1ygdqaPsTaRg/edit?hl=en_US

i (deborah) am also about to share this etherpad link with that page.

Sun use case (persona):  Several questions.  What is colocated data?  What is colocated in the context of this problem (tortoise food web).  Where is there some definition of tortoise food web.  What met data is relevant (would have to take into account prevailing met patterns)?   

Note that the fundamental difference between semantics that can be captured at point of data generation and semantics to be added to existing metdata.  Begs for tools and best practices for semantics and markup at the point of data collection/generation. 

Working Group Coordination:  AHM and charters are a key issue.  AHM meeting October 18-20, in Albuquerque.  Will start Tuesday AM; finish late Thursday.  About 45 minutes from ABQ.  LT and CCIT should also have some role in cross-WG communication.  WG's should feel free to communicate (telecon, webinar) for subgroups.  

Mark Shildhauer -- Semantic Schmear or Semantic juju :-).  for working on the keyword list.  

The Questions from the I&S Working Group
•Is the keyword list still an issue that CCIT want the WG to address?  Where is CCIT with this?
•Will D1 recommend a particular metadata schema for member nodes to use when they want to contribute their data to D1?
•Are Mercury and the metadata model a constraint or an early implementation of a low-hanging fruit?
•How much of this group is analysis and recommendation and how much should we actually be doing work?
•The degree to which D1 resources are accessible as Web resources from REST interface and/or w get resources?


The keyword problem
- ability to move from arbitrary list to an automated mechanism to annotate metadata (augment the search index) with key concepts may open access to a bunch of useful technologies

- context is key to leveraging power of what's availble in the semantics of science metadata
- context = "additional associated metadata"
- how to bootstrap semantic annotation of datasets beyond what is available in the science metadata
- often need to open the dat afile to discover additional information about the data set (variables, content range, ...)
- need to move from "Tair" -> "air temperature measured x m above ground using y instrument"
- There is reasonable structure in the metadata documents, though many of the element values are uncontrolled (e.g. Tair, air_temp, temperature, ...)
- metadata = EML, iso19115, dryad application profile, FGDC bio profile
- The process of extracting the metadata elements that are indexed for search is one place where we could enhance what we are doing now
- need to document use cases for discovery and use these to help drive the process of the Semantics WG
- There are existing use cases - need to capture these in one location

EVA group has two "science scenarios" being worked through:
a) the bird migration simulations. Lots of semantics issues wrt integration
b) climate change.  Existing carbon models (input to IPCC) -> comparisons of predictions of the outputs to each other and to real observations to evaluate prediction of past events
- sonet use cases, personas, ...

Need a trajectory for getting semantic technologies to work across the system - CCIT can help to get it into the infrastructure

Browse hierarchy and faceted search terms that can be used in search interface
extractKeyConcepts(metadata):
  - return a list of key concepts given a science metadata document
  1. identify context
  2. identify relevant resources (ontologies, ...) given context
  3. extract keywords from appropriate location in metadata document
  4. for each keword
     a) match keyword to key concept drawn from ontology / thesaurus / controlled source
  5. return list of concepts
  
An alternative approach is to say, given this list of key concepts, which are relevant to this science metadata document?  
  
Need concise descriptions of what is required to implement key capabilities - then the CCIT team can allocate resources to get it done.


Possible Steps:
Uncontrolled keywords (from metadata records) -> map to controlled keywords -> expand keywords using ontology lookup
 
Uncontrolled keywords: generic keywords, theme keywords etc.. from the Science metadata records
 
Controlled keywords: using CF variables list, GCMD hierarchical keywords, NBII Thesaurus etc..
 
Ontology lookup using : SWEET, OBOE, DOLCE lite, ESG (model data..) Etc..

Ontology portals: BioPortal, HIVE...

--

For ranking objects found in keyword searches, we could use a system like Google’s PageRank. 

We know about object relationships:

- obsoletes, obsoletedBy -> object
- describes, describedBy -> object
- created by -> subject
- accessible by -> subject
- originating member node

Objects and subjects can be followed recursively to discover their objects and subjects. The discovered information can be used for adjusting the ranking of a given object for a given keyword. For instance, if an object has the keyword “water” in its science metadata, and is shared with subjects which have themselves created objects with the keyword “water”, the initial object would be given higher ranking for that keyword.

Idea: Putting additional metadata into System Metadata that allows subjects to point to objects that were of interest to them when they were doing research related to “this” object (to further help with keyword searches).

CN still needs Logging aggregation added.

OAI-ORE needs implementation on CNs, and /assertRelation service endpoint needs development

Debian packages need an update mechanism; configuration mechanism

Create a debian package for both Metacat and the GMN to install on a MemberNode instance

 Later, we should distribute generic MemberNode VirtualMachine image for KVM and VMWare

---------


V1r1 = OAI-ORE add, sys meta changes, version support in package names, functionality updated to work with sys meta changes

vacation time:
Chris:  Week of August 2
Robert:  August 4-9
Rob:  August 3-12
Roger:  October
Matt:  July 25 week + 1 week
Nick D: week of August 15