#persist rst .. meta:: :keywords: provenance, CCIT, infrastructure, resource map .. sectnum:: Tracking Provenance in DataONE ============================== :Document Source: http://epad.dataone.org/201301-provenance-package :Document Status: Initial draft for comment and edit :About: Scratch pad for developing the mechanism for linking provenance information to a data package. :DataONE Prov WG Notes: (requires auth) https://docs.google.com/document/d/1Nxz0bV7gI3kRbeGjXo3Drs0E2ri-aasujI2Ju3px7gQ/edit .. contents:: :depth: 2 Problem ------- Provenance information where available should be associated with the relevant objects. Solution -------- Storage ~~~~~~~ Provenance documents will be stored as objects in the DataONE system in a manner similar to science metadata and science data objects. Provenance documents will have a unique identifier (PID), and will be assigned a formatId of "http://www.w3.org/TR/prov-xml/" (TO BE CONFIRMED). Provenance documents are to be treated similarly to science metadata objects and resource maps in that they will be harvested and indexed (see Indexing, below) by the coordinating nodes. Cross Referencing ~~~~~~~~~~~~~~~~~ A provenance document provides information about one or more objects that participate in a data package described by an OAI-ORE document. References to provenance information for an object will be recorded in the ORE document that describes the data package containing the object. The provenance document is itself an object stored in the DataONE system, hence a reference to the provenance document can be created using the PID for the provenance document. A provenance document will be associated with an object in a data package using the dcterms:provenance predicate, the object for which will be the DataONE REST endpoint to resolve the provenance document. For example, if the provenance document had a PID "provenance_id", the REST URL will be "https://cn.dataone.org/cn/v1/resolve/provenance_id". Example ....... A data package with three components: A science metadata document *scimeta_id* that describes a science data object *scidata_id*, and a provenance document, *provenance_id*, associated with the data object. In N3 notation:: @prefix cito: . @prefix dc: . @prefix dcterms: . @prefix foaf: . @prefix ore: . @prefix rdfs1: . a ; dc:format "text/rdf+n3"; dcterms:created "2011-08-12T12:57:03Z"; dcterms:creator ; dcterms:identifier "resource_map_id"; dcterms:modified "2011-08-12T12:57:03Z"; ore:describes . a ; dcterms:title "Simple aggregation of science metadata and data"; ore:aggregates , , . foaf:mbox "foresite@googlegroups.com"; foaf:name "Foresite Toolkit (Python)" . rdfs1:isDefinedBy ore:; rdfs1:label "Aggregation" . rdfs1:isDefinedBy ore:; rdfs1:label "ResourceMap" . dcterms:description "A reference to a science data object using a DataONE identifier"; dcterms:identifier "scidata_id"; dcterms:provenance ; cito:isDocumentedBy . dcterms:description "A reference to a science metadata document using a DataONE identifier."; dcterms:identifier "scimeta_id"; cito:documents . dcterms:description "Provenance information for scidata_id"; dcterms:identifier "provenance_id". In RDF XML:: text/rdf+n3 2011-08-12T12:57:03Z resource_map_id 2011-08-12T12:57:03Z Simple aggregation of science metadata and data A reference to a science data object using a DataONE identifier scidata_id A reference to a science metadata document using a DataONE identifier. scimeta_id Aggregation ResourceMap As a graph rendering: .. image:: https://dl.dropbox.com/u/231460/eg_provgraph.png Indexing ~~~~~~~~ Provenance information will be indexed in a manner similar to science metadata and resource maps, where the values of selected terms are extracted, transformed / normalized as necessary, and added to the SOLR index on a coordinating node. Some examples of searches that should be supported include: Return all provenance traces that:: contain data that was attributed to "Yaxing" --> field: wasAttributedTo include the "reGrid" activity --> field: activity contain data used by the reGrid activity --> field: used/activity contain data generated by the reGrid activity --> field: wasGeneratedBy/activity contain data about 'climate' --> field: entity/D1:about contain data attributed to a climate scientist --> field: entity/prov:role "Yaxing", "climate scientist" are Subjects, http://mule1.dataone.org/ArchitectureDocs-current/apis/Types.html#Types.Subject Note that mapping subjects between the values used in tools that generate provenance documents and DataONE may be a challenge. Discussion ---------- Some issues to consider: - Consistent representation of subjects in generated documents, and maping those to DataONE subjects. This will be a potential problem with other terms requiring consistent representation such as the various activities that may be employed. e.g. is "reGrid" equivalent to "regrid"? - Richness of query support. Using a flat index reduces the relationships that can be expressed. Is it sufficient? Do we need to add something like a SPARQL endpoint? - Efficiency of indexing needs to be considered, especially since it can not be assumed that a provenance doc will be available on the CN when a resource map that references it is being indexed and vice-versa. - The format of the provenance trace, W3C PROV-XML?