DataONE Provenance Working Group Meeting, June 7-8
Genome Center (GBSF Building), University of California, Davis

Meeting Overview and Venue: 
DRAFT AGENDA TUESDAY AM (ProvWG Session I
TUESDAY PM (ProvWG Session II)
WEDNESDAY AM (ProvWG Session III & Summer Project I
WEDNESDAY PM  (Summer Project II
NOTES 

TUESDAY AM (ProvWG Session I)
Participants:
Bertram overview of DataONE ProvWG (slides, charter)
Questions/Comments:
Terminology discussions (Paolo)
Provenance Uses, Scoping
- Audit, reproduce/repeat, debug
- reuse, reproduce
- rerun with modification ( data, parameter, wf, platform)
- project support
- archival, preservation, discovery
- fitness for use
- Is Y ok for B given that X <-- T <-- Y
- Improved understanding (optimization potential, extracting parallelism)
- Verification, Validation
- Cost & Performance Analysis
- Model Comparison ("method swap" and analsys) 


Scope of D-OPM:
(1) PIL (Provenance Interchange Language)
(2) Data Structure
        (a) nested list
        (b) access path
(3) Process Structure [WF land: WF, actor, channels, ports]
        (a) Vistrials allows WF evolution
        (b) Pegasus allows WF execution plan (the executable WF may be different from the abstract WF and it may not be possible to map back[?])
(4) Context
        (a) who
        (b) when
        (c) where

Streaming Data: any special use case?

A WF could be run multiple times
(a) with different parameters
(b) with different input data items
(c) with modifications to a actor or to WF design

Run vs Invocation:
- Run is linked with an agent
- Invocation is not

Data Binding could be at various stages:
(1) Data Type (conceptual) : netCDF
(b) Logical: N101576
(c) Physical: //../../N101576

Pegasus:
1. WF
2. Sub WF (unit of planning) 
3. Job (unit of scheduling)
4. Tasks

Pegasus captures the trace of the execution to develop the execution plan. [Abstract WF -- (reduce)--> WF1  -- (add StageIn)--> MF2]. 

Following notions may be needed:
(1) isInstanceOf
(2) isReplacementOf
        (i) isReductionOf
        (ii) addedTo

There was an idea that this trace could be captured using Vistrail's evolution model. There was another thought that OPM itself could capture the WF definition/evolution (?).

06/07/2011

Do we need to capture “port”?
 
WF: 
[A:x] --- (data) --- [u:B]
[A:y] --- (data) --- [v:B]
 
WF Run:
[a1:x] --- (d1) --- [u:b1]
[a1:y] --- (d2) --- [v:b1]

Approaches:
(1)   use OPM and  add the port information as “role” (e.g. “role: port x”)
(2)   use “port” as the first level entity
 
We need to undersatnd whether a port is an “input” or an “output” port. 
 
Do we need to capture “channel”?
 
Channels are used for
(1)   control flow
(2)   data flow
 
Through a data stote (BPMN: gateway): [A:x] à (data store) à [u:B]
Data pass through: [A:y] ---(data)--->[v:B]
Control pass: [A:y] ------>[v:B]
 
In D-OPM we would call this “Data Connector (DC)”.
 
An output port could be connected to one or more DCs and one or more DCs could be connected to an input port.
 
In D-OPM we would use the following relations:
Send: (Actor/Procvess Invocation, Port, Data Connector, Data)
Recieve: (Actor/Procvess Invocation, Port, Data Connector, Data)
 
sent ~ gen_by
received ~ used
 
We need to clearly state “inside” and “outside” languages. 
 
Data Collection:
 
A set of generic vocabolary needed to understand/capture provenance of a collection: operations those are used with collection are “insert”, “delete” and “select”. In D-OPM we need to capture finer grain dependencies for a collection. Also, we need to have backward compatibility with OPM.
 
One way:  we use OPM to capture the dependency to the collection and then introduce another another relation (wasDerivedFrom) to map respective elements from the collection. Paolo showed an example.