/0613-workflow-minutes

Meeting for 06-13 Workflows
Present: Bill, Rebecca, Richard

Richard - going over the excel sheet from before
    - would be useful to go over the differences between knowledge retrieval WF
- Kepler seems to be much more used in the literature for things that wouldn't be in myExp, as that is populated by developers and bioinformatics alone

Bill:
Hypotheses:
1)    Workflows are becoming more complex and powerful over time:
a.      as demonstrated by increases in numbers of components and dataflow links
b.     as demonstrated by increased branching
c.      as demonstrated by increased numbers of sub-workflows (embedded workflows)
d.      proportion of workflows that perform simple data acquisition vs those that perform numerous processing steps
2)   Most workflows perform simple, but repetitive data acquisition tasks as opposed to complex operations
3)   Workflows become more complex as one gains more experience (i.e., number of previous workflows created by that individual)
4)   Workflow re-use (downloads) is proportional to the complexity of tasks performed by the workflow.
5)   Workfow re-use is proportional to the sufficiency (comprehensiveness) of the documentation (i.e., metadata)
6)

Understanding workflows:
1)    numbers of data sources within a workflow
2)    incorporation of QA/QC
3)   number of components
4)    number of workflow links
5)   date created
6)    number of downloads
7)   number of sub-workflows
8)    use of specific data sources
9)   use of specific models embedded in workflow
10) number of workflows created by an individual
11)   discipline covered by workflow
12) number of workflows created by individual users (what is the shape of the curve? Long tail?)
13)

Richard:

   Organise in terms of 2,3 tiers of information
      1: high level - author, when, how many, nodes, workflow links, goal
        2: of nodes, % of data aquisition, % of data sources, type of nodes, models?
            3: sufficiency of the metadata (semantic and natural language description), plays into workflow reuse
Time ten random workflows

The goal is to verify/ look into the hypothesis above. (and then go on to more)

additional hypotheses?
   - ways to automate? (for the email to David)
      - also, suggestions for other ways to analyse these

      1. revise spreadsheet to be organised by tiers
       2. Get to grips with SPARQL as much as possible
      3. write up for the approach (methods, teir 1/2/3, selection processes, null hypothesis, etc...)
      4. use for a basis to contact David, and as a mini-proposal for what to do for the summer. Send out when done. Significant time restraints in what we can do - justification for myExperiment, using tiered approach to do that. (2-5 pages probably)
      5. Keep track of a list of bioinformatics/ confusing ones

      London time 1600, Mountain Time 900, Pacific Time 800, Tuesday