Understanding Workflows 08/06/2011 Present: Bill Michener, Rebecca Koskela, Karthik Ram, Richard Littauer Richard: myExperiment: * Some issues with Kepler. 17 uploads, by 2 people. One is a dev, the other was a one-time show. * Taverna 1, 2, & RapidMiner are fine * Triana - is around, but there's nothing on myExperiment * Bill: More than enough just to use T1, T2, RM, and Kepler Worth waiting until we email Kepler lists to see if anyone uses it. - Email Ecolog, Kepler Lists, etc.… - Worth saying that we're not going to post our results to the list (confidential) - Direct users to the workflows website Bill: Will send out an email about contacting science pipes Run through one or two of Taverna stuff from myExperiment Categorisation: http://epad.dataone.org/xw4frGxC80 - Do we do it based on the information which is flowing through the flow, or on the output? Or the input? - How do we characterise them? End up with an excel sheet with toggles * QA/QC steps? * Can you visualise it? * Throughout the process? Only at the end? * Edit it in the middle? * Embedded? * How many are embedded? * Are they embedded? * WHat sort of processes are embedded? * How many databases? * Does it conver the file types? * Does it run stats? Is it doing results itself? Much of these are dependent on the system - some might not have visualisations in the middle, but only at the end. How do you tell the workflow from the system, in this case? B: We should be looking for commonalities between Kepler and Taverna etc. * Are there characteristics that we can use to characterise complexity or the types of processes within these? * Are there generic ways to categorise them? If there aren't, then these might just be additional criteria for each system. Would be worth reading the Groth paper: - Talks about natural language descriptions of workflows - Looked at what the tags were, what the example test was. Information about the steps, about the statements, and how they were organised, and advice. - Ideally, we'd like to codify this information - not just workflow management, how we would describe them to other scientists. Workflow Forever: http://www.wf4ever-project.org/ - New project, looking at workflow decay - Looking at the reuse, dropping off, and how are repositories helping - Specifically, how can we build a cyberinfrastructure to keep them together - We're rather dealing with natural language descriptions in order to ease the sharing and use of workflows. We need to strike a balance between Groth and w4ever. Karthik: Did Bertram get Richard in touch with paulo? (Not yet.) Re-email the Groth and the other paper around. Next step is to take some particular workflows, and see how we can use the natural language categories to characterise these. Going through half a dozen would be a good test to see if the methodology holds. K: Don't download more, for now, just go through these and see how it goes before we get more data. Workflow 761: Bindata for Kegg http://www.myexperiment.org/workflows/761.html - Img: http://www.myexperiment.org/workflow/version/image/2182/databinswithkeggid.png - KEGG - Kyoto gene database - Goes along, mines the information from various databases,works on that. - Input is a 6/8 letter KEGG ID. - Richard: haven't been able to get an output - Have everyone install Taverna 2 by next week, try and run some workflows - Random selection (and the example one) all failed. - Bill: Ideally we wouldn't need to run the workflows - R: some inputs we don't have, - B: some you might need access to databases. * So, this is a complicated workflow; doesn't just convert a geneID * Accesses several databases, with different information in them * Multi- vs. Uni-purpose workflow * Run it with the example one. * Update Skype Client * Three main areas: * far right - ID for the genes. Goes onto PDB and looks for that. * Far left - looks for SNPS and Swiss probes in another database * Also using Moby, a whole set of programs coming out of Mass * Middle - PubMed, data publication * Doesn't just find information, but also publication * Also translating IDS * Also provides visualisations (pdb 3d images) * Is this a visualisation not just of workflow but of output? Worth checking. * Query mutliple sorces - not processing? Workflow 161: XKCD http://www.myexperiment.org/workflows/161.html * Gets the URL, current page, img link, download img * Simple workflow, but: * Different inputs - human modified input, vs. hardcoded parts of the workflow itself * How often are there variables you need to define for each workflow? * 33% user made workflow? Possible. Think about terminology. * Branching structure - this is linear, the other is much more complex * What is flowing through is changed here - goes from text to image file * Might be worth taking into account the information being passed along * Just text? * image? * 3d image? * text with a lot of provenance data? * This isn't shown in the workflow, but it could be. * This is essentially just a retrieve - but there's no new processing So, * One query * Mult. queries * One processing job * grid processing job * etc... Start identifying ways to categorise these. Go ahead and document this workflow, others, would be a good way to progress. Look at workflows on Thursday, Friday. Next meeting go over results, where to go from there. Should have Kepler results by then. Next meeting: Monday 13th 7 PCT, 8 Mountain Time, 15 GMT Take the images for the workflows, look at them, describe, then build on those descriptions. Will be an iterative project, but once the basic structure is there we should be good to go. Lots of help from Rebecca most likely from the bioinformatics way of looking at things. Email Rebecca about flights to Berlin, internet access for after June 26th End of call.