Coordination Meeting for DataONE, TeraGrid MN, STEM TeraGrid Analyses Agenda * brief summary of successful TG computation allocation - Daniel F * brief summary of planned/underway TG MN work - John C, Nicholas D * DataONE linkages/objectives - Dave V, Matt J * issues to solve? * action items? Attendees: Dave Vieglais, Matt Jones, Nick Dexter, John Cobb, Paul Allen, Kevin Webb, Steve Kelling, Daniel Fink Notes ------- * brief summary of successful TG computation allocation - Daniel F * awarded time, produced spp distrubition models * State of the Birds to come out on Tuesday * in January submitted research track allocation; extend modeling for State of the Birds * More species, more detail, a lot more computational resources * Got the allocation in April * part of request was request to put together a MN on TG as part of allocation * results of modeling will include: * for each species, produce distribution model, plus data products, and a series of predictions * Keep models, visualizations of predicted distributions, and statistics that describe impact of predictors in the model, some results on predictive performance of the model * Main result are species distribution estimates: 1) data files as rasters/matrices, and 2) visualizations as png files & animations * Approximately 10-15TB final output size (2-4 TB/yr, and have about 7 years of data) * Possibly some larger data sets at intermediate stages, but not all need to be kept * brief summary of planned/underway TG MN work - John C, Nicholas D * TG'11 same week as CCIT meting; Dan Fink invited to give invited talk; probably will go * This allocation award is the only NSF Bio award for TG allocations * Requested two VMs on Quarry; 5TB of shared file system; plus 5TB on data capacitor * Discussions with TG to have all TG Nodes deploy MN at their site; this is a pilot for that * Last year we said EVA could use more D1 infrastructure directly * Plan is to write runs to Quarry, Albedo, and directly insert into DataONE, and replicate to storage at co-located MNs at CNs * would show end-to-end from consumption to model to data products storage to sharing in full cycle * Paul: Over the summer the AKN will bring up a MN; so might be options there for storage too * need to discuss metadata generation for data insert into MNs * Nick Dexter has brought up a MN at TG using Metacat * Can request ASTA support from TG if needed * What web service services are available? The basic CRUD operations from DataONE. * Matt can overview R-client for DataONE access with anyone who's interested * Matt: biggest issue might be timing of EVA production runs with DataONE production status * Daniel: initial production runs will start in the summer, producing about a TB of data initially * Dave: probably reasonable to send the data to the MNs (TG, AKN); can trigger a manual replication to staging nodes, and manually push content back to AKN or other location * Steve: AKN up to now focused on delivering un-analyzed data products (e.g., eBird data set); don't have way to organize and structure second generation data sets; have ideas on visualization tools, etc, not developed yet; could use feedback on ways in which they can optimize ease of transfer these second gen data sets (animations, images, and data behind the animation) * also have time on Nautilus visualization machine at SDSC; want to support large R applications; we might want to create additional visualizations; could be good for John and Daniel to discuss these opportunities * DataONE linkages/objectives - Dave V, Matt J * Dave: main thing is need flow diagram to show how whole system will wokr for the EVA experiment * will help nail down issues of timing of replication, etc., avoid blocking each other through the year * Dave:Nick has good handle on getting MN up and operational; would be good to push data in; Metacat up and running, as is the GMN; Nick has detailed notes on what it took to do deployment; wants to extend implementation notes and OS install instructions on mule1 * http://gw59.quarry.iu.teragrid.org:8080/knb/ - Metacat * http://gw59.quarry.iu.teragrid.org/mn - GMN (Python) * Would be good to get feedback on MN deployment; positive that it worked well on other OSes; need testing on multiple OSes; * Matt: how would a TG node be maintained over the longer term? * John: TG to be replaced by XD * John: TG has standard sw stack, which should include a MN component on every TG node; got some pushback from HPC centers that ask why 'these' services and not others; need to demonstrate use cases; unlikely to get TG fte's to support it; TG is mostly focused on storage management; exception is Reagan Moore's collaboration with iRODS for many users * Matt: could really use input on design of R-client and what data formats, metadata formats, etc that it supports * John: also need to examine the performance profile for all D1 components * issues to solve * Timing of production STEM analysis will begin in summer but D1 infrastructure may not yet be ready * Matt: lets create staging system now to give EVA people more stable env to work against * action items * (Kevin - lead, Daniel, Matt, Paul, Vivek): [June 30] come up with real metadata template for STEM results; use Matt's first draft created for the Feb NSF demo * (Daniel - lead, Dave, Nick, Kevin): [May 31] Draft a document describing information flow and major components, systems, services and protocols involved in the interactions * (Matt - lead, Kevin, Dave?, Paul, Nick): [June 15; needs draft of document above] Review the R package design and refactor as necessary for supporting this experiment * Evaluate whether STEM analysis would use R package at all; might be commandline instead * Design packaging/ folder structures etc for pushing content back into D1. Relationships between content elements. * (Dave):[June 15] Design and implement setup staging implementation of D1 infrastructure * (Paul): schedule regular (monthly) checkin of this group * June - first/second week * July - during CCIT meeting * August - TBD * Targets: * July 15 - STEM analysis products being deposited dynamically, as analyses are complete * discuss and resolve any blockers at CCIT meeting in July; add check up call to CCIT agenda John Cobb's talking point reminders: - PR coordination with TG for May 2? Elizabeth Leake - Get Nick on TG Project # TG-DEB110008 (AI: Dan as TRAC PI) - TG'11 invited talk prospect (Dan may be contacted by Amit or Sudhakar) - MN implementation on Quarry (Nick update) - Can move from Lonestar Albedo to Quarry Albedoe to MN to Cornell as a workflow - what metadata to insert into runs - Do we store all data for all time or are there "short-term" collections - May need to get supplement Data request on DC-WAN or Albedo to support upcoming year of work - Nautilus help: we have alloc and consulting help. including R support -Think about taking advantage of this effort to also instrument and measure performance of initial data implementation, load and perforamcne of meta-data management transaction traffic, ... 140,830 files/species/yr: What is more effective at metadata maangement? LS or tagged FS interface?