/20110429-D1-TG-STEM

Coordination Meeting for DataONE, TeraGrid MN, STEM TeraGrid Analyses

Agenda

brief summary of successful TG computation allocation - Daniel F
brief summary of planned/underway TG MN work - John C, Nicholas D
DataONE linkages/objectives - Dave V, Matt J
issues to solve?
action items?

Attendees:
Dave Vieglais, Matt Jones, Nick Dexter, John Cobb, Paul Allen, Kevin Webb, Steve Kelling, Daniel Fink

Notes
-------

brief summary of successful TG computation allocation - Daniel F
- awarded time, produced spp distrubition models
- State of the Birds to come out on Tuesday
- in January submitted research track allocation; extend modeling for State of the Birds
  - More species, more detail, a lot more computational resources
  - Got the allocation in April
- part of request was request to put together a MN on TG as part of allocation
  - results of modeling will include:
    - for each species, produce distribution model, plus data products, and a series of predictions
      - Keep models, visualizations of predicted distributions, and statistics that describe impact of predictors in the model, some results on predictive performance of the model
      - Main result are species distribution estimates: 1) data files as rasters/matrices, and 2) visualizations as png files & animations
      - Approximately 10-15TB final output size (2-4 TB/yr, and have about 7 years of data)
      - Possibly some larger data sets at intermediate stages, but not all need to be kept
brief summary of planned/underway TG MN work - John C, Nicholas D
- TG'11 same week as CCIT meting; Dan Fink invited to give invited talk; probably will go
- This allocation award is the only NSF Bio award for TG allocations
- Requested two VMs on Quarry; 5TB of shared file system; plus 5TB on data capacitor
- Discussions with TG to have all TG Nodes deploy MN at their site; this is a pilot for that
- Last year we said EVA could use more D1 infrastructure directly
- Plan is to write runs to Quarry, Albedo, and directly insert into DataONE, and replicate to storage at co-located MNs at CNs
  - would show end-to-end from consumption to model to data products storage to sharing in full cycle
- Paul: Over the summer the AKN will bring up a MN; so might be options there for storage too
- need to discuss metadata generation for data insert into MNs
- Nick Dexter has brought up a MN at TG using Metacat
- Can request ASTA support from TG if needed
- What web service services are available? The basic CRUD operations from DataONE.
- Matt can overview R-client for DataONE access with anyone who's interested
- Matt: biggest issue might be timing of EVA production runs with DataONE production status
- Daniel: initial production runs will start in the summer, producing about a TB of data initially
  - Dave: probably reasonable to send the data to the MNs (TG, AKN); can trigger a manual replication to staging nodes, and manually push content back to AKN or other location
  - Steve: AKN up to now focused on delivering un-analyzed data products (e.g., eBird data set); don't have way to organize and structure second generation data sets; have ideas on visualization tools, etc, not developed yet; could use feedback on ways in which they can optimize ease of transfer these second gen data sets (animations, images, and data behind the animation)
- also have time on Nautilus visualization machine at SDSC; want to support large R applications; we might want to create additional visualizations; could be good for John and Daniel to discuss these opportunities
DataONE linkages/objectives - Dave V, Matt J
- Dave: main thing is need flow diagram to show how whole system will wokr for the EVA experiment
  - will help nail down issues of timing of replication, etc., avoid blocking each other through the year
- Dave:Nick has good handle on getting MN up and operational; would be good to push data in; Metacat up and running, as is the GMN; Nick has detailed notes on what it took to do deployment; wants to extend implementation notes and OS install instructions on mule1
- http://gw59.quarry.iu.teragrid.org:8080/knb/ - Metacat
- http://gw59.quarry.iu.teragrid.org/mn - GMN (Python)
- Would be good to get feedback on MN deployment; positive that it worked well on other OSes; need testing on multiple OSes;
- Matt: how would a TG node be maintained over the longer term?
  - John: TG to be replaced by XD
  - John: TG has standard sw stack, which should include a MN component on every TG node; got some pushback from HPC centers that ask why 'these' services and not others; need to demonstrate use cases; unlikely to get TG fte's to support it; TG is mostly focused on storage management; exception is Reagan Moore's collaboration with iRODS for many users
- Matt: could really use input on design of R-client and what data formats, metadata formats, etc that it supports
- John: also need to examine the performance profile for all D1 components
issues to solve
- Timing of production STEM analysis will begin in summer but D1 infrastructure may not yet be ready
  - Matt: lets create staging system now to give EVA people more stable env to work against
action items
- (Kevin - lead, Daniel, Matt, Paul, Vivek): [June 30] come up with real metadata template for STEM results; use Matt's first draft created for the Feb NSF demo
- (Daniel - lead, Dave, Nick, Kevin): [May 31] Draft a document describing information flow and major components, systems, services and protocols involved in the interactions
- (Matt - lead, Kevin, Dave?, Paul, Nick): [June 15; needs draft of document above] Review the R package design and refactor as necessary for supporting this experiment
  - Evaluate whether STEM analysis would use R package at all; might be commandline instead
  - Design packaging/ folder structures etc for pushing content back into D1. Relationships between content elements.
- (Dave):[June 15] Design and implement setup staging implementation of D1 infrastructure
- (Paul): schedule regular (monthly) checkin of this group
  - June - first/second week
  - July - during CCIT meeting
  - August - TBD
Targets:
- July 15 - STEM analysis products being deposited dynamically, as analyses are complete
- discuss and resolve any blockers at CCIT meeting in July; add check up call to CCIT agenda

John Cobb's talking point reminders:
- PR coordination with TG for May 2? Elizabeth Leake

- Get Nick on TG Project # TG-DEB110008 (AI: Dan as TRAC PI)

- TG'11 invited talk prospect (Dan may be contacted by Amit or Sudhakar)

- MN implementation on Quarry (Nick update)

- Can move from Lonestar Albedo to Quarry Albedoe to MN to Cornell as a workflow

- what metadata to insert into runs

- Do we store all data for all time or are there "short-term" collections

- May need to get supplement Data request on DC-WAN or Albedo to support upcoming year of work

- Nautilus help: we have alloc and consulting help. including R support

-Think about taking advantage of this effort to also instrument and measure performance of initial data implementation, load and perforamcne of meta-data management transaction traffic, ...

140,830 files/species/yr: What is more effective at metadata maangement? LS or tagged FS interface?