NEON DSWG Meeting 3

April 13, 2011
1 pm Eastern

Participants: M Jones, DJ Spiess, P Griffith, T Erickson, D Tarboton, L Gardiner, B Kao, C Lagoze, E Boose, I San Gil, B Peet, M Parsons, M Burek, B Domenico, Brian Wee, D Greenlee

Connection details
Agenda and Notes:

1) Becky Kao: overview of NEON data activities in biology areas
   -- see outline http://wiki.neoninc.org:8080/download/attachments/5079065/DSWG_FSU20110412.docx
  -- Types of data on their team
  -- Sampling modules (FSU) for terrestrial field sampling
    -- person on ground collecting data on plants, animals, microbes, etc
    -- above and below ground biomass, LAI, ..., (disease components)
    -- field component and lab component for each
    -- plus many samples shipped to external facilities for: genetics, chemical, disease, etc.
    -- bioarchive (collections): samples in one or more museum collections
    -- want compatibility with other communities that want the data
      -- taxonomic and phylogenetic; compatible with Darwin Core and Specify and museum collections; trying
      -- microbial diversity and function; GSC MIMARKS; compare Darwin Core with MIMARKS
    Questions --
      -- Peet: Taxonomic concepts versus latin binomials; specify not really there yet; what thought in NEON to these standards?
        -- Becky -- three sources for a name for a thing, often don't agree; need to decide on what to push out there to the community
          -- field crews
          -- some sent for morphological expert ID
          -- DNA barcoding (often disagrees with expert)
        -- Peet: anything that is determined should have ability to have multiple determinations
          -- one of these might be tagged as preferred for reports, etc
          -- anytime a latin name is used, recommend that the Name plus reference be cited
        -- Matt: is there an authoritative list of concepts (versus names)?
        -- Peet: NEON sites would be covered in standard places; could get ITIS to link to concept references; MoBot making run at comprehensive list of concepts
        -- doesn't GBIF have a naming authority
          -- yes, Global Names Architecture (GNA), but it is name not concept based
        -- MIMARKS already selected as important for microbial work; are there others?
        -- Becky: would like clarification on recommendation for what the minimal fields are (beyond the required fields from the standards)
            -- NEON additional constraints/required fields for each metadata standards
            -- short name: want a profile for NEON
        
2) Review of survey results
     See http://wiki.neoninc.org:8080/display/DSWG/Metadata+standards
     
     See http://wiki.neoninc.org:8080/display/DSWG/Data+standards
     
3) Open discussion on data format standards
    -- Parsons: ASCII common, but need to make recommendations about how to format it
    -- Best practices: ORNL DAAC has best practices preparation
    -- Parsons: Consistently describe how your ascii is formatted; could be covered in the metadata standards selected; how are lat/lon developed; would need to look a lot more closely to the data in question; 
    -- Is there a link for the DataONE best practices work? Matt will find and send a link.
    -- Parsons: NEON should provide data in multiple formats (as appropriate)
        -- e.g., ASCII, NetCDF, HDF
    -- Brian: if downloaded in pure ascii, how do you get schema information (e.g., metadata)
        -- for data mashups and repurposing, want to expose schema to end user
    -- good idea;
    -- Matt: existing metadata standards provide schema details
    -- use OAIS 'distributed information package' to distribute metadata and data together
    -- Matt: Nobody listed RDF/LOD as a format;
        -- Carl: LOD/RDF makes a lot of sense, but imposes a lot
    -- Carl: OData also provides schema
    -- LOD: we should be on top of it, but probably too early to define detailed standards
    -- Lagoze: OPeNDAP, Google's new data markup language
        -- Matt: Dataset Publishing Language (DSPL) [http://code.google.com/apis/publicdata/]
    -- Parsons:Starting to make more use of OpenSearch; ESIP is making more use of this
    -- Domenico: not sure the OGC protocols were mentioned; WCS/WFS/SOS, plus CSW (Catalog Services for the Web); need to also list these access protocols, in addition to the metadata standards and data standards;
    http://wiki.neoninc.org:8080/display/DSWG/DSWG+Deliverables 
    -- Matt: don't forget about genetics data formats
    -- Brian:  raw data from sequencers is in proprietary formats that are vendor specific; analyzed data peaks into ACTG would be represented in standard for GenBank etc -- can't remember name of the Center
    -- Inigo: community databases accept Genbank format; FASTA format (formatted ASCII); also recently accept MIMARKS and BCOL extensions; is NEON going to upload sequences to central repositories?  
      -- Brian: not sure, but he thinks they are planning on doing so
      -- Brian: in addition to sequence data, have chip-based genetic screening data (microarray?); raw data may be in different format;  
      -- Matt: is there a need to store and publish the raw data from the sequencers/sensors?
      -- Brian: for sequencing data, should keep the calibration data also
      -- Inigo: MIMARKS is a concept list, can be represented in GCDML (Genomics Contextual Data Markup Language)
      
4) Discussion of metadata and data standards assessment for NEON Data products
    See http://wiki.neoninc.org:8080/download/attachments/4653063/DataProducts.L1thru4.2010.06.pdf
    -- Matt: will want to make specific recommendations about the individual data products
    -- Brian: will provide another level of categorization of products soon (in about two weeks, four-level hierarchy)
    -- Inigo: great to stay at individual basis of measurement; 
    -- Matt: will work with Steve to segment data products into related groups that can be treated fairly uniformly from a publication/documentation/representation perspective; next call to be focused on making specific recommendations for these product groups
    
Action items:
  -- Matt: Provide link to DataONE best practices work
  -- Matt: provide URL for Google's data markup language
      -- Matt: Dataset Publishing Language (DSPL) [http://code.google.com/apis/publicdata/]
  -- Brian: Send link to new data product classification when available
  -- Matt and Steve: break products into functional groups for recommendations
  -- Matt: Doodle for next meeting in about 3 weeks
  -- All: review data products document for detailed discussion of publishing for products group