NEON DSWG Meeting 3 April 13, 2011 1 pm Eastern Participants: M Jones, DJ Spiess, P Griffith, T Erickson, D Tarboton, L Gardiner, B Kao, C Lagoze, E Boose, I San Gil, B Peet, M Parsons, M Burek, B Domenico, Brian Wee, D Greenlee Connection details * Dial-in number: 866.212.0875 * Attendee password: 504782# Agenda and Notes: 1) Becky Kao: overview of NEON data activities in biology areas -- see outline http://wiki.neoninc.org:8080/download/attachments/5079065/DSWG_FSU20110412.docx -- Types of data on their team -- Sampling modules (FSU) for terrestrial field sampling -- person on ground collecting data on plants, animals, microbes, etc -- above and below ground biomass, LAI, ..., (disease components) -- field component and lab component for each -- plus many samples shipped to external facilities for: genetics, chemical, disease, etc. -- bioarchive (collections): samples in one or more museum collections -- want compatibility with other communities that want the data -- taxonomic and phylogenetic; compatible with Darwin Core and Specify and museum collections; trying -- microbial diversity and function; GSC MIMARKS; compare Darwin Core with MIMARKS Questions -- -- Peet: Taxonomic concepts versus latin binomials; specify not really there yet; what thought in NEON to these standards? -- Becky -- three sources for a name for a thing, often don't agree; need to decide on what to push out there to the community -- field crews -- some sent for morphological expert ID -- DNA barcoding (often disagrees with expert) -- Peet: anything that is determined should have ability to have multiple determinations -- one of these might be tagged as preferred for reports, etc -- anytime a latin name is used, recommend that the Name plus reference be cited -- Matt: is there an authoritative list of concepts (versus names)? -- Peet: NEON sites would be covered in standard places; could get ITIS to link to concept references; MoBot making run at comprehensive list of concepts -- doesn't GBIF have a naming authority -- yes, Global Names Architecture (GNA), but it is name not concept based -- MIMARKS already selected as important for microbial work; are there others? -- Becky: would like clarification on recommendation for what the minimal fields are (beyond the required fields from the standards) -- NEON additional constraints/required fields for each metadata standards -- short name: want a profile for NEON 2) Review of survey results See http://wiki.neoninc.org:8080/display/DSWG/Metadata+standards See http://wiki.neoninc.org:8080/display/DSWG/Data+standards 3) Open discussion on data format standards -- Parsons: ASCII common, but need to make recommendations about how to format it -- Best practices: ORNL DAAC has best practices preparation -- Parsons: Consistently describe how your ascii is formatted; could be covered in the metadata standards selected; how are lat/lon developed; would need to look a lot more closely to the data in question; -- Is there a link for the DataONE best practices work? Matt will find and send a link. -- Parsons: NEON should provide data in multiple formats (as appropriate) -- e.g., ASCII, NetCDF, HDF -- Brian: if downloaded in pure ascii, how do you get schema information (e.g., metadata) -- for data mashups and repurposing, want to expose schema to end user -- good idea; -- Matt: existing metadata standards provide schema details -- use OAIS 'distributed information package' to distribute metadata and data together -- Matt: Nobody listed RDF/LOD as a format; -- Carl: LOD/RDF makes a lot of sense, but imposes a lot -- Carl: OData also provides schema -- LOD: we should be on top of it, but probably too early to define detailed standards -- Lagoze: OPeNDAP, Google's new data markup language -- Matt: Dataset Publishing Language (DSPL) [http://code.google.com/apis/publicdata/] -- Parsons:Starting to make more use of OpenSearch; ESIP is making more use of this -- Domenico: not sure the OGC protocols were mentioned; WCS/WFS/SOS, plus CSW (Catalog Services for the Web); need to also list these access protocols, in addition to the metadata standards and data standards; http://wiki.neoninc.org:8080/display/DSWG/DSWG+Deliverables -- Matt: don't forget about genetics data formats -- Brian: raw data from sequencers is in proprietary formats that are vendor specific; analyzed data peaks into ACTG would be represented in standard for GenBank etc -- can't remember name of the Center -- Inigo: community databases accept Genbank format; FASTA format (formatted ASCII); also recently accept MIMARKS and BCOL extensions; is NEON going to upload sequences to central repositories? -- Brian: not sure, but he thinks they are planning on doing so -- Brian: in addition to sequence data, have chip-based genetic screening data (microarray?); raw data may be in different format; -- Matt: is there a need to store and publish the raw data from the sequencers/sensors? -- Brian: for sequencing data, should keep the calibration data also -- Inigo: MIMARKS is a concept list, can be represented in GCDML (Genomics Contextual Data Markup Language) 4) Discussion of metadata and data standards assessment for NEON Data products See http://wiki.neoninc.org:8080/download/attachments/4653063/DataProducts.L1thru4.2010.06.pdf -- Matt: will want to make specific recommendations about the individual data products -- Brian: will provide another level of categorization of products soon (in about two weeks, four-level hierarchy) -- Inigo: great to stay at individual basis of measurement; -- Matt: will work with Steve to segment data products into related groups that can be treated fairly uniformly from a publication/documentation/representation perspective; next call to be focused on making specific recommendations for these product groups Action items: -- Matt: Provide link to DataONE best practices work -- Matt: provide URL for Google's data markup language -- Matt: Dataset Publishing Language (DSPL) [http://code.google.com/apis/publicdata/] -- Brian: Send link to new data product classification when available -- Matt and Steve: break products into functional groups for recommendations -- Matt: Doodle for next meeting in about 3 weeks -- All: review data products document for detailed discussion of publishing for products group