Discussion on DataONE support for Dublin Core metadata
======================================================
Attending
---------
Christopher Jones, Skye Roseboom, Dave Vieglais, Matt Jones, Jane Greenberg, Laura Moyers, James Long, Ben Leinfelder, Robert Waltz, Bruce Wilson, Angela Murillo
Conference URL: https://plus.google.com/hangouts/_/nceas.ucsb.edu/dc-discussion
Background
----------
DataONE has upcoming Member Nodes that would like to use Dublin Core metadata fields as harvested science metadata describing their objects. Should DataONE host a custom schema to accommodate this, or should research communities use them in their own hosted schemas? Should DataONE parse *any* science metadata document, and index DC elements that are encountered as a fallback? We need a short term solution (1 -2 weeks), which will be the focus of the discussion.
The indexed content for DataONE is described in the Content Discovery section of the Architecture documents:
https://releases.dataone.org/online/api-documentation-v1.2.0/design/SearchMetadata.html
The implementation of this uses a Solr search engine, and you can see the list of indexed fields at the CNRead.getQueryEngineDescription() REST API endpoint:
https://cn.dataone.org/cn/v1/query/solr
Agenda
------
1. Quick summary of email discussion thus far (Chris)
2. Creating a custom DataONE schema
- What DC namespace would it target? (full /terms or limited /elements/1.1)
- Pros
- A minor change from the existing Dryad schema should be easy to parse.
- Cons
- We'd have to maintain it, possibly for decades (which is hard based on our EML experience)
- Not really community-driven
- Ball, A. (2009, June 3). "Scientific Data Application Profile Scoping Study Report". Retrieved from http://www.ukoln.ac.uk/projects/sdapss/papers/ball2009sda-v11.pdf
- Page 20, Section 3.6.3 describes the ambiguity of Simple DC and why it is not suited for Scientific MetaData description
- Conclusion states that it would take a year of effort to produce an Application Profile
- Page 39, Section 7.2 "From work on previous Dublin Core Application Profiles, it is estimated that the development effort required to produce a Scientific Data Application Profile, along with supporting documentation, will be twelve months full time equivalent."
3. Parsing DC fields from any valid XML/XHTML
- Pros
- doesn't require a specific schema
- provides a good fallback
- Cons
- it would be difficult to interpret context of the DC fields if they are found 'anywhere' in the XML document; its quite ambiguous in many scenarios
- Would take considerable effort and time to produce the solution
4. Using the Dryad DC Application Profile schema
- Pros
- Already Supported by DataONE
- Cons
5. Using an existing DC schema
- See http://dublincore.org/documents/dc-xml-guidelines/
- Can we use or extend one of:
- Pros
- Already exists, don't need to be extended for base elements
- An existing community already maintains these schemas, and they are well-known
- Both simpledc and qualifieddc are lready registered format types
- Cons
- May not provide all of the elements needed for good indexing in DataONE for MPC
- But we don't really have minimum requirements now
- defined simple version does not have a spatial field, does not have a place for scientific names.
6. Support DDI directly, which is what MPC really seems to use
7. Register the DataCite Metadata Kernel schema as an accepted type
8. Short term decision for supporting the Minnesota Population Center
DECISIONS:
- Recommend that MPC uses http://dublincore.org/schemas/xmls/qdc/2008/02/11/qualifieddc.xsd as a short term fix
- Option #2 is not desirable from a long-term maintenance POV, and shouldn't be pursued
- Option #3 is a very broad effort, and is difficult to interpret field context, and shouldn't be pursued at the moment
- For option #5, we should support the containers for all of these existing schemas as science metadata for the long term
- We should also support #6 and #7 in the long term; we could register them now, but not create an index parser until someone starts adding content of that type
- TODO: Bruce will write up recommendation for MPC
- BEW 2014-08-24: To convert an XML file that used the dataone_dc proposed schema to the qualifieddc schema:
- Change the xsi to
- xsi:schemaLocation="qualifieddc.xsd http://dublincore.org/schemas/xmls/qdc/2008/02/11/qualifieddc.xsd”
- Change the containing element from <dataone_dc> to <qualifieddc>
- Change the xmlns for dc to xmlns:dc="http://purl.org/dc/terms/"
- Change all instances of dcterms: to dc:
- Note: the Dryad spec includes dwc:scientificName and we do index that. It appears possible, using the qualifieddc schema, to include that by adding xmlns:dwc="http://rs.tdwg.org/dwc/terms/" to the namespace declaration for the metadata and then including terms such as
- <dwc:scientificName>Cercis canadensis</dwc:scientificName>
- MPC requested timeline:
- by August 22 - DataONE indexing system updated to parse DC metadata (based on the mapping Wendy provided yesterday)
- by August 22 - Wendy generates DC metadata objects compatible with the updated indexing system
- by August 25 - We ingest the metadata into our member node, and it is ingested and indexed in DataONE's test environment
- Fabio will be in Italy August 27- September 12 (Our lead TerraPop developer, Alex Jokela, will be handling issues that require our attention during this time.)
- By September 22 - Testing complete; any glitches in DC indexing ironed out, metadata vetted and certified for production
- By September 29 - Member node moved into production with indexed metadata