Notes for Standup 2013.20-Block.3.2 (5/12 to 5/25)
=================================================
Chris B.
--------
- 20130531
- Worked with David on ORC node
- Robert and Chris started Ansible upgrade in dev
- flags are getting reset, looking into it
- initial installation works fine, but upgrades are problematic
- Working on the ORE maps in indexing
- using a separate branch
- using an interface to translate to ResourceMapFactory
- 20130524
- Finished getting David set up at UTK
- public key is in the sysadmin folder
- worked with Robert on dataone-cn-os-core install scripts
- still failing this morning (blanking out the CONTEXT label field)
- Meeting with Skye on indexing today
- Created the UTK DSpace VM for Joe Agrippa
- OIT will be upgrading nodes at UTK (migrating VMs around, no outage expected)
- 20130522
- dataone-cn-os-core package is
finished being tested (Robert)- fails installation during preconfigure
- Meeting with Joseph Agrippa (sp?) re: DSpace at ORC
- Keep SEAD and Dryad impls in mind if UTK is planning a DSpace MN, and consolidate the effort
- Working with David
- 20130520
- Wrote Ansible playbook for LDAP, covers initial setup with LDAP auth in VMs
- Working on dataone-cn-os-core package: getting an error installing this package
- perhaps a certificate issue?
- will debug this with robert
- 20130517
- Ready for an Ansible demo
- 20130515
- ansible loading templates in deb config db
- learning David some DataONE
- will be ready for ansible demo on Friday
- 20130513
- Finished ansible module
- using it to populate templates (stating each question, associated flags)
- some default values
- known IP addresses may get 1 to 2 questions
- asks for LDAP and Java passwords
- Shooting to finish playbooks this week (demo on Friday)
Robert
------
- 20130324
- Working with Chris on dataone-cn-* packages
- figured out some full-clean LDAP installation problems
- Debugging scripts now - ensuring properties are being set correctly
- need to take one of the cn-dev machines out of the round robin
- LogStats API work
- 20130522
- dataone-cn-os-core package is in testing phase, troubleshooting bugs
- 20130520
- release of revised dataone packages for Chris
- working on XML download of config file
- 20130517
- 20130515
- restructuring postinst along procedural lines
- writing up document
- 20130513
- got the git repo set up on github, now cloned on dev-testing (dataone-cn-os-core)
- using ubuntu-rpw channel at the moment, usable by Chris B.
- working on using set -e in the scripts, and pulling values from XML file
- see http://epad.dataone.org/d1DebConfigXml for continuing implementation notes and decisions
David Doyle
-----------
- 20130324
- Worked with Chris on the DSpace VM
- Subversion training and access
- Continued work on D1 documentation
- 20130522
- Working on documentation
- Working with Chris
- 20130517
- Vsphere access now
- worked with Chris on building VMs
- Now has Sysadmin access in subversion
- 20130513
- Getting access to systems (VPN, Sharepoint, VSphere, etc.)
- Will be getting together with Chris this week
- Working on documentation
Roger Dahl
----------
- 20130524
- Initial versions of controlled hierarchies in ONEDrive are working
- still working on the ability to click through down to the science data object
- using 'workspace' resolvers
- testing against production
- 20130522
- Working on ONEDrive controlled hierarchy
- 20130520
- Adding in the filtered D1 objects (implementing matt's mockup)
- taxa
- time periods
- authors
- regions
- 20130517
- Finished initial conversion of ONEDrive to a Workspace based approach.
- 20130515
- Working on ONEDrive. Still hoping to complete by EOW.
- 20130513
- Continued ONEDrive work - shooting for the end of the week
Rob
---
- 2013
- 20130524
- In touch with Ryan re: dev environment, haven't heard back yet
- Needs to register in /portal and at docs.dataone.org for the certificates
- 20130522
- Testing Dryad - found some issues
- Will register dev.datadryad.org in dev
- Packaging up libclient
- added new maven assembly goal to produce artifacts with /libs on dev-testing
- will manually copy to rreleases.dataone.org
- release names need to follow tag names precisely
- 20130520
- Looking at the libclient jar issue (releasing a jar + libs/ tgz)
- R client includes everything except jibx*
- jibx is only needed for the codegen phase of the buildout
- maven dependency list still has them listed, trying to figure out how to exclude them
- 20130517
- Performance testing wrapped up for very large resource maps
- Upshot: creating is expensive with foresite, deserializing isn't
- Perl script takes 10 seconds to serialize
- Big impact is the high degree of cardinality for the documents field
- Suggest review of the writeup documentation
- Libclient discussion
- Use maven to create a release with dependencies
- release.dataone.org
- JibX tools and its dependencies on Eclipse plugin arch are not needed for the execution of the jibx generated classes.
- Will work on using hudson to create a recipe for producing product releases for common and libclient
- 20130515
- finishing Very Large Data Packages arch document (http://mule1.dataone.org/ArchitectureDocs-current/design/VeryLargeDataPackage.html)
- memory needed to build the model from the serialized form is the main limitation, and the reasoning model I use takes 3-6 times the memory, so this may need to inform our data package requirements
- need tools for estimating resource map size prior to deserializing into a model, where the memory hit is incurred.
- use of a solr query (q=resourceMap:{ReM}) gets the number data one objects in the resource map
- number of objects in the map is a good indictor under normal circumstances.
- 20130513
- Experimenting with nested, large resource maps
- Generation is quicker with certain techniques
- Looking at the memory footprint - using VisualVM
- DataPackage holds a resource map - but isn't used. Refactoring that out.
- Will document as well
Skye
----
- 20130524
- Working on EML support in indexing for taxonomic coverage
- ONEDrive: 8 taxonomic ranks? Scientific name
- Should use the genus and species binomial for the 'scientific name' field
- Adding in entity-attribute information from EML
- eml-attribute description field
- parameterDescription
- parameterUnits
- Will be out next week
- Moving into Solr 4 installation work
- Will discuss indexing with Chris today
- 20130522
- Rolled out replication auditor project
- new hudson build
- integrated into process daemon
- has a standalone tool
- Worked with Dave on common solr code
- Classes will be added to d1_cn_common
- Will be using Solr 4 Cloud service
- cloud config provides multimaster replication
- Be sure these Solr VMs are added to the operational procedures, layout is sane in datacenters, etc.
- Change build tool to use database access layer
- 20130517
- MN forum - generated some questions re: libclient
- Mike F. using Matlab and needs some Java/JRE assistance
- Proposal for eceptions logging index and moving to Solr 4.x
- Using Jetty may increase performance
- Auditing is now split out into a single project
- 20130513
- Upgrading production ORC and UNM machines
- some trouble in the Solr index, ended up re-building the indices
- cleaned out cn-specific names (like cn-ucsb-1)
- Solr is read-dominated, updates are expensive
- Rebuild finished Sunday - pushing through some lingering OREs
- Some trouble with ORC - hzIdentifiers is short (200K, vs 331K)
- Will be entering tasks on the exception logging
- Would want to upgrade to Solr 4.x - big changes - new required field
- 20130515
- Proposed idea of a common location to place dataone CN utilities
- Common place for finding indexing, auditing tools, utilities
- /usr/share/dataone/(bin/tools) seem to be popular in su.
- in /usr/share "application-specific, architecture-independent directories be placed here" (Filesystem Hierarchy Standard). So as long as the scripts are in perl/java/shell/python then it seems like it would be the place.
- Splitting replication auditing into stand alone project.
- Working on stand alone replication auditing tool.
- Writing up proposal for new event, exception logging index service
- solr 4 with cloud replication on jetty server...
- seperate jvm, no additional load on the tc
- no additional load on metacat, indexes, cn services
- no need for immediate migration of existing solr indexes to begin creating solr 4 install/config etc.
Dave
----
- 20130524
- Working with eBird helping setup for publishing annual dataset
- OS issues - rebuilding desktop machine
- Opportunity for working through builds of all products with empty machine
- considering adding support for "query" method in CLI
- 20130522
- Will prune Metacat builds on Hudson for 7 days
- Will delete the Metacat build, Metacat_unstable will save 10 builds
- Catching up on the backlog
- working on dependency diagrams
- streamlining releases
- could use a redmine URL for release notes
- 20130522
- Working through backlog of issues
- Completing product dependencies and setting up release pages / process
- 20130520
- LT meeting
- Increasing MNs and content
- Improving discoverability and usability
- Catchup after travel, grant reporting, etc
- 20130517
- LT meeting all week on strategies for next iteration of DataONE project
- General concensus that we need to seriously consider the "slender node" concept, or a Tier0 API for MNs. Approach would be to leverage existing services or capabilities (e.g. site maps, OAI-PMH, WCS, ...) available at a repository, and have a DataONE adapter (perhaps run by DataONE) that performs translation sufficient for a repository to appear with Tier 1 capability.
- Also Discovery is the other hot topic that is seen as essential for DataONE's future.
- Met with UNM IT about issues with storage being offerred up. Basically we have a large amount (currently 16TB, but growable to about 1PB) of space available, but due to implementation and operational issues, is not really very useful in a production environment, except perhaps for storing backups or as scratch space.
Matt
----
- 20130520
- At the LT meeting last week, working on the new proposal
- Confirmation that size is approximately 1/2
- Dev of Core CI was key, hopeful that CI size will remain
- CE working group - education modules, training, etc.
Ben
---
- 20130324
- Troubleshooting some minor logging issues with CILogon
- slightly holding up the release
- 20130522
- Finished CILogon upgrade
- deployed on cn-dev
- servers need to be pre-registered. So, Ben registered all of the CN envs (in dataone-cn-portal)
- minor change in d1_solr_extensions
- for production deployments:
- will need to upgrade dataone-cn-portal and dataone-cn-solr
- 20130520
- Upgraded the CILogon code to use their new API
- configuration is no longer RDF/XML - now plain XML
- 20130517
- PPBIO node will be coming online (Brazilian node)
Chris
-----
- 20130522
- TODO: email Bob re: DSpace
- Deprecated mn-orc-2.dataone.org, mn-unm-2.dataone.org from the replica target MNs
- some troubleshooting of LDAP syncrepl - updates not making it to cn-unm-1.dataone.org
- ended up restarting slapd on unm
- Ben - LTER upgrade and invalid pid removal with Mark S.?
- Working on hz updater code for ORNLDAAC pids
- testing in the dev env - can't use stage because ORNLDAAC stage MN == prod MN
- using pids on mn-demo-5
- Metacat-specific work
- Put some thought into the Settings/Workspace API - need to discuss this
- Minor coordination on Dryad MN spinup with Rob
- Are we ready to register in stage?
- Are we planning for a demo by Friday?
- Will be getting back to hzIdentifiers issue, but restarted d1-processing on cn-orc-1
- 20130520
- Ansible walk through with Chris B.
- discussion with Dave and Roger re: ONEDrive and a Workspace API
- Investigating the ORC hzIdentifiers set offset after our CN upgrade last week
- 20130517
- Have been out this week - catching up on emails
- 20130513
- Metacat-specific work
- Working with Skye on upgrading production CNs
- Working on ORNL system metadata updates
- Sensor best practices workshop writing
ONEDrive Discussion
-------------------
- Workspace Type
- XML document that supports
- python library can serialize and deserialize this type - likely will be rolled into to libclient
- Populated by logging in, and calling an API to get the workspace instance for the user
- Initial implementation:
- We'll use a wildcard query
- Will implement the mockup view that matt put together (by year, by author, etc). These will be the first pass filters
User Settings API notes
- API name: This could be the Workspace API, or Settings API, or other
- Association with CNs/MNs
- This could be a stand alone service and not tied to a CN or MN per se
- Potential REST endpoints and API calls
- The settings or workspace could be seen as a series of parameters, one being a per-subject 'collection' or 'folder' of 1) static pid lists, and 2) saved search queries
- e.g.
- Settings.listCollections() OR Workspace.listFolders()
- /collections/{subject} OR
- /folders/{subject}:
- returns the settings collections for a given user. Should be able to pass in a collection name parameter to limit the collections returned. The return format would be either XML -see the schema Roger worked on), or a standardized JSON object of the same information. Look at Badgerfish for transforming XML to JSON consistently.
- How do we model taxonomic hierarchies?
- We have about 8 solr fields
- No hierarchies are wrong/right
- Just follow an example such as
- EOL - may have a REST service to view taxonomy
- GBIF
- ITIS - has REST API
- OR, browse by rank <-- good first stab
- kingdom
- phylum
- class
- order
- family
- genus
- species
- etc
Talk with Skye on Indexing Plans
--------------------------------
topics:
1. (P1) Foresite for parsing resource maps - redmine 3723
- Chris B has this on his schedule after Ansible
2. Generalized pattern for index processing
- pluggable pre-processing
- SOLRField class supports this
- Issue is picking up the new code in a running process - simplest is to simply add and restart.
2. a. (P2) Performance of indexing - redmine 3766
- multiple instances of SOLR can work with the same index. e.g. search instance handling public requests, then start up one or more jetty instances that can write to the index.
- Replace the hazelcast iterator with a postgres query.
- Improve the commit strategy refresh/rebuild
3. (Depends somewhat on #4) Upgrade to SOLR 4 and related upgrades - subtask of redmine 3764
- Cloud design - write to a single virtual instance that is manifest across the CNs.
- would significantly reduce the hazelcast traffic since sysmeta would only be pulled by one CN
- Would require rebuilding index, perhaps some schema redesign, some new required fields (e.g. version)
- Provides a new upgrade path for solr for dataONE. Debian packaging. (perhaps start from https://github.com/zeraholladay/solr4-tomcat-debian )
4. (P3) Refactoring the solr index schema, especially for dealing with packages - redmine 3726
Need to denormalize data as much as possible - but tradeoff is flexibility.
- Add support for data set variables (e.g. column name and units)
- http://siren.sindice.com/documentation.html (seems a bit dated)
- http://jena.apache.org/documentation/larq/
5. OpenSearch (parallel activity - lots of design required) - redmine 3608
- Issues with state management - query, page, etc.
- Scalability of view generation using SOLR as a velocity template engine
- Example templates were fairly broken
- Seems best to havea separate java app to implement the view rendering and perhaps handle some/most of hte state information