Notes for Search API Alterations - 2012-07-25 ============================================= Currently, the search API consists of a single method: CNRead.search(session, queryType, query) → ObjectList * queryType (string) – Indicates which search engine will be used to handle the query. Currently supported search engines include: “SOLR”. Transmitted as part of the URL path and must be escaped accordingly. * query (string) – The remainder of the URL is passed verbatim to the respective search engine implementation. Hence it may contain additional path elements and query elements as determined by the functionality of the search engine. The caller is reponsible for providing a ‘?’ to indicate the start of the query string portion of the URL, as well as proper URL escaping. Transmitted as part of the URL path and must be escaped accordingly. Limitations: 1. No way to discover the fields that can be used when constructing queries 2. Inflexible response structure (no returnfields) 3. No field introspection (e.g. what are the terms and their frequency in field_x?) Is this a true limitation? Don't the examples below show that we can specify faceted searches through the query string? This limitation is actually the same as #2 - when a faceted search is provided, the api does not support returning the facet info. From my usage with mercury, it does appear that facet values/counts do have the access control portion of the query filter applied - we dont get facet counts larger than result size - which we would/should see if the facets ignored access control (right now just isPublic:true) 4. Search is implemented on CNs only, meaning that clients interacting with a MN through the DataONE APIs must be able to conenct with the CNs to do something involving discovery 5. More complex queries such as spatial on polygon 6. Datapackage-based search: would like to search for keywords (sci meta), but find ORE document matches. 7. Performance: certain UI interfaces may want to exclude fields in the ObjectInfo response to reduce the serialized response time (encoded in XML, JSON, etc.), and so required fields limit this. For instance, checksum may not be desired, but it will still be serialized across the wire, affecting performance in AJAX-type UIs. This is largely a ObjectInfo schema cardinality issue. 8. No way to discover the queryTypes supported at a node. Possible Solutions: 1. Add an API method that returns the list of fields, their type, and number of unique values. This is already available under SOLR through the Luke interface ( http://wiki.apache.org/solr/LukeRequestHandler ) and so would be fairly easy to implement. The response from Luke should be cached / updated by the indexer process. Perhaps: CNRead.listSearchFields() -> SearchFieldList 2. a) Add ability to include additional elements in the ObjectList response CNRead.search(session, queryType, query, fields) -> ExtendedObjectList b) Enable access to raw SOLR response (including response writers other than XML, e.g. JSON) c) Add in another standard protocol (SOLR is something of a defacto standard, though is only implemented by SOLR) d) ? 3. Add API methods to support introspection on a field. This is very easy to do for public content, but may be complicated for access controlled content. SOLR supports this functionality out of the box through faceted query ( http://wiki.apache.org/solr/SimpleFacetParameters ), so for example one can obtain a list of keywords and their frequency of occurrence that appear in the result set obtained by applying some query. Access control can be achieved by adding access control restriction to the query (with consideration of SOLR equivalent of SQL injection). The question is whether facet values from content that the user does not have read access to will appear in the response. The response structure from a faceted select is quite different to a regular search response. Facet Query Examples: eg 1. Show author facet for all content: curl -s "http://localhost:8080/solr/d1-cn-index/select?rows=0&q=*:*&facet=true&facet.limit=10&facet.field=author" | xml fo 0 5 true 10 0 *:* author 7664 3480 3181 1616 513 384 354 280 253 253 eg 2. Show author facet for content with abstract containing "soil carbon" curl -s "http://localhost:8080/solr/d1-cn-index/select?rows=0&q=abstract%3Asoil%20carbon&facet=true&facet.limit=10&facet.field=author" | xml fo 0 3 true 10 0 abstract:soil carbon author 42 41 26 22 17 14 14 12 10 8 eg 3. Same as for #2 but include one record in the response output: curl -s "http://localhost:8080/solr/d1-cn-index/select?rows=1&q=abstract%3Asoil%20carbon&facet=true&facet.limit=10&facet.field=author" | xml fo 0 2 true 10 1 abstract:soil carbon author The BOREAS TGB-12 team made measurements of soil carbon inventories, carbon concentration in soil gases, and rates of soil respiration at several sites to estimate the rates of carbon accumulation and turnover in each of the major vegetation types. This data set contains information on the carbon isotopic content of carbon dioxide sampled from soils. TRUMBORE, S.E. urn:node:ORNLDAAC 1993-11-14T00:00:00Z e2a368f6e6a0434dc337bc3c6740b3ba MD5 The Oak Ridge National Laboratory (ORNL) Distributed Active Archive Center (DAAC) https://cn-ucsb-1.dataone.org/cn/v1/resolve/scimeta_399.xml urn:node:ORNLDAAC 2012-07-13T20:58:09.05Z 2012-07-12T20:58:07Z BOREAS_TGB12CI_399.zip -98.29 1996-10-10T00:00:00Z https://cn-ucsb-1.dataone.org/cn/v1/resolve/scimeta_399.xml FGDC-STD-001.1-1999 metadata scimeta_399.xml scimeta_399.xml TRUMBORE, S.E. true 55.93 TRUMBORE, S.E. BOREAL FORESTS/CANADA metadata public urn:node:ORNLDAAC urn:node:CN 2012-07-13T00:00:00Z 2012-07-13T00:00:00Z false resourceMap_399.xml CN=ornldaac,DC=cilogon,DC=org 8065 scimeta_399.xml 55.88 CN=ornldaac,DC=cilogon,DC=org TRUMBORE, S.E. BOREAS TGB-12 CARBON DIOXIDE ISOTOPIC CONTENT DATA OVER THE NSA http://daac.ornl.gov/cgi-bin/dsviewer.pl?ds_id=399 http://daac.ornl.gov/mercury_harvest/399.xml http://daac.ornl.gov/mercury_harvest/399.xml http://daac.ornl.gov//BOREAS/guides/TGB12_Iso_CO2.html metadata The BOREAS TGB-12 team made measurements of soil carbon inventories, carbon concentration in soil gases, and rates of soil respiration at several sites to estimate the rates of carbon accumulation and turnover in each of the major vegetation types. This data set contains information on the carbon isotopic content of carbon dioxide sampled from soils. 19931114 19961010 Complete As appropriate -98.62 -98.29 55.93 55.88 Parameter_Sensor_Source SOIL GAS/AIR|MASS SPECTROMETER|LABORATORY Parameter SOIL GAS/AIR Source LABORATORY Sensor MASS SPECTROMETER Place Keywords BOREAL FORESTS/CANADA TRUMBORE, S.E. 6.1.10.8 Contact Electronic Mail: INTERNET > The Oak Ridge National Laboratory (ORNL) Distributed Active Archive Center (DAAC) ORNL DAAC User Services Office Oak Ridge National Laboratory Oak Ridge, Tennessee 37831 USA FAX: +1(865)574-4665 +1(865)241-3952 6.1.10.8 Contact Electronic Mail: INTERNET > ornldaac@ornl.gov http://daac.ornl.gov/ PUBLIC Trumbore, S. E., E. T. Sundquist, and G. C. Winston. 1998. BOREAS TGB-12 Carbon Dioxide Isotopic Content Data over the NSA. Data set. Available on-line [http://www.daac.ornl.gov] from Oak Ridge National Laboratory Distributed Active Archive Center, Oak Ridge, Tennessee, U.S.A. doi:10.3334/ORNLDAAC/399 19990130 ORNL DAAC Staff ORNL DAAC Staff +1(865)241-3952 ornldaac@ornl.gov FGDC Content Standard for Digital Geospatial Metadata Created 2012 05 22 21 30 13 1331736613 by 160.91.11.44 19931114 19961010 BOREAL FORESTS/CANADA -98.62 -98.29 55.93 55.88 CO2 14C FLUX TRACE GAS CARBON CYCLE CARBON DIOXIDE CARBON ISOTOPES SOIL GAS/AIR MASS SPECTROMETER LABORATORY SOILS LAND SURFACE BOREAS LABORATORY|MASS SPECTROMETER|SOIL GAS/AIR TRUMBORE, S.E. http://daac.ornl.gov/cgi-bin/dsviewer.pl?ds_id=399 BOREAS TGB-12 CARBON DIOXIDE ISOTOPIC CONTENT DATA OVER THE NSA http://daac.ornl.gov//BOREAS/guides/TGB12_Iso_CO2.html BOREAS TGB-12 CARBON DIOXIDE ISOTOPIC CONTENT DATA OVER THE NSA /home/web/mercury/write_mercury_xml.pl mercury21.dtd ornldaac@ornl.gov ORNL DAAC User Services Office Oak Ridge National Laboratory Oak Ridge, Tennessee 37831 USA FAX: +1(865)574-4665 +1(865)241-3952 399 BOREAS_TGB12CI 0.016499999999999 http://daac.ornl.gov//BOREAS/guides/TGB12_Iso_CO2.html BOREAS TGB-12 CARBON DIOXIDE ISOTOPIC CONTENT DATA OVER THE NSA scimeta_399.xml BOREAS TGB-12 CARBON DIOXIDE ISOTOPIC CONTENT DATA OVER THE NSA 2012-07-12T20:58:07Z http://daac.ornl.gov/cgi-bin/dsviewer.pl?ds_id=399 http://daac.ornl.gov/mercury_harvest/399.xml http://daac.ornl.gov//BOREAS/guides/TGB12_Iso_CO2.html -98.62 42 41 26 22 17 14 14 12 10 8 eg 4. As for #3, but with a limited selection of fields (Author, Title, beginDate): curl -s "http://localhost:8080/solr/d1-cn-index/select?rows=1&q=abstract%3Asoil%20carbon&facet=true&facet.limit=10&facet.field=author&fl=author,title,beginDate" | xml fo 0 1 true 10 1 author,title,beginDate abstract:soil carbon author TRUMBORE, S.E. 1993-11-14T00:00:00Z BOREAS TGB-12 CARBON DIOXIDE ISOTOPIC CONTENT DATA OVER THE NSA 42 41 26 22 17 14 14 12 10 8 eg. 5. Find documents that match "Soil Organic Carbon" and is referenced by a resource map. Return the resource map in the response. curl -s "http://localhost:8080/solr/d1-cn-index/select?rows=1&q=abstract%3Asoil+carbon+AND+resourceMap%3A%5B*+TO+*%5D&facet=true&facet.limit=10&facet.field=author&fl=author,title,beginDate,resourceMap" | xml fo 0 36 true 10 1 author,title,beginDate,resourceMap abstract:soil carbon AND resourceMap:[* TO *] author TRUMBORE, S.E. 1993-11-14T00:00:00Z resourceMap_399.xml BOREAS TGB-12 CARBON DIOXIDE ISOTOPIC CONTENT DATA OVER THE NSA 5 4 3 3 3 3 3 3 3 3 e.g. 6: curl -s "http://localhost:8080/solr/d1-cn-index/select?rows=10&q=abstract%3Asoil+carbon+AND+resourceMap%3A%5B*+TO+*%5D&fl=author,title,beginDate,resourceMap" | xml fo 0 1 10 author,title,beginDate,resourceMap abstract:soil carbon AND resourceMap:[* TO *] TRUMBORE, S.E. 1993-11-14T00:00:00Z resourceMap_399.xml BOREAS TGB-12 CARBON DIOXIDE ISOTOPIC CONTENT DATA OVER THE NSA ANDERSON, DARWIN 1993-01-01T00:00:00Z resourceMap_530.xml BOREAS TE-01 SSA SOIL LAB DATA EMANUEL, W.R. 1940-01-01T00:00:00Z resourceMap_638.xml SAFARI 2000 ORGANIC SOIL CARBON AND NITROGEN DATA (ZINKE ET AL.) DAVIDSON, E.A. 1937-01-01T00:00:00Z resourceMap_517.xml BOREAS TGB-12 SOIL CARBON AND FLUX DATA OF NSA-MSA IN RASTER FORMAT BATJES, N.H. 1950-01-01T00:00:00Z resourceMap_634.xml SAFARI 2000 DERIVED SOIL PROPERTIES, 0.5-DEG (ISRIC-WISE) NORMAN, J.M. 1989-07-24T00:00:00Z resourceMap_105.xml SOIL CO2 FLUX DATA (FIFE) HARDEN, J.W. 1993-08-01T00:00:00Z resourceMap_402.xml BOREAS TGB-12 SOIL CARBON DATA: NSA ALENCAR, A.C. 1973-01-01T00:00:00Z resourceMap_941.xml PRE-LBA RADAMBRASIL PROJECT DATA PEREZ, T 2000-07-08T00:00:00Z resourceMap_1013.xml LBA-ECO TG-09 SOIL ISOTOPIC C, N, H2O, AND N2O DATA, TAPAJOS NATIONAL FOREST, BRAZIL EHLERINGER, J.R. 1994-05-26T00:00:00Z resourceMap_325.xml BOREAS TE-05 SOIL RESPIRATION DATA 4. Search on CNs only limits ITK to online use only (i.e. connected to the internet), and reliance on rapid indexing. E.g. for morpho - user just added metadata, but does not show up in search for a few minutes afterwards. Possibly Relevant / Interesting Libraries, Standards, Protocols --------------------------------------------------------------- SOLR: Indexer service built on Lucene. Fast, scalable, widely used. http://lucene.apache.org/solr/ OpenSearch - A rather loose specification for describing search interfaces. Implemnetations generally return results in Atom http://www.opensearch.org SRW/SRU, A protocol promoted by the Library of Congress, uses CQL (contextual query language) http://www.loc.gov/standards/sru/ SIREn: Efficient semi-structured Information Retrieval for Lucene http://siren.sindice.com/index.html SPARQL: http://www.w3.org/standards/semanticweb/query Requires decomposition of content into triples (actually or virtually) Other Useful Projects --------------------- Apache Tika: Content and metadata extraction http://tika.apache.org/ Notes from 20120925 ------------------- 1) defining the REST endpoint for the new search ("query") GET /query/ --> List of queryEngine CNRead.listQueryEngines() GET /query/{queryEngine} --> Types.QueryEngineDescription GET /query/{queryEngine}/{query} --> Types.OctetStream GET /search --> 404 2) defining the return type from CNRead.getQueryEngineDescription (formerly drafted as list search fields) * CNRead.listSearchFields() -> SearchFieldList * CNRead.getQueryEngineDescription() --> QueryEngineDescription * similarities between search engine search field list responses might be consolidated * versions of responses from a single engine is an issue /search/{queryEngine} e.g. /search/solr QueryEngineDescription queryEngineVersion version of search engine implementation (e.g. "SOLR-3.6.1") (string) querySchemaVersion version of schema being used for defining the search capabilities (e.g. "1.0.1") (string) name (human readable) e.g. "solr" (string) additionalInfo (URL, repeatable) e.g. "http://wiki.apache.org/solr/SolrQuerySyntax" Link to a description of the search engine and query syntax queryFieldList(optional, QueryFieldList): fieldName (string) fieldDescription (string) fieldType (string) returnable (boolean) searchable (boolean) sortable (boolean) multivalued (boolean) dataONEtypes.xsd will become version 1.1 the namespace will remain /v1 3) Versioning the search schema for DataONE * ONEMercury is using a property file stating the schema version * Will the search schema always be backwards comaptible? Probably not - hard to maintain, although the context of a field may change whereas the name of the field may not