Notes for Search API Alterations - 2012-07-25
=============================================

Currently, the search API consists of a single method:

  CNRead.search(session, queryType, query) → ObjectList

Limitations:

1. No way to discover the fields that can be used when constructing queries
2. Inflexible response structure (no returnfields)
3. No field introspection (e.g. what are the terms and their frequency in field_x?)
Is this a true limitation?  Don't the examples below show that we can specify faceted searches through the query string?  This limitation is actually the same as #2 - when a faceted search is provided, the api does not support returning the facet info.  

From my usage with mercury, it does appear that facet values/counts do have the access control portion of the query filter applied - we dont get facet counts larger than result size - which we would/should see if the facets ignored access control (right now just isPublic:true)

4. Search is implemented on CNs only, meaning that clients interacting with a MN through the DataONE APIs must be able to conenct with the CNs to do something involving discovery
5. More complex queries such as spatial on polygon
6. Datapackage-based search: would like to search for keywords (sci meta), but find ORE document matches.
7. Performance: certain UI interfaces may want to exclude fields in the ObjectInfo response to reduce the serialized response time (encoded in XML, JSON, etc.), and so required fields limit this.  For instance, checksum may not be desired, but it will still be serialized across the wire, affecting performance in AJAX-type UIs. This is largely a ObjectInfo schema cardinality issue.
8. No way to discover the queryTypes supported at a node.


Possible Solutions:

1. Add an API method that returns the list of fields, their type, and number of unique values. This is already available under SOLR through the Luke interface ( http://wiki.apache.org/solr/LukeRequestHandler ) and so would be fairly easy to implement. The response from Luke should be cached / updated by the indexer process.

Perhaps:
  
  CNRead.listSearchFields() -> SearchFieldList


2. a) Add ability to include additional elements in the ObjectList response

     CNRead.search(session, queryType, query, fields) -> ExtendedObjectList

   b) Enable access to raw SOLR response (including response writers other than XML, e.g. JSON)
   c) Add in another standard protocol (SOLR is something of a defacto standard, though is only implemented by SOLR)
   d) ?


3. Add API methods to support introspection on a field. This is very easy to do for public content, but may be complicated for access controlled content. SOLR supports this functionality out of the box through faceted query ( http://wiki.apache.org/solr/SimpleFacetParameters ), so for example one can obtain a list of keywords and their frequency of occurrence that appear in the result set obtained by applying some query.

Access control can be achieved by adding access control restriction to the query (with consideration of SOLR equivalent of SQL injection). The question is whether facet values from content that the user does not have read access to will appear in the response.

The response structure from a faceted select is quite different to a regular search response.

Facet Query Examples:

eg 1. Show author facet for all content:

curl -s "http://localhost:8080/solr/d1-cn-index/select?rows=0&q=*:*&facet=true&facet.limit=10&facet.field=author" | xml fo
<?xml version="1.0" encoding="UTF-8"?>
<response>
  <lst name="responseHeader">
    <int name="status">0</int>
    <int name="QTime">5</int>
    <lst name="params">
      <str name="facet">true</str>
      <str name="facet.limit">10</str>
      <str name="rows">0</str>
      <str name="q">*:*</str>
      <str name="facet.field">author</str>
    </lst>
  </lst>
  <result name="response" numFound="74163" start="0"/>
  <lst name="facet_counts">
    <lst name="facet_queries"/>
    <lst name="facet_fields">
      <lst name="author">
        <int name="Bruce Menge">7664</int>
        <int name="Margaret McManus">3480</int>
        <int name="Libe Washburn">3181</int>
        <int name="Jack Barth">1616</int>
        <int name="Mary Sue Brancato">513</int>
        <int name="Anne Giblin">384</int>
        <int name="Pete Raimondi">354</int>
        <int name="John Vande Castle">280</int>
        <int name="Alexander Buyantuyev">253</int>
        <int name="UVM Spatial Analysis Lab, Grove, J.M, O'Neil-Dunne, J.">253</int>
      </lst>
    </lst>
    <lst name="facet_dates"/>
    <lst name="facet_ranges"/>
  </lst>
</response>



eg 2. Show author facet for content with abstract containing "soil carbon"

curl -s "http://localhost:8080/solr/d1-cn-index/select?rows=0&q=abstract%3Asoil%20carbon&facet=true&facet.limit=10&facet.field=author" | xml fo
<?xml version="1.0" encoding="UTF-8"?>
<response>
  <lst name="responseHeader">
    <int name="status">0</int>
    <int name="QTime">3</int>
    <lst name="params">
      <str name="facet">true</str>
      <str name="facet.limit">10</str>
      <str name="rows">0</str>
      <str name="q">abstract:soil carbon</str>
      <str name="facet.field">author</str>
    </lst>
  </lst>
  <result name="response" numFound="701" start="0"/>
  <lst name="facet_counts">
    <lst name="facet_queries"/>
    <lst name="facet_fields">
      <lst name="author">
        <int name="Peter Groffman ,     Email: groffmanp@ecostudies.org">42</int>
        <int name="David Tilman">41</int>
        <int name="Robert P. Griffiths">26</int>
        <int name="George Kling">22</int>
        <int name="Daniel Childers">17</int>
        <int name="Jill Johnstone">14</int>
        <int name="Roger W. Ruess">14</int>
        <int name="Tim Seastedt">12</int>
        <int name="Isla Myers-Smith">10</int>
        <int name="Eric S. Kasischke">8</int>
      </lst>
    </lst>
    <lst name="facet_dates"/>
    <lst name="facet_ranges"/>
  </lst>
</response>


eg 3. Same as for #2 but include one record in the response output:

curl -s "http://localhost:8080/solr/d1-cn-index/select?rows=1&q=abstract%3Asoil%20carbon&facet=true&facet.limit=10&facet.field=author" | xml fo
<?xml version="1.0" encoding="UTF-8"?>
<response>
  <lst name="responseHeader">
    <int name="status">0</int>
    <int name="QTime">2</int>
    <lst name="params">
      <str name="facet">true</str>
      <str name="facet.limit">10</str>
      <str name="rows">1</str>
      <str name="q">abstract:soil carbon</str>
      <str name="facet.field">author</str>
    </lst>
  </lst>
  <result name="response" numFound="701" start="0">
    <doc>
      <str name="abstract">The BOREAS TGB-12 team made measurements of soil carbon inventories, carbon concentration in soil gases, and rates of soil respiration at several sites to estimate the rates of carbon accumulation and turnover in each of the major vegetation types. This data set contains information on the carbon isotopic content of carbon dioxide sampled from soils.</str>
      <str name="author">TRUMBORE, S.E.</str>
      <str name="authoritativeMN">urn:node:ORNLDAAC</str>
      <date name="beginDate">1993-11-14T00:00:00Z</date>
      <str name="checksum">e2a368f6e6a0434dc337bc3c6740b3ba</str>
      <str name="checksumAlgorithm">MD5</str>
      <arr name="contactOrganization">
        <str>The Oak Ridge National Laboratory (ORNL) Distributed Active Archive Center (DAAC)</str>
      </arr>
      <str name="dataUrl">https://cn-ucsb-1.dataone.org/cn/v1/resolve/scimeta_399.xml</str>
      <str name="datasource">urn:node:ORNLDAAC</str>
      <date name="dateModified">2012-07-13T20:58:09.05Z</date>
      <date name="dateUploaded">2012-07-12T20:58:07Z</date>
      <arr name="documents">
        <str>BOREAS_TGB12CI_399.zip</str>
      </arr>
      <float name="eastBoundCoord">-98.29</float>
      <date name="endDate">1996-10-10T00:00:00Z</date>
      <str name="fileID">https://cn-ucsb-1.dataone.org/cn/v1/resolve/scimeta_399.xml</str>
      <str name="formatId">FGDC-STD-001.1-1999</str>
      <str name="geoform">metadata</str>
      <str name="id">scimeta_399.xml</str>
      <str name="identifier">scimeta_399.xml</str>
      <arr name="investigator">
        <str>TRUMBORE, S.E.</str>
      </arr>
      <bool name="isPublic">true</bool>
      <float name="northBoundCoord">55.93</float>
      <arr name="origin">
        <str>TRUMBORE, S.E.</str>
      </arr>
      <arr name="placeKey">
        <str>BOREAL FORESTS/CANADA</str>
      </arr>
      <str name="presentationCat">metadata</str>
      <arr name="readPermission">
        <str>public</str>
      </arr>
      <arr name="replicaMN">
        <str>urn:node:ORNLDAAC</str>
        <str>urn:node:CN</str>
      </arr>
      <arr name="replicaVerifiedDate">
        <date>2012-07-13T00:00:00Z</date>
        <date>2012-07-13T00:00:00Z</date>
      </arr>
      <bool name="replicationAllowed">false</bool>
      <arr name="resourceMap">
        <str>resourceMap_399.xml</str>
      </arr>
      <str name="rightsHolder">CN=ornldaac,DC=cilogon,DC=org</str>
      <long name="size">8065</long>
      <str name="sku">scimeta_399.xml</str>
      <float name="southBoundCoord">55.88</float>
      <str name="submitter">CN=ornldaac,DC=cilogon,DC=org</str>
      <str name="text">TRUMBORE, S.E.  BOREAS TGB-12 CARBON DIOXIDE ISOTOPIC CONTENT DATA OVER THE NSA  http://daac.ornl.gov/cgi-bin/dsviewer.pl?ds_id=399  http://daac.ornl.gov/mercury_harvest/399.xml  http://daac.ornl.gov/mercury_harvest/399.xml  http://daac.ornl.gov//BOREAS/guides/TGB12_Iso_CO2.html  metadata     The BOREAS TGB-12 team made measurements of soil carbon inventories, carbon concentration in soil gases, and rates of soil respiration at several sites to estimate the rates of carbon accumulation and turnover in each of the major vegetation types. This data set contains information on the carbon isotopic content of carbon dioxide sampled from soils.       19931114  19961010      Complete  As appropriate     -98.62  -98.29  55.93  55.88      Parameter_Sensor_Source  SOIL GAS/AIR|MASS SPECTROMETER|LABORATORY    Parameter  SOIL GAS/AIR    Source  LABORATORY    Sensor  MASS SPECTROMETER    Place Keywords  BOREAL FORESTS/CANADA       TRUMBORE, S.E.   6.1.10.8 Contact Electronic Mail: INTERNET &amp;gt;         The Oak Ridge National Laboratory (ORNL) Distributed Active Archive Center (DAAC)  ORNL DAAC User Services Office
        Oak Ridge National Laboratory
        Oak Ridge, Tennessee 37831 USA     
        FAX: +1(865)574-4665   +1(865)241-3952  6.1.10.8 Contact Electronic Mail: INTERNET &amp;gt;  ornldaac@ornl.gov        http://daac.ornl.gov/  PUBLIC      Trumbore, S. E., E. T. Sundquist, and G. C. Winston. 1998. BOREAS TGB-12 Carbon Dioxide Isotopic Content Data over the NSA. Data set. Available on-line [http://www.daac.ornl.gov] from Oak Ridge National Laboratory Distributed Active Archive Center, Oak Ridge, Tennessee, U.S.A. doi:10.3334/ORNLDAAC/399    19990130     ORNL DAAC Staff    ORNL DAAC Staff   +1(865)241-3952  ornldaac@ornl.gov    FGDC Content Standard for Digital Geospatial Metadata   Created 2012 05 22 21 30 13 1331736613 by 160.91.11.44   19931114  19961010   BOREAL FORESTS/CANADA  -98.62  -98.29  55.93  55.88   CO2  14C  FLUX  TRACE GAS  CARBON CYCLE  CARBON DIOXIDE  CARBON ISOTOPES   SOIL GAS/AIR  MASS SPECTROMETER  LABORATORY  SOILS  LAND SURFACE   BOREAS  LABORATORY|MASS SPECTROMETER|SOIL GAS/AIR   TRUMBORE, S.E.     http://daac.ornl.gov/cgi-bin/dsviewer.pl?ds_id=399  BOREAS TGB-12 CARBON DIOXIDE ISOTOPIC CONTENT DATA OVER THE NSA     http://daac.ornl.gov//BOREAS/guides/TGB12_Iso_CO2.html  BOREAS TGB-12 CARBON DIOXIDE ISOTOPIC CONTENT DATA OVER THE NSA    /home/web/mercury/write_mercury_xml.pl  mercury21.dtd   ornldaac@ornl.gov  ORNL DAAC User Services Office
        Oak Ridge National Laboratory
        Oak Ridge, Tennessee 37831 USA     
        FAX: +1(865)574-4665  +1(865)241-3952     399  BOREAS_TGB12CI  0.016499999999999   http://daac.ornl.gov//BOREAS/guides/TGB12_Iso_CO2.html  BOREAS TGB-12 CARBON DIOXIDE ISOTOPIC CONTENT DATA OVER THE NSA scimeta_399.xml</str>
      <str name="title">BOREAS TGB-12 CARBON DIOXIDE ISOTOPIC CONTENT DATA OVER THE NSA</str>
      <date name="updateDate">2012-07-12T20:58:07Z</date>
      <arr name="webUrl">
        <str>http://daac.ornl.gov/cgi-bin/dsviewer.pl?ds_id=399</str>
        <str>http://daac.ornl.gov/mercury_harvest/399.xml</str>
        <str>http://daac.ornl.gov//BOREAS/guides/TGB12_Iso_CO2.html</str>
      </arr>
      <float name="westBoundCoord">-98.62</float>
    </doc>
  </result>
  <lst name="facet_counts">
    <lst name="facet_queries"/>
    <lst name="facet_fields">
      <lst name="author">
        <int name="Peter Groffman ,     Email: groffmanp@ecostudies.org">42</int>
        <int name="David Tilman">41</int>
        <int name="Robert P. Griffiths">26</int>
        <int name="George Kling">22</int>
        <int name="Daniel Childers">17</int>
        <int name="Jill Johnstone">14</int>
        <int name="Roger W. Ruess">14</int>
        <int name="Tim Seastedt">12</int>
        <int name="Isla Myers-Smith">10</int>
        <int name="Eric S. Kasischke">8</int>
      </lst>
    </lst>
    <lst name="facet_dates"/>
    <lst name="facet_ranges"/>
  </lst>
</response>



eg 4. As for #3, but with a limited selection of fields (Author, Title, beginDate):

curl -s "http://localhost:8080/solr/d1-cn-index/select?rows=1&q=abstract%3Asoil%20carbon&facet=true&facet.limit=10&facet.field=author&fl=author,title,beginDate" | xml fo
<?xml version="1.0" encoding="UTF-8"?>
<response>
  <lst name="responseHeader">
    <int name="status">0</int>
    <int name="QTime">1</int>
    <lst name="params">
      <str name="facet">true</str>
      <str name="facet.limit">10</str>
      <str name="rows">1</str>
      <str name="fl">author,title,beginDate</str>
      <str name="q">abstract:soil carbon</str>
      <str name="facet.field">author</str>
    </lst>
  </lst>
  <result name="response" numFound="701" start="0">
    <doc>
      <str name="author">TRUMBORE, S.E.</str>
      <date name="beginDate">1993-11-14T00:00:00Z</date>
      <str name="title">BOREAS TGB-12 CARBON DIOXIDE ISOTOPIC CONTENT DATA OVER THE NSA</str>
    </doc>
  </result>
  <lst name="facet_counts">
    <lst name="facet_queries"/>
    <lst name="facet_fields">
      <lst name="author">
        <int name="Peter Groffman ,     Email: groffmanp@ecostudies.org">42</int>
        <int name="David Tilman">41</int>
        <int name="Robert P. Griffiths">26</int>
        <int name="George Kling">22</int>
        <int name="Daniel Childers">17</int>
        <int name="Jill Johnstone">14</int>
        <int name="Roger W. Ruess">14</int>
        <int name="Tim Seastedt">12</int>
        <int name="Isla Myers-Smith">10</int>
        <int name="Eric S. Kasischke">8</int>
      </lst>
    </lst>
    <lst name="facet_dates"/>
    <lst name="facet_ranges"/>
  </lst>
</response>



eg. 5. Find documents that match "Soil Organic Carbon" and is referenced by a resource map. Return the resource map in the response.

curl -s "http://localhost:8080/solr/d1-cn-index/select?rows=1&q=abstract%3Asoil+carbon+AND+resourceMap%3A%5B*+TO+*%5D&facet=true&facet.limit=10&facet.field=author&fl=author,title,beginDate,resourceMap" | xml fo
<?xml version="1.0" encoding="UTF-8"?>
<response>
  <lst name="responseHeader">
    <int name="status">0</int>
    <int name="QTime">36</int>
    <lst name="params">
      <str name="facet">true</str>
      <str name="facet.limit">10</str>
      <str name="rows">1</str>
      <str name="fl">author,title,beginDate,resourceMap</str>
      <str name="q">abstract:soil carbon AND resourceMap:[* TO *]</str>
      <str name="facet.field">author</str>
    </lst>
  </lst>
  <result name="response" numFound="233" start="0">
    <doc>
      <str name="author">TRUMBORE, S.E.</str>
      <date name="beginDate">1993-11-14T00:00:00Z</date>
      <arr name="resourceMap">
        <str>resourceMap_399.xml</str>
      </arr>
      <str name="title">BOREAS TGB-12 CARBON DIOXIDE ISOTOPIC CONTENT DATA OVER THE NSA</str>
    </doc>
  </result>
  <lst name="facet_counts">
    <lst name="facet_queries"/>
    <lst name="facet_fields">
      <lst name="author">
        <int name="DAVIDSON, E.A.">5</int>
        <int name="ANDERSON, DARWIN">4</int>
        <int name="Anne Cross">3</int>
        <int name="Arnold van der Valk">3</int>
        <int name="BATJES, N.H.">3</int>
        <int name="BLACK, T.A.">3</int>
        <int name="CRILL, P.M.">3</int>
        <int name="EMANUEL, W.R.">3</int>
        <int name="GOWER, S.T.">3</int>
        <int name="HARDEN, J.W.">3</int>
      </lst>
    </lst>
    <lst name="facet_dates"/>
    <lst name="facet_ranges"/>
  </lst>
</response>


e.g. 6: 

curl -s "http://localhost:8080/solr/d1-cn-index/select?rows=10&q=abstract%3Asoil+carbon+AND+resourceMap%3A%5B*+TO+*%5D&fl=author,title,beginDate,resourceMap" | xml fo
<?xml version="1.0" encoding="UTF-8"?>
<response>
  <lst name="responseHeader">
    <int name="status">0</int>
    <int name="QTime">1</int>
    <lst name="params">
      <str name="rows">10</str>
      <str name="fl">author,title,beginDate,resourceMap</str>
      <str name="q">abstract:soil carbon AND resourceMap:[* TO *]</str>
    </lst>
  </lst>
  <result name="response" numFound="233" start="0">
    <doc>
      <str name="author">TRUMBORE, S.E.</str>
      <date name="beginDate">1993-11-14T00:00:00Z</date>
      <arr name="resourceMap">
        <str>resourceMap_399.xml</str>
      </arr>
      <str name="title">BOREAS TGB-12 CARBON DIOXIDE ISOTOPIC CONTENT DATA OVER THE NSA</str>
    </doc>
    <doc>
      <str name="author">ANDERSON, DARWIN</str>
      <date name="beginDate">1993-01-01T00:00:00Z</date>
      <arr name="resourceMap">
        <str>resourceMap_530.xml</str>
      </arr>
      <str name="title">BOREAS TE-01 SSA SOIL LAB DATA</str>
    </doc>
    <doc>
      <str name="author">EMANUEL, W.R.</str>
      <date name="beginDate">1940-01-01T00:00:00Z</date>
      <arr name="resourceMap">
        <str>resourceMap_638.xml</str>
      </arr>
      <str name="title">SAFARI 2000 ORGANIC SOIL CARBON AND NITROGEN DATA (ZINKE ET AL.)</str>
    </doc>
    <doc>
      <str name="author">DAVIDSON, E.A.</str>
      <date name="beginDate">1937-01-01T00:00:00Z</date>
      <arr name="resourceMap">
        <str>resourceMap_517.xml</str>
      </arr>
      <str name="title">BOREAS TGB-12 SOIL CARBON AND FLUX DATA OF NSA-MSA IN RASTER FORMAT</str>
    </doc>
    <doc>
      <str name="author">BATJES, N.H.</str>
      <date name="beginDate">1950-01-01T00:00:00Z</date>
      <arr name="resourceMap">
        <str>resourceMap_634.xml</str>
      </arr>
      <str name="title">SAFARI 2000 DERIVED SOIL PROPERTIES, 0.5-DEG (ISRIC-WISE)</str>
    </doc>
    <doc>
      <str name="author">NORMAN, J.M.</str>
      <date name="beginDate">1989-07-24T00:00:00Z</date>
      <arr name="resourceMap">
        <str>resourceMap_105.xml</str>
      </arr>
      <str name="title">SOIL CO2 FLUX DATA (FIFE)</str>
    </doc>
    <doc>
      <str name="author">HARDEN, J.W.</str>
      <date name="beginDate">1993-08-01T00:00:00Z</date>
      <arr name="resourceMap">
        <str>resourceMap_402.xml</str>
      </arr>
      <str name="title">BOREAS TGB-12 SOIL CARBON DATA: NSA</str>
    </doc>
    <doc>
      <str name="author">ALENCAR, A.C.</str>
      <date name="beginDate">1973-01-01T00:00:00Z</date>
      <arr name="resourceMap">
        <str>resourceMap_941.xml</str>
      </arr>
      <str name="title">PRE-LBA RADAMBRASIL PROJECT DATA</str>
    </doc>
    <doc>
      <str name="author">PEREZ, T</str>
      <date name="beginDate">2000-07-08T00:00:00Z</date>
      <arr name="resourceMap">
        <str>resourceMap_1013.xml</str>
      </arr>
      <str name="title">LBA-ECO TG-09 SOIL ISOTOPIC C, N, H2O, AND N2O DATA, TAPAJOS NATIONAL FOREST, BRAZIL</str>
    </doc>
    <doc>
      <str name="author">EHLERINGER, J.R.</str>
      <date name="beginDate">1994-05-26T00:00:00Z</date>
      <arr name="resourceMap">
        <str>resourceMap_325.xml</str>
      </arr>
      <str name="title">BOREAS TE-05 SOIL RESPIRATION DATA</str>
    </doc>
  </result>
</response>





4. Search on CNs only limits ITK to online use only (i.e. connected to the internet), and reliance on rapid indexing. E.g. for morpho - user just added metadata, but does not show up in search for a few minutes afterwards.





Possibly Relevant / Interesting Libraries, Standards, Protocols
---------------------------------------------------------------

SOLR: Indexer service built on Lucene. Fast, scalable, widely used.
http://lucene.apache.org/solr/


OpenSearch - A rather loose specification for describing search interfaces. Implemnetations generally return results in Atom
http://www.opensearch.org


SRW/SRU, A protocol promoted by the Library of Congress, uses CQL (contextual query language)
http://www.loc.gov/standards/sru/


SIREn: Efficient semi-structured Information Retrieval for Lucene
http://siren.sindice.com/index.html


SPARQL: http://www.w3.org/standards/semanticweb/query
Requires decomposition of content into triples (actually or virtually)



Other Useful Projects
---------------------

Apache Tika: Content and metadata extraction
http://tika.apache.org/



Notes from 20120925
-------------------
1) defining the REST endpoint for the new search ("query")
GET /query/ --> List of queryEngine   CNRead.listQueryEngines()
GET /query/{queryEngine} --> Types.QueryEngineDescription 
GET /query/{queryEngine}/{query} --> Types.OctetStream
GET /search --> 404


2) defining the return type from CNRead.getQueryEngineDescription (formerly drafted as list search fields)
/search/{queryEngine}
e.g. /search/solr

QueryEngineDescription
  queryEngineVersion version of search engine implementation (e.g. "SOLR-3.6.1") (string)
  querySchemaVersion version of schema being used for defining the search capabilities (e.g. "1.0.1") (string)
  name (human readable) e.g. "solr" (string)
  additionalInfo (URL, repeatable) e.g. "http://wiki.apache.org/solr/SolrQuerySyntax" Link to a description of the search engine and query syntax
  queryFieldList(optional, QueryFieldList):
    fieldName (string)
    fieldDescription (string)
    fieldType (string)
    returnable (boolean)
    searchable (boolean)
    sortable (boolean)
    multivalued (boolean)

dataONEtypes.xsd will become version 1.1
the namespace will remain /v1

3) Versioning the search schema for DataONE