OpenCms 8.5 integrates Apache Solr. And not only for full text search, but as a powerful query engine as well.
Imagine you want to show a list of "all resources of type news, that have changed since yesterday, where property X has the value Y" on your web page. Sure, there are API methods in OpenCms to load resources based on the type, on the date of change, or on the value of a specific property. But for many common use case combinations, there is no single API call. This means if you create a collector, you often end up sorting out the results of the initial API query in code.
In this session, Rüdiger will show how Apache Solr has been integrated in OpenCms 8.5. He will explain how to create improved front-end full text search functions with advanced options like faceting and spell check suggestions. And he will explain how to use Solr to directly read resources from the OpenCms VFS, allowing query combinations that combine resource attributes, properties and content in a powerful new way.
3. Agenda
3
1. What is Solr?
2. Benefits
3. Searching
4. Indexing
5. Configuration
4. Retrieving data fast
4
●Apache Solr is
hopefully not able
to answer this
question!
●BUT it will
return the
results in less
than a second
5. What is Apache Solr?
5
● Solr is an enterprise search platform from the
Apache Lucene project
● Solr is highly scalable, providing distributed
search and index replication
● Solr powers the search and navigation features
● Major features include
● Powerful full-text search
● Hit highlighting
● Faceted search
● Rich document (e.g., Word, PDF) handling
6. What is faceted search?
6
● Faceted search is the dynamic clustering of items
or search results into categories
● That let users drill into search results (or even skip
searching entirely)
● Each facet displayed typically shows the number of
hits that match that category
● Users can then “drill down” by applying specific
constraints to the search results
● Faceted search is also called faceted browsing,
faceted navigation, guided navigation and
sometimes parametric search
7. What is Faceted Search?
7
The breadcrumb trail
shows what constraints
have already been
applied and allows their
removal
“Resource types” is a
facet, a way of
Regular search results
categorizing the results
containerpage,
v8flwoer, v8textblock,
… are constraints, or
facet values
The facet count shows
how many results The tag bar shows other
match each value facet values of the found
document that can be applied
9. Database as bottleneck
9
● DBs are proprietary
● Require elaborate infrastructures
● SQL queries are hard to formulate
● SQL on DB is slower than search queries
● A lot SQL statements make DB to bottleneck
● Also lower traffic sites will slow to run when
executing too many statements on DB layer
Overall performance starts to degrade
10. Content retrieval so far
10
● OpenCms stores the content in a RDBMS
● To access values of an XML content you have to
perform the following steps:
1. Read the resource Resource (dates, refs, attr)
2. Read binary content Content (blob)
3. Un-marshal content Marshaled XML
4. Access with getters Java Access Bean
11. The new way of content retrieval
11
● “Read” whole resource content by a single query
● Increase ease of data structure by storing
documents
● New flexibility by using power of Solr query syntax
● Best performance based on optimized index
● HTTP interface for external applications
● Secure, scalable and cost-effective access
● Reduced DB traffic and increased performance
14. Search with Solr in OpenCms
14
●Querying OpenCms content using
the power of Solr’s query syntax
1. Send a HTTP request handler
2. Use the new Solr Collector
3. Call the Java API search method
15. OpenCms Solr handler
15
● The REST-like interface of Solr makes you able
to access indexed documents over HTTP
without any knowledge about CMS specific
syntax
● A permission check is performed by OpenCms
making sure no secure documents will be returned
● Using Solr based UI frameworks like “Ajax Solr” on
your website without development costs
● Providing an open interface for external
applications e.g. mobile applications
19. Indexed data
19
● Data indexed by default (hard coded)
● Field configuration (opencms-search.xml)
● XSD field mapping (Content definition)
● Implement a custom field configuration (Java)
20. Solr schema
20
● The Schema file contains all of the details
about which fields your documents can
contain
● OpenCms uses an adjusted version of the
schema.xml that is contained within
Apache Solr standard distribution
WEB-INF/solr/conf/schama.xml
● If you want to add a new custom field or
field type for documents you can modify
this file
21. Advantages of field types
21
● Types are checked during the index
process
● It enables easy rage queries even for
dates, what is real facilitation making
dev-life easier
● Custom types can be added, e.g.
key/value tuple or some special JSON
fields
22. Default indexed data
22
● id - Structure id used as unique identifier for an document (The structure id of the resource)
● path - Full root path (The root path of the resource e.g. /sites/default/flower_en/.content/article.html)
● path_hierarchy - The full path as (path tokenized field type: text_path)
● parent-folders - Parent folders (multi-valued field containing an entry for each parent path)
● type - Type name (the resource type name)
● res_locales - Existing locale nodes for XML content and all available locales in case of binary files
● created - The creation date (The date when the resource itself has being created)
● lastmodified - The date last modified (The last modification date of the resource itself)
● contentdate - The content date (The date when the resource's content has been modified)
● released - The release and expiration date of the resource
● content A general content field that holds all extracted resource data (all languages, type text_general)
● contentblob - The serialized extraction result toimprove the extraction performance while indexing
● category - All categories as general text
● category_exact - All categories as exact string for faceting reasons
● text_<locale> - Extracted textual content optimized for the language specific search
● timestamp - The time when the document was indexed last time
● *_prop - All properties of a resource as searchable and stored text (<Property_Definition_Name>_prop)
● *_exact - All properties of a resource as exact not stored string (<Property_Definition_Name>_exact)
23. XSD field mapping
23
● Additional field mappings for XML contents can
now be configured directly within the XSD Schema
● Without modifying opencms-search.xml No
restart of the servlet container required
<searchsetting element=“DisplayDate” searchcontent=“false”>
<solrfield targetfield=“myDisplayDateField” sourcefield=“*_dt” />
</searchsetting>
<searchsetting element=“Teaser”>
<solrfield targetfield=“ateaser”>
<mapping type=“item” default=“Homepage n.a.”>Homepage</mapping>
<mapping type=“property-search”>search.special</mapping>
<mapping type=“dynamic” class=“my.DynamicMapping”>special</mapping>
</solrfield>
</searchsetting>
25. Enable Solr in OpenCms
25
● When installing OpenCms v8.5 Solr will be enabled by default
while Solr will be disabled after updating a system to
OpenCms 8.5
● To enable Solr in after updating you must create a Solr home
directory in the WEB-INF folder of your OpenCms application
● Copy the solr/ folder from the OpenCms standard distribution
as a starting point for your configuration
● All search configurations are done as usual in the opencms-
search.xml below WEB-INF/config
● Adding the following lines will enable the Embedded Server
<opencms><search>
<solr enabled="true"/> […]
</search></opencms>
26. Search index configuration
26
● You can add a custom Solr index with the known
OpenCms search configuration syntax
● NOTE: class attributes are needed for the index and its
field configuration
<index
class="org.opencms.search.solr.CmsSolrIndex">
<name>Solr Online</name>
<rebuild>auto</rebuild>
<project>Online</project>
<locale>all</locale>
<configuration>solr_fields</configuration>
<sources>
<source>solr_source</source>
</sources>
</index>
27. Create field configuration (1/3)
27
● For converting a field configuration by:
1. Copy a <filedconfiguration>-node
2. Change / set the class attribute
3. Optionally add a type attributes for fields
<fieldconfiguration
class="org.opencms.search.solr.CmsSolrFieldConfiguration">
<name>example</name>
<description>Converted Lucene Index</description>
<field name="meta" store="false" index="true" type="en">
<mapping type="property">Title</mapping>
<mapping type="property">Description</mapping>
</field>
</fields>
</fieldconfiguration>
28. Create field configuration (2/3)
28
● As value for the type attribute of a field
definition inside the opencms-system.xml
you can use names of any dynamic field
defined in the schema.xml
● For example:
i - type=“int”
dt - type=“date”
txt - type=“text_general”
en - type=“text_en”
es - type=“text_es”
fr - type=“text_fr”
29. Create field configuration (3/3)
29
● As previously said the field names are defined
in the schema.xml <solr_name> of Solr, now
we define additional fields inside the opencms-
search.xml <opencms_name>
● How does that work?
String fieldName = <opencms_name>_txt;
if (existsInSolrSchema(fieldName)) {
fieldName = <opencms_name>;
} else if (isTypeAttributeSet()) {
fieldName = <opencms_name>_<type>;
}
31. Future steps with IKS and Stanbol
31
● Having Solr and VIE integrated into OpenCms
we are well prepared start using Apache
Stanbol
● Stanbol is a top level Apache project
● Stanbol guarantees a quality standard
● Stanbol opens the perspective of sustainability
● We are looking to integrate Stanbol into
OpenCms 9
33. Integration Conclusion
33
● Permission checked search (secure)
● Solr Request handler (accessible)
● Solr Collector (integrated)
● Result highlighting (user-friendly)
● Configuration opportunities (flexible)
● Search field mapping (sensitive)
● Type based field schema (type-safe)
● Lucene conversion (compatible)
34. 34
Thank you very much for your
attention!
Rüdiger Kurz
Alkacon Software GmbH
http://www.alkacon.com
http://www.opencms.org
http://www.iks-project.eu
http://stanbol.apache.org