OpenCms Days 2014 - Using the SOLR collector

Sören Schneider, Alkacon Software
WORKSHOP TRACK Using the SOLR Collector
27.11.2014

1.Brief Introduction Into Solr
2.Common Mistakes Using OpenCms & Solr
3.Using the Solr Collector (DEMO)
4.Spellchecking in OpenCms Using Solr
Agenda

●Solr is a very versatile and powerfool search engine that supports various features
●This functionality comes with the price of increased complexity to handle Solr
●Many customizations available
●All fields composing a single document are typed
Brief Solr Introduction

●Data structures of Solr‘s documents are defined the file schema.xml
●Performing changes on this file requires reindexing
●Dynamic Fields cope with that limitiation
●Can be used without being explicitely defined in the schema using wildcards
Defining Solr‘s Data Structure

Solr: Indexing Content
a: date
b: text
c: string
Solr processing (through analyzers, filters and tokenizers)
a: date
b: string
c: string

●„Direct“ usage of OpenCms & Solr requires a basic understanding of Solr
●Use proper datatypes in respect of individual usecase, gain knowledge of filters
●Know the query syntax (for appropriate datatypes)
●Most common mistakes of OpenCms users result in insufficient knowledge of Solr basics
OpenCms & Solr

1.Using inproper types
●„text“ vs „string“
●Formulating correct queries
2.Issues regarding mapping OpenCms <->Solr
3.(Encoding Problems)
Common Mistakes Using Solr & OpenCms

●String
●Stores its content as exact string
●No tokenization / processing is being performed
●Useful when searching for exact value
●Text
●Tokenization and processing is performed
●Useful when a part of the content is searched for
„text“ vs „string“

●OpenCms‘s copies the entire XML content into a single(!) locale-aware Solr field of type „text“ for each locale
●Particular information of a resource is made searchable in OpenCms using two approaches
●Automatic mapping of properties to Solr fields
●Manual definintion of mappings
Making Your Content Searchable

Indexing Content w/o Searchsettings
x: text
a: date
b: string
c: string

Indexing Content with Searchsettings
a: date
b: text
c: string
a: date
b: string
c: string

●Mapping happens in the scheme of the appropriate resource type
●Excerpt
Solr – OpenCms Interaction: Mapping
<xsd:schema
…
<xsd:annotation
<xsd:appinfo
<searchsettings>
<searchsetting element= "City" searchcontent="true">
<solrfield targetfield= "city" sourcefield="_s"
</searchsetting> …
Resource type element name

Element Mapping Attributes
Attribute Name
Effect on the Solr Field
targetfield*
The resulting name
locale
Write content only for specific locale
sourcefield
Defines the resulting type
copyfields
Copies the value to a different field
default
Sets a default value
boost
Sets a boost for the field

●Users complain about problems regarding certain Characters – mostly German Umlauts – in Solr results
●In nearly all cases the sole problem lies within the integration of Solr to the servlet cotainer which is not happening in UTF-8
●Extra note for Tomcat users: Please check whether you appended the required attributes all appropriate „<Connector>“s ;-)
Using UTF-8 in Solr

●Live Demo
15
Live Demo
Demo
Demo
Demo
Demo
デモ

●The Spellchecker has been realized using Solr
●Solr already provides a flexible component named „SpellCheckComponent“
●This component supports inline spellchecking of Solr queries
●Source for suggestions can be specified by Solr fields or text files
WYSIWIG Spellchecker

●The „SpellCheckComponent“ is widely used to implement the „Did you mean?“-feature known by popular search engines
●The component is
●Reliable and mature
●Fast
●Plus, Solr is already available in OpenCms
Why using Solr as Spellchecker

●If both usecases use the same component, how do the implementations actually differ?
●„Did you mean?“ builds source of suggested words based on the entire data, the search runs on. Usually only a single hit is returned.
●The WYSIWYG spellchecker builds ist source of suggestions based on a data that solely contains the dictionary for a single language
Differences Between Usecases in Regards of Implementation

●Spellchecking has been realized using another Solr core that resides in WEB-INF/spellcheck
●As the only purpose of this core is to contain spellcheck information, the schema.xml file is as simple as it gets
●Why using another Solr core instead of the default core that‘s used by OpenCms?
●Dictionaries are stored as one Solr index per language
How to model this scenario using Solr?

●Sadly, the spellchecking interfaces of tinyMCE and Solr are incompatible
Problems regarding tinyMCE and Solr
Solr
tinyMCE

Comparison Spellcheck Responses
{
"id":"c0",
"result":{„hsoue":[„house„, „has“]}
}
"spellcheck":{ "suggestions":[ „hsoue",{"numFound":5, "startOffset":0, "endOffset":4, "origFreq":0, "suggestion":[{"word":„house","freq": 53}, {"word":"has","freq":271}, … ]}, "correctlySpelled",false, "collation","hsue„ ]},

●A new component had to be realized in OpenCms that basically
●Accepts spellcheck requests from tinyMCE
●Handles tinyMCE and Solr communication and message conversion
●Checks and (re-)builds spellcheck indices
●The appropriate code is found in org.opencms.search.solr.spellcheck
Glueing the Pieces together

●Dictionaries can be edited easily in OpenCms
●Those indices are automatically filled by flat text files, one word per line
●Support for multiple languages
●To access the dicts, have a look at the directory org.opencms.workplace.spellcheck/resources/
Spellchecker in OpenCms

●Adding a new language
1.Create new Solr field in schema.xml
2.Create new dictionary file inside VFS
3.Restart OpenCms
●Adding words to the custom dict
Extending the Spellchecker

●Any Questions?
26
Any Questions?
Fragen?
Questions ?
Questiones?
¿Preguntas?
質問

Sören Schneider
Alkacon Software GmbH
http://www.alkacon.com
http://www.opencms.org
Thank you very much for your attention!
27

OpenCms Days 2014 - Using the SOLR collector

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie OpenCms Days 2014 - Using the SOLR collector

Ähnlich wie OpenCms Days 2014 - Using the SOLR collector (20)

Mehr von Alkacon Software GmbH & Co. KG

Mehr von Alkacon Software GmbH & Co. KG (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

OpenCms Days 2014 - Using the SOLR collector