This presentation outlines introduction to lucene , solr. It also provides overview of powerful Solr Search features and different types of queries.
This would be useful to get an idea while working on search application development in initial phases.
2. SOLR BASED SEARCH
CONTENTS
▸ Introduction to Lucene
▸ Introduction to Solr
▸ Terminologies
▸ Steps
▸ Document / Query Analysis
▸ Solr Search Features
▸ Solr Search - Query types
▸ Search Interfaces
▸ Search Challenges and solution
3. SOLR BASED SEARCH
WHAT IS LUCENE ?
▸ Open Source full text search (IR) library /API
▸ Witten in Java by Doug Cutting
▸ Major Components
▸ Indexing (Inverted Index : keyword -> page) : IndexWritter , (20-30% of
data size)
▸ Search Algorithm : IndexSearcher
▸ No notion of schema
▸ Example Usage : Atlassian Jira / Confluence , Salesforce, Oracle Text Search
▸ Lucene is very powerful & difficult to use
4. SOLR BASED SEARCH
WHAT IS SOLR ?
▸ A full text Enterprise Search Server
▸ Caching
▸ Replication
▸ Easy administration
▸ Web Service layer on top of Lucene
▸ Non-Relation data storage and processing
▸ Loose schema to define type and fields
▸ Better recall and precision with various configurations options
▸ Easy to use
5. SOLR BASED SEARCH
TERMINOLOGIES
▸ Document : Unit of Index and Search
▸ Format : XML , JSON , CSV
▸ Fields : Name - Value pair , type is associated with each
field
▸ Search :
▸ Query : QueryParser - Creates query ——- >
IndexSearcher —- > Return hits
6. ▸ Create Indexes
▸ Build Document
▸ Analyse Document
▸ Index Document
▸ Search
▸ Input Query
▸ Analyse Query
▸ Render Result
SOLR BASED SEARCH
STEPS
GET
CONTENTS
BUILD
SOLR DOC
ANALYSE
DOC
INDEX DOC
SEARCH UI
BUILD
QUERY
SEARCH
QUERY
STRING
RENDER
RESULT
ANALYSE
QUERY
CREATE INDEXES
SEARCH DOCUMENT
7. SOLR BASED SEARCH
SEARCH STRING / DOCUMENT ANALYSIS
▸ Analysis = Analyzer + Tokenizer + Filter
▸ Analyzer for Index and Search may or may not same
▸ E.g. <filedType name=“nametext” class=“solo.TextField”>
<analyzer class=“org.apache.lucene.analysis.core.WhitespaceAnalyzer” />
<fieldType>
<fieldType name=“nametext” class=“solo.TextField”>
<analyzer type=“index”>
<tokenizer class=“solo.StandardTokenizerFactory” />
<filter class=“solr.LowerCaseFilterFactory” />
<filter class=“solr.KeepWorFilterFactory” words=“keepwords.txt” />
<filter class=“solr.SynonymFilterFactory” synonyms=“synonymsfile.txt” />
<analyzer>
<analyzer type=“query”>
<tokenizer class=“solo.StandardTokenizerFactory” />
<filter class=“solo.LowerCaseFilterFactory” />
<analyzer>
<fieldType>
8. SOLR BASED SEARCH
SOLR SEARCH FEATURES
▸ Ranked Search : High score documents at top , score is one of the field in hits
▸ Field Searching
▸ Custom Sort by Field
▸ Boosting Result
▸ Multiword synonyms (Solr 6.5 onwards)
▸ Stemming
▸ Hit highlight
▸ Autocomplete
9. SOLR BASED SEARCH
SOLR SEARCH FEATURES
▸ Faceting
▸ Term Frequency
▸ Document age consideration
▸ Spellchecks
▸ Typo tolerant
▸ Phonetic match
▸ OpenNLP / UIMA integration
▸ Pagination
▸ Functions for computation (Like this)
▸ So on …
10. SOLR BASED SEARCH
SOLR SEARCH - VARIOUS TYPES OF QUERIES
▸ Simple text Search :
▸ Find films where genre contains word “action” (q=genre:Action)
▸ Find films where genre contains word “Thriller” (q=genre:Thriller)
▸ Find films where genre contains words Action and Thriller ( fq=genre:Action&fq=genre:Thriller&q=*:*)
▸ Find films directed by Gary Bose (q=directed_by:Gary&q=directed_by:bose)
▸ Strict term presence search :
▸ Find films where genre contains word “action” as well as “Thriller" (q=*:*&fq=genre:(+action +thriller))
▸ Find films directed by person whose name contains words “Gary” as well as
“Bose” (fq=+directed_by:Bose&fq=+directed_by:Gary)
▸ Proximity Search
▸ Find films where genre contains words Action and Thriller 5 words apart (q=*:*&fq=genre:"action
adventure”~20)
11. SOLR BASED SEARCH
SOLR SEARCH - VARIOUS TYPES OF QUERIES
▸ Phrase Search
▸ Find films with genre “Action Thriller” (q=genre:”Action Thriller”) (or this way)
(q=*:*&fq=genre:”action thriller”)
▸ Faceted Search
▸ Movies released during 2005 and 2006 , get count for each director
(fl=initial_release_date&fq=initial_release_date: [ 2005-10-27T00:00:00Z TO
2006-11-30T00:00:00Z ]&q=*:*&&facet=true&facet.field=directed_by)
▸ Fuzzy Search (~)
▸ Genre contains word sychologikal (q=*:*&fq=genre:sychologikal~)
▸ Negative Search
▸ Genre contains only Action but no Thriller (q=*:*&fq=genre:action&fq=-genre:thriller)
12. TEXT
▸ Wildcard Search
▸ Genre contains word like *ction (q=*:*&fq=genre:*ction)
▸ Conditional Logic in search
▸ Genre contains Psychological AND thriller (q=*:*&fq=genre:(psychological AND Thriller))
▸ Genre contains Psycological OR Thriller (q=*:*&fq=genre:(psychological OR Thriller))
▸ Genre contains Psychological but no Thriller (q=*:*&fq=genre:(psychological NOT
Thriller))
▸ Range Search
▸ Movies released during 2005 and 2006 (fl=initial_release_date&fq=initial_release_date:
[ 2005-10-27T00:00:00Z TO 2006-11-30T00:00:00Z ]&q=*:*)
▸ So on…
13. TEXT
EXAMPLES OF SEARCH INTERFACE
▸ REST API
▸ http://<host>:<port>/solr/<collection>/query?
▸ APIs such as SolrJ
▸ Solr Admin UI
14. SOLR BASED SEARCH
PRECISION AND RECALL WITH SOLR
WHAT
HOW
Results Relevant Results More Hits More Relevant Results at High Rank
Solr Synonyms
Fuzzy Search
Proximity Search
Phrase Search
Negative Search
Strict Term Presence
Doc Boosting
Index Binary Docs
Multiline Search
Index Many Fields
Search String Limit
15. SOLR BASED SEARCH
CHALLENGES
▸ Domain Specific knowledge transformation into config files as Synonym , protwords etc
▸ Proper Solr Collection Configuration
▸ Slight change in query string words changes search results considerably
▸ Use stemming
▸ High recall
▸ Limit search by score
▸ Spaces
▸ Custom tokanizer
▸ Spelling mistakes in query string
▸ Fuzzy Search and/or spelling checkers
▸ Document Field names get indexed