The open source Apache Solr open source search engine provides powerful, versatile search application development technology so you to take full control of your search needs. Solr’s rich interfaces and convenient server packaging of the underlying Apache Lucene search libraries into web service interfaces, and near limitless customizability let you take control of your search. From e-commerce to content management and endless variations in between, Solr is the right tool at the right time to turn ever growing volume and variety of data and documents to the advantage of your business.http://www.lucidimagination.com/blog/2009/12/01/webinar-an-introduction-to-basics-of-search-and-relevancy-with-apache-solr/
An Introduction to Basics of Search and Relevancy with Apache Solr
1. Introduction to basics of
Search and Relevancy
with Apache Solr
FEATURING:
Mark Bennett, CTO
2. Agenda
• Prerequisites: Browser Tricks
• Web “Command Line”
• The DisMax Parser
• Boosting Formula
• Explaining “Explain”
• Check Your Index!
• Q&A
• Resources / About NIE
12/2/2009 Lucid Imagination, Inc. 2
3. Prerequisite:
Some Browser Tricks
12/2/2009 Lucid Imagination, Inc. 3
4. Browsers Matter – install them all!
Firefox: IE and Safari:
• Default XML Rendering • Better “Explain”
• (also some versions of IE)
copy & paste
• Lots of Plugins maintains line
breaks
• Better table copy
and paste
12/2/2009 Lucid Imagination, Inc. 4
5. Larger Firefox “Command Line”
Customize the Firefox
URL box as a command
line in 3 easy steps
1. Toolbar: Right Click
2. Customize… Add New Toolbar
3. URL bar ->CLICK and DRAG
Lucid Imagination, Inc. 5
6. Turn off Solr HTTP Caching
• Change in solrconfig.xml
• Disable the http304 section
• Turn it back on before you deploy!
12/2/2009 Lucid Imagination, Inc. 6
8. The “Web Command Line”
CLI CONCEPT SOLR EQUIVALENT
• Command Prompt URL bar
• -o or --foo bar ? or & and =
• (spaces) +
• some punctuation %nn
• output XML or HTML
• Command line “adapter” Curl
• Script files can
call URLs
• Not built into
Windows – try cygwin
12/2/2009 Lucid Imagination, Inc. 8
10. Example: search for “solr”
http://localhost:8983/solr/select?q=solr&debugQuery=true
With
Firefox
you get XML
output you
can expand
and collapse
With MSIE* and Safari,
not so much
* Some versions
12/2/2009 Lucid Imagination, Inc. 10
12. A look at the
DisMax query parser
12/2/2009 Lucid Imagination, Inc. 12
13. Solr DisMax: Defined
• What is it?
• Dis-joint text (Multiple fields)
• Max-imum match (score)
• How do you get it?
• Configured in:
• solrconfig.xml and schema.xml
• Called with:
• qt=dismax
• Adjusted with:
• mm, bf, qf, pf, qs, ps, tie
12/2/2009 Lucid Imagination, Inc. 13
14. Solr DisMax: Pros and Cons
General Benefits
• Multiple Fields
• Multiple Relevancy Rules
• Great for Freshness / Popularity
Issues to be Aware of
• Tie-in between schema.xml & solrconfig.xml
• Trouble with some CJK (Chinese, Japanese, Korean)
• Limited wildcard / field / range support
• Difficult to customize and debug
• Trouble with shingles
• Understand mm!
Lucid Imagination, Inc. 14
15. About the “dis” and the “max”
Distributed across multiple fields
• Breakup query into words
• Each part becomes field clause
• Like an OR but with extra credit
Takes the Maximum of each set
• Word 1 had highest score in Title
• Word 2 very dense in the doc body
• Adds in Tie breaker if in multiple fields
Lucid Imagination, Inc. 15
16. Coming soon: Extended DisMax
Improvements
• Flexible case Boolean ops: AND/and, OR/or
• Auto-escape punctuation & -> &, etc.
• Improved Proximity Boosting (via word bigrams)
• Other changes in stop words, relevancy calc, URL arguments
How to get it
• Post 1.4 patch, planned for 1.5
• Details + Patch in JIRA: SOLR-1553
http://issues.apache.org/jira/browse/SOLR-1553
• TBD: change URL option qt=edismax (or qt=dismax )
Lucid Imagination, Inc. 16
18. Boost Functions in Dismax
High Level Feature
• Numeric functions for scoring
• sum(), product(), sqrt(), log(), etc.
• Boost on recent dates, user popularity
Good Combination: Reverse-Ordinal & Reciprocal
• Position in index : ord(), reverse is: rord()
• Larger y for smaller x: recip()
How to get it
• URL parameter bf = “boost function”
• Configured in solrconfig.xml
• See http://wiki.apache.org/solr/FunctionQuery
Lucid Imagination, Inc. 18
25. Another way to view Explain data
• Solr1.4 has Solritas
• Various features, including toggle explain display
• “Some assembly required…”
http://www.lucidimagination.com/blog/2009/11/04/solritas-solr-1-4s-hidden-gem/
Lucid Imagination, Inc. 25
27. Checking what got Indexed
Bad Index = Bad Search
• Check Upper / lower case and Punctuation
• Bad Fields / Meta Data = Bad Facets, Filters, Sorting
Use built-in Schema Browser:
• Check each field
• Common words =
• IDF “Inverse Document Frequency”
Lucid Imagination, Inc. 27
28. Check IDF w/ the Schema Browser
Start at the Admin Screen:
http://localhost:8983/solr/admin
Schema Browser
• select a field
• change #
to see more
Lucid Imagination, Inc.
29. About NIE
New Idea Engineering
12/2/2009 Lucid Imagination, Inc. 29
30. NIE Resources
Newsletter & Whitepapers: Search Dev Newsgroup:
www.ideaeng.com/current www.SearchDev.org
Blogs:
EnterpriseSearchBlog.com
SearchComponentsOnline.com
12/2/2009 Lucid Imagination, Inc. 30
31. Finish Line / Q & A
Review & Questions
Mark Bennett mbennett@ideaeng.com
main 408-446-3460
cell 408-829-6513
12/2/2009 Lucid Imagination, Inc. 31
32. Q&A
These slides and a recorded presentation are available at
bit.ly/SolrRelevancy
12/2/2009 Lucid Imagination, Inc.