SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Boosting Documents in Solr by
Recency, Popularity, and User
         Preferences
               Timothy Potter
     thelabdude@gmail.com, May 25, 2011
What I Will Cover
 Recency Boost
 Popularity Boost
 Filtering based on user preferences




                                        2
My Background
 Timothy Potter
 Large scale distributed systems engineer
  specializing in Web and enterprise search,
  machine learning, and big data analytics.
 5 years Lucene
  • Search solution for learning management sys
 2+ years Solr
  • Mobile app for magazine content
      Solr + Mahout + Hadoop
  • FAST to Solr Migration for a Real Estate Portal
  • VinWiki: Wine search and recommendation engine
                                                      3
Boost documents by age

 Just do a descending sort by
  age = done?
 Boost more recent documents
  and penalize older documents
  just for being old
 Useful for news, business docs,
  and local search




                                    4
Solr: Indexing
   In schema.xml:

     <fieldType name="tdate"
                class="solr.TrieDateField"
                omitNorms="true"
                precisionStep="6"
                positionIncrementGap="0"/>

      <field name="pubdate"
             type="tdate"
             indexed="true"
             stored="true"
             required="true" />

Date published =
   DateUtils.round(item.getPublishedOnDate(),Calendar.HOUR);
                                                               5
FunctionQuery Basics
 FunctionQuery: Computes a value for each
  document
   • Ranking
   • Sorting
constant       pow      recip
literal        abs      max
fieldvalue     log      min
ord            sqrt     ms
rord           map      sqedist - Squared Euclidean Dist
sum            scale    hsin, ghhsin - Haversine Formula
sub            query    geohash - Convert to geohash
product        linear   strdist

                                                           6
Solr: Query Time Boost
 Use the recip function with the ms function:
q={!boost b=$recency v=$qq}&
 recency=recip(ms(NOW/HOUR,pubdate),3.16e-11,0.08,0.05)&
 qq=wine



 Use edismax vs. dismax if possible:
 q=wine&
 boost=recip(ms(NOW/HOUR,pubdate),3.16e-11,0.08,0.05)



 Recip is a highly tunable function
   • recip(x,m,a,b) implementing a / (m*x + b)
   • m = 3.16E-11 a= 0.08 b=0.05 x = Document Age


                                                           7
Tune Solr recip function




                           8
Tips and Tricks
 Boost should be a multiplier on the relevancy score

 {!boost b=} syntax confuses the spell checker so
  you need to use spellcheck.q to be explicit
  q={!boost b=$recency v=$qq}&spellcheck.q=wine


 Bottom out the old age penalty using min:
  • min(recip(…), 0.20)


 Not a one-size fits all solution – academic research
  focused on when to apply it
                                                         9
Boost by Popularity




 Score based on number of unique views
 Not known at indexing time
 View count should be broken into time slots

                                                10
Popularity Illustrated




                         11
Solr: ExternalFileField
In schema.xml:

<fieldType name="externalPopularityScore"
           keyField="id"
           defVal="1"
           stored="false" indexed="false"
           class=” solr.ExternalFileField "
           valType="pfloat"/>

<field name="popularity"
       type="externalPopularityScore" />




                                              12
Popularity Boost: Nuts & Bolts
User activity
  logged
                            solr-home/data/
                            external_popularity

                                a=1.114
                                b=1.05               Solr Server
                                                     Solr Server
   Logs
    Logs                        c=1.111
                                …




                                                  commit
                View Counting
                View Counting
                     Job
                      Job



                                                                   13
Popularity Tips & Tricks
 For big, high traffic sites, use log analysis
   • Perfect problem for MapReduce
   • Take a look at Hive for analyzing large volumes
     of log data


 Minimum popularity score is 1 (not zero) … up
  to 2 or more
   • 1 + (0.4*recent + 0.3*lastWeek + 0.2*lastMonth …)


 Watch out for spell checker “buildOnCommit”

                                                         14
Filtering By User Preferences
 Easy approach is to build basic preference
  fields in to the index:
  • Content types of interest – content_type
  • High-level categories of interest - category
  • Source of interest – source


 We had too many categories and sources that
  a user could enable / disable to use basic
  filtering
  • Custom SearchComponent with a connection to
    a JDBC DataSource

                                                   15
Preferences Component
 Connects to a database
 Caches DocIdSet in a Solr FastLRUCache
 Cached values marked as dirty using a simple
  timestamp passed in the request

Declared in solrconfig.xml:
 <searchComponent
     class=“demo.solr.PreferencesComponent"
     name=”pref">
   <str name="jdbcJndi">jdbc/solr</str>
 </searchComponent>

                                                 16
Preferences Filter
 Parameters passed in the query string:
  • pref.id = primary key in db
  • pref.mod = preferences modified on timestamp
      So the Solr side knows the database has been updated
 Use simple SQL queries to compute a list of
  disabled categories, feeds, and types
  • Lucene FieldCaches for category, source, type
 Custom SearchComponent included in the list
  of components for edismax search handler
      <arr name="last-components">
         <str>pref</str>
       </arr>
                                                              17
Preferences Filter in Action


      Update                   Query with
      Preferences              pref.id=123 and
                               pref.mod = TS
                                                        Solr Server
                                                        Solr Server
   User
    User
Preferences
Preferences
    Db
     Db

                        Preferences
                        Preferences    pref.id & pref.mod
                        Component
                         Component
  SQL to compute
  excluded categories
  sources and types                                    LRU
                                                        LRU
                                                      Cache
                                                      Cache
                          If cached mod == pref.mod
                          read from cache
                                                                      18
Wrap Up
 Use recip & ms functions to boost recent
  documents

 Use ExternalFileField to load popularity scores
  calculated outside the index

 Use a custom SearchComponent with a Solr
  FastLRUCache to filter documents using
  complex user preferences


                                                    19
Contact
 Timothy Potter
  • thelabdude@gmail.com
  • http://thelabdude.blogspot.com
  • http://www.linkedin.com/in/thelabdude




                                            20

Weitere ähnliche Inhalte

Was ist angesagt?

Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis TechnologySimple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis TechnologyLucidworks
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesRahul Jain
 
Search is the UI
Search is the UI Search is the UI
Search is the UI danielbeach
 
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...Lucidworks
 
Introduction to Lucidworks Fusion - Alexander Kanarsky, Lucidworks
Introduction to Lucidworks Fusion - Alexander Kanarsky, LucidworksIntroduction to Lucidworks Fusion - Alexander Kanarsky, Lucidworks
Introduction to Lucidworks Fusion - Alexander Kanarsky, LucidworksLucidworks
 
Relevance in the Wild - Daniel Gomez Vilanueva, Findwise
Relevance in the Wild - Daniel Gomez Vilanueva, FindwiseRelevance in the Wild - Daniel Gomez Vilanueva, Findwise
Relevance in the Wild - Daniel Gomez Vilanueva, FindwiseLucidworks
 
Ingesting and Manipulating Data with JavaScript
Ingesting and Manipulating Data with JavaScriptIngesting and Manipulating Data with JavaScript
Ingesting and Manipulating Data with JavaScriptLucidworks
 
Solr JDBC: Presented by Kevin Risden, Avalon Consulting
Solr JDBC: Presented by Kevin Risden, Avalon ConsultingSolr JDBC: Presented by Kevin Risden, Avalon Consulting
Solr JDBC: Presented by Kevin Risden, Avalon ConsultingLucidworks
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development TutorialErik Hatcher
 
Query Parsing - Tips and Tricks
Query Parsing - Tips and TricksQuery Parsing - Tips and Tricks
Query Parsing - Tips and TricksErik Hatcher
 
Integrating the Solr search engine
Integrating the Solr search engineIntegrating the Solr search engine
Integrating the Solr search engineth0masr
 
Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Integrating Spark and Solr-(Timothy Potter, Lucidworks)Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Integrating Spark and Solr-(Timothy Potter, Lucidworks)Spark Summit
 
Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5israelekpo
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkJake Mannix
 
Click-through relevance ranking in solr &  lucid works enterprise - By Andrz...
 Click-through relevance ranking in solr &  lucid works enterprise - By Andrz... Click-through relevance ranking in solr &  lucid works enterprise - By Andrz...
Click-through relevance ranking in solr &  lucid works enterprise - By Andrz...lucenerevolution
 
Solr 6 Feature Preview
Solr 6 Feature PreviewSolr 6 Feature Preview
Solr 6 Feature PreviewYonik Seeley
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBertrand Delacretaz
 

Was ist angesagt? (20)

Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis TechnologySimple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 
Search is the UI
Search is the UI Search is the UI
Search is the UI
 
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
 
Introduction to Lucidworks Fusion - Alexander Kanarsky, Lucidworks
Introduction to Lucidworks Fusion - Alexander Kanarsky, LucidworksIntroduction to Lucidworks Fusion - Alexander Kanarsky, Lucidworks
Introduction to Lucidworks Fusion - Alexander Kanarsky, Lucidworks
 
Relevance in the Wild - Daniel Gomez Vilanueva, Findwise
Relevance in the Wild - Daniel Gomez Vilanueva, FindwiseRelevance in the Wild - Daniel Gomez Vilanueva, Findwise
Relevance in the Wild - Daniel Gomez Vilanueva, Findwise
 
Ingesting and Manipulating Data with JavaScript
Ingesting and Manipulating Data with JavaScriptIngesting and Manipulating Data with JavaScript
Ingesting and Manipulating Data with JavaScript
 
Solr JDBC: Presented by Kevin Risden, Avalon Consulting
Solr JDBC: Presented by Kevin Risden, Avalon ConsultingSolr JDBC: Presented by Kevin Risden, Avalon Consulting
Solr JDBC: Presented by Kevin Risden, Avalon Consulting
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development Tutorial
 
Query Parsing - Tips and Tricks
Query Parsing - Tips and TricksQuery Parsing - Tips and Tricks
Query Parsing - Tips and Tricks
 
Integrating the Solr search engine
Integrating the Solr search engineIntegrating the Solr search engine
Integrating the Solr search engine
 
Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Integrating Spark and Solr-(Timothy Potter, Lucidworks)Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Integrating Spark and Solr-(Timothy Potter, Lucidworks)
 
Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and Spark
 
Click-through relevance ranking in solr &  lucid works enterprise - By Andrz...
 Click-through relevance ranking in solr &  lucid works enterprise - By Andrz... Click-through relevance ranking in solr &  lucid works enterprise - By Andrz...
Click-through relevance ranking in solr &  lucid works enterprise - By Andrz...
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
Solr 6 Feature Preview
Solr 6 Feature PreviewSolr 6 Feature Preview
Solr 6 Feature Preview
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and Solr
 

Ähnlich wie Boost Solr Results by Recency, Popularity, and User Preferences

Automated Cluster Management and Recovery for Large Scale Multi-Tenant Sea...
  Automated Cluster Management and Recovery  for Large Scale Multi-Tenant Sea...  Automated Cluster Management and Recovery  for Large Scale Multi-Tenant Sea...
Automated Cluster Management and Recovery for Large Scale Multi-Tenant Sea...Lucidworks
 
Solr search engine with multiple table relation
Solr search engine with multiple table relationSolr search engine with multiple table relation
Solr search engine with multiple table relationJay Bharat
 
IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys" IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys" DataArt
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
Information Retrieval - Data Science Bootcamp
Information Retrieval - Data Science BootcampInformation Retrieval - Data Science Bootcamp
Information Retrieval - Data Science BootcampKais Hassan, PhD
 
Just the Job: Employing Solr for Recruitment Search -Charlie Hull
Just the Job: Employing Solr for Recruitment Search -Charlie Hull Just the Job: Employing Solr for Recruitment Search -Charlie Hull
Just the Job: Employing Solr for Recruitment Search -Charlie Hull lucenerevolution
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sqlaftab alam
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasMapR Technologies
 
Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever
Apache Solr 1.4 – Faster, Easier, and More Versatile than EverApache Solr 1.4 – Faster, Easier, and More Versatile than Ever
Apache Solr 1.4 – Faster, Easier, and More Versatile than EverLucidworks (Archived)
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platformTommaso Teofili
 
MongoDB Administration 101
MongoDB Administration 101MongoDB Administration 101
MongoDB Administration 101MongoDB
 
Solr Black Belt Pre-conference
Solr Black Belt Pre-conferenceSolr Black Belt Pre-conference
Solr Black Belt Pre-conferenceErik Hatcher
 
Rapid prototyping with solr - By Erik Hatcher
Rapid prototyping with solr -  By Erik Hatcher Rapid prototyping with solr -  By Erik Hatcher
Rapid prototyping with solr - By Erik Hatcher lucenerevolution
 

Ähnlich wie Boost Solr Results by Recency, Popularity, and User Preferences (20)

Automated Cluster Management and Recovery for Large Scale Multi-Tenant Sea...
  Automated Cluster Management and Recovery  for Large Scale Multi-Tenant Sea...  Automated Cluster Management and Recovery  for Large Scale Multi-Tenant Sea...
Automated Cluster Management and Recovery for Large Scale Multi-Tenant Sea...
 
Apache Solr for begginers
Apache Solr for begginersApache Solr for begginers
Apache Solr for begginers
 
Solr search engine with multiple table relation
Solr search engine with multiple table relationSolr search engine with multiple table relation
Solr search engine with multiple table relation
 
IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys" IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys"
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Solr 101
Solr 101Solr 101
Solr 101
 
Information Retrieval - Data Science Bootcamp
Information Retrieval - Data Science BootcampInformation Retrieval - Data Science Bootcamp
Information Retrieval - Data Science Bootcamp
 
Just the Job: Employing Solr for Recruitment Search -Charlie Hull
Just the Job: Employing Solr for Recruitment Search -Charlie Hull Just the Job: Employing Solr for Recruitment Search -Charlie Hull
Just the Job: Employing Solr for Recruitment Search -Charlie Hull
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
 
Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever
Apache Solr 1.4 – Faster, Easier, and More Versatile than EverApache Solr 1.4 – Faster, Easier, and More Versatile than Ever
Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever
 
Solr5
Solr5Solr5
Solr5
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
 
MongoDB Administration 101
MongoDB Administration 101MongoDB Administration 101
MongoDB Administration 101
 
Data Science
Data ScienceData Science
Data Science
 
Solr Black Belt Pre-conference
Solr Black Belt Pre-conferenceSolr Black Belt Pre-conference
Solr Black Belt Pre-conference
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Rapid prototyping with solr - By Erik Hatcher
Rapid prototyping with solr -  By Erik Hatcher Rapid prototyping with solr -  By Erik Hatcher
Rapid prototyping with solr - By Erik Hatcher
 

Mehr von thelabdude

Running Solr in the Cloud at Memory Speed with Alluxio
Running Solr in the Cloud at Memory Speed with AlluxioRunning Solr in the Cloud at Memory Speed with Alluxio
Running Solr in the Cloud at Memory Speed with Alluxiothelabdude
 
NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solrthelabdude
 
ApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr IntegrationApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr Integrationthelabdude
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scalethelabdude
 
Solr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudSolr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudthelabdude
 
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale ToolkitDeploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkitthelabdude
 
Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4thelabdude
 
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...thelabdude
 
Dachis Group Pig Hackday: Pig 202
Dachis Group Pig Hackday: Pig 202Dachis Group Pig Hackday: Pig 202
Dachis Group Pig Hackday: Pig 202thelabdude
 

Mehr von thelabdude (9)

Running Solr in the Cloud at Memory Speed with Alluxio
Running Solr in the Cloud at Memory Speed with AlluxioRunning Solr in the Cloud at Memory Speed with Alluxio
Running Solr in the Cloud at Memory Speed with Alluxio
 
NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solr
 
ApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr IntegrationApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr Integration
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
Solr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudSolr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloud
 
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale ToolkitDeploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
 
Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4
 
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
 
Dachis Group Pig Hackday: Pig 202
Dachis Group Pig Hackday: Pig 202Dachis Group Pig Hackday: Pig 202
Dachis Group Pig Hackday: Pig 202
 

Kürzlich hochgeladen

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 

Kürzlich hochgeladen (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 

Boost Solr Results by Recency, Popularity, and User Preferences

  • 1. Boosting Documents in Solr by Recency, Popularity, and User Preferences Timothy Potter thelabdude@gmail.com, May 25, 2011
  • 2. What I Will Cover  Recency Boost  Popularity Boost  Filtering based on user preferences 2
  • 3. My Background  Timothy Potter  Large scale distributed systems engineer specializing in Web and enterprise search, machine learning, and big data analytics.  5 years Lucene • Search solution for learning management sys  2+ years Solr • Mobile app for magazine content  Solr + Mahout + Hadoop • FAST to Solr Migration for a Real Estate Portal • VinWiki: Wine search and recommendation engine 3
  • 4. Boost documents by age  Just do a descending sort by age = done?  Boost more recent documents and penalize older documents just for being old  Useful for news, business docs, and local search 4
  • 5. Solr: Indexing In schema.xml: <fieldType name="tdate" class="solr.TrieDateField" omitNorms="true" precisionStep="6" positionIncrementGap="0"/> <field name="pubdate" type="tdate" indexed="true" stored="true" required="true" /> Date published = DateUtils.round(item.getPublishedOnDate(),Calendar.HOUR); 5
  • 6. FunctionQuery Basics  FunctionQuery: Computes a value for each document • Ranking • Sorting constant pow recip literal abs max fieldvalue log min ord sqrt ms rord map sqedist - Squared Euclidean Dist sum scale hsin, ghhsin - Haversine Formula sub query geohash - Convert to geohash product linear strdist 6
  • 7. Solr: Query Time Boost  Use the recip function with the ms function: q={!boost b=$recency v=$qq}& recency=recip(ms(NOW/HOUR,pubdate),3.16e-11,0.08,0.05)& qq=wine  Use edismax vs. dismax if possible: q=wine& boost=recip(ms(NOW/HOUR,pubdate),3.16e-11,0.08,0.05)  Recip is a highly tunable function • recip(x,m,a,b) implementing a / (m*x + b) • m = 3.16E-11 a= 0.08 b=0.05 x = Document Age 7
  • 8. Tune Solr recip function 8
  • 9. Tips and Tricks  Boost should be a multiplier on the relevancy score  {!boost b=} syntax confuses the spell checker so you need to use spellcheck.q to be explicit q={!boost b=$recency v=$qq}&spellcheck.q=wine  Bottom out the old age penalty using min: • min(recip(…), 0.20)  Not a one-size fits all solution – academic research focused on when to apply it 9
  • 10. Boost by Popularity  Score based on number of unique views  Not known at indexing time  View count should be broken into time slots 10
  • 12. Solr: ExternalFileField In schema.xml: <fieldType name="externalPopularityScore" keyField="id" defVal="1" stored="false" indexed="false" class=” solr.ExternalFileField " valType="pfloat"/> <field name="popularity" type="externalPopularityScore" /> 12
  • 13. Popularity Boost: Nuts & Bolts User activity logged solr-home/data/ external_popularity a=1.114 b=1.05 Solr Server Solr Server Logs Logs c=1.111 … commit View Counting View Counting Job Job 13
  • 14. Popularity Tips & Tricks  For big, high traffic sites, use log analysis • Perfect problem for MapReduce • Take a look at Hive for analyzing large volumes of log data  Minimum popularity score is 1 (not zero) … up to 2 or more • 1 + (0.4*recent + 0.3*lastWeek + 0.2*lastMonth …)  Watch out for spell checker “buildOnCommit” 14
  • 15. Filtering By User Preferences  Easy approach is to build basic preference fields in to the index: • Content types of interest – content_type • High-level categories of interest - category • Source of interest – source  We had too many categories and sources that a user could enable / disable to use basic filtering • Custom SearchComponent with a connection to a JDBC DataSource 15
  • 16. Preferences Component  Connects to a database  Caches DocIdSet in a Solr FastLRUCache  Cached values marked as dirty using a simple timestamp passed in the request Declared in solrconfig.xml: <searchComponent class=“demo.solr.PreferencesComponent" name=”pref"> <str name="jdbcJndi">jdbc/solr</str> </searchComponent> 16
  • 17. Preferences Filter  Parameters passed in the query string: • pref.id = primary key in db • pref.mod = preferences modified on timestamp  So the Solr side knows the database has been updated  Use simple SQL queries to compute a list of disabled categories, feeds, and types • Lucene FieldCaches for category, source, type  Custom SearchComponent included in the list of components for edismax search handler <arr name="last-components"> <str>pref</str> </arr> 17
  • 18. Preferences Filter in Action Update Query with Preferences pref.id=123 and pref.mod = TS Solr Server Solr Server User User Preferences Preferences Db Db Preferences Preferences pref.id & pref.mod Component Component SQL to compute excluded categories sources and types LRU LRU Cache Cache If cached mod == pref.mod read from cache 18
  • 19. Wrap Up  Use recip & ms functions to boost recent documents  Use ExternalFileField to load popularity scores calculated outside the index  Use a custom SearchComponent with a Solr FastLRUCache to filter documents using complex user preferences 19
  • 20. Contact  Timothy Potter • thelabdude@gmail.com • http://thelabdude.blogspot.com • http://www.linkedin.com/in/thelabdude 20

Hinweis der Redaktion

  1. Attendees with come away from this presentation with a good understanding and access to source code for boosting and/or filtering documents by recency, popularity, and personal preferences. My solution improves upon the common &quot;recip&quot; based solution for boosting by document age. The framework also supports boosting documents by a popularity score, which is calculated and managed outside the index. I will present a few different ways to calculate popularity in a scalable manner. Lastly, my solution supports the concept of a personal document collection, where each user is only interested in a subset of the total number of documents in the index. My presentation will provide a good example of how to filter and/or boost results based on user preferences, which is a very common requirement of many Web applications.
  2. The one thing I’d like you to come away with today is confidence that Solr has powerful boosting capabilities built-in, but they require some fine-tuning and experimentation. Some simple recipes for complementing core Solr functionality to do: I. Boost documents by age (recency / freshness boost) II. Boost documents by popularity III. Filter results based on User Preferences (Personalized collection)
  3. Currently working at the National Renewable Energy Laboratory on building an infrastructure for storing and analyzing large volumes of smart grid related energy data using Hadoop technologies. Been doing search work for the past 5 years including a Lucene based search solution of eLearning content, Solr based solution for online magazine content and a FAST to Solr migration for a real estate portal. My other area of interest is in Mahout; I&apos;ve contributed a few bug fixes and several pages on the wiki including working with Grant Ingersoll on benchmarking Mahout&apos;s distributed clustering algorithms in the Amazon cloud. Technical Blog: http://thelabdude.blogspot.com/ Currently working on JSF2 components for Solr.
  4. All other things being equal, more recent documents are better What’s not covered is how to determine if you should apply the boost. That’s a more in-depth topic that is the focus of academic research, especially in relation to Web search. News and most magazine articles Business documents – perhaps a less aggressive boost function identification of recency sensitive queries before ranking. see: http://technicallypossible.wordpress.com/2011/03/13/identifying-queries-which-demand-recency-sensitive-results-in-web-search/
  5. Careful! TrieFields make it more efficient to do range searches on numeric fields indexed at full precision, but it doesn&apos;t actually do anything to round the fields for people who genuinely want their stored and index values to only have second/minute/hour/day precision regardless of what the initial raw data looks like. Currently, Solr doesn&apos;t have anything built-in to round a date down to a different precision, such as minute / hour. Thus, you may need to do this yourself prior to indexing a document. see SOLR-741 // from commons DateUtils Date published = DateUtils.round(item.getPublishedOnDate(), Calendar.HOUR);
  6. Solr 1.4+ the recommended approach is to use the recip function with the ms function: There are approximately 3.16e10 milliseconds in a year, so one can scale dates to fractions of a year with the inverse, or 3.16e-11 recip(ms(NOW/HOUR,pubdate),3.16e-11,1,1) For standard query parser, you could do: q={!boost b=recip(ms(NOW/HOUR,pubdate),3.16e-11,1,1)}wine This uses the built-in boost function query. This uses a Lucene FieldCache under the covers on the pubdate field (stored in the index as long). The ms(NOW/HOUR) uses less precise measure of document age (rounding clause), which helps reduce memory consumption. Lessons: 1 - {!boost b=} syntax breaks spell-checking so you need to use spellcheck.q to be explicit 2 - Use edismax because it multiplies the boost whereas dismax adds &quot;bf&quot; 3 - Use a tdate field when indexing 4 - Use ms(NOW/HOUR) and less precision when indexing 5 - Use max(boost,0.20) - to bottom out the age penalty
  7. A reciprocal function with recip(x,m,a,b) implementing a/(m*x+b). m,a,b are constants, x is any numeric field or arbitrarily complex function. When a and b are equal, and x&gt;=0, this function has a maximum value of 1 that drops as x increases. Increasing the value of a and b together results in a movement of the entire function to a flatter part of the curve. These properties can make this an ideal function for boosting more recent documents – see http://wiki.apache.org/solr/FunctionQuery
  8. identification of recency sensitive queries before ranking. see: http://technicallypossible.wordpress.com/2011/03/13/identifying-queries-which-demand-recency-sensitive-results-in-web-search/
  9. Score made of number of unique views in a time slot + avg rating / # of comments, etc. Must be computed outside of the index; refreshed periodically Probably don’t want to mix this with age boost as an older document might be really popular for some weird reason; think of old videos that become popular on YouTube Age – probably not as an old doc might get popular identification of recency sensitive queries before ranking. see: http://technicallypossible.wordpress.com/2011/03/13/identifying-queries-which-demand-recency-sensitive-results-in-web-search/
  10. Bar chart illustrates time slots Popularity score favors more recent content Document A is most popular; B was popular but is now on the decline and C has enjoyed consistent interest for a longer period but scores a little lower than A because of the recent interest in A
  11. Most likely use case would be to use log-file analysis &gt; Ideal problem for MapReduce Question the audience – who has heard of MapReduce?