SlideShare ist ein Scribd-Unternehmen logo
1 von 28
1




                                 U s a g e of S olr a t T r ov it
                        A Search Engine For Classified Ads



                                                             Marc Sturlese
                                                                    Trovit

                                                           marc@trovit.com
                           Apache Lucene Eurocon 2010, Prague, 20 May 2010


Apache Lucene EuroCon                                                   4 May 2010
Agenda

              ● Trovit, a Solr use case
              ● Types of index
              ● Architecture overview
              ● Relevance tuning
              ● Out of the box features
              ● Custom features
              ● Sharding
              ● Future directions
              ● Questions


Apache Lucene EuroCon   05/16/10
W h a t is T r o v it? A S e a r c h E n g in e F o r C la s s ifie d A d s




Apache Lucene EuroCon   05/16/10
T y pe s o f in de x

              There are 3 different types of index
              ● Organic ads index
              ● Sponsored ads index
              ● Recommended searches index


              There is an index per country and per business category for
              every type... what means a total of 180 index
              Some of them are sharded. All of them have replicas.




Apache Lucene EuroCon    05/16/10
T y pe s o f in de x




                        Captura donde se vean los 3 tipos de índice




Apache Lucene EuroCon       05/16/10
A r qu ite ctu r e o v e r v ie w   crawling / parsing

                                           wharehouse

                                          indexing

                                           Solr indexer
                                                                   back end
                                           replication

                                                          Solr
                                                          slaves



                                            load balancer


                                                     frontal
                                          load balancing

                                            load balancer          front end
                                          request



Apache Lucene EuroCon   05/16/10                                               6
A r ch ite ctu r e o v e r v ie w

              M a s te r s - I n de x in g
              ● 4 servers. Continuously updating index sequentially
                        ● 1 server to index organic ads for all countries/categories
                        ● 1 server to index powered ads for all countries/categories
                        ● 1 server to index recommended searches for all countries/categories


              S la v e s – S e r v in g s e a r c h r e q u e s ts
              ● Index with high traffic have 4 replicas
              ● Indexs with less traffic have 3 replicas




Apache Lucene EuroCon           05/16/10
A r qu ite ctu r e o v e r v ir e w

       ● Index are replicated using modified c o l l e c t i o n
                    d i s t r i b u t i o n scripts to allow multi core
       ● Snapshooter and snappuller are sequentially executed
       ● Snapinstaller is executed at the same time on each slave
           to preserve exactly the same content all the time
       ● Started load balancing with P e r l b a l . It was producing
                high CPU loads




Apache Lucene EuroCon   05/16/10
L ife o f a u s e r s e a r ch r e qu e s t

           For every user search:
           ● A request is done to the organic and sponsored index
           ● Per each result of the organic search, a request to the
                  recommended searches ads is done


           ● 13 Solr request per user search! And once this is done...
                  The user search request is going to be batch processed to decide
                  if it must be indexed in the similar user searches index




Apache Lucene EuroCon      05/16/10
L ife o f a u s e r s e a r ch r e qu e s t




Apache Lucene EuroCon   05/16/10
R e le v a n c e tu n in g

           ● Basic searches use dismax qt. Build on top of Lucenes
                        DisjunctionMaxQuery
           ● Boosting queries to make latest ads more relevant
           ● Boost some ads at document level at indexing time to
                        make them more important than others
           ● Boost ads at field level at query time to make the match
                        more important in some fields than in others




Apache Lucene EuroCon      05/16/10
R e le v a n c e tu n in g

         Us er s ea r ch: hom e tennes s ee
         ● Higher quality ad




         ● Lower quality ad




Apache Lucene EuroCon   05/16/10
O u t o f th e bo x S o lr fe a tu r e s

          ● Synonyms for USA states
          ● Per country and per business category stopwords
          ● MoreLikeThis request handler
          ● TrieFields to index housing latitude and longitude
          ● Facet fields, queries and dates.
          ● Warming queries from a specific file using an EventListener.
                  Issue SOLR-784




Apache Lucene EuroCon    05/16/10
O u t o f th e bo x S o lr fe a tu r e s : M o r e L ik e T h is




Apache Lucene EuroCon   05/16/10
O u t o f th e bo x S o lr fe a tu r e s : U s a g e o f T r ie F ie ld s




Apache Lucene EuroCon   05/16/10
Cus tom fe a tu r e s

          ● Duplicates detection
                  ● Coming from the same source: Indexing time
                  ● Coming from different sources: Indexing and search
                         time
          ● Pseudo field collapsing
          ● Custom ranking for sponsored ads
          ● Custom Data Import Handler for full indexing and updates




Apache Lucene EuroCon     05/16/10
C u s to m fe a tu r e s – N e a r d u plic a te s d e te c tio n


          ● A ds c om in g fr om th e s a m e s ou r c e
                 ● Last who comes is the one that will be kept on the index
                 ● Deduplication method using SignatureUpdateProcessor
                 ● Small hack to custom the TextProfileSignature


          ● A ds c om in g fr om diffe r e n t s ou r c e s
                 ● Give the user the chance to decide the source to visit
                ● Based on field collapsing issue (SOLR-236) and
                 SignatureUpdateProcessor used in Deduplication
                 ● Done in 2 steps, one at index time and one at search time.
Apache Lucene EuroCon     05/16/10
N e a r d u plic a te s d e te c tio n
          A ds c o m in g fr o m diffe r e n t s o u r c e s




Apache Lucene EuroCon     05/16/10
C u s to m fe a tu r e s – N e a r d u plic a te s d e te c tio n
      A ds c o m in g fr o m diffe r e n t s o u r c e s


      ● Why to calculate them at index time?
              ● Avoid loading FieldCache of a “big field” at search time.
                        Very memory consuming!




Apache Lucene EuroCon    05/16/10
C u s to m fe a tu r e s – P s e u d o fie ld c o lla ps in g


      ● Don't want to show first results pages with all ads from the
                        same sources
      ● “Bad” results will be send to the later pages
      ● SOLR-236 makes a double trip, not so good in performance
                        terms
      ● Core hack to avoid the double trip... SOLR–1311
      ● Does not support proper distributed search at the moment




Apache Lucene EuroCon           05/16/10
C u s to m fe a tu r e s – S pe cia l r a n k in g fo r S po n s o r e d
          Ads
          ● Not just relevance is important. External factors are
          important too.
          ● Implemented using a Solr SearchComponent
          ● External factors are loaded from a resource and used
                        in a Lucene FieldComparatorSource to alter the
                           score of the documents




Apache Lucene EuroCon      05/16/10
C u s to m fe a tu r e s – H a c k e d D a ta I m po r tH a n d le r
      ● DIH is a tool to index data to Solr from different sources
      (xml, txt, data bases...)
      ● Extended transformers to alter data before it is indexed
      ● Delta imports are meant to be used not updating huge
      amounts of rows. Doing that can end up with memory
      problems
      ● If something crashes we have to reindex. It can sometimes
      take a long time. We want to keep going from the last indexed
      doc
      ● Hacks to allow us to use it as distributed indexer.

Apache Lucene EuroCon   05/16/10
S h a r din g

          F ir s t s tr a te g y
          ● No distributed IDF's at the moment Better to choose
          randomly the shard where to index a doc:
                  SolrDocUniqueField.hashCode / NumberOfShards = ShardNumber

          ● Once we started keeping track of near duplicates among
          ads from different sources this was not good anymore.
                  W h y ? Dups system is based on SOLR-236: Duplicated
                        documents must be indexed on the same shard to
                        be detected!!!


Apache Lucene EuroCon      05/16/10
S h a r din g

       S e cond s tr a te gy
       ● HashCode of the signature field will decide the shard number
       ● This forces the signature field to be calculated in the
                           warehouse so when indexing process starts we
       already             have it


       T h ir d a n d fu tu r e s tr a te g y
       ● Calculate duplicates in the warehouse
       ● There will be no need for the dups to be in the same shard
                        anymore
Apache Lucene EuroCon        05/16/10
F u tu r e dir e ctio n s
         P r o pe r dis tr ibu te d I D F ' s
         ● Allows to have absolute relevance among shards.
                More accurate results
         ● Issue SOLR-1632
         ● Still some bugs specially when using boosting functions
         ● Allows to improve sharding strategies. No need to choose the
                shard number randomly anymore.




Apache Lucene EuroCon     05/16/10
F u tu r e dir e ctio n s
      L o a d ba la n c e w ith Z o o k e e pe r ( S o lr C lo u d )
      ● Use Solr Cloud to manage sharding
      ● Currently being commited to trunk
      ● Replace load balancer for Zookeeper
      ● Let Zookeeper handle distributed configuration stuff




Apache Lucene EuroCon    05/16/10
?
Apache Lucene EuroCon   05/16/10
T ha nk y ou
                                    for y ou r a tte n tion

                                                          Marc Sturlese
                                                                 Trovit

                                                        marc@trovit.com
                        Apache Lucene Eurocon 2010, Prague, 20 May 2010

Apache Lucene EuroCon    05/16/10

Weitere ähnliche Inhalte

Ähnlich wie Use of-solr-at-trovit-classified-ads marc-sturlese

Introduction to Apache Flink, Vienna 07.11.2018
Introduction to Apache Flink, Vienna 07.11.2018Introduction to Apache Flink, Vienna 07.11.2018
Introduction to Apache Flink, Vienna 07.11.2018Andrey Zagrebin
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveAndrea Gazzarini
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSease
 
How the Lucene More Like This Works
How the Lucene More Like This WorksHow the Lucene More Like This Works
How the Lucene More Like This WorksSease
 
Splunk Ninjas: New Features, Pivot and Search Dojo
Splunk Ninjas: New Features, Pivot and Search DojoSplunk Ninjas: New Features, Pivot and Search Dojo
Splunk Ninjas: New Features, Pivot and Search DojoSplunk
 
Splunk Ninja: New Features, Pivot and Search Dojo
 Splunk Ninja: New Features, Pivot and Search Dojo Splunk Ninja: New Features, Pivot and Search Dojo
Splunk Ninja: New Features, Pivot and Search DojoSplunk
 
Splunk Ninjas: New Features, Pivot and Search Dojo
Splunk Ninjas: New Features, Pivot and Search DojoSplunk Ninjas: New Features, Pivot and Search Dojo
Splunk Ninjas: New Features, Pivot and Search DojoSplunk
 
Splunk Ninjas: New Features, Pivot and Search Dojo
Splunk Ninjas: New Features, Pivot and Search DojoSplunk Ninjas: New Features, Pivot and Search Dojo
Splunk Ninjas: New Features, Pivot and Search DojoSplunk
 
OWASP Poland Day 2018 - Jakub Botwicz - AFL that you do not know
OWASP Poland Day 2018 - Jakub Botwicz - AFL that you do not knowOWASP Poland Day 2018 - Jakub Botwicz - AFL that you do not know
OWASP Poland Day 2018 - Jakub Botwicz - AFL that you do not knowOWASP
 
How OPNFV Uses OpenStack & How It's Useful
How OPNFV Uses OpenStack & How It's UsefulHow OPNFV Uses OpenStack & How It's Useful
How OPNFV Uses OpenStack & How It's UsefulOPNFV
 
Developing SDN apps in Ryu
Developing SDN apps in RyuDeveloping SDN apps in Ryu
Developing SDN apps in RyuChe Wei Lin
 
From Generator to Fiber the Road to Coroutine in PHP
From Generator to Fiber the Road to Coroutine in PHPFrom Generator to Fiber the Road to Coroutine in PHP
From Generator to Fiber the Road to Coroutine in PHPAlbert Chen
 
Migrating Fast to Solr
Migrating Fast to SolrMigrating Fast to Solr
Migrating Fast to SolrCominvent AS
 
Splunk conf2014 - Using Selenium and Splunk for Transaction Monitoring Insight
Splunk conf2014 - Using Selenium and Splunk for Transaction Monitoring InsightSplunk conf2014 - Using Selenium and Splunk for Transaction Monitoring Insight
Splunk conf2014 - Using Selenium and Splunk for Transaction Monitoring InsightSplunk
 
NBN:URN Generator and Resolver
NBN:URN Generator and ResolverNBN:URN Generator and Resolver
NBN:URN Generator and Resolverhorvadam
 
Deploying Splunk. Arquitetura e dimensionamento do Splunk
Deploying Splunk. Arquitetura e dimensionamento do SplunkDeploying Splunk. Arquitetura e dimensionamento do Splunk
Deploying Splunk. Arquitetura e dimensionamento do SplunkSplunk
 
BKK16-106 ODP Project Update
BKK16-106 ODP Project UpdateBKK16-106 ODP Project Update
BKK16-106 ODP Project UpdateLinaro
 
TSC Sponsored BoF: Can Linux and Automotive Functional Safety Mix ? Take 2: T...
TSC Sponsored BoF: Can Linux and Automotive Functional Safety Mix ? Take 2: T...TSC Sponsored BoF: Can Linux and Automotive Functional Safety Mix ? Take 2: T...
TSC Sponsored BoF: Can Linux and Automotive Functional Safety Mix ? Take 2: T...Linaro
 

Ähnlich wie Use of-solr-at-trovit-classified-ads marc-sturlese (20)

Introduction to Apache Flink, Vienna 07.11.2018
Introduction to Apache Flink, Vienna 07.11.2018Introduction to Apache Flink, Vienna 07.11.2018
Introduction to Apache Flink, Vienna 07.11.2018
 
Erlangfactory
ErlangfactoryErlangfactory
Erlangfactory
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer Perspective
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer Perspective
 
How the Lucene More Like This Works
How the Lucene More Like This WorksHow the Lucene More Like This Works
How the Lucene More Like This Works
 
Splunk Ninjas: New Features, Pivot and Search Dojo
Splunk Ninjas: New Features, Pivot and Search DojoSplunk Ninjas: New Features, Pivot and Search Dojo
Splunk Ninjas: New Features, Pivot and Search Dojo
 
Splunk Ninja: New Features, Pivot and Search Dojo
 Splunk Ninja: New Features, Pivot and Search Dojo Splunk Ninja: New Features, Pivot and Search Dojo
Splunk Ninja: New Features, Pivot and Search Dojo
 
Splunk Ninjas: New Features, Pivot and Search Dojo
Splunk Ninjas: New Features, Pivot and Search DojoSplunk Ninjas: New Features, Pivot and Search Dojo
Splunk Ninjas: New Features, Pivot and Search Dojo
 
Splunk Ninjas: New Features, Pivot and Search Dojo
Splunk Ninjas: New Features, Pivot and Search DojoSplunk Ninjas: New Features, Pivot and Search Dojo
Splunk Ninjas: New Features, Pivot and Search Dojo
 
OWASP Poland Day 2018 - Jakub Botwicz - AFL that you do not know
OWASP Poland Day 2018 - Jakub Botwicz - AFL that you do not knowOWASP Poland Day 2018 - Jakub Botwicz - AFL that you do not know
OWASP Poland Day 2018 - Jakub Botwicz - AFL that you do not know
 
How OPNFV Uses OpenStack & How It's Useful
How OPNFV Uses OpenStack & How It's UsefulHow OPNFV Uses OpenStack & How It's Useful
How OPNFV Uses OpenStack & How It's Useful
 
Developing SDN apps in Ryu
Developing SDN apps in RyuDeveloping SDN apps in Ryu
Developing SDN apps in Ryu
 
From Generator to Fiber the Road to Coroutine in PHP
From Generator to Fiber the Road to Coroutine in PHPFrom Generator to Fiber the Road to Coroutine in PHP
From Generator to Fiber the Road to Coroutine in PHP
 
معماری Splunk
معماری Splunkمعماری Splunk
معماری Splunk
 
Migrating Fast to Solr
Migrating Fast to SolrMigrating Fast to Solr
Migrating Fast to Solr
 
Splunk conf2014 - Using Selenium and Splunk for Transaction Monitoring Insight
Splunk conf2014 - Using Selenium and Splunk for Transaction Monitoring InsightSplunk conf2014 - Using Selenium and Splunk for Transaction Monitoring Insight
Splunk conf2014 - Using Selenium and Splunk for Transaction Monitoring Insight
 
NBN:URN Generator and Resolver
NBN:URN Generator and ResolverNBN:URN Generator and Resolver
NBN:URN Generator and Resolver
 
Deploying Splunk. Arquitetura e dimensionamento do Splunk
Deploying Splunk. Arquitetura e dimensionamento do SplunkDeploying Splunk. Arquitetura e dimensionamento do Splunk
Deploying Splunk. Arquitetura e dimensionamento do Splunk
 
BKK16-106 ODP Project Update
BKK16-106 ODP Project UpdateBKK16-106 ODP Project Update
BKK16-106 ODP Project Update
 
TSC Sponsored BoF: Can Linux and Automotive Functional Safety Mix ? Take 2: T...
TSC Sponsored BoF: Can Linux and Automotive Functional Safety Mix ? Take 2: T...TSC Sponsored BoF: Can Linux and Automotive Functional Safety Mix ? Take 2: T...
TSC Sponsored BoF: Can Linux and Automotive Functional Safety Mix ? Take 2: T...
 

Kürzlich hochgeladen

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 

Kürzlich hochgeladen (20)

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 

Use of-solr-at-trovit-classified-ads marc-sturlese

  • 1. 1 U s a g e of S olr a t T r ov it A Search Engine For Classified Ads Marc Sturlese Trovit marc@trovit.com Apache Lucene Eurocon 2010, Prague, 20 May 2010 Apache Lucene EuroCon 4 May 2010
  • 2. Agenda ● Trovit, a Solr use case ● Types of index ● Architecture overview ● Relevance tuning ● Out of the box features ● Custom features ● Sharding ● Future directions ● Questions Apache Lucene EuroCon 05/16/10
  • 3. W h a t is T r o v it? A S e a r c h E n g in e F o r C la s s ifie d A d s Apache Lucene EuroCon 05/16/10
  • 4. T y pe s o f in de x There are 3 different types of index ● Organic ads index ● Sponsored ads index ● Recommended searches index There is an index per country and per business category for every type... what means a total of 180 index Some of them are sharded. All of them have replicas. Apache Lucene EuroCon 05/16/10
  • 5. T y pe s o f in de x Captura donde se vean los 3 tipos de índice Apache Lucene EuroCon 05/16/10
  • 6. A r qu ite ctu r e o v e r v ie w crawling / parsing wharehouse indexing Solr indexer back end replication Solr slaves load balancer frontal load balancing load balancer front end request Apache Lucene EuroCon 05/16/10 6
  • 7. A r ch ite ctu r e o v e r v ie w M a s te r s - I n de x in g ● 4 servers. Continuously updating index sequentially ● 1 server to index organic ads for all countries/categories ● 1 server to index powered ads for all countries/categories ● 1 server to index recommended searches for all countries/categories S la v e s – S e r v in g s e a r c h r e q u e s ts ● Index with high traffic have 4 replicas ● Indexs with less traffic have 3 replicas Apache Lucene EuroCon 05/16/10
  • 8. A r qu ite ctu r e o v e r v ir e w ● Index are replicated using modified c o l l e c t i o n d i s t r i b u t i o n scripts to allow multi core ● Snapshooter and snappuller are sequentially executed ● Snapinstaller is executed at the same time on each slave to preserve exactly the same content all the time ● Started load balancing with P e r l b a l . It was producing high CPU loads Apache Lucene EuroCon 05/16/10
  • 9. L ife o f a u s e r s e a r ch r e qu e s t For every user search: ● A request is done to the organic and sponsored index ● Per each result of the organic search, a request to the recommended searches ads is done ● 13 Solr request per user search! And once this is done... The user search request is going to be batch processed to decide if it must be indexed in the similar user searches index Apache Lucene EuroCon 05/16/10
  • 10. L ife o f a u s e r s e a r ch r e qu e s t Apache Lucene EuroCon 05/16/10
  • 11. R e le v a n c e tu n in g ● Basic searches use dismax qt. Build on top of Lucenes DisjunctionMaxQuery ● Boosting queries to make latest ads more relevant ● Boost some ads at document level at indexing time to make them more important than others ● Boost ads at field level at query time to make the match more important in some fields than in others Apache Lucene EuroCon 05/16/10
  • 12. R e le v a n c e tu n in g Us er s ea r ch: hom e tennes s ee ● Higher quality ad ● Lower quality ad Apache Lucene EuroCon 05/16/10
  • 13. O u t o f th e bo x S o lr fe a tu r e s ● Synonyms for USA states ● Per country and per business category stopwords ● MoreLikeThis request handler ● TrieFields to index housing latitude and longitude ● Facet fields, queries and dates. ● Warming queries from a specific file using an EventListener. Issue SOLR-784 Apache Lucene EuroCon 05/16/10
  • 14. O u t o f th e bo x S o lr fe a tu r e s : M o r e L ik e T h is Apache Lucene EuroCon 05/16/10
  • 15. O u t o f th e bo x S o lr fe a tu r e s : U s a g e o f T r ie F ie ld s Apache Lucene EuroCon 05/16/10
  • 16. Cus tom fe a tu r e s ● Duplicates detection ● Coming from the same source: Indexing time ● Coming from different sources: Indexing and search time ● Pseudo field collapsing ● Custom ranking for sponsored ads ● Custom Data Import Handler for full indexing and updates Apache Lucene EuroCon 05/16/10
  • 17. C u s to m fe a tu r e s – N e a r d u plic a te s d e te c tio n ● A ds c om in g fr om th e s a m e s ou r c e ● Last who comes is the one that will be kept on the index ● Deduplication method using SignatureUpdateProcessor ● Small hack to custom the TextProfileSignature ● A ds c om in g fr om diffe r e n t s ou r c e s ● Give the user the chance to decide the source to visit ● Based on field collapsing issue (SOLR-236) and SignatureUpdateProcessor used in Deduplication ● Done in 2 steps, one at index time and one at search time. Apache Lucene EuroCon 05/16/10
  • 18. N e a r d u plic a te s d e te c tio n A ds c o m in g fr o m diffe r e n t s o u r c e s Apache Lucene EuroCon 05/16/10
  • 19. C u s to m fe a tu r e s – N e a r d u plic a te s d e te c tio n A ds c o m in g fr o m diffe r e n t s o u r c e s ● Why to calculate them at index time? ● Avoid loading FieldCache of a “big field” at search time. Very memory consuming! Apache Lucene EuroCon 05/16/10
  • 20. C u s to m fe a tu r e s – P s e u d o fie ld c o lla ps in g ● Don't want to show first results pages with all ads from the same sources ● “Bad” results will be send to the later pages ● SOLR-236 makes a double trip, not so good in performance terms ● Core hack to avoid the double trip... SOLR–1311 ● Does not support proper distributed search at the moment Apache Lucene EuroCon 05/16/10
  • 21. C u s to m fe a tu r e s – S pe cia l r a n k in g fo r S po n s o r e d Ads ● Not just relevance is important. External factors are important too. ● Implemented using a Solr SearchComponent ● External factors are loaded from a resource and used in a Lucene FieldComparatorSource to alter the score of the documents Apache Lucene EuroCon 05/16/10
  • 22. C u s to m fe a tu r e s – H a c k e d D a ta I m po r tH a n d le r ● DIH is a tool to index data to Solr from different sources (xml, txt, data bases...) ● Extended transformers to alter data before it is indexed ● Delta imports are meant to be used not updating huge amounts of rows. Doing that can end up with memory problems ● If something crashes we have to reindex. It can sometimes take a long time. We want to keep going from the last indexed doc ● Hacks to allow us to use it as distributed indexer. Apache Lucene EuroCon 05/16/10
  • 23. S h a r din g F ir s t s tr a te g y ● No distributed IDF's at the moment Better to choose randomly the shard where to index a doc: SolrDocUniqueField.hashCode / NumberOfShards = ShardNumber ● Once we started keeping track of near duplicates among ads from different sources this was not good anymore. W h y ? Dups system is based on SOLR-236: Duplicated documents must be indexed on the same shard to be detected!!! Apache Lucene EuroCon 05/16/10
  • 24. S h a r din g S e cond s tr a te gy ● HashCode of the signature field will decide the shard number ● This forces the signature field to be calculated in the warehouse so when indexing process starts we already have it T h ir d a n d fu tu r e s tr a te g y ● Calculate duplicates in the warehouse ● There will be no need for the dups to be in the same shard anymore Apache Lucene EuroCon 05/16/10
  • 25. F u tu r e dir e ctio n s P r o pe r dis tr ibu te d I D F ' s ● Allows to have absolute relevance among shards. More accurate results ● Issue SOLR-1632 ● Still some bugs specially when using boosting functions ● Allows to improve sharding strategies. No need to choose the shard number randomly anymore. Apache Lucene EuroCon 05/16/10
  • 26. F u tu r e dir e ctio n s L o a d ba la n c e w ith Z o o k e e pe r ( S o lr C lo u d ) ● Use Solr Cloud to manage sharding ● Currently being commited to trunk ● Replace load balancer for Zookeeper ● Let Zookeeper handle distributed configuration stuff Apache Lucene EuroCon 05/16/10
  • 28. T ha nk y ou for y ou r a tte n tion Marc Sturlese Trovit marc@trovit.com Apache Lucene Eurocon 2010, Prague, 20 May 2010 Apache Lucene EuroCon 05/16/10