SlideShare ist ein Scribd-Unternehmen logo
1 von 41
Downloaden Sie, um offline zu lesen
Grouping & Joining
                Martijn van Groningen
                martijn.vangroningen@searchworkings.com
                Lucene Committer & PMC Member




Thursday, May 17, 2012
Grouping & Joining

       Overview
      ‣ Background



      ‣ Joining



      ‣ Result grouping



      ‣ Conclusion




                          Searchworkings.org - The online search community   2
Thursday, May 17, 2012
Background

       Lucene’s model
      ‣ Lucene is document based.



      ‣ Lucene doesn’t store information about relations between documents.



      ‣ Data often holds relations.



      ‣ Good free text search over relational data.




                          Searchworkings.org - The online search community    3
Thursday, May 17, 2012
Background

       Example
         ‣ Product

               ‣ Name

               ‣ Description

         ‣ Product-item

               ‣ Color

               ‣ Size

               ‣ Price



         ‣ Goal: Show the most applicable product based on product-item criteria.
                               Searchworkings.org - The online search community   4
Thursday, May 17, 2012
Background

       Common Lucene solutions
      ‣ Compound documents.

            ‣ May result in documents with many fields.

      ‣ Subsequent searches.

            ‣ May cause a lot network overhead.




      ‣ Non Lucene based approach:

            ‣ If free text search isn’t very important use a relational database.


                             Searchworkings.org - The online search community       5
Thursday, May 17, 2012
Background

       Example domain
      ‣ Compound Product & Product-items document.

      ‣ Each product-item has its own field prefix.




                          Searchworkings.org - The online search community   6
Thursday, May 17, 2012
Background

       Different solutions
      ‣ Lucene offers solutions to have a 'relational' like search.

            ‣ Parent child



      ‣ Grouping & joining aren't naturally supported.

            ‣ All the solutions do increase the search time.



      ‣ Some scenarios grouping and joining isn't the right solution.




                             Searchworkings.org - The online search community   7
Thursday, May 17, 2012
Joining
                Modelling relations




Thursday, May 17, 2012
Joining

       Introduction
      ‣ Support for parent child like search from Lucene 3.4

            ‣ Not a SQL join.



      ‣ The parent and each children are stored as documents.



      ‣ Two types:

            ‣ Index time join

            ‣ Query time join


                                Searchworkings.org - The online search community   9
Thursday, May 17, 2012
Joining

       Index time join
      ‣ Two block join queries:

            ‣ ToParentBlockJoinQuery

            ‣ ToChildBlockJoinQuery



      ‣ One Lucene collector:

            ‣ ToParentBlockJoinCollector



      ‣ Index time join requires block indexing.


                            Searchworkings.org - The online search community   10
Thursday, May 17, 2012
Joining

       Block indexing
      ‣ Atomically adding documents.

            ‣ A block of documents.



      ‣ Each document gets sequentially assigned Lucene document id.



      ‣ IndexWriter#addDocuments(docs);




                            Searchworkings.org - The online search community   11
Thursday, May 17, 2012
Joining

       Block indexing
      ‣ Index doesn't record blocks.

            ‣ Segment merging doesn’t re-order documents in a segment.



      ‣ App is responsible for identifying block documents.

            ‣ Marking the last document in a block.



      ‣ Adding a document to a block requires you to reindex the whole block.

      ‣ Removing a document from a block doesn’t requires reindexing a block.


                            Searchworkings.org - The online search community    12
Thursday, May 17, 2012
Joining

       Example domain
      ‣ Parent is the last document in a block.




                          Searchworkings.org - The online search community   13
Thursday, May 17, 2012
Joining

       Block indexing

                             Marking parent documents




                         Searchworkings.org - The online search community   14
Thursday, May 17, 2012
Joining

       Block indexing




                                                                            Add block




                                                                             Add block



                         Searchworkings.org - The online search community                15
Thursday, May 17, 2012
Joining

       ToParentBlockJoinQuery
      ‣ Parent filter marks the parent documents.



      ‣ Child query is executed in the parent space.




      ‣ ToChildBlockJoinQuery works in the opposite direction.
                         Searchworkings.org - The online search community   16
Thursday, May 17, 2012
Joining

       Query time joining
      ‣ Query time joining is executed in two phases and is field based:

            ‣ fromField

            ‣ toField



      ‣ Doesn’t require block indexing.




                          Searchworkings.org - The online search community   17
Thursday, May 17, 2012
Joining

       Query time joining
      ‣ First phase collects all the terms in the fromField for the documents
        that match with the original query.

            ‣ Currently doesn’t take the score from original query into account.



      ‣ The second phase returns the documents that match with the collected
        terms from the previous phase in the toField.



      ‣ Two different implementations:

            ‣ JoinUtil - Lucene (≥ 3.6)

            ‣ Join query parser - Solr (trunk)
                             Searchworkings.org - The online search community      18
Thursday, May 17, 2012
Joining

       Query time joining - Indexing




                           Referrer the product id.
                         Searchworkings.org - The online search community   19
Thursday, May 17, 2012
Joining

       Query time joining - Indexing




                         Searchworkings.org - The online search community   20
Thursday, May 17, 2012
Joining

       Query time joining




      ‣ Result will contain one product.

      ‣ Possible to join over two indices.




                          Searchworkings.org - The online search community   21
Thursday, May 17, 2012
Joining

       Final thoughts
      ‣ Joining module has good solutions to model parent child relations.



      ‣ Use block join if you care about scoring.

            ‣ Frequent updates can be problematic.

      ‣ Use query time join for parent child filtering.

            ‣ Query time join is slower than index time join.



      ‣ Mostly a Lucene feature only.

      ‣ All code is annotated as experimental.
                             Searchworkings.org - The online search community   22
Thursday, May 17, 2012
Result grouping
                Previously known as Field Collapsing.




Thursday, May 17, 2012
Result grouping

       Introduction
      ‣ Group matching documents that share a common property.



      ‣ Search hit represents a group.

            ‣ Facet counts & total hit count represent groups.



      ‣ Per group collect information

            ‣ Most relevant document.

            ‣ Top three documents.

            ‣ Aggregated counts
                             Searchworkings.org - The online search community   24
Thursday, May 17, 2012
Result grouping

       Usages
      ‣ Group documents by a shared property

            ‣ Product-item by product id (Parent child)



      ‣ Collapse similar looking documents

            ‣ E.g. all results from the Wikipedia domains.



      ‣ Remove duplicates from the search result.

            ‣ Based on a field that contains a hash


                             Searchworkings.org - The online search community   25
Thursday, May 17, 2012
Result grouping

       Example domain




      ‣ Each Product-item is a document, but includes the product data.


                         Searchworkings.org - The online search community   26
Thursday, May 17, 2012
Result grouping

       Implementation
      ‣ Result grouping implemented with Lucene collectors.

            ‣ Module in trunk and a contrib in 3.x versions.



      ‣ Two pass result grouping.

            ‣ Grouping by indexed field, function or doc values.



      ‣ Single pass result grouping.

            ‣ Requires block indexing.


                             Searchworkings.org - The online search community   27
Thursday, May 17, 2012
Result grouping

       Two pass implementation
      ‣ First pass collects the top N groups.

            ‣ Per group: group value + sort value



      ‣ Second pass collects data for each top group.

            ‣ The top N documents per group.

            ‣ Possible other aggregated information.



      ‣ Second pass search ignores all documents outside topN groups.


                            Searchworkings.org - The online search community   28
Thursday, May 17, 2012
Result grouping

       Result grouping - Indexing




                         Searchworkings.org - The online search community   29
Thursday, May 17, 2012
Result grouping

       Result grouping - Searching




                         Searchworkings.org - The online search community   30
Thursday, May 17, 2012
Result grouping

       Result grouping made easier
     ‣ GroupingSearch




     ‣ Solr

         ‣ http://myhost/solr/select?q=shirt&group=true&group.field=product_id

     ‣ Many more options:

         ‣ http://wiki.apache.org/solr/FieldCollapsing
                            Searchworkings.org - The online search community     31
Thursday, May 17, 2012
Result grouping

       Parent child result
      ‣ TopGroups - Equivalent to TopDocs.

            ‣ Hit count

            ‣ Group count

            ‣ Groups

                  ‣ Top documents



      ‣ Facet and total count can represent groups instead of documents.

            ‣ But requires more query time.


                              Searchworkings.org - The online search community   32
Thursday, May 17, 2012
Conclusion
                Compare...




Thursday, May 17, 2012
Conclusion

       Compare the parent child solutions
      ‣ Result grouping

            ‣ + Distributed support & Parent child relation as hit.

            ‣ - Parent data duplication

            ‣ - Impact on query time



      ‣ Joining

            ‣ + Fast & no data duplication

            ‣ - Index time join not optimal for updates

            ‣ - Query time join is limited.
                              Searchworkings.org - The online search community   34
Thursday, May 17, 2012
Conclusion

       Compare the parent child solutions
      ‣ Compound documents.

            ‣ + Fast and works out-of-the box with all features.

            ‣ - Not flexible when it comes to updates.

            ‣ - Document granularity is set in stone.




                             Searchworkings.org - The online search community   35
Thursday, May 17, 2012
Any questions?




                                          36

Thursday, May 17, 2012
Extra slides
                We have time left!




Thursday, May 17, 2012
Conclusion

       Future work
       ‣ Higher level parent-child API.

             ‣ Needs to cover search & indexing.



       ‣ Joining

             ‣ Distributed support.

             ‣ Represent a hit as a parent child relation in the search result.



       ‣ Result grouping

             ‣ Aggregated grouped information like: sum, avg, min, max etc.
                              Searchworkings.org - The online search community    38
Thursday, May 17, 2012
Joining

       ToParentBlockJoinCollector




         ‣ TopGroups contains a group per top N parent document.

         ‣ Each group contains a parent and child documents.

                          Searchworkings.org - The online search community   39
Thursday, May 17, 2012
Result grouping

       Groups & facet counts
      ‣ Faceting and result grouping are different features.

            ‣ But are often used together!



      ‣ Facet counts can be based on:

            ‣ Found documents.

            ‣ Found groups.

            ‣ Combination of facet value and group.



      ‣ All options are supported in Solr.
                              Searchworkings.org - The online search community   40
Thursday, May 17, 2012
Result grouping

       Doc values
      ‣ Doc values / Column Stride values



      ‣ Prevents the creation of expensive data structures in FieldCache.



      ‣ Inverted index is meant for free text search.



      ‣ All grouping collectors have doc values based implementations!




                          Searchworkings.org - The online search community   41
Thursday, May 17, 2012

Weitere ähnliche Inhalte

Was ist angesagt?

A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiGrowth Intelligence
 
Mongo DB 성능최적화 전략
Mongo DB 성능최적화 전략Mongo DB 성능최적화 전략
Mongo DB 성능최적화 전략Jin wook
 
RxJS + Redux + React = Amazing
RxJS + Redux + React = AmazingRxJS + Redux + React = Amazing
RxJS + Redux + React = AmazingJay Phelps
 
백억개의 로그를 모아 검색하고 분석하고 학습도 시켜보자 : 로기스
백억개의 로그를 모아 검색하고 분석하고 학습도 시켜보자 : 로기스백억개의 로그를 모아 검색하고 분석하고 학습도 시켜보자 : 로기스
백억개의 로그를 모아 검색하고 분석하고 학습도 시켜보자 : 로기스NAVER D2
 
JSON and the Oracle Database
JSON and the Oracle DatabaseJSON and the Oracle Database
JSON and the Oracle DatabaseMaria Colgan
 
Using Optimizer Hints to Improve MySQL Query Performance
Using Optimizer Hints to Improve MySQL Query PerformanceUsing Optimizer Hints to Improve MySQL Query Performance
Using Optimizer Hints to Improve MySQL Query Performanceoysteing
 
MongoDB Tick Data Presentation
MongoDB Tick Data PresentationMongoDB Tick Data Presentation
MongoDB Tick Data PresentationMongoDB
 
More mastering the art of indexing
More mastering the art of indexingMore mastering the art of indexing
More mastering the art of indexingYoshinori Matsunobu
 
Working with deeply nested documents in Apache Solr
Working with deeply nested documents in Apache SolrWorking with deeply nested documents in Apache Solr
Working with deeply nested documents in Apache SolrAnshum Gupta
 
Functional Programming 101 with Scala and ZIO @FunctionalWorld
Functional Programming 101 with Scala and ZIO @FunctionalWorldFunctional Programming 101 with Scala and ZIO @FunctionalWorld
Functional Programming 101 with Scala and ZIO @FunctionalWorldJorge Vásquez
 
PostgreSQL WAL for DBAs
PostgreSQL WAL for DBAs PostgreSQL WAL for DBAs
PostgreSQL WAL for DBAs PGConf APAC
 
PostgreSQL and JDBC: striving for high performance
PostgreSQL and JDBC: striving for high performancePostgreSQL and JDBC: striving for high performance
PostgreSQL and JDBC: striving for high performanceVladimir Sitnikov
 
Errant GTIDs breaking replication @ Percona Live 2019
Errant GTIDs breaking replication @ Percona Live 2019Errant GTIDs breaking replication @ Percona Live 2019
Errant GTIDs breaking replication @ Percona Live 2019Dieter Adriaenssens
 
Ceph Object Storage at Spreadshirt (July 2015, Ceph Berlin Meetup)
Ceph Object Storage at Spreadshirt (July 2015, Ceph Berlin Meetup)Ceph Object Storage at Spreadshirt (July 2015, Ceph Berlin Meetup)
Ceph Object Storage at Spreadshirt (July 2015, Ceph Berlin Meetup)Jens Hadlich
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data ScienceErik Bernhardsson
 
Redux Sagas - React Alicante
Redux Sagas - React AlicanteRedux Sagas - React Alicante
Redux Sagas - React AlicanteIgnacio Martín
 
Introducing Spirit - Online Schema Change
Introducing Spirit - Online Schema ChangeIntroducing Spirit - Online Schema Change
Introducing Spirit - Online Schema ChangeMorgan Tocker
 

Was ist angesagt? (20)

A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
 
Mongo DB 성능최적화 전략
Mongo DB 성능최적화 전략Mongo DB 성능최적화 전략
Mongo DB 성능최적화 전략
 
RxJS + Redux + React = Amazing
RxJS + Redux + React = AmazingRxJS + Redux + React = Amazing
RxJS + Redux + React = Amazing
 
백억개의 로그를 모아 검색하고 분석하고 학습도 시켜보자 : 로기스
백억개의 로그를 모아 검색하고 분석하고 학습도 시켜보자 : 로기스백억개의 로그를 모아 검색하고 분석하고 학습도 시켜보자 : 로기스
백억개의 로그를 모아 검색하고 분석하고 학습도 시켜보자 : 로기스
 
JSON and the Oracle Database
JSON and the Oracle DatabaseJSON and the Oracle Database
JSON and the Oracle Database
 
Using Optimizer Hints to Improve MySQL Query Performance
Using Optimizer Hints to Improve MySQL Query PerformanceUsing Optimizer Hints to Improve MySQL Query Performance
Using Optimizer Hints to Improve MySQL Query Performance
 
MongoDB Tick Data Presentation
MongoDB Tick Data PresentationMongoDB Tick Data Presentation
MongoDB Tick Data Presentation
 
Redis
RedisRedis
Redis
 
More mastering the art of indexing
More mastering the art of indexingMore mastering the art of indexing
More mastering the art of indexing
 
Working with deeply nested documents in Apache Solr
Working with deeply nested documents in Apache SolrWorking with deeply nested documents in Apache Solr
Working with deeply nested documents in Apache Solr
 
Functional Programming 101 with Scala and ZIO @FunctionalWorld
Functional Programming 101 with Scala and ZIO @FunctionalWorldFunctional Programming 101 with Scala and ZIO @FunctionalWorld
Functional Programming 101 with Scala and ZIO @FunctionalWorld
 
PostgreSQL WAL for DBAs
PostgreSQL WAL for DBAs PostgreSQL WAL for DBAs
PostgreSQL WAL for DBAs
 
PostgreSQL and JDBC: striving for high performance
PostgreSQL and JDBC: striving for high performancePostgreSQL and JDBC: striving for high performance
PostgreSQL and JDBC: striving for high performance
 
Errant GTIDs breaking replication @ Percona Live 2019
Errant GTIDs breaking replication @ Percona Live 2019Errant GTIDs breaking replication @ Percona Live 2019
Errant GTIDs breaking replication @ Percona Live 2019
 
MySQL Router REST API
MySQL Router REST APIMySQL Router REST API
MySQL Router REST API
 
Ceph Object Storage at Spreadshirt (July 2015, Ceph Berlin Meetup)
Ceph Object Storage at Spreadshirt (July 2015, Ceph Berlin Meetup)Ceph Object Storage at Spreadshirt (July 2015, Ceph Berlin Meetup)
Ceph Object Storage at Spreadshirt (July 2015, Ceph Berlin Meetup)
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data Science
 
Redux Sagas - React Alicante
Redux Sagas - React AlicanteRedux Sagas - React Alicante
Redux Sagas - React Alicante
 
JSON in Solr: from top to bottom
JSON in Solr: from top to bottomJSON in Solr: from top to bottom
JSON in Solr: from top to bottom
 
Introducing Spirit - Online Schema Change
Introducing Spirit - Online Schema ChangeIntroducing Spirit - Online Schema Change
Introducing Spirit - Online Schema Change
 

Ähnlich wie Grouping and Joining in Lucene/Solr

Symfony2 and MongoDB
Symfony2 and MongoDBSymfony2 and MongoDB
Symfony2 and MongoDBPablo Godel
 
Prateek Jain dissertation defense, Kno.e.sis, Wright State University
Prateek Jain dissertation defense, Kno.e.sis, Wright State UniversityPrateek Jain dissertation defense, Kno.e.sis, Wright State University
Prateek Jain dissertation defense, Kno.e.sis, Wright State UniversityPrateek Jain
 
Linked Open Data Alignment and Enrichment Using Bootstrapping Based Techniques
Linked Open Data Alignment and Enrichment Using Bootstrapping Based TechniquesLinked Open Data Alignment and Enrichment Using Bootstrapping Based Techniques
Linked Open Data Alignment and Enrichment Using Bootstrapping Based TechniquesPrateek Jain
 
Pedagogy coloda
Pedagogy colodaPedagogy coloda
Pedagogy colodaKBTNHKU
 
The W3C and the web design ecosystem
The W3C and the web design ecosystemThe W3C and the web design ecosystem
The W3C and the web design ecosystemChris Mills
 
ORCID Outreach Meeting dev breakout session
ORCID Outreach Meeting dev breakout sessionORCID Outreach Meeting dev breakout session
ORCID Outreach Meeting dev breakout sessionGudmundur Thorisson
 
Symfony2 y MongoDB - deSymfony 2012
Symfony2 y MongoDB - deSymfony 2012Symfony2 y MongoDB - deSymfony 2012
Symfony2 y MongoDB - deSymfony 2012Pablo Godel
 
DigiBird: on the fly collection integration using crowdsourcing
DigiBird: on the fly collection integration using crowdsourcing DigiBird: on the fly collection integration using crowdsourcing
DigiBird: on the fly collection integration using crowdsourcing VU University Amsterdam
 
Splendid: SPARQL Endpoint Federation Exploiting VOID Descriptions
Splendid: SPARQL Endpoint Federation Exploiting VOID DescriptionsSplendid: SPARQL Endpoint Federation Exploiting VOID Descriptions
Splendid: SPARQL Endpoint Federation Exploiting VOID DescriptionsOlafGoerlitz
 
Linked Open Data Principles, benefits of LOD for sustainable development
Linked Open Data Principles, benefits of LOD for sustainable developmentLinked Open Data Principles, benefits of LOD for sustainable development
Linked Open Data Principles, benefits of LOD for sustainable developmentMartin Kaltenböck
 
Linked Open Data: The Essentials
Linked Open Data: The EssentialsLinked Open Data: The Essentials
Linked Open Data: The Essentialsreeep
 
Moving beyond sameAs with PLATO: Partonomy detection for Linked Data
Moving beyond sameAs with PLATO: Partonomy detection for Linked DataMoving beyond sameAs with PLATO: Partonomy detection for Linked Data
Moving beyond sameAs with PLATO: Partonomy detection for Linked DataPrateek Jain
 
Introduction to the BioJS project
Introduction to the BioJS projectIntroduction to the BioJS project
Introduction to the BioJS projectRafael C. Jimenez
 
Open standards and open source mean open for business cms expo session mc-k...
Open standards and open source mean open for business   cms expo session mc-k...Open standards and open source mean open for business   cms expo session mc-k...
Open standards and open source mean open for business cms expo session mc-k...Cheryl McKinnon
 
Late March and April 2012 Media 21 Calendar Updated
Late March and April 2012 Media 21 Calendar UpdatedLate March and April 2012 Media 21 Calendar Updated
Late March and April 2012 Media 21 Calendar UpdatedB. Hamilton
 
Construction and Querying of Dynamic Knowledge Graphs
Construction and Querying of Dynamic Knowledge GraphsConstruction and Querying of Dynamic Knowledge Graphs
Construction and Querying of Dynamic Knowledge GraphsSutanay Choudhury
 

Ähnlich wie Grouping and Joining in Lucene/Solr (20)

Symfony2 and MongoDB
Symfony2 and MongoDBSymfony2 and MongoDB
Symfony2 and MongoDB
 
Prateek Jain dissertation defense, Kno.e.sis, Wright State University
Prateek Jain dissertation defense, Kno.e.sis, Wright State UniversityPrateek Jain dissertation defense, Kno.e.sis, Wright State University
Prateek Jain dissertation defense, Kno.e.sis, Wright State University
 
Prateek Jain's Dissertation Defense - Linked Open Data Alignment and Querying
Prateek Jain's Dissertation Defense - Linked Open Data Alignment and QueryingPrateek Jain's Dissertation Defense - Linked Open Data Alignment and Querying
Prateek Jain's Dissertation Defense - Linked Open Data Alignment and Querying
 
Linked Open Data Alignment and Enrichment Using Bootstrapping Based Techniques
Linked Open Data Alignment and Enrichment Using Bootstrapping Based TechniquesLinked Open Data Alignment and Enrichment Using Bootstrapping Based Techniques
Linked Open Data Alignment and Enrichment Using Bootstrapping Based Techniques
 
PhD Proposal Defense - Prateek Jain
PhD Proposal Defense - Prateek JainPhD Proposal Defense - Prateek Jain
PhD Proposal Defense - Prateek Jain
 
Pedagogy coloda
Pedagogy colodaPedagogy coloda
Pedagogy coloda
 
NoTube: Models & Semantics
NoTube: Models & SemanticsNoTube: Models & Semantics
NoTube: Models & Semantics
 
The W3C and the web design ecosystem
The W3C and the web design ecosystemThe W3C and the web design ecosystem
The W3C and the web design ecosystem
 
ORCID Outreach Meeting dev breakout session
ORCID Outreach Meeting dev breakout sessionORCID Outreach Meeting dev breakout session
ORCID Outreach Meeting dev breakout session
 
Symfony2 y MongoDB - deSymfony 2012
Symfony2 y MongoDB - deSymfony 2012Symfony2 y MongoDB - deSymfony 2012
Symfony2 y MongoDB - deSymfony 2012
 
DigiBird: on the fly collection integration using crowdsourcing
DigiBird: on the fly collection integration using crowdsourcing DigiBird: on the fly collection integration using crowdsourcing
DigiBird: on the fly collection integration using crowdsourcing
 
Splendid: SPARQL Endpoint Federation Exploiting VOID Descriptions
Splendid: SPARQL Endpoint Federation Exploiting VOID DescriptionsSplendid: SPARQL Endpoint Federation Exploiting VOID Descriptions
Splendid: SPARQL Endpoint Federation Exploiting VOID Descriptions
 
Linked Open Data Principles, benefits of LOD for sustainable development
Linked Open Data Principles, benefits of LOD for sustainable developmentLinked Open Data Principles, benefits of LOD for sustainable development
Linked Open Data Principles, benefits of LOD for sustainable development
 
Linked Open Data: The Essentials
Linked Open Data: The EssentialsLinked Open Data: The Essentials
Linked Open Data: The Essentials
 
Moving beyond sameAs with PLATO: Partonomy detection for Linked Data
Moving beyond sameAs with PLATO: Partonomy detection for Linked DataMoving beyond sameAs with PLATO: Partonomy detection for Linked Data
Moving beyond sameAs with PLATO: Partonomy detection for Linked Data
 
Introduction to the BioJS project
Introduction to the BioJS projectIntroduction to the BioJS project
Introduction to the BioJS project
 
Open standards and open source mean open for business cms expo session mc-k...
Open standards and open source mean open for business   cms expo session mc-k...Open standards and open source mean open for business   cms expo session mc-k...
Open standards and open source mean open for business cms expo session mc-k...
 
Late March and April 2012 Media 21 Calendar Updated
Late March and April 2012 Media 21 Calendar UpdatedLate March and April 2012 Media 21 Calendar Updated
Late March and April 2012 Media 21 Calendar Updated
 
No More SQL
No More SQLNo More SQL
No More SQL
 
Construction and Querying of Dynamic Knowledge Graphs
Construction and Querying of Dynamic Knowledge GraphsConstruction and Querying of Dynamic Knowledge Graphs
Construction and Querying of Dynamic Knowledge Graphs
 

Mehr von lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 

Mehr von lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Kürzlich hochgeladen

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 

Kürzlich hochgeladen (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 

Grouping and Joining in Lucene/Solr

  • 1. Grouping & Joining Martijn van Groningen martijn.vangroningen@searchworkings.com Lucene Committer & PMC Member Thursday, May 17, 2012
  • 2. Grouping & Joining Overview ‣ Background ‣ Joining ‣ Result grouping ‣ Conclusion Searchworkings.org - The online search community 2 Thursday, May 17, 2012
  • 3. Background Lucene’s model ‣ Lucene is document based. ‣ Lucene doesn’t store information about relations between documents. ‣ Data often holds relations. ‣ Good free text search over relational data. Searchworkings.org - The online search community 3 Thursday, May 17, 2012
  • 4. Background Example ‣ Product ‣ Name ‣ Description ‣ Product-item ‣ Color ‣ Size ‣ Price ‣ Goal: Show the most applicable product based on product-item criteria. Searchworkings.org - The online search community 4 Thursday, May 17, 2012
  • 5. Background Common Lucene solutions ‣ Compound documents. ‣ May result in documents with many fields. ‣ Subsequent searches. ‣ May cause a lot network overhead. ‣ Non Lucene based approach: ‣ If free text search isn’t very important use a relational database. Searchworkings.org - The online search community 5 Thursday, May 17, 2012
  • 6. Background Example domain ‣ Compound Product & Product-items document. ‣ Each product-item has its own field prefix. Searchworkings.org - The online search community 6 Thursday, May 17, 2012
  • 7. Background Different solutions ‣ Lucene offers solutions to have a 'relational' like search. ‣ Parent child ‣ Grouping & joining aren't naturally supported. ‣ All the solutions do increase the search time. ‣ Some scenarios grouping and joining isn't the right solution. Searchworkings.org - The online search community 7 Thursday, May 17, 2012
  • 8. Joining Modelling relations Thursday, May 17, 2012
  • 9. Joining Introduction ‣ Support for parent child like search from Lucene 3.4 ‣ Not a SQL join. ‣ The parent and each children are stored as documents. ‣ Two types: ‣ Index time join ‣ Query time join Searchworkings.org - The online search community 9 Thursday, May 17, 2012
  • 10. Joining Index time join ‣ Two block join queries: ‣ ToParentBlockJoinQuery ‣ ToChildBlockJoinQuery ‣ One Lucene collector: ‣ ToParentBlockJoinCollector ‣ Index time join requires block indexing. Searchworkings.org - The online search community 10 Thursday, May 17, 2012
  • 11. Joining Block indexing ‣ Atomically adding documents. ‣ A block of documents. ‣ Each document gets sequentially assigned Lucene document id. ‣ IndexWriter#addDocuments(docs); Searchworkings.org - The online search community 11 Thursday, May 17, 2012
  • 12. Joining Block indexing ‣ Index doesn't record blocks. ‣ Segment merging doesn’t re-order documents in a segment. ‣ App is responsible for identifying block documents. ‣ Marking the last document in a block. ‣ Adding a document to a block requires you to reindex the whole block. ‣ Removing a document from a block doesn’t requires reindexing a block. Searchworkings.org - The online search community 12 Thursday, May 17, 2012
  • 13. Joining Example domain ‣ Parent is the last document in a block. Searchworkings.org - The online search community 13 Thursday, May 17, 2012
  • 14. Joining Block indexing Marking parent documents Searchworkings.org - The online search community 14 Thursday, May 17, 2012
  • 15. Joining Block indexing Add block Add block Searchworkings.org - The online search community 15 Thursday, May 17, 2012
  • 16. Joining ToParentBlockJoinQuery ‣ Parent filter marks the parent documents. ‣ Child query is executed in the parent space. ‣ ToChildBlockJoinQuery works in the opposite direction. Searchworkings.org - The online search community 16 Thursday, May 17, 2012
  • 17. Joining Query time joining ‣ Query time joining is executed in two phases and is field based: ‣ fromField ‣ toField ‣ Doesn’t require block indexing. Searchworkings.org - The online search community 17 Thursday, May 17, 2012
  • 18. Joining Query time joining ‣ First phase collects all the terms in the fromField for the documents that match with the original query. ‣ Currently doesn’t take the score from original query into account. ‣ The second phase returns the documents that match with the collected terms from the previous phase in the toField. ‣ Two different implementations: ‣ JoinUtil - Lucene (≥ 3.6) ‣ Join query parser - Solr (trunk) Searchworkings.org - The online search community 18 Thursday, May 17, 2012
  • 19. Joining Query time joining - Indexing Referrer the product id. Searchworkings.org - The online search community 19 Thursday, May 17, 2012
  • 20. Joining Query time joining - Indexing Searchworkings.org - The online search community 20 Thursday, May 17, 2012
  • 21. Joining Query time joining ‣ Result will contain one product. ‣ Possible to join over two indices. Searchworkings.org - The online search community 21 Thursday, May 17, 2012
  • 22. Joining Final thoughts ‣ Joining module has good solutions to model parent child relations. ‣ Use block join if you care about scoring. ‣ Frequent updates can be problematic. ‣ Use query time join for parent child filtering. ‣ Query time join is slower than index time join. ‣ Mostly a Lucene feature only. ‣ All code is annotated as experimental. Searchworkings.org - The online search community 22 Thursday, May 17, 2012
  • 23. Result grouping Previously known as Field Collapsing. Thursday, May 17, 2012
  • 24. Result grouping Introduction ‣ Group matching documents that share a common property. ‣ Search hit represents a group. ‣ Facet counts & total hit count represent groups. ‣ Per group collect information ‣ Most relevant document. ‣ Top three documents. ‣ Aggregated counts Searchworkings.org - The online search community 24 Thursday, May 17, 2012
  • 25. Result grouping Usages ‣ Group documents by a shared property ‣ Product-item by product id (Parent child) ‣ Collapse similar looking documents ‣ E.g. all results from the Wikipedia domains. ‣ Remove duplicates from the search result. ‣ Based on a field that contains a hash Searchworkings.org - The online search community 25 Thursday, May 17, 2012
  • 26. Result grouping Example domain ‣ Each Product-item is a document, but includes the product data. Searchworkings.org - The online search community 26 Thursday, May 17, 2012
  • 27. Result grouping Implementation ‣ Result grouping implemented with Lucene collectors. ‣ Module in trunk and a contrib in 3.x versions. ‣ Two pass result grouping. ‣ Grouping by indexed field, function or doc values. ‣ Single pass result grouping. ‣ Requires block indexing. Searchworkings.org - The online search community 27 Thursday, May 17, 2012
  • 28. Result grouping Two pass implementation ‣ First pass collects the top N groups. ‣ Per group: group value + sort value ‣ Second pass collects data for each top group. ‣ The top N documents per group. ‣ Possible other aggregated information. ‣ Second pass search ignores all documents outside topN groups. Searchworkings.org - The online search community 28 Thursday, May 17, 2012
  • 29. Result grouping Result grouping - Indexing Searchworkings.org - The online search community 29 Thursday, May 17, 2012
  • 30. Result grouping Result grouping - Searching Searchworkings.org - The online search community 30 Thursday, May 17, 2012
  • 31. Result grouping Result grouping made easier ‣ GroupingSearch ‣ Solr ‣ http://myhost/solr/select?q=shirt&group=true&group.field=product_id ‣ Many more options: ‣ http://wiki.apache.org/solr/FieldCollapsing Searchworkings.org - The online search community 31 Thursday, May 17, 2012
  • 32. Result grouping Parent child result ‣ TopGroups - Equivalent to TopDocs. ‣ Hit count ‣ Group count ‣ Groups ‣ Top documents ‣ Facet and total count can represent groups instead of documents. ‣ But requires more query time. Searchworkings.org - The online search community 32 Thursday, May 17, 2012
  • 33. Conclusion Compare... Thursday, May 17, 2012
  • 34. Conclusion Compare the parent child solutions ‣ Result grouping ‣ + Distributed support & Parent child relation as hit. ‣ - Parent data duplication ‣ - Impact on query time ‣ Joining ‣ + Fast & no data duplication ‣ - Index time join not optimal for updates ‣ - Query time join is limited. Searchworkings.org - The online search community 34 Thursday, May 17, 2012
  • 35. Conclusion Compare the parent child solutions ‣ Compound documents. ‣ + Fast and works out-of-the box with all features. ‣ - Not flexible when it comes to updates. ‣ - Document granularity is set in stone. Searchworkings.org - The online search community 35 Thursday, May 17, 2012
  • 36. Any questions? 36 Thursday, May 17, 2012
  • 37. Extra slides We have time left! Thursday, May 17, 2012
  • 38. Conclusion Future work ‣ Higher level parent-child API. ‣ Needs to cover search & indexing. ‣ Joining ‣ Distributed support. ‣ Represent a hit as a parent child relation in the search result. ‣ Result grouping ‣ Aggregated grouped information like: sum, avg, min, max etc. Searchworkings.org - The online search community 38 Thursday, May 17, 2012
  • 39. Joining ToParentBlockJoinCollector ‣ TopGroups contains a group per top N parent document. ‣ Each group contains a parent and child documents. Searchworkings.org - The online search community 39 Thursday, May 17, 2012
  • 40. Result grouping Groups & facet counts ‣ Faceting and result grouping are different features. ‣ But are often used together! ‣ Facet counts can be based on: ‣ Found documents. ‣ Found groups. ‣ Combination of facet value and group. ‣ All options are supported in Solr. Searchworkings.org - The online search community 40 Thursday, May 17, 2012
  • 41. Result grouping Doc values ‣ Doc values / Column Stride values ‣ Prevents the creation of expensive data structures in FieldCache. ‣ Inverted index is meant for free text search. ‣ All grouping collectors have doc values based implementations! Searchworkings.org - The online search community 41 Thursday, May 17, 2012