Proposal for nested document support in Lucene

•

20 gefällt mir•15,483 views

Mark Harwood

Technologie

Nested Documents in Lucene High-performance support for parent/child document relations mark@searcharea.co.uk

Problem: The Lucene data model is based on Documents, Fields and Terms. However many real-world data structures cannot be properly represented when collapsed into a single Lucene document. Single Lucene document

Problem: “Cross-matching” When two or more data structures of the same type are jumbled up into a single Lucene field, matching logic becomes confused e.g. >1 qualification in a resume John Name John A1 in Maths A1, E1 Grade E1 in Science Subject Maths, Science ! False match for query: Grade:A1 AND Subject:Science

Unacceptable solution #1 One modeling approach is to store related items in the same field and use proximity operators in queries Name John A1 Maths….E1 Science GradeAndSubject John Example query: “GradeAndSubject:”A1 Science”~2 A1 in Maths E1 in Science ! Slow ! Not scalable with number of fields ,[object Object]

Only one choice of Analyzer for given field ,[object Object],[object Object]

Solution: Nested Document Queries Nested documents need to be queried using new NestedDocumentQuery class which understands document relationships John Name A1 E1 Grade Grade docType resume Subject Maths Subject Science New NestedDocumentQuery ,[object Object]

Reports any matches as a match on the parent document not the child

Super-fast evaluation of joins between child and parent

Requires an indexed field to identify parent documents?

Solution: Example Query Find resume of person called “John” with A1 grade in Maths John Name E1 A1 resume Grade docType Grade Subject Science Subject Maths The NestedDocumentQuery wrapper simply translates the stream of reported matches from the child-level query criteria into matches on the parent for evaluation of all the parent-level logic

Solution: Join speed Unlike a database, the cost of a join (child to parent) is blisteringly fast 3) Find first prior set bit e.g. position #356,670 100000100000000100000001000000010000001000010000000001000000100000100001 2) Index directly into cached BitSet at position #356,675 1) Match reported on document #356,675 ParentQuery 4) Attribute match to doc #356,670 NestedDocumentQuery ChildQuery The BitSet for defining parents is obtained from a Filter and can be cached aggressively with minimal memory cost (one bit per document in the index)

Other advantages Parent-child document relationships can also be used to limit child results from any one parent (e.g. efficiently control the max number of pages returned from any one website) Nesting levels can be arbitrarily deep Very powerful multi-child queries possible e.g. find people likely to know person X using resume’s employment histories (multiple employer names/urls and related date-ranges)

Weitere ähnliche Inhalte

Was ist angesagt?

ElasticsearchShagun Rathore

Amazon Athena Capabilities and Use Cases Overview Amazon Web Services

Indexing with MongoDBMongoDB

Building a real time, solr-powered recommendation engineTrey Grainger

Amazon Aurora: Under the HoodAmazon Web Services

ontop: A tutorialMariano Rodriguez-Muro

Introduction to Data EngineeringVivek Aanand Ganesan

Introducing Multi Valued Vectors Fields in Apache LuceneSease

Parquet performance tuning: the missing guideRyan Blue

Grouping and Joining in Lucene/Solrlucenerevolution

Rds data lake @ Robinhood BalajiVaradarajan13

Migración Discoverer a Oracle BIavanttic Consultoría Tecnológica

Introducing MongoDB AtlasMongoDB

Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann

The Basics of MongoDBvaluebound

OpenSearch.pdfAbhi Jain

Real-time Analytics with Presto and Apache PinotXiang Fu

ElasticSearchVolodymyr Kraietskyi

Apache Spark and MongoDB - Turning Analytics into Real-Time ActionJoão Gabriel Lima

End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaDatabricks

Was ist angesagt? (20)

Elasticsearch

Amazon Athena Capabilities and Use Cases Overview

Indexing with MongoDB

Building a real time, solr-powered recommendation engine

Amazon Aurora: Under the Hood

ontop: A tutorial

Introduction to Data Engineering

Introducing Multi Valued Vectors Fields in Apache Lucene

Parquet performance tuning: the missing guide

Grouping and Joining in Lucene/Solr

Rds data lake @ Robinhood

Migración Discoverer a Oracle BI

Introducing MongoDB Atlas

Introduction to Apache Flink - Fast and reliable big data processing

The Basics of MongoDB

OpenSearch.pdf

Real-time Analytics with Presto and Apache Pinot

ElasticSearch

Apache Spark and MongoDB - Turning Analytics into Real-Time Action

End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta

Andere mochten auch

Approaching Join Index: Presented by Mikhail Khludnev, Grid DynamicsLucidworks

Lucene KV-StoreMark Harwood

Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...Lucidworks

Mark Harwood - Building Entity Centric Indexes - NoSQL matters Dublin 2015NoSQLmatters

MaFI Meeting 2016 (slides)MaFI (The Market Facilitation Initiative)

Solr search engine with multiple table relationJay Bharat

Patterns for large scale searchMark Harwood

Lucene with Bloom filtered segmentsMark Harwood

Faceting with Lucene Block Join Query: Presented by Oleg Savrasov, Grid DynamicsLucidworks

Is Your Index Reader Really Atomic or Maybe Slow?lucenerevolution

Understanding and visualizing solr explain information - Rafal Kuclucenerevolution

Working with deeply nested documents in Apache SolrAnshum Gupta

An Introduction to Basics of Search and Relevancy with Apache SolrLucidworks (Archived)

Andere mochten auch (13)

Approaching Join Index: Presented by Mikhail Khludnev, Grid Dynamics

Lucene KV-Store

Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...

Mark Harwood - Building Entity Centric Indexes - NoSQL matters Dublin 2015

MaFI Meeting 2016 (slides)

Solr search engine with multiple table relation

Patterns for large scale search

Lucene with Bloom filtered segments

Faceting with Lucene Block Join Query: Presented by Oleg Savrasov, Grid Dynamics

Is Your Index Reader Really Atomic or Maybe Slow?

Understanding and visualizing solr explain information - Rafal Kuc

Working with deeply nested documents in Apache Solr

An Introduction to Basics of Search and Relevancy with Apache Solr

Ähnlich wie Proposal for nested document support in Lucene

11.0004www.iiste.org call for paper.on demand quality of web services using r...Alexander Decker

4.on demand quality of web services using ranking by multi criteria 31-35Alexander Decker

The Statement of Conjunctive and Disjunctive Queries in Object Oriented Datab...Editor IJCATR

The Duet modelBhaskar Mitra

HyperQA: A Framework for Complex Question-AnsweringJinho Choi

EDI 2009- Advanced Search: What’s Under the Hood of your Favorite Search System?Georgetown University Law Center Office of Continuing Legal Education

Data models and roDiana Diana

Entity linking with a knowledge base issues techniques and solutionsCloudTechnologies

Expression of Query in XML object-oriented databaseEditor IJCATR

Equation 2.docbutest

Contextual Ontology Alignment - ESWC 2011Mariana Damova, Ph.D

Semantic Relatedness of Web Resources by XESA - Philipp SchollCROKODIl consortium

B01DataMgt.pptarchana balachandran

Introduction to Data Management Powerpointichanismo

A rough set based hybrid method to text categorizationNinad Samel

Cl4201593597IJERA Editor

Low Resource Domain Subjective Context Feature Extraction via Thematic Meta-l...AI Publications

Automating Relational Database Schema Design for Very Large Semantic DatasetsThomas Lee

Ähnlich wie Proposal for nested document support in Lucene (20)

11.0004www.iiste.org call for paper.on demand quality of web services using r...

4.on demand quality of web services using ranking by multi criteria 31-35

The Statement of Conjunctive and Disjunctive Queries in Object Oriented Datab...

The Duet model

HyperQA: A Framework for Complex Question-Answering

EDI 2009- Advanced Search: What’s Under the Hood of your Favorite Search System?

Data models and ro

Entity linking with a knowledge base issues techniques and solutions

Expression of Query in XML object-oriented database

Equation 2.doc

Contextual Ontology Alignment - ESWC 2011

Semantic Relatedness of Web Resources by XESA - Philipp Scholl

B01DataMgt.ppt

Introduction to Data Management Powerpoint

A rough set based hybrid method to text categorization

Cl4201593597

Low Resource Domain Subjective Context Feature Extraction via Thematic Meta-l...

Automating Relational Database Schema Design for Very Large Semantic Datasets

Kürzlich hochgeladen

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

How to write a Business Continuity PlanDatabarracks

Advanced Computer Architecture – An IntroductionDilum Bandara

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

From Family Reminiscence to Scholarly Archive .Alan Dix

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

Story boards and shot lists for my a level piececharlottematthew16

"ML in Production",Oleksandr BaganFwdays

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

CloudStudio User manual (basic edition):comworks

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

Artificial intelligence in cctv survelliance.pptxhariprasad279825

Gen AI in Business - Global Trends Report 2024.pdfAddepto

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati

Kürzlich hochgeladen (20)

Scanning the Internet for External Cloud Exposures via SSL Certs

How to write a Business Continuity Plan

Advanced Computer Architecture – An Introduction

Are Multi-Cloud and Serverless Good or Bad?

From Family Reminiscence to Scholarly Archive .

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

Story boards and shot lists for my a level piece

"ML in Production",Oleksandr Bagan

Vertex AI Gemini Prompt Engineering Tips

What's New in Teams Calling, Meetings and Devices March 2024

DMCC Future of Trade Web3 - Special Edition

Streamlining Python Development: A Guide to a Modern Project Setup

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

DevEX - reference for building teams, processes, and platforms

CloudStudio User manual (basic edition):

"Debugging python applications inside k8s environment", Andrii Soldatenko

Developer Data Modeling Mistakes: From Postgres to NoSQL

Artificial intelligence in cctv survelliance.pptx

Gen AI in Business - Global Trends Report 2024.pdf

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day

Proposal for nested document support in Lucene

1. Nested Documents in Lucene High-performance support for parent/child document relations mark@searcharea.co.uk

2. Problem: The Lucene data model is based on Documents, Fields and Terms. However many real-world data structures cannot be properly represented when collapsed into a single Lucene document. Single Lucene document

3. Problem: “Cross-matching” When two or more data structures of the same type are jumbled up into a single Lucene field, matching logic becomes confused e.g. >1 qualification in a resume John Name John A1 in Maths A1, E1 Grade E1 in Science Subject Maths, Science ! False match for query: Grade:A1 AND Subject:Science

5. Proximity distances must grow.

8. Reports any matches as a match on the parent document not the child

9. Super-fast evaluation of joins between child and parent

10. Requires an indexed field to identify parent documents?

11. Solution: Example Query Find resume of person called “John” with A1 grade in Maths John Name E1 A1 resume Grade docType Grade Subject Science Subject Maths The NestedDocumentQuery wrapper simply translates the stream of reported matches from the child-level query criteria into matches on the parent for evaluation of all the parent-level logic

12. Solution: Join speed Unlike a database, the cost of a join (child to parent) is blisteringly fast 3) Find first prior set bit e.g. position #356,670 100000100000000100000001000000010000001000010000000001000000100000100001 2) Index directly into cached BitSet at position #356,675 1) Match reported on document #356,675 ParentQuery 4) Attribute match to doc #356,670 NestedDocumentQuery ChildQuery The BitSet for defining parents is obtained from a Filter and can be cached aggressively with minimal memory cost (one bit per document in the index)

13. Other advantages Parent-child document relationships can also be used to limit child results from any one parent (e.g. efficiently control the max number of pages returned from any one website) Nesting levels can be arbitrarily deep Very powerful multi-child queries possible e.g. find people likely to know person X using resume’s employment histories (multiple employer names/urls and related date-ranges)

14. “Lucene is not a database”, but….. Structure matters Many data sources are a mix of structured and unstructured content (e.g. microformats). This is unlikely to change. Lucene has historically been about unstructured text but has steadily been adding structured capability (Trie, spatial, facets) and become a great solution for hybrid data. However support for modeling and querying non-trivial data structures is missing currently. Relationships matter This proposal is not to recreate the full capabilities of a SQL database with arbitrary relationships. However we can benefit greatly from providing simple parent-child relationships We have some unique capabilities Parent-child joins are very fast Unlike SQL we can return partial, relevance-ranked matches Probably more akin to XML databases than SQL databases

15. Next steps Existing code/unit tests can be released to Lucene project if there is sufficient interest. This software has been deployed in production on large datasets. The matching approach is reliant on parents and children being held in the same Lucene index segment. Additional control is needed to enforce this more rigorously - either by Adding more user-control over IndexWritersegment creation where applications understand/control parent-child dependencies OR Making Lucene aware of parent-child relationships e.g. new method Document.add(Document) Query parser support XML Query Parser support is available End-user Query parser could add new syntax e.g. +candidateLocale:UK +child(grade:A1 AND subject:music)

16. Thoughts? Feedback encouraged on dev@lucene.apache.org

Proposal for nested document support in Lucene

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (13)

Ähnlich wie Proposal for nested document support in Lucene

Ähnlich wie Proposal for nested document support in Lucene (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Proposal for nested document support in Lucene