What is in a Lucene index?

What is in a Lucene index?
WHAT IS IN A LUCENE INDEX
Adrien Grand
@jpountz

Software engineer at Elasticsearch
About me
•
•

Lucene/Solr committer
Software engineer at Elasticsearch

•

I like changing the index file formats!
– stored fields
– term vectors
– doc values
– ...
Why should I
learn about
Lucene internals?
Why should I learn about Lucene internals?
•

Know the cost of the APIs
– to build blazing fast search applications
– don’t commit all the time
– when to use stored fields vs. doc values
– maybe Lucene is not the right tool

•

Understand index size
– oh, term vectors are 1/2 of the index size!
– I removed 20% of my documents and index size hasn’t changed

•

This is a lot of fun!
Indexing
•

Make data fast to search
– duplicate data if it helps
– decide on how to index based on the queries

•

Trade update speed for search speed
– Grep vs full-text indexing
– Prefix queries vs edge n-grams
– Phrase queries vs shingles

•

Indexing is fast
– 220 GB/hour for 4K docs!
– http://people.apache.org/~mikemccand/lucenebench/indexing.html
Let’s create an index
•

Tree structure
– sorted for range queries
– O(log(n)) search

sql
index

data

term

Lucene

Lucene in action
Databases
Lucene doesn’t
work this way
Another index
•

Store terms and documents in arrays
– binary search

0

data

0,1

1

index

0,1

2

Lucene

0

3

term

0

4

sql

1

0

Lucene in action

1

Databases
Another index
•

Store terms and documents in arrays
– binary search

0

0,1

1

Segment

data
index

0,1

2

Lucene

0

3

term

0

4

sql

1

term
ordinal

terms
dict

postings
list

0

Lucene in action

1

Databases

doc id

document
Insertions?
•
•

Insertion = write a new segment
Merge segments when there are too many of them
– concatenate docs, merge terms dicts and postings lists (merge sort!)
0

data

0

1

index

0

2

Lucene

0

term

0

0

data

0

1

index

0

2

sql

0

0

Databases

1

index

0,1

Lucene

0

term

0

4

Lucene in action

0,1

2

0

data

3

3

0

sql

1

0

Lucene in action

1

Databases
Insertions?
•
•

Insertion = write a new segment
Merge segments when there are too many of them
– concatenate docs, merge terms dicts and postings lists (merge sort!)
0

data

0

1

index

0

2

Lucene

0

term

0

0

data

1

1

index

1

2

sql

1

1

Databases

1

index

0,1

Lucene

0

term

0

4

Lucene in action

0,1

2

0

data

3

3

0

sql

1

0

Lucene in action

1

Databases
Deletions?
•
•
•

Deletion = turn a bit off
Ignore deleted documents when searching and merging (reclaims space)
Merge policies favor segments with many deletions

0

data

0,1

1

index

0,1

2

Lucene

0

3

term

0

4

sql

1

0

Lucene in action

1

1

Databases

0

live docs: 1 = live, 0 = deleted
Pros/cons
•

•

•
•

•

Updates require writing a new segment
– single-doc updates are costly, bulk updates preferred
– writes are sequential
Segments are never modified in place
– filesystem-cache-friendly
– lock-free!
Terms are deduplicated
– saves space for high-freq terms
Docs are uniquely identified by an ord
– useful for cross-API communication
– Lucene can use several indexes in a single query
Terms are uniquely identified by an ord
– important for sorting: compare longs, not strings
– important for faceting (more on this later)
Lucene can use
several indexes
Many databases can’t
Index intersection
1

red
shoe

2

4

6

7

9

1, 2, 10, 11, 20, 30, 50, 100
2, 20, 21, 22, 30, 40, 100
3

5

8

Lucene’s postings lists support skipping that
can be use to “leap-frog”
Many databases just pick the most selective
index and ignore the other ones
What else?
•
•

We just covered search
Lucene does more
– term vectors
– norms
– numeric doc values
– binary doc values
– sorted doc values
– sorted set doc values
Term vectors
•
•
•

Per-document inverted index
Useful for more-like-this
Sometimes used for highlighting
0

Lucene in action

0

data

0

0

data

0,1

1

index

0

1

index

0,1

2

Lucene

0

2

Lucene

0

3

term

0

3

term

0

0

data

0

4

sql

1

1

index

0

2

sql

0

1

Databases
Numeric/binary doc values
•
•
•

Per doc and per field single numeric values, stored in a column-stride fashion
Useful for sorting and custom scoring
Norms are numeric doc values
field_a field_b
0

Lucene in action

42

afc

1

Databases

1

gce

2

Solr in action

3

ppy

3

Java

10

ccn
Sorted (set) doc values
•

Ordinal-enabled per-doc and per-field values
– sorted: single-valued, useful for sorting
– sorted set: multi-valued, useful for faceting

0

Lucene in action

1,2

0

distributed

1

Databases

0

1

Java

2

Solr in action

0,1,2

2

search

3

Java

1

Ordinals

Terms dictionary for
this dv field
Faceting
•

Compute value counts for docs that match a query
– eg. category counts on an ecommerce website

•

Naive solution
– hash table: value to count
– O(#docs) ordinal lookups
– O(#doc) value lookups

•

2nd solution
– hash table: ord to count
– resolve values in the end
– O(#docs) ordinal lookups
– O(#values) value lookups

Since ordinals are dense,
this can be a simple array
How can I use these APIs?
•

These are the low-level Lucene APIs, everything is built on top of these APIs:
searching, faceting, scoring, highlighting, etc.
API

Useful for

Method

Inverted index

Term -> doc ids, positions,
offsets

AtomicReader.fields

Stored fields

Summaries of search results

IndexReader.document

Live docs

Ignoring deleted docs

AtomicReader.liveDocs

Term vectors

More like this

IndexReader.termVectors

Doc values / Norms

Sorting/faceting/scoring

AtomicReader.get*Values
Wrap up
•

•

Data duplicated up to 4 times
– not a waste of space!
– easy to manage thanks to immutability
Stored fields vs doc values
– Optimized for different access patterns
– get many field values for a few docs: stored fields
– get a few field values for many docs: doc values

Stored fields

0,A

0,B

0,C

Doc values

0,A

1,A

2,A

0,B

1,B

2,B

0,B

1,B

2,B

1,A

1,B

1,C

2,A

2,B

2,C

At most 1 seek per doc
At most 1 seek per doc per field
BUT more disk / file-system cache-friendly
File formats
Important rules
•

Save file handles
– don’t use one file per field or per doc

•

Avoid disk seeks whenever possible
– disk seek on spinning disk is ~10 ms

•

BUT don’t ignore the filesystem cache
– random access in small files is fine

•

Light compression helps
– less I/O
– smaller indexes
– filesystem-cache-friendly
Codecs
•

File formats are codec-dependent

•

Default codec tries to get the best speed for little memory
– To trade memory for speed, don’t use RAMDirectory:
– MemoryPostingsFormat, MemoryDocValuesFormat, etc.

•

Detailed file formats available in javadocs
– http://lucene.apache.org/core/4_5_1/core/org/apache/lucene/codecs/packagesummary.html
–
Compression techniques
•

Bit packing / vInt encoding
– postings lists
– numeric doc values

•

LZ4
– code.google.com/p/lz4
– lightweight compression algorithm
– stored fields, term vectors

•

FSTs
– conceptually a Map<String, ?>
– keys share prefixes and suffixes
– terms index
What happens
when I run a
TermQuery?
1. Terms index
•

Lookup the term in the terms index
– In-memory FST storing terms prefixes
– Gives the offset to look at in the terms dictionary
– Can fast-fail if no terms have this prefix

r

b/2
l/4

a/1

c

u
y/3

r

br = 2
brac = 3
luc = 4
lyr = 7
2. Terms dictionary
•

•

Jump to the given offset in the terms dictionary
– compressed based on shared prefixes, similarly to a burst trie
– called the “BlockTree terms dict”
read sequentially until the term is found
–

Jump here
Not found
Not found
Found

[prefix=luc]
a, freq=1, offset=101
as, freq=1, offset=149
ene, freq=9, offset=205
ky, frea=7, offset=260
rative, freq=5, offset=323
3. Postings lists
•
•

Jump to the given offset in the postings lists
Encoded using modified FOR (Frame of Reference) delta
– 1. delta-encode
– 2. split into block of N=128 values
– 3. bit packing per block
– 4. if remaining docs, encode with vInt

Example with N=4

1,3,4,6,8,20,22,26,30,31
1,2,1,2,2,12,2,4,4,1
[1,2,1,2] [2,12,2,4] 4, 1

2 bits per value

vInt-encoded

4 bits per value
4. Stored fields
•

•

In-memory index for a subset of the doc ids
– memory-efficient thanks to monotonic compression
– searched using binary search
Stored fields
– stored sequentially
– compressed (LZ4) in 16+KB blocks
docId=3
offset=127

docId=0
offset=42

0

1
16KB

2

docId=4
offset=199

3
16KB

4

5
16KB

6
Query execution
•
•

2 disk seeks per field for search
1 disk seek per doc for stored fields

•

It is common that the terms dict / postings lists fits into the file-system cache

•

“Pulse” optimization
– For unique terms (freq=1), postings are inlined in the terms dict
– Only 1 disk seek
– Will always be used for your primary keys
Quizz
What is happening here?
qps

1
2

#docs in the index
What is happening here?
qps

1

Index grows larger than the filesystem
cache: stored fields not fully in the cache
anymore

2

#docs in the index
What is happening here?
qps

1

Index grows larger than the filesystem
cache: stored fields not fully in the cache
anymore

2 Terms dict/Postings lists not fully in the
cache

#docs in the index
Thank you!
1 von 38

Recomendados

Espresso: LinkedIn's Distributed Data Serving Platform (Paper) von
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Amy W. Tang
40.1K views12 Folien
Dynamic filtering for presto join optimisation von
Dynamic filtering for presto join optimisationDynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisationOri Reshef
1.5K views14 Folien
Introduction to Redis von
Introduction to RedisIntroduction to Redis
Introduction to RedisDvir Volk
121K views24 Folien
Performance Optimizations in Apache Impala von
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
10.7K views63 Folien
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014 von
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Julien Le Dem
16.6K views40 Folien
Cassandra Introduction & Features von
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & FeaturesDataStax Academy
31.9K views21 Folien

Más contenido relacionado

Was ist angesagt?

Introduction and Overview of Apache Kafka, TriHUG July 23, 2013 von
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013mumrah
61.2K views43 Folien
Introduction to memcached von
Introduction to memcachedIntroduction to memcached
Introduction to memcachedJurriaan Persyn
70.9K views77 Folien
Parquet performance tuning: the missing guide von
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
40.5K views44 Folien
A Thorough Comparison of Delta Lake, Iceberg and Hudi von
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
11.1K views27 Folien
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi... von
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
8.4K views48 Folien
Building large scale transactional data lake using apache hudi von
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBill Liu
686 views32 Folien

Was ist angesagt?(20)

Introduction and Overview of Apache Kafka, TriHUG July 23, 2013 von mumrah
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
mumrah61.2K views
Parquet performance tuning: the missing guide von Ryan Blue
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue40.5K views
A Thorough Comparison of Delta Lake, Iceberg and Hudi von Databricks
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks11.1K views
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi... von Databricks
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks8.4K views
Building large scale transactional data lake using apache hudi von Bill Liu
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
Bill Liu686 views
Iceberg: A modern table format for big data (Strata NY 2018) von Ryan Blue
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue2K views
Schema-on-Read vs Schema-on-Write von Amr Awadallah
Schema-on-Read vs Schema-on-WriteSchema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-Write
Amr Awadallah26.9K views
Hive Bucketing in Apache Spark with Tejas Patil von Databricks
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
Databricks14.9K views
The Parquet Format and Performance Optimization Opportunities von Databricks
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks8.1K views
Iceberg: a fast table format for S3 von DataWorks Summit
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
DataWorks Summit7.5K views
Virtual Nodes: Rethinking Topology in Cassandra von Eric Evans
Virtual Nodes: Rethinking Topology in CassandraVirtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in Cassandra
Eric Evans6.6K views
MySQL innoDB split and merge pages von Marco Tusa
MySQL innoDB split and merge pagesMySQL innoDB split and merge pages
MySQL innoDB split and merge pages
Marco Tusa337 views
Apache Iceberg: An Architectural Look Under the Covers von ScyllaDB
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
ScyllaDB1.4K views
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ... von Altinity Ltd
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
Altinity Ltd116 views
Apache Spark Core—Deep Dive—Proper Optimization von Databricks
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
Databricks6.1K views

Destacado

Lucene basics von
Lucene basicsLucene basics
Lucene basicsNitin Pande
27K views41 Folien
Berlin Buzzwords 2013 - How does lucene store your data? von
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Adrien Grand
6.6K views22 Folien
Lucene Introduction von
Lucene IntroductionLucene Introduction
Lucene Introductionotisg
8.8K views26 Folien
Apache Lucene: Searching the Web and Everything Else (Jazoon07) von
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)dnaber
10.1K views35 Folien
Elasticsearch From the Bottom Up von
Elasticsearch From the Bottom UpElasticsearch From the Bottom Up
Elasticsearch From the Bottom Upfoundsearch
6K views82 Folien
Apache Solr/Lucene Internals by Anatoliy Sokolenko von
Apache Solr/Lucene Internals  by Anatoliy SokolenkoApache Solr/Lucene Internals  by Anatoliy Sokolenko
Apache Solr/Lucene Internals by Anatoliy SokolenkoProvectus
7.3K views70 Folien

Destacado(11)

Berlin Buzzwords 2013 - How does lucene store your data? von Adrien Grand
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?
Adrien Grand6.6K views
Lucene Introduction von otisg
Lucene IntroductionLucene Introduction
Lucene Introduction
otisg8.8K views
Apache Lucene: Searching the Web and Everything Else (Jazoon07) von dnaber
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
dnaber10.1K views
Elasticsearch From the Bottom Up von foundsearch
Elasticsearch From the Bottom UpElasticsearch From the Bottom Up
Elasticsearch From the Bottom Up
foundsearch6K views
Apache Solr/Lucene Internals by Anatoliy Sokolenko von Provectus
Apache Solr/Lucene Internals  by Anatoliy SokolenkoApache Solr/Lucene Internals  by Anatoliy Sokolenko
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Provectus 7.3K views
Introduction to Elasticsearch with basics of Lucene von Rahul Jain
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
Rahul Jain13.2K views
Introduction to Elasticsearch von Ruslan Zavacky
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
Ruslan Zavacky7.6K views
Elastic search overview von ABC Talks
Elastic search overviewElastic search overview
Elastic search overview
ABC Talks8.6K views
Elasticsearch presentation 1 von Maruf Hassan
Elasticsearch presentation 1Elasticsearch presentation 1
Elasticsearch presentation 1
Maruf Hassan4.6K views
SlideShare 101 von Amit Ranjan
SlideShare 101SlideShare 101
SlideShare 101
Amit Ranjan29.7M views

Similar a What is in a Lucene index?

Lucene BootCamp von
Lucene BootCampLucene BootCamp
Lucene BootCampGokulD
2.4K views83 Folien
Finite State Queries In Lucene von
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Luceneotisg
7.5K views25 Folien
Lucene Bootcamp - 2 von
Lucene Bootcamp - 2Lucene Bootcamp - 2
Lucene Bootcamp - 2GokulD
1.2K views59 Folien
Intro to Elasticsearch von
Intro to ElasticsearchIntro to Elasticsearch
Intro to ElasticsearchClifford James
10.7K views44 Folien
Illuminating Lucene.Net von
Illuminating Lucene.NetIlluminating Lucene.Net
Illuminating Lucene.NetDean Thrasher
3.1K views48 Folien
Portable Lucene Index Format & Applications - Andrzej Bialecki von
Portable Lucene Index Format & Applications - Andrzej BialeckiPortable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej Bialeckilucenerevolution
1.5K views31 Folien

Similar a What is in a Lucene index?(20)

Lucene BootCamp von GokulD
Lucene BootCampLucene BootCamp
Lucene BootCamp
GokulD2.4K views
Finite State Queries In Lucene von otisg
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Lucene
otisg7.5K views
Lucene Bootcamp - 2 von GokulD
Lucene Bootcamp - 2Lucene Bootcamp - 2
Lucene Bootcamp - 2
GokulD1.2K views
Illuminating Lucene.Net von Dean Thrasher
Illuminating Lucene.NetIlluminating Lucene.Net
Illuminating Lucene.Net
Dean Thrasher3.1K views
Portable Lucene Index Format & Applications - Andrzej Bialecki von lucenerevolution
Portable Lucene Index Format & Applications - Andrzej BialeckiPortable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej Bialecki
lucenerevolution1.5K views
Introduction to elasticsearch von pmanvi
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
pmanvi6.7K views
Musings on Secondary Indexing in HBase von Jesse Yates
Musings on Secondary Indexing in HBaseMusings on Secondary Indexing in HBase
Musings on Secondary Indexing in HBase
Jesse Yates3K views
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup von rcmuir
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
rcmuir2.7K views
Introduction to libre « fulltext » technology von Robert Viseur
Introduction to libre « fulltext » technologyIntroduction to libre « fulltext » technology
Introduction to libre « fulltext » technology
Robert Viseur561 views
Exploring Direct Concept Search von Steve Rowe
Exploring Direct Concept SearchExploring Direct Concept Search
Exploring Direct Concept Search
Steve Rowe200 views
Is Your Index Reader Really Atomic or Maybe Slow? von lucenerevolution
Is Your Index Reader Really Atomic or Maybe Slow?Is Your Index Reader Really Atomic or Maybe Slow?
Is Your Index Reader Really Atomic or Maybe Slow?
lucenerevolution4.1K views
SFDC Introduction to Apex von Sujit Kumar
SFDC Introduction to ApexSFDC Introduction to Apex
SFDC Introduction to Apex
Sujit Kumar63 views
Elasticsearch and Spark von Audible, Inc.
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and Spark
Audible, Inc.7.2K views
Lucene Bootcamp -1 von GokulD
Lucene Bootcamp -1 Lucene Bootcamp -1
Lucene Bootcamp -1
GokulD1.4K views
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr von Sease
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Sease7.3K views
Exploring Direct Concept Search - Steve Rowe, Lucidworks von Lucidworks
Exploring Direct Concept Search - Steve Rowe, LucidworksExploring Direct Concept Search - Steve Rowe, Lucidworks
Exploring Direct Concept Search - Steve Rowe, Lucidworks
Lucidworks574 views

Más de lucenerevolution

Text Classification Powered by Apache Mahout and Lucene von
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
7.7K views88 Folien
State of the Art Logging. Kibana4Solr is Here! von
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
6.2K views21 Folien
Search at Twitter von
Search at TwitterSearch at Twitter
Search at Twitterlucenerevolution
4K views86 Folien
Building Client-side Search Applications with Solr von
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
5.2K views36 Folien
Integrate Solr with real-time stream processing applications von
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
8.7K views39 Folien
Scaling Solr with SolrCloud von
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
2.3K views57 Folien

Más de lucenerevolution(20)

Text Classification Powered by Apache Mahout and Lucene von lucenerevolution
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
lucenerevolution7.7K views
State of the Art Logging. Kibana4Solr is Here! von lucenerevolution
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
lucenerevolution6.2K views
Building Client-side Search Applications with Solr von lucenerevolution
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
lucenerevolution5.2K views
Integrate Solr with real-time stream processing applications von lucenerevolution
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
lucenerevolution8.7K views
Administering and Monitoring SolrCloud Clusters von lucenerevolution
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
lucenerevolution1.7K views
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled von lucenerevolution
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
lucenerevolution1.6K views
Using Solr to Search and Analyze Logs von lucenerevolution
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
lucenerevolution4.5K views
Enhancing relevancy through personalization & semantic search von lucenerevolution
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
lucenerevolution5.9K views
Real-time Inverted Search in the Cloud Using Lucene and Storm von lucenerevolution
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
lucenerevolution4K views
Solr's Admin UI - Where does the data come from? von lucenerevolution
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
lucenerevolution2.5K views
Schemaless Solr and the Solr Schema REST API von lucenerevolution
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
lucenerevolution9K views
High Performance JSON Search and Relational Faceted Browsing with Lucene von lucenerevolution
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
lucenerevolution5.2K views
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM von lucenerevolution
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
lucenerevolution11.1K views
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke... von lucenerevolution
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
lucenerevolution4.6K views
Shrinking the haystack wes caldwell - final von lucenerevolution
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
lucenerevolution809 views

Último

Serverless computing with Google Cloud (2023-24) von
Serverless computing with Google Cloud (2023-24)Serverless computing with Google Cloud (2023-24)
Serverless computing with Google Cloud (2023-24)wesley chun
11 views33 Folien
PRODUCT PRESENTATION.pptx von
PRODUCT PRESENTATION.pptxPRODUCT PRESENTATION.pptx
PRODUCT PRESENTATION.pptxangelicacueva6
14 views1 Folie
handbook for web 3 adoption.pdf von
handbook for web 3 adoption.pdfhandbook for web 3 adoption.pdf
handbook for web 3 adoption.pdfLiveplex
22 views16 Folien
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors von
TouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective SensorsTouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective Sensors
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensorssugiuralab
19 views15 Folien
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... von
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...James Anderson
85 views32 Folien

Último(20)

Serverless computing with Google Cloud (2023-24) von wesley chun
Serverless computing with Google Cloud (2023-24)Serverless computing with Google Cloud (2023-24)
Serverless computing with Google Cloud (2023-24)
wesley chun11 views
handbook for web 3 adoption.pdf von Liveplex
handbook for web 3 adoption.pdfhandbook for web 3 adoption.pdf
handbook for web 3 adoption.pdf
Liveplex22 views
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors von sugiuralab
TouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective SensorsTouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective Sensors
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors
sugiuralab19 views
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... von James Anderson
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
James Anderson85 views
Unit 1_Lecture 2_Physical Design of IoT.pdf von StephenTec
Unit 1_Lecture 2_Physical Design of IoT.pdfUnit 1_Lecture 2_Physical Design of IoT.pdf
Unit 1_Lecture 2_Physical Design of IoT.pdf
StephenTec12 views
STPI OctaNE CoE Brochure.pdf von madhurjyapb
STPI OctaNE CoE Brochure.pdfSTPI OctaNE CoE Brochure.pdf
STPI OctaNE CoE Brochure.pdf
madhurjyapb14 views
Voice Logger - Telephony Integration Solution at Aegis von Nirmal Sharma
Voice Logger - Telephony Integration Solution at AegisVoice Logger - Telephony Integration Solution at Aegis
Voice Logger - Telephony Integration Solution at Aegis
Nirmal Sharma39 views
Igniting Next Level Productivity with AI-Infused Data Integration Workflows von Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software263 views
Case Study Copenhagen Energy and Business Central.pdf von Aitana
Case Study Copenhagen Energy and Business Central.pdfCase Study Copenhagen Energy and Business Central.pdf
Case Study Copenhagen Energy and Business Central.pdf
Aitana16 views
Data Integrity for Banking and Financial Services von Precisely
Data Integrity for Banking and Financial ServicesData Integrity for Banking and Financial Services
Data Integrity for Banking and Financial Services
Precisely21 views
Attacking IoT Devices from a Web Perspective - Linux Day von Simone Onofri
Attacking IoT Devices from a Web Perspective - Linux Day Attacking IoT Devices from a Web Perspective - Linux Day
Attacking IoT Devices from a Web Perspective - Linux Day
Simone Onofri16 views
Piloting & Scaling Successfully With Microsoft Viva von Richard Harbridge
Piloting & Scaling Successfully With Microsoft VivaPiloting & Scaling Successfully With Microsoft Viva
Piloting & Scaling Successfully With Microsoft Viva

What is in a Lucene index?

  • 2. WHAT IS IN A LUCENE INDEX Adrien Grand @jpountz Software engineer at Elasticsearch
  • 3. About me • • Lucene/Solr committer Software engineer at Elasticsearch • I like changing the index file formats! – stored fields – term vectors – doc values – ...
  • 4. Why should I learn about Lucene internals?
  • 5. Why should I learn about Lucene internals? • Know the cost of the APIs – to build blazing fast search applications – don’t commit all the time – when to use stored fields vs. doc values – maybe Lucene is not the right tool • Understand index size – oh, term vectors are 1/2 of the index size! – I removed 20% of my documents and index size hasn’t changed • This is a lot of fun!
  • 6. Indexing • Make data fast to search – duplicate data if it helps – decide on how to index based on the queries • Trade update speed for search speed – Grep vs full-text indexing – Prefix queries vs edge n-grams – Phrase queries vs shingles • Indexing is fast – 220 GB/hour for 4K docs! – http://people.apache.org/~mikemccand/lucenebench/indexing.html
  • 7. Let’s create an index • Tree structure – sorted for range queries – O(log(n)) search sql index data term Lucene Lucene in action Databases
  • 9. Another index • Store terms and documents in arrays – binary search 0 data 0,1 1 index 0,1 2 Lucene 0 3 term 0 4 sql 1 0 Lucene in action 1 Databases
  • 10. Another index • Store terms and documents in arrays – binary search 0 0,1 1 Segment data index 0,1 2 Lucene 0 3 term 0 4 sql 1 term ordinal terms dict postings list 0 Lucene in action 1 Databases doc id document
  • 11. Insertions? • • Insertion = write a new segment Merge segments when there are too many of them – concatenate docs, merge terms dicts and postings lists (merge sort!) 0 data 0 1 index 0 2 Lucene 0 term 0 0 data 0 1 index 0 2 sql 0 0 Databases 1 index 0,1 Lucene 0 term 0 4 Lucene in action 0,1 2 0 data 3 3 0 sql 1 0 Lucene in action 1 Databases
  • 12. Insertions? • • Insertion = write a new segment Merge segments when there are too many of them – concatenate docs, merge terms dicts and postings lists (merge sort!) 0 data 0 1 index 0 2 Lucene 0 term 0 0 data 1 1 index 1 2 sql 1 1 Databases 1 index 0,1 Lucene 0 term 0 4 Lucene in action 0,1 2 0 data 3 3 0 sql 1 0 Lucene in action 1 Databases
  • 13. Deletions? • • • Deletion = turn a bit off Ignore deleted documents when searching and merging (reclaims space) Merge policies favor segments with many deletions 0 data 0,1 1 index 0,1 2 Lucene 0 3 term 0 4 sql 1 0 Lucene in action 1 1 Databases 0 live docs: 1 = live, 0 = deleted
  • 14. Pros/cons • • • • • Updates require writing a new segment – single-doc updates are costly, bulk updates preferred – writes are sequential Segments are never modified in place – filesystem-cache-friendly – lock-free! Terms are deduplicated – saves space for high-freq terms Docs are uniquely identified by an ord – useful for cross-API communication – Lucene can use several indexes in a single query Terms are uniquely identified by an ord – important for sorting: compare longs, not strings – important for faceting (more on this later)
  • 15. Lucene can use several indexes Many databases can’t
  • 16. Index intersection 1 red shoe 2 4 6 7 9 1, 2, 10, 11, 20, 30, 50, 100 2, 20, 21, 22, 30, 40, 100 3 5 8 Lucene’s postings lists support skipping that can be use to “leap-frog” Many databases just pick the most selective index and ignore the other ones
  • 17. What else? • • We just covered search Lucene does more – term vectors – norms – numeric doc values – binary doc values – sorted doc values – sorted set doc values
  • 18. Term vectors • • • Per-document inverted index Useful for more-like-this Sometimes used for highlighting 0 Lucene in action 0 data 0 0 data 0,1 1 index 0 1 index 0,1 2 Lucene 0 2 Lucene 0 3 term 0 3 term 0 0 data 0 4 sql 1 1 index 0 2 sql 0 1 Databases
  • 19. Numeric/binary doc values • • • Per doc and per field single numeric values, stored in a column-stride fashion Useful for sorting and custom scoring Norms are numeric doc values field_a field_b 0 Lucene in action 42 afc 1 Databases 1 gce 2 Solr in action 3 ppy 3 Java 10 ccn
  • 20. Sorted (set) doc values • Ordinal-enabled per-doc and per-field values – sorted: single-valued, useful for sorting – sorted set: multi-valued, useful for faceting 0 Lucene in action 1,2 0 distributed 1 Databases 0 1 Java 2 Solr in action 0,1,2 2 search 3 Java 1 Ordinals Terms dictionary for this dv field
  • 21. Faceting • Compute value counts for docs that match a query – eg. category counts on an ecommerce website • Naive solution – hash table: value to count – O(#docs) ordinal lookups – O(#doc) value lookups • 2nd solution – hash table: ord to count – resolve values in the end – O(#docs) ordinal lookups – O(#values) value lookups Since ordinals are dense, this can be a simple array
  • 22. How can I use these APIs? • These are the low-level Lucene APIs, everything is built on top of these APIs: searching, faceting, scoring, highlighting, etc. API Useful for Method Inverted index Term -> doc ids, positions, offsets AtomicReader.fields Stored fields Summaries of search results IndexReader.document Live docs Ignoring deleted docs AtomicReader.liveDocs Term vectors More like this IndexReader.termVectors Doc values / Norms Sorting/faceting/scoring AtomicReader.get*Values
  • 23. Wrap up • • Data duplicated up to 4 times – not a waste of space! – easy to manage thanks to immutability Stored fields vs doc values – Optimized for different access patterns – get many field values for a few docs: stored fields – get a few field values for many docs: doc values Stored fields 0,A 0,B 0,C Doc values 0,A 1,A 2,A 0,B 1,B 2,B 0,B 1,B 2,B 1,A 1,B 1,C 2,A 2,B 2,C At most 1 seek per doc At most 1 seek per doc per field BUT more disk / file-system cache-friendly
  • 25. Important rules • Save file handles – don’t use one file per field or per doc • Avoid disk seeks whenever possible – disk seek on spinning disk is ~10 ms • BUT don’t ignore the filesystem cache – random access in small files is fine • Light compression helps – less I/O – smaller indexes – filesystem-cache-friendly
  • 26. Codecs • File formats are codec-dependent • Default codec tries to get the best speed for little memory – To trade memory for speed, don’t use RAMDirectory: – MemoryPostingsFormat, MemoryDocValuesFormat, etc. • Detailed file formats available in javadocs – http://lucene.apache.org/core/4_5_1/core/org/apache/lucene/codecs/packagesummary.html –
  • 27. Compression techniques • Bit packing / vInt encoding – postings lists – numeric doc values • LZ4 – code.google.com/p/lz4 – lightweight compression algorithm – stored fields, term vectors • FSTs – conceptually a Map<String, ?> – keys share prefixes and suffixes – terms index
  • 28. What happens when I run a TermQuery?
  • 29. 1. Terms index • Lookup the term in the terms index – In-memory FST storing terms prefixes – Gives the offset to look at in the terms dictionary – Can fast-fail if no terms have this prefix r b/2 l/4 a/1 c u y/3 r br = 2 brac = 3 luc = 4 lyr = 7
  • 30. 2. Terms dictionary • • Jump to the given offset in the terms dictionary – compressed based on shared prefixes, similarly to a burst trie – called the “BlockTree terms dict” read sequentially until the term is found – Jump here Not found Not found Found [prefix=luc] a, freq=1, offset=101 as, freq=1, offset=149 ene, freq=9, offset=205 ky, frea=7, offset=260 rative, freq=5, offset=323
  • 31. 3. Postings lists • • Jump to the given offset in the postings lists Encoded using modified FOR (Frame of Reference) delta – 1. delta-encode – 2. split into block of N=128 values – 3. bit packing per block – 4. if remaining docs, encode with vInt Example with N=4 1,3,4,6,8,20,22,26,30,31 1,2,1,2,2,12,2,4,4,1 [1,2,1,2] [2,12,2,4] 4, 1 2 bits per value vInt-encoded 4 bits per value
  • 32. 4. Stored fields • • In-memory index for a subset of the doc ids – memory-efficient thanks to monotonic compression – searched using binary search Stored fields – stored sequentially – compressed (LZ4) in 16+KB blocks docId=3 offset=127 docId=0 offset=42 0 1 16KB 2 docId=4 offset=199 3 16KB 4 5 16KB 6
  • 33. Query execution • • 2 disk seeks per field for search 1 disk seek per doc for stored fields • It is common that the terms dict / postings lists fits into the file-system cache • “Pulse” optimization – For unique terms (freq=1), postings are inlined in the terms dict – Only 1 disk seek – Will always be used for your primary keys
  • 34. Quizz
  • 35. What is happening here? qps 1 2 #docs in the index
  • 36. What is happening here? qps 1 Index grows larger than the filesystem cache: stored fields not fully in the cache anymore 2 #docs in the index
  • 37. What is happening here? qps 1 Index grows larger than the filesystem cache: stored fields not fully in the cache anymore 2 Terms dict/Postings lists not fully in the cache #docs in the index