SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
How does Lucene
store your data?
Adrien Grand
@jpountz
Apache Lucene/Solr committer
Software engineer @ Elasticsearch
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Outline
●Segments
●What does a segment store?
●Improvements since Lucene 4.0
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Segments
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Segments
●Every segment is a fully
functional index
●High numbers of
segments trigger merges
●Merge: Copy all live data
from several segments
into a new one
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Segments
●Immutable (up to deletes)
● SSD-friendly (no write amplification)
● great for caches (including the FS cache)
● easy incremental backups
●Merged together when they are too many of them
● Expunges deleted documents
●An IndexReader is a point-in-time view over a fixed
number of segments
● Need to reopen to see changes
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
What does a
segment store?
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
What is in a segment?
Stores Useful for
Segment &
Field infos
Metadata
Getting doc count / index
options
Live docs Non-deleted docs
Excluding deleted docs
from results
Inverted index
The mapping from terms to
docs and positions Finding matching docs
Norms Index-time boosts Scoring
Doc values Any number or (small) bytes
Sorting, faceting, custom
scoring
Stored fields The original doc Result summaries
Term vectors Single doc inverted index Highlighting, MoreLikeThis
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
What is in a segment?
API
Field infos AtomicReader.getFieldInfos()
Live docs AtomicReader.getLiveDocs()
Inverted index AtomicReader.fields()
Norms AtomicReader.getNormValues(String field)
Doc values AtomicReader.get*Values(String field)
Stored fields AtomicReader.document(int docID, FieldVisitor visitor)
Term vectors AtomicReader.getTermVectors()
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Doc IDs
●Lucene gives sequential doc IDs to all documents in a
segment, from 0 (inclusive) to AtomicReader.maxDoc()
(exclusive)
●Uniquely identifies documents inside a segment
● ie. if the inverted index API says that document 42
matches the term "bbuzz", I can query the stored
fields API with the same ID
●Allows for efficient storage
● doc IDs can be used as ordinals
● Small & dense ints are easy to compress
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Detour: bit packing
●Efficient technique to store blocks of small ints
● Supports random access
● Special case: bits per value = 1 is a bit set
●Say you want to store
● 5 30 1 1 10 12
● Raw data: 6 * 32 = 192 bits
● Packed : 6 * 5 = 30 bits (84% size reduction!)
00000000000000000000000000000101 = 5
00000000000000000000000000011110 = 30
00000000000000000000000000000001 = 1
00000000000000000000000000000001 = 1
00000000000000000000000000001010 = 10
00000000000000000000000000001100 = 12
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Fixed-length data
●Dense doc IDs are great for single-valued fixed-length
data
● Store data sequentially
● Data for doc N is at offset N * dataLength
● Allows for fast and memory-efficient lookups
●Live docs (1 bit per value)
●Norms (1 byte per value)
●Numeric doc values
● Blocks with independent numbers of bits per value
4096 values 4096 values 4096 values ● Block idx
○ docID / 4096
● Idx in block
○ docID % 4096
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Variable-length data
end addresses
bytes
●Binary doc values
●Stored fields
●Term vectors
●Need one level of indirection: store end addresses
● Easy to compress since end addresses are
increasing
● Only store endAddress - (docID+1) * avgLength
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
String data
●Terms index
●Sorted (Set) doc values
●MemoryPostingsFormat
●Suggesters
s/1 t a c k
r/1o/2
p
t/4
●FST: automaton with weighted arcs
○ compact thanks to shared prefixes/suffixes
●Stack = 1
●Star = 2
●Stop = 3
●Top = 4
o
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Inverted index
●Terms index: map a term prefix to a block in the dict
○ FST
●Terms dictionary: statistics + pointer in postings lists
●Postings lists: encodes matching docs in sorted order
○ + positions + offsets
Original data 1 2 4 11 42 43 (6 * 4 = 32 bytes)
Split into blocks of 3
(128 in practice)
1 2 4 | 11 42 43
Delta-encode 1 1 2 | 11 31 1
Pack values 3 [1 1 2] | 5 [11 31 1] (1+1+1+2 = 5 bytes)
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Improvements since
Lucene 4.0
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Improvements since Lucene 4.0
●LUCENE-4399 (4.1): no seek on write
●LUCENE-4498 (4.1): terms "pulsed" when freq=1
●Compression:
● LUCENE-3892 (4.1): postings encoding moved from
vInt to packed ints: smaller & faster!
● LUCENE-4226 (4.1): compressed stored fields
● LUCENE-4599 (4.2): compressed term vectors
● LUCENE-4547 (4.2): better doc values:
● blocks of packed ints for numbers
● compression of addresses for binary
● FST for Sorted (Set)
● LUCENE-4936 (4.4): compression for date DV
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Performance
●http://people.apache.org/~mikemccand/lucenebench/Term.html
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Detour: LZ4
●Super simple, blazing fast compression codec
●http://code.google.com/p/lz4/
●https://github.com/jpountz/lz4-java
●Example
● L: literals
● R: reference = (offset decrement, length)
● 1 2 3 6 7 6 7 6 7 6 7 8 9 1 2 3 6 7 10
● L 1 2 3 6 7 R(2,6) L 8 9 R(13,5) L 10
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Detour: LZ4
●https://github.com/ning/jvm-compressor-benchmark
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Twitter benchmark
●Quick benchmark on a Twitter corpus
● 160908 tweets
● WhitespaceAnalyzer
Type Indexed Stored Doc values
Term
vectors
id long yes yes - -
created_at long - yes numeric -
user.name string yes yes sorted -
text text yes yes - yes
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Twitter benchmark
Lucene 4.0
Lucene 4.4
(not released yet)
Difference
Inverted index 23.3M 20.5M -12%
Norms 157K 157K +0%
Doc values 3.4M 3.1M -9%
Stored fields 21.2M 15.7M -26%
Term vectors 23.5M 15.5M -34%
Overall ~71.5M ~55.0M -23%
Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited
Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
How to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkHow to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkIlya Ganelin
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkKazuaki Ishizaki
 
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer SimonDocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simonlucenerevolution
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...Databricks
 
Mongodb - Scaling write performance
Mongodb - Scaling write performanceMongodb - Scaling write performance
Mongodb - Scaling write performanceDaum DNA
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrSease
 
Paul Dix [InfluxData] The Journey of InfluxDB | InfluxDays 2022
Paul Dix [InfluxData] The Journey of InfluxDB | InfluxDays 2022Paul Dix [InfluxData] The Journey of InfluxDB | InfluxDays 2022
Paul Dix [InfluxData] The Journey of InfluxDB | InfluxDays 2022InfluxData
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesDatabricks
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
Speed Up Uber's Presto with Alluxio
Speed Up Uber's Presto with AlluxioSpeed Up Uber's Presto with Alluxio
Speed Up Uber's Presto with AlluxioAlluxio, Inc.
 
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxData
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseDatabricks
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
 

Was ist angesagt? (20)

Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
How to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkHow to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They Work
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
 
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer SimonDocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
 
Lucene indexing
Lucene indexingLucene indexing
Lucene indexing
 
SolrCloud and Shard Splitting
SolrCloud and Shard SplittingSolrCloud and Shard Splitting
SolrCloud and Shard Splitting
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 
Mongodb - Scaling write performance
Mongodb - Scaling write performanceMongodb - Scaling write performance
Mongodb - Scaling write performance
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
 
Paul Dix [InfluxData] The Journey of InfluxDB | InfluxDays 2022
Paul Dix [InfluxData] The Journey of InfluxDB | InfluxDays 2022Paul Dix [InfluxData] The Journey of InfluxDB | InfluxDays 2022
Paul Dix [InfluxData] The Journey of InfluxDB | InfluxDays 2022
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Speed Up Uber's Presto with Alluxio
Speed Up Uber's Presto with AlluxioSpeed Up Uber's Presto with Alluxio
Speed Up Uber's Presto with Alluxio
 
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 

Ähnlich wie Berlin Buzzwords 2013 - How does lucene store your data?

Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Building a Unified Logging Layer with Fluentd, Elasticsearch and KibanaBuilding a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Building a Unified Logging Layer with Fluentd, Elasticsearch and KibanaMushfekur Rahman
 
Oracle to Postgres Schema Migration Hustle
Oracle to Postgres Schema Migration HustleOracle to Postgres Schema Migration Hustle
Oracle to Postgres Schema Migration HustleEDB
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriDemi Ben-Ari
 
Elasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep diveElasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep diveSematext Group, Inc.
 
Data Engineer's Lunch #54: dbt and Spark
Data Engineer's Lunch #54: dbt and SparkData Engineer's Lunch #54: dbt and Spark
Data Engineer's Lunch #54: dbt and SparkAnant Corporation
 
Deep Postgres Extensions in Rust | PGCon 2019 | Jeff Davis
Deep Postgres Extensions in Rust | PGCon 2019 | Jeff DavisDeep Postgres Extensions in Rust | PGCon 2019 | Jeff Davis
Deep Postgres Extensions in Rust | PGCon 2019 | Jeff DavisCitus Data
 
Monitoring.pptx
Monitoring.pptxMonitoring.pptx
Monitoring.pptxShadi Akil
 
High performance json- postgre sql vs. mongodb
High performance json- postgre sql vs. mongodbHigh performance json- postgre sql vs. mongodb
High performance json- postgre sql vs. mongodbWei Shan Ang
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
[scala.by] Launching new application fast
[scala.by] Launching new application fast[scala.by] Launching new application fast
[scala.by] Launching new application fastDenis Karpenko
 
Apache Spark e AWS Glue
Apache Spark e AWS GlueApache Spark e AWS Glue
Apache Spark e AWS GlueLaercio Serra
 
Building scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thriftBuilding scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thriftTalentica Software
 
Deploy STM32 family on Zephyr - SFO17-102
Deploy STM32 family on Zephyr - SFO17-102Deploy STM32 family on Zephyr - SFO17-102
Deploy STM32 family on Zephyr - SFO17-102Linaro
 
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, UberKafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, UberHostedbyConfluent
 

Ähnlich wie Berlin Buzzwords 2013 - How does lucene store your data? (20)

Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Building a Unified Logging Layer with Fluentd, Elasticsearch and KibanaBuilding a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
 
mdc_ppt
mdc_pptmdc_ppt
mdc_ppt
 
Oracle to Postgres Schema Migration Hustle
Oracle to Postgres Schema Migration HustleOracle to Postgres Schema Migration Hustle
Oracle to Postgres Schema Migration Hustle
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
Elasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep diveElasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep dive
 
Data Engineer's Lunch #54: dbt and Spark
Data Engineer's Lunch #54: dbt and SparkData Engineer's Lunch #54: dbt and Spark
Data Engineer's Lunch #54: dbt and Spark
 
Deep Postgres Extensions in Rust | PGCon 2019 | Jeff Davis
Deep Postgres Extensions in Rust | PGCon 2019 | Jeff DavisDeep Postgres Extensions in Rust | PGCon 2019 | Jeff Davis
Deep Postgres Extensions in Rust | PGCon 2019 | Jeff Davis
 
Monitoring.pptx
Monitoring.pptxMonitoring.pptx
Monitoring.pptx
 
High performance json- postgre sql vs. mongodb
High performance json- postgre sql vs. mongodbHigh performance json- postgre sql vs. mongodb
High performance json- postgre sql vs. mongodb
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
[scala.by] Launching new application fast
[scala.by] Launching new application fast[scala.by] Launching new application fast
[scala.by] Launching new application fast
 
Apache Spark e AWS Glue
Apache Spark e AWS GlueApache Spark e AWS Glue
Apache Spark e AWS Glue
 
$ Spark start
$  Spark start$  Spark start
$ Spark start
 
Doc32000
Doc32000Doc32000
Doc32000
 
Building scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thriftBuilding scalable and language independent java services using apache thrift
Building scalable and language independent java services using apache thrift
 
Memory mgmt 80386
Memory mgmt 80386Memory mgmt 80386
Memory mgmt 80386
 
Apache ignite v1.3
Apache ignite v1.3Apache ignite v1.3
Apache ignite v1.3
 
Fluent Bit: Log Forwarding at Scale
Fluent Bit: Log Forwarding at ScaleFluent Bit: Log Forwarding at Scale
Fluent Bit: Log Forwarding at Scale
 
Deploy STM32 family on Zephyr - SFO17-102
Deploy STM32 family on Zephyr - SFO17-102Deploy STM32 family on Zephyr - SFO17-102
Deploy STM32 family on Zephyr - SFO17-102
 
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, UberKafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
 

Kürzlich hochgeladen

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 

Kürzlich hochgeladen (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

Berlin Buzzwords 2013 - How does lucene store your data?

  • 1. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited How does Lucene store your data? Adrien Grand @jpountz Apache Lucene/Solr committer Software engineer @ Elasticsearch
  • 2. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Outline ●Segments ●What does a segment store? ●Improvements since Lucene 4.0
  • 3. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Segments
  • 4. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Segments ●Every segment is a fully functional index ●High numbers of segments trigger merges ●Merge: Copy all live data from several segments into a new one
  • 5. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Segments ●Immutable (up to deletes) ● SSD-friendly (no write amplification) ● great for caches (including the FS cache) ● easy incremental backups ●Merged together when they are too many of them ● Expunges deleted documents ●An IndexReader is a point-in-time view over a fixed number of segments ● Need to reopen to see changes
  • 6. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited What does a segment store?
  • 7. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited What is in a segment? Stores Useful for Segment & Field infos Metadata Getting doc count / index options Live docs Non-deleted docs Excluding deleted docs from results Inverted index The mapping from terms to docs and positions Finding matching docs Norms Index-time boosts Scoring Doc values Any number or (small) bytes Sorting, faceting, custom scoring Stored fields The original doc Result summaries Term vectors Single doc inverted index Highlighting, MoreLikeThis
  • 8. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited What is in a segment? API Field infos AtomicReader.getFieldInfos() Live docs AtomicReader.getLiveDocs() Inverted index AtomicReader.fields() Norms AtomicReader.getNormValues(String field) Doc values AtomicReader.get*Values(String field) Stored fields AtomicReader.document(int docID, FieldVisitor visitor) Term vectors AtomicReader.getTermVectors()
  • 9. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Doc IDs ●Lucene gives sequential doc IDs to all documents in a segment, from 0 (inclusive) to AtomicReader.maxDoc() (exclusive) ●Uniquely identifies documents inside a segment ● ie. if the inverted index API says that document 42 matches the term "bbuzz", I can query the stored fields API with the same ID ●Allows for efficient storage ● doc IDs can be used as ordinals ● Small & dense ints are easy to compress
  • 10. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Detour: bit packing ●Efficient technique to store blocks of small ints ● Supports random access ● Special case: bits per value = 1 is a bit set ●Say you want to store ● 5 30 1 1 10 12 ● Raw data: 6 * 32 = 192 bits ● Packed : 6 * 5 = 30 bits (84% size reduction!) 00000000000000000000000000000101 = 5 00000000000000000000000000011110 = 30 00000000000000000000000000000001 = 1 00000000000000000000000000000001 = 1 00000000000000000000000000001010 = 10 00000000000000000000000000001100 = 12
  • 11. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Fixed-length data ●Dense doc IDs are great for single-valued fixed-length data ● Store data sequentially ● Data for doc N is at offset N * dataLength ● Allows for fast and memory-efficient lookups ●Live docs (1 bit per value) ●Norms (1 byte per value) ●Numeric doc values ● Blocks with independent numbers of bits per value 4096 values 4096 values 4096 values ● Block idx ○ docID / 4096 ● Idx in block ○ docID % 4096
  • 12. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Variable-length data end addresses bytes ●Binary doc values ●Stored fields ●Term vectors ●Need one level of indirection: store end addresses ● Easy to compress since end addresses are increasing ● Only store endAddress - (docID+1) * avgLength
  • 13. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited String data ●Terms index ●Sorted (Set) doc values ●MemoryPostingsFormat ●Suggesters s/1 t a c k r/1o/2 p t/4 ●FST: automaton with weighted arcs ○ compact thanks to shared prefixes/suffixes ●Stack = 1 ●Star = 2 ●Stop = 3 ●Top = 4 o
  • 14. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Inverted index ●Terms index: map a term prefix to a block in the dict ○ FST ●Terms dictionary: statistics + pointer in postings lists ●Postings lists: encodes matching docs in sorted order ○ + positions + offsets Original data 1 2 4 11 42 43 (6 * 4 = 32 bytes) Split into blocks of 3 (128 in practice) 1 2 4 | 11 42 43 Delta-encode 1 1 2 | 11 31 1 Pack values 3 [1 1 2] | 5 [11 31 1] (1+1+1+2 = 5 bytes)
  • 15. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Improvements since Lucene 4.0
  • 16. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Improvements since Lucene 4.0 ●LUCENE-4399 (4.1): no seek on write ●LUCENE-4498 (4.1): terms "pulsed" when freq=1 ●Compression: ● LUCENE-3892 (4.1): postings encoding moved from vInt to packed ints: smaller & faster! ● LUCENE-4226 (4.1): compressed stored fields ● LUCENE-4599 (4.2): compressed term vectors ● LUCENE-4547 (4.2): better doc values: ● blocks of packed ints for numbers ● compression of addresses for binary ● FST for Sorted (Set) ● LUCENE-4936 (4.4): compression for date DV
  • 17. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Performance ●http://people.apache.org/~mikemccand/lucenebench/Term.html
  • 18. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Detour: LZ4 ●Super simple, blazing fast compression codec ●http://code.google.com/p/lz4/ ●https://github.com/jpountz/lz4-java ●Example ● L: literals ● R: reference = (offset decrement, length) ● 1 2 3 6 7 6 7 6 7 6 7 8 9 1 2 3 6 7 10 ● L 1 2 3 6 7 R(2,6) L 8 9 R(13,5) L 10
  • 19. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Detour: LZ4 ●https://github.com/ning/jvm-compressor-benchmark
  • 20. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Twitter benchmark ●Quick benchmark on a Twitter corpus ● 160908 tweets ● WhitespaceAnalyzer Type Indexed Stored Doc values Term vectors id long yes yes - - created_at long - yes numeric - user.name string yes yes sorted - text text yes yes - yes
  • 21. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Twitter benchmark Lucene 4.0 Lucene 4.4 (not released yet) Difference Inverted index 23.3M 20.5M -12% Norms 157K 157K +0% Doc values 3.4M 3.1M -9% Stored fields 21.2M 15.7M -26% Term vectors 23.5M 15.5M -34% Overall ~71.5M ~55.0M -23%
  • 22. Copyright Elasticsearch 2013. Copying, publishing and/or distributing without written permission is strictly prohibited Questions?