SlideShare ist ein Scribd-Unternehmen logo
1 von 40
Downloaden Sie, um offline zu lesen
Efficient Data Storage for Analytics
with Apache Parquet 2.0
Julien Le Dem @J_
Processing tools tech lead, Data Platform at Twitter
Nong Li nong@cloudera.com
Software engineer, Cloudera Impala
@ApacheParquet
Outline
2
- Why we need efficiency
- Properties of efficient algorithms
- Enabling efficiency
- Efficiency in Apache Parquet
Why we need efficiency
Producing a lot of data is easy
4
Producing a lot of derived data is even easier.

Solution: Compress all the things!
Scanning a lot of data is easy
5
1% completed
… but not necessarily fast.

Waiting is not productive. We want faster turnaround.

Compression but not at the cost of reading speed.
Trying new tools is easy
6
ETL
Storage
ad-hoc
queries
log
collection
automated
dashboard
machine
learning
graph
processing
external
datasources and
schema definition
...
...
We need a storage format interoperable with all the tools we use
and keep our options open for the next big thing.
Enter Apache Parquet
Parquet design goals
8
- Interoperability

- Space efficiency

- Query efficiency
Parquet timeline
9
- Fall 2012: Twitter & Cloudera merge efforts to develop columnar formats

- March 2013: OSS announcement; Criteo signs on for Hive integration

- July 2013: 1.0 release. 18 contributors from more than 5 organizations.

- May 2014: Apache Incubator. 40+ contributors, 18 with 1000+ LOC. 26 incremental releases.

- Parquet 2.0 coming as Apache release
Interoperability
Interoperable
11
Model agnostic
Language agnostic
Java C++
Avro Thrift
Protocol
Buffer
Pig Tuple Hive SerDe
Assembly/striping
Parquet file format
Object model
parquet-avroConverters parquet-thrift parquet-proto parquet-pig parquet-hive
Column encoding
Impala
...
...
Encoding
Query
execution
Frameworks and libraries integrated with Parquet
12
Query engines:
Hive, Impala, HAWQ,
IBM Big SQL, Drill, Tajo,
Pig, Presto
!
Frameworks:
Spark, MapReduce, Cascading,
Crunch, Scalding, Kite
!
Data Models:
Avro, Thrift, ProtocolBuffers,
POJOs
Enabling efficiency
Columnar storage
14
Logical table
representation
Row layout
Column layout
encoding
Nested schema
a b c
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5
a1 b1 c1a2 b2 c2a3 b3 c3a4 b4 c4a5 b5 c5
encoded chunk encoded chunk encoded chunk
Parquet nested representation
15
Document
DocId Links Name
Backward Forward Language Url
Code Country
Columns:
docid
links.backward
links.forward
name.language.code
name.language.country
name.url
Schema:
Borrowed from the Google Dremel paper
https://blog.twitter.com/2013/dremel-made-simple-with-parquet
Statistics for filter and query optimization
16
Vertical partitioning
(projection push down)
Horizontal partitioning
(predicate push down)
Read only the data
you need!
+ =
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
+ =
Properties of efficient algorithms
CPU Pipeline
18
pipe
1 a b c d
2 a b c d
3 a b c d
4 a b c
1 2 3 4 5 6
1 a b c d
2 a b c
3 a b
4 a
clock1 2 3 4 5 6
7
d
7 8
b
b
b
b
c d
c d
c d
c d
9 10
clock
pipe
pipeline
time
8 9 10
d
c
b
a
Mis-prediction
(“Bubble”)
Ideal case
Optimize for the processor pipeline
19
ifs
“Bubbles” can be caused by:
loops
virtual
calls
data
dependency
cost ~ 12 cycles
Minimize CPU cache misses
20
a cache miss costs 10 to 100s cycles depending on the level
RAM
Bus
CPU Cache
Encodings in Apache Parquet 2.0
The right encoding for the right job
22
- Delta encodings:

for sorted datasets or signals where the variation is less important than the absolute
value. (timestamp, auto-generated ids, metrics, …) Focuses on avoiding branching.
!
- Prefix coding (delta encoding for strings)

When dictionary encoding does not work.
!
- Dictionary encoding: 

small (60K) set of values (server IP, experiment id, …)
!
- Run Length Encoding:

repetitive data.
Delta encoding
23
8 * 64bits values = 64 bytes 8 * 64bits values = 64 bytes
101 100101 105102 107101 11499 116101 102 101 119 120 121
values:
deltas
1 10 51 2-1 7-2 20 1 -1 3 1 1100
100
101 100101 105102 107101 11499 116101 102 101 119 120 121100
reference block 1 block 2
Delta encoding
24
3 02 43 11 60 12 3 1 2 0 0100 -2
min
delta
1 10 51 2-1 7-2 20 1 -1 3 1 1100
1
min
delta
make deltas > 0
by subtracting min
3 02 43 11 60 12 3 1 2 0 0
maxbits = 2
11 10 11 01 0010 11 01
1110110110110100
maxbits = 3
8 * 2 bits = 2 bytes
000 100 001 110 001 010 000 000
000100001110001010000000
8 * 3 bits = 3 bytes
2 3
bits bits
100 -2
100 -2
1
1
min
delta
min
delta
reference
packing packing
1110110110110100 0001000011100010100000002 3100 -2 1result:
min
delta
min
delta
Delta encoding
25
Delta encoding
26
3 02 43 11 60 12 3 1 2 0 0
maxbits = 2
11 10 11 01 0010 11 01
1110110110110100
8 * 64bits values = 64 bytes 8 * 64bits values = 64 bytes
maxbits = 3
8 * 2 bits = 2 bytes
000 100 001 110 001 010 000 000
000100001110001010000000
8 * 3 bits = 3 bytes
2 3
bits bits
101 100101 105102 107101 11499 116101 102 101 119 120 121
100
values:
-2
min
delta
100 -2
deltas
1 10 51 2-1 7-2 20 1 -1 3 1 1100
1
min
delta
make deltas > 0
by subtracting min
1
min
delta
min
delta
100
101 100101 105102 107101 11499 116101 102 101 119 120 121100
reference block 1 block 2
reference
packing packing
1110110110110100 0001000011100010100000002 3100 -2 1result:
Binary packing designed for CPU efficiency
27
better:
orvalues = 0!
for (int i = 0; i<values.length; ++i) {!
orvalues |= values[i]!
}!
max = maxbit(orvalues)!
see paper: 

“Decoding billions of integers per second through vectorization” 

by Daniel Lemire and Leonid Boytsov
Unpredictable branch! Loop => Very predictable branch
naive maxbit:
max = 0!
for (int i = 0; i<values.length; ++i) {!
current = maxbit(values[i])!
if (current > max) max = current!
}!
even better:
orvalues = 0!
orvalues |= values[0]!
…!
orvalues |= values[32]!
max = maxbit(orvalues)
no branching at all!
Binary unpacking designed for CPU efficiency
28
!
int j = 0!
while (int i = 0; i < output.length; i += 32) {!
maxbit = input[j]!
unpack_32_values(values, i, out, j + 1, maxbit);!
j += 1 + maxbit!
}!
Compression comparison
29
TPCH: compression of two 64 bits id columns with delta encoding
Primary key
0%
20%
40%
60%
80%
100%
plain delta
no compression + snappy
Compression comparison
30
TPCH: compression of two 64 bits id columns with delta encoding
Primary key
0%
20%
40%
60%
80%
100%
plain delta
no compression + snappy
Foreign key
0%
20%
40%
60%
80%
100%
plain delta
Decoding time vs Compression
31
decodingspeed:!
Million/second
0
350
700
1050
1400
Compression (percent saved)
0% 25% 50% 75% 100%
Delta
Plain + Snappy
Plain
Performance
Size comparison
33
TPCDS 100GB scale factor (+ Snappy unless otherwise specified)
Store salesLineitem
Text uncompressed
Seq Avro Text + LZO
RC
Parquet 1
Parquet 2
The area of the circle is proportional to the file size
Text uncompressed
Seq
RC Avro Parquet 1
Parquet 2
Impala query performance
34
Seconds
0
75
150
225
300
Interactive Reporting Deep Analytics
Text Seq RC Parquet 1.0 Parquet 2.0
10 machines:
8 cores
48 GB of RAM
12 Disks
OS buffer cache flushed between every query
TPCDS geometric mean per query category
Roadmap 2.x
Roadmap 2.x
36
C++ library: implementation of encodings

!
Predicate push down: 

use statistics to implement filters at the metadata level

!
Decimal, Timestamp logical types
Community
Thank you to our contributors
38
Open Source announcement
1.0 release
Get involved
39
Mailing lists:
- dev@parquet.incubator.apache.org
!
Parquet sync ups:
- Regular meetings on google hangout
Questions
40
Questions.foreach( answer(_) )
@ApacheParquet

Weitere ähnliche Inhalte

Was ist angesagt?

The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemDatabricks
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013Jun Rao
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in DeltaDatabricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
 
Apache Arrow Flight Overview
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight OverviewJacques Nadeau
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Databricks
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IODatabricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
RedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ TwitterRedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ TwitterRedis Labs
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path ForwardAlluxio, Inc.
 

Was ist angesagt? (20)

The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
Apache Arrow Flight Overview
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight Overview
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
RedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ TwitterRedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ Twitter
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 

Ähnlich wie Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014

PostgreSQL as seen by Rubyists (Kaigi on Rails 2022)
PostgreSQL as seen by Rubyists (Kaigi on Rails 2022)PostgreSQL as seen by Rubyists (Kaigi on Rails 2022)
PostgreSQL as seen by Rubyists (Kaigi on Rails 2022)Андрей Новиков
 
Cryptography and secure systems
Cryptography and secure systemsCryptography and secure systems
Cryptography and secure systemsVsevolod Stakhov
 
ParallelLogicToEventDrivenFirmware_Doin
ParallelLogicToEventDrivenFirmware_DoinParallelLogicToEventDrivenFirmware_Doin
ParallelLogicToEventDrivenFirmware_DoinJonny Doin
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeWim Godden
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeWim Godden
 
Microprocessor Week1: Introduction
Microprocessor Week1: IntroductionMicroprocessor Week1: Introduction
Microprocessor Week1: IntroductionArkhom Jodtang
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeWim Godden
 
Mcs 012 computer organisation and assemly language programming- ignou assignm...
Mcs 012 computer organisation and assemly language programming- ignou assignm...Mcs 012 computer organisation and assemly language programming- ignou assignm...
Mcs 012 computer organisation and assemly language programming- ignou assignm...Dr. Loganathan R
 
OpenZFS data-driven performance
OpenZFS data-driven performanceOpenZFS data-driven performance
OpenZFS data-driven performanceahl0003
 
Kaizen cso002 l1
Kaizen cso002 l1Kaizen cso002 l1
Kaizen cso002 l1asslang
 
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCinside-BigData.com
 
Synapse 2018 Guarding against failure in a hundred step pipeline
Synapse 2018 Guarding against failure in a hundred step pipelineSynapse 2018 Guarding against failure in a hundred step pipeline
Synapse 2018 Guarding against failure in a hundred step pipelineCalvin French-Owen
 
Sql server engine cpu cache as the new ram
Sql server engine cpu cache as the new ramSql server engine cpu cache as the new ram
Sql server engine cpu cache as the new ramChris Adkin
 
Alto Desempenho com Java
Alto Desempenho com JavaAlto Desempenho com Java
Alto Desempenho com Javacodebits
 
Building a PII scrubbing layer
Building a PII scrubbing layerBuilding a PII scrubbing layer
Building a PII scrubbing layerTilak Patidar
 
Chapter 8 1 Digital Design and Computer Architecture, 2n.docx
Chapter 8 1 Digital Design and Computer Architecture, 2n.docxChapter 8 1 Digital Design and Computer Architecture, 2n.docx
Chapter 8 1 Digital Design and Computer Architecture, 2n.docxchristinemaritza
 
Blockchain (using NBitcoin and FSharp)
Blockchain (using NBitcoin and FSharp)Blockchain (using NBitcoin and FSharp)
Blockchain (using NBitcoin and FSharp)Tuomas Hietanen
 
BSides MCR 2016: From CSV to CMD to qwerty
BSides MCR 2016: From CSV to CMD to qwertyBSides MCR 2016: From CSV to CMD to qwerty
BSides MCR 2016: From CSV to CMD to qwertyJerome Smith
 

Ähnlich wie Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014 (20)

PostgreSQL as seen by Rubyists (Kaigi on Rails 2022)
PostgreSQL as seen by Rubyists (Kaigi on Rails 2022)PostgreSQL as seen by Rubyists (Kaigi on Rails 2022)
PostgreSQL as seen by Rubyists (Kaigi on Rails 2022)
 
Cryptography and secure systems
Cryptography and secure systemsCryptography and secure systems
Cryptography and secure systems
 
Data type
Data typeData type
Data type
 
ParallelLogicToEventDrivenFirmware_Doin
ParallelLogicToEventDrivenFirmware_DoinParallelLogicToEventDrivenFirmware_Doin
ParallelLogicToEventDrivenFirmware_Doin
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
Final Presentation
Final PresentationFinal Presentation
Final Presentation
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
Microprocessor Week1: Introduction
Microprocessor Week1: IntroductionMicroprocessor Week1: Introduction
Microprocessor Week1: Introduction
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
Mcs 012 computer organisation and assemly language programming- ignou assignm...
Mcs 012 computer organisation and assemly language programming- ignou assignm...Mcs 012 computer organisation and assemly language programming- ignou assignm...
Mcs 012 computer organisation and assemly language programming- ignou assignm...
 
OpenZFS data-driven performance
OpenZFS data-driven performanceOpenZFS data-driven performance
OpenZFS data-driven performance
 
Kaizen cso002 l1
Kaizen cso002 l1Kaizen cso002 l1
Kaizen cso002 l1
 
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
 
Synapse 2018 Guarding against failure in a hundred step pipeline
Synapse 2018 Guarding against failure in a hundred step pipelineSynapse 2018 Guarding against failure in a hundred step pipeline
Synapse 2018 Guarding against failure in a hundred step pipeline
 
Sql server engine cpu cache as the new ram
Sql server engine cpu cache as the new ramSql server engine cpu cache as the new ram
Sql server engine cpu cache as the new ram
 
Alto Desempenho com Java
Alto Desempenho com JavaAlto Desempenho com Java
Alto Desempenho com Java
 
Building a PII scrubbing layer
Building a PII scrubbing layerBuilding a PII scrubbing layer
Building a PII scrubbing layer
 
Chapter 8 1 Digital Design and Computer Architecture, 2n.docx
Chapter 8 1 Digital Design and Computer Architecture, 2n.docxChapter 8 1 Digital Design and Computer Architecture, 2n.docx
Chapter 8 1 Digital Design and Computer Architecture, 2n.docx
 
Blockchain (using NBitcoin and FSharp)
Blockchain (using NBitcoin and FSharp)Blockchain (using NBitcoin and FSharp)
Blockchain (using NBitcoin and FSharp)
 
BSides MCR 2016: From CSV to CMD to qwerty
BSides MCR 2016: From CSV to CMD to qwertyBSides MCR 2016: From CSV to CMD to qwerty
BSides MCR 2016: From CSV to CMD to qwerty
 

Mehr von Julien Le Dem

Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageJulien Le Dem
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & MarquezJulien Le Dem
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageJulien Le Dem
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Julien Le Dem
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Julien Le Dem
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseJulien Le Dem
 
From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed databaseJulien Le Dem
 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapJulien Le Dem
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowJulien Le Dem
 
Mule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowMule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowJulien Le Dem
 
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowJulien Le Dem
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Julien Le Dem
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Julien Le Dem
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drillJulien Le Dem
 
If you have your own Columnar format, stop now and use Parquet 😛
If you have your own Columnar format,  stop now and use Parquet  😛If you have your own Columnar format,  stop now and use Parquet  😛
If you have your own Columnar format, stop now and use Parquet 😛Julien Le Dem
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsJulien Le Dem
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Julien Le Dem
 
Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open houseJulien Le Dem
 
Poster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languagesPoster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languagesJulien Le Dem
 

Mehr von Julien Le Dem (20)

Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineage
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed database
 
From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed database
 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmap
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
 
Mule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowMule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet Arrow
 
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet Arrow
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
If you have your own Columnar format, stop now and use Parquet 😛
If you have your own Columnar format,  stop now and use Parquet  😛If you have your own Columnar format,  stop now and use Parquet  😛
If you have your own Columnar format, stop now and use Parquet 😛
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
 
Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open house
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
 
Poster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languagesPoster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languages
 

Kürzlich hochgeladen

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 

Kürzlich hochgeladen (20)

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 

Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014

  • 1. Efficient Data Storage for Analytics with Apache Parquet 2.0 Julien Le Dem @J_ Processing tools tech lead, Data Platform at Twitter Nong Li nong@cloudera.com Software engineer, Cloudera Impala @ApacheParquet
  • 2. Outline 2 - Why we need efficiency - Properties of efficient algorithms - Enabling efficiency - Efficiency in Apache Parquet
  • 3. Why we need efficiency
  • 4. Producing a lot of data is easy 4 Producing a lot of derived data is even easier. Solution: Compress all the things!
  • 5. Scanning a lot of data is easy 5 1% completed … but not necessarily fast. Waiting is not productive. We want faster turnaround. Compression but not at the cost of reading speed.
  • 6. Trying new tools is easy 6 ETL Storage ad-hoc queries log collection automated dashboard machine learning graph processing external datasources and schema definition ... ... We need a storage format interoperable with all the tools we use and keep our options open for the next big thing.
  • 8. Parquet design goals 8 - Interoperability - Space efficiency - Query efficiency
  • 9. Parquet timeline 9 - Fall 2012: Twitter & Cloudera merge efforts to develop columnar formats - March 2013: OSS announcement; Criteo signs on for Hive integration - July 2013: 1.0 release. 18 contributors from more than 5 organizations. - May 2014: Apache Incubator. 40+ contributors, 18 with 1000+ LOC. 26 incremental releases. - Parquet 2.0 coming as Apache release
  • 11. Interoperable 11 Model agnostic Language agnostic Java C++ Avro Thrift Protocol Buffer Pig Tuple Hive SerDe Assembly/striping Parquet file format Object model parquet-avroConverters parquet-thrift parquet-proto parquet-pig parquet-hive Column encoding Impala ... ... Encoding Query execution
  • 12. Frameworks and libraries integrated with Parquet 12 Query engines: Hive, Impala, HAWQ, IBM Big SQL, Drill, Tajo, Pig, Presto ! Frameworks: Spark, MapReduce, Cascading, Crunch, Scalding, Kite ! Data Models: Avro, Thrift, ProtocolBuffers, POJOs
  • 14. Columnar storage 14 Logical table representation Row layout Column layout encoding Nested schema a b c a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a1 b1 c1a2 b2 c2a3 b3 c3a4 b4 c4a5 b5 c5 encoded chunk encoded chunk encoded chunk
  • 15. Parquet nested representation 15 Document DocId Links Name Backward Forward Language Url Code Country Columns: docid links.backward links.forward name.language.code name.language.country name.url Schema: Borrowed from the Google Dremel paper https://blog.twitter.com/2013/dremel-made-simple-with-parquet
  • 16. Statistics for filter and query optimization 16 Vertical partitioning (projection push down) Horizontal partitioning (predicate push down) Read only the data you need! + = a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 + =
  • 18. CPU Pipeline 18 pipe 1 a b c d 2 a b c d 3 a b c d 4 a b c 1 2 3 4 5 6 1 a b c d 2 a b c 3 a b 4 a clock1 2 3 4 5 6 7 d 7 8 b b b b c d c d c d c d 9 10 clock pipe pipeline time 8 9 10 d c b a Mis-prediction (“Bubble”) Ideal case
  • 19. Optimize for the processor pipeline 19 ifs “Bubbles” can be caused by: loops virtual calls data dependency cost ~ 12 cycles
  • 20. Minimize CPU cache misses 20 a cache miss costs 10 to 100s cycles depending on the level RAM Bus CPU Cache
  • 21. Encodings in Apache Parquet 2.0
  • 22. The right encoding for the right job 22 - Delta encodings: for sorted datasets or signals where the variation is less important than the absolute value. (timestamp, auto-generated ids, metrics, …) Focuses on avoiding branching. ! - Prefix coding (delta encoding for strings) When dictionary encoding does not work. ! - Dictionary encoding: small (60K) set of values (server IP, experiment id, …) ! - Run Length Encoding: repetitive data.
  • 23. Delta encoding 23 8 * 64bits values = 64 bytes 8 * 64bits values = 64 bytes 101 100101 105102 107101 11499 116101 102 101 119 120 121 values: deltas 1 10 51 2-1 7-2 20 1 -1 3 1 1100 100 101 100101 105102 107101 11499 116101 102 101 119 120 121100 reference block 1 block 2
  • 24. Delta encoding 24 3 02 43 11 60 12 3 1 2 0 0100 -2 min delta 1 10 51 2-1 7-2 20 1 -1 3 1 1100 1 min delta make deltas > 0 by subtracting min
  • 25. 3 02 43 11 60 12 3 1 2 0 0 maxbits = 2 11 10 11 01 0010 11 01 1110110110110100 maxbits = 3 8 * 2 bits = 2 bytes 000 100 001 110 001 010 000 000 000100001110001010000000 8 * 3 bits = 3 bytes 2 3 bits bits 100 -2 100 -2 1 1 min delta min delta reference packing packing 1110110110110100 0001000011100010100000002 3100 -2 1result: min delta min delta Delta encoding 25
  • 26. Delta encoding 26 3 02 43 11 60 12 3 1 2 0 0 maxbits = 2 11 10 11 01 0010 11 01 1110110110110100 8 * 64bits values = 64 bytes 8 * 64bits values = 64 bytes maxbits = 3 8 * 2 bits = 2 bytes 000 100 001 110 001 010 000 000 000100001110001010000000 8 * 3 bits = 3 bytes 2 3 bits bits 101 100101 105102 107101 11499 116101 102 101 119 120 121 100 values: -2 min delta 100 -2 deltas 1 10 51 2-1 7-2 20 1 -1 3 1 1100 1 min delta make deltas > 0 by subtracting min 1 min delta min delta 100 101 100101 105102 107101 11499 116101 102 101 119 120 121100 reference block 1 block 2 reference packing packing 1110110110110100 0001000011100010100000002 3100 -2 1result:
  • 27. Binary packing designed for CPU efficiency 27 better: orvalues = 0! for (int i = 0; i<values.length; ++i) {! orvalues |= values[i]! }! max = maxbit(orvalues)! see paper: “Decoding billions of integers per second through vectorization” by Daniel Lemire and Leonid Boytsov Unpredictable branch! Loop => Very predictable branch naive maxbit: max = 0! for (int i = 0; i<values.length; ++i) {! current = maxbit(values[i])! if (current > max) max = current! }! even better: orvalues = 0! orvalues |= values[0]! …! orvalues |= values[32]! max = maxbit(orvalues) no branching at all!
  • 28. Binary unpacking designed for CPU efficiency 28 ! int j = 0! while (int i = 0; i < output.length; i += 32) {! maxbit = input[j]! unpack_32_values(values, i, out, j + 1, maxbit);! j += 1 + maxbit! }!
  • 29. Compression comparison 29 TPCH: compression of two 64 bits id columns with delta encoding Primary key 0% 20% 40% 60% 80% 100% plain delta no compression + snappy
  • 30. Compression comparison 30 TPCH: compression of two 64 bits id columns with delta encoding Primary key 0% 20% 40% 60% 80% 100% plain delta no compression + snappy Foreign key 0% 20% 40% 60% 80% 100% plain delta
  • 31. Decoding time vs Compression 31 decodingspeed:! Million/second 0 350 700 1050 1400 Compression (percent saved) 0% 25% 50% 75% 100% Delta Plain + Snappy Plain
  • 33. Size comparison 33 TPCDS 100GB scale factor (+ Snappy unless otherwise specified) Store salesLineitem Text uncompressed Seq Avro Text + LZO RC Parquet 1 Parquet 2 The area of the circle is proportional to the file size Text uncompressed Seq RC Avro Parquet 1 Parquet 2
  • 34. Impala query performance 34 Seconds 0 75 150 225 300 Interactive Reporting Deep Analytics Text Seq RC Parquet 1.0 Parquet 2.0 10 machines: 8 cores 48 GB of RAM 12 Disks OS buffer cache flushed between every query TPCDS geometric mean per query category
  • 36. Roadmap 2.x 36 C++ library: implementation of encodings ! Predicate push down: use statistics to implement filters at the metadata level ! Decimal, Timestamp logical types
  • 38. Thank you to our contributors 38 Open Source announcement 1.0 release
  • 39. Get involved 39 Mailing lists: - dev@parquet.incubator.apache.org ! Parquet sync ups: - Regular meetings on google hangout