ORC 2015: Faster, Better, Smaller

Page
1
©
Hortonworks
Inc.
2011
–
2015.
All
Rights
Reserved

ORC
2015:
Faster,
BeCer,
Smaller

Prasanth Jayachandran
Apache Hive Team, Hortonworks
@prasanth_j

Page
2
©
Hortonworks
Inc.
2011
–
2015.
All
Rights
Reserved

Apache ORC – Optimized Row-Columnar File
Apache
TLP
–
orc.apache.org
+
Type
Speciﬁc
Encodings
+
Came
out
of
Apache
Hive
+
Vectorized
Readers
(Java,
C++)
+
ProjecVon
and
Predicate
Pushdown
+
Columnar
Storage
+
Block
Compression
+
Hive
ACID
transacVons
+
Single
SerDe
Format
+
Protobuf
Metadata
Storage
+

Page
3
©
Hortonworks
Inc.
2011
–
2015.
All
Rights
Reserved

ORC:
Format
SpeciﬁcaVon

How
ORC
stores
data?

Page
4
©
Hortonworks
Inc.
2011
–
2015.
All
Rights
Reserved

ORC File Layout
§  File Footer and Postscript
§  Stripes
§  Indexes (Row group indexes and Bloom Filter
interleaved)
§  Min/Max stats, Positions for every 10K rows
§  Data
§  Multiple streams per column encoded and
compressed independently
§  Stripe Footer
§  Locations to streams, type of encoding
§  Full specification at [1]

Page
5
©
Hortonworks
Inc.
2011
–
2015.
All
Rights
Reserved

ORC Writer
Schema: <i:int,m:map<k:string,v:struct<s:string,d:double>,t:time>
§  One tree writer per flattened column
§  Multiple streams per column
§  PRESENT
§  DATA
§  LENGTH
§  DICTIONARY_DATA
§  SECONDARY
§  ROW_INDEX
§  BLOOM_FILTER

Page
6
©
Hortonworks
Inc.
2011
–
2015.
All
Rights
Reserved

ORC Data Streams
Schema: <i:int,m:map<k:string,v:struct<s:string,d:double>,t:time>
§  Streams can be suppressed.
§  Example: PRESENT stream is suppressed when all values in a stripe are non-null.
IS_PRESENT DATA DICTIONARY LENGTH SECONDARY
Compression
Buffers

Page
7
©
Hortonworks
Inc.
2011
–
2015.
All
Rights
Reserved

ORC:
Features
Timeline

How
ORC
improved
over
<me?

Page
8
©
Hortonworks
Inc.
2011
–
2015.
All
Rights
Reserved

Timeline
February 2013
§  Stinger Initiative Announcement*
§  Roadmap to improve Apache Hive’s
performance by 100x
§  Delivered in 100% Apache Open Source
* http://hortonworks.com/blog/100x-faster-hive/
| 2013
| 2014
| 2015
SQL Engine
Vectorized
SQL Engine
Columnar
Storage
ORC
+
+

Distributed
Execution
Apache Tez
= 100x

Page
9
©
Hortonworks
Inc.
2011
–
2015.
All
Rights
Reserved

Timeline
March 2013
Optimized Row Columnar (ORC)
file format committed to Hive
§  Hive version: 0.11
§  Native data format in Hive
| 2013
| 2014
| 2015

Page
10
©
Hortonworks
Inc.
2011
–
2015.
All
Rights
Reserved

Timeline
March 2013
| 2013
| 2014
| 2015
Predicate Pushdown
§  SARG interface
§  Prune stripes and row groups
based on min/max statistics
Improved Run Length Encoding
§  Tighter bit packing
§  Longer runs
§  DELTA, SHORT_REPEATS,
DIRECT, PATCHED_BASE

Page
11
©
Hortonworks
Inc.
2011
–
2015.
All
Rights
Reserved

Run Length Encoding Improvements

RLE
(hive
0.11)
RLE
(hive
>=
0.12)

Compression

RaVo

Encoding
Time
(in

ms)

Decoding
Time
(in

ms)

Compression

RaVo

Encoding
Time
(in

ms)

Decoding
Time
(in

ms)

Twi$er
Census
API
ID
(24,556,361

records)
2.32
1770
1263
6.97
1558
864

HTTP
Archive
(bytes.json)
79.4
198
191
200.82
263
125

Github
Archive

(root.payload.name.txt.dict-‐len)
114.05
21
15
260.73
23
15

AOL
Querylog
Epoch
(36,389,577

records)
2.51
553
364
3.7
652
246

Reference:
h$ps://issues.apache.org/jira/secure/a$achment/12596722/ORC-‐Compression-‐RaWo-‐Comparison.xlsx

Page
12
©
Hortonworks
Inc.
2011
–
2015.
All
Rights
Reserved

Timeline
April 2013
| 2013
| 2014
| 2015
Vectorized ORC readers
§  Read and process columns in
batches of size 1024
Null stream suppression
§  Suppress PRESENT stream
if no nulls in a stripe
§  Enables fast path in vectorization
June 2013

Page
13
©
Hortonworks
Inc.
2011
–
2015.
All
Rights
Reserved

Timeline
October 2013
| 2013
| 2014
| 2015
Statistics Interface
§  Writer – Update statistics during load time
§  Reader – ANALYZE TABLE .. NOSCAN
Split Elimination
§  Stripe level column statistics
§  Eliminate stripes that do not satisfy
predicate conditions
November 2013

Page
14
©
Hortonworks
Inc.
2011
–
2015.
All
Rights
Reserved

Timeline
February 2014
| 2013
| 2014
| 2015
Zero copy read path
§  HDFS caching APIs to read directly into
memory without extra data copies
Serialization Improvements
§  Bit width alignment (trade-off space
for speed)
§  Unrolled bit packing and unpacking
§  Buffered double reader and writer
June 2014

Page
15
©
Hortonworks
Inc.
2011
–
2015.
All
Rights
Reserved

0
200
400
600
800
1000
1200
1400
1600
1800
1 2 4 8 16 24 32 40 48 56 64
MeanTime(ms)
Bit Width
ORC Read Integer Performance (smaller is better)
hive 0.13 unpacking
hive-1.0 unpacking (new)

Page
16
©
Hortonworks
Inc.
2011
–
2015.
All
Rights
Reserved

241.679
171.045
174.163
0
50
100
150
200
250
300
hive <= 0.13 buffered + BE buffered + LE
MeanTime(ms)
Double Read Modes
ORC Read Double Performance
(smaller is better)
~1.4x improvement

Page
17
©
Hortonworks
Inc.
2011
–
2015.
All
Rights
Reserved

Timeline
June 2014
| 2013
| 2014
| 2015
Adaptive compression buffer size
§  >1000 columns adjust compression buffer
size based on available memory
§  Avoids wide table OOMs
Fast stripe level file merging
§  Many small files to few large files
§  No Decompression, No Decoding
§  ALTER TABLE … CONCATENATE
July 2014

Page
18
©
Hortonworks
Inc.
2011
–
2015.
All
Rights
Reserved

Fast File Merging
1091
651
245
816
0
200
400
600
800
1000
1200
1400
1600
ORC RCFile
TotalTimeinseconds
CONCAT Supporting File Formats
ETL With File Merging – TPC-H 1000 Scale Lineitem
(smaller is better)
Merge Time
Load Time
1336
1467
~3.33x improvement
in merge time

Page
19
©
Hortonworks
Inc.
2011
–
2015.
All
Rights
Reserved

Timeline
July 2014
| 2013
| 2014
| 2015
ORC Padding Improvements
§  Pad bytes to avoid remote HDFS reads
§  Last stripe is adjusted to fit within HDFS
block boundary (worst case: 5% wastage)
Decouple stripe size vs block size
§  Smaller stripes (64MB)
§  More stripes per block (4 per block)
§  Better parallelism & split elimination

Page
20
©
Hortonworks
Inc.
2011
–
2015.
All
Rights
Reserved

Timeline
September 2014
| 2013
| 2014
| 2015
String Dictionary Improvements
§  Row group level checking
§  Remember decision across stripes
§  Avoids expensive RBTree insertions

Page
21
©
Hortonworks
Inc.
2011
–
2015.
All
Rights
Reserved

String Dictionary Improvements
767
540
0
100
200
300
400
500
600
700
800
900
hive <= 0.13 hive > 0.13
Timeinseconds
Hive Version
String Dictionary Improvements - TPC-H 1000 Scale Lineitem
(smaller is better)
Load Time
~1.4x improvement

Page
22
©
Hortonworks
Inc.
2011
–
2015.
All
Rights
Reserved

Timeline
September 2014
| 2013
| 2014
| 2015
Improved ZLIB compression
§  Different streams compressed with
different zlib strategies/levels
§  Compress integers and doubles
differently
§  Data and Dictionary stream
- Looks for smaller byte patterns
§  All other streams
- Less LZ77, More Huffman

Page
23
©
Hortonworks
Inc.
2011
–
2015.
All
Rights
Reserved

ZLIB Improvements
178.5
172.2
225.1
0
50
100
150
200
250
ORC + (old ZLIB) ORC + (new ZLIB) ORC + SNAPPY
DataSizeinGBs
File Format + Compression Codec
Data Size Improvements - TPC-H 1000 Scale Lineitem
(smaller is better)
~4% improvement ~1.3x smaller

Page
24
©
Hortonworks
Inc.
2011
–
2015.
All
Rights
Reserved

ZLIB Improvements
674
433
389
0
100
200
300
400
500
600
700
800
ORC + (old ZLIB) ORC + (new ZLIB) ORC + SNAPPY
DataSizeinGBs
File Format + Compression Codec
Load Time Improvements - TPC-H 1000 Scale Lineitem
(smaller is better)
~1.6x improvement Only ~10% slower than SNAPPY

Page
25
©
Hortonworks
Inc.
2011
–
2015.
All
Rights
Reserved

Timeline
September 2014
| 2013
| 2014
| 2015
ACID transactions
§  Order of millions of rows
§  Not designed for OLTP requirements
§  Streaming Ingest via Flume or Storm
§  Atomically add base and delta directories
§  Minor compaction – Merge many delta files
§  Major compaction – Re-write base files to
incorporate delta file changes
Broken pattern: Add Partitions for Atomicity-

Page
26
©
Hortonworks
Inc.
2011
–
2015.
All
Rights
Reserved

Timeline
January 2015
| 2013
| 2014
| 2015
hasNull flag in ORC internal index
§  Better pruning of row groups
§  Improves the performance of
SELECT .. WHERE column IS NULL;

Page
27
©
Hortonworks
Inc.
2011
–
2015.
All
Rights
Reserved

hasNull in Index Improvement
Bytes Read: 208.77 GB vs 539 MB
66.73
7.87
0
10
20
30
40
50
60
70
80
hive < 1.1.0 hive >= 1.1.0
ExecutionTimeinseconds
Hive Version
select * from lineitem where l_shipdate is null
(smaller is better)
Execution Time~8.5x improvement

Page
28
©
Hortonworks
Inc.
2011
–
2015.
All
Rights
Reserved

Timeline
February 2015
| 2013
| 2014
| 2015
Bloom Filter Index
§  Much better row group pruning when
compared to min/max
§  Bloom filter evaluated after the
fast Min/Max based elimination

Page
29
©
Hortonworks
Inc.
2011
–
2015.
All
Rights
Reserved

Bloom Filter Indexes Improvements
5999989709
540,000
10,000
No Indexes Min-Max Indexes Bloomfilter Indexes
select * from tpch_1000.lineitem where l_orderkey = 1212000001;
(log scale – smaller is better)
Rows Read

Page
30
©
Hortonworks
Inc.
2011
–
2015.
All
Rights
Reserved

Bloom Filter Indexes Improvements
74
4.5 1.34
No Indexes Min-Max Indexes Bloomfilter Indexes
select * from tpch_1000.lineitem where l_orderkey=1212000001;
(smaller is better)
Time Taken (seconds)
~16x improvement
~3.3x improvement

Page
31
©
Hortonworks
Inc.
2011
–
2015.
All
Rights
Reserved

Timeline
April 2015
| 2013
| 2014
| 2015
Split Strategies
§  BI – Skip reading file footer
§  ETL – Read and cache file footer
§  HYBRID – Default. Chooses BI/ETL
based on number of files and
average file size
§  Group splits based on columnar
projection size instead of file size

Page
32
©
Hortonworks
Inc.
2011
–
2015.
All
Rights
Reserved

Timeline
April 2015
| 2013
| 2014
| 2015
ORC became Apache Top Level Project
§  C++ reader with contributions from
Hortonworks, HP and Microsoft
§  Column encryption to encrypt
sensitive columns
http://orc.apache.org/

Page
34
©
Hortonworks
Inc.
2011
–
2015.
All
Rights
Reserved

ORC at Facebook
Saved
more
than
1,400

servers
worth
of
storage.
(2)

Compression
i
Compression
raVo

increased
from
5x
to
8x

globally.
(2)

Compression
i

Page
38
©
Hortonworks
Inc.
2011
–
2015.
All
Rights
Reserved

ORC: LLAP
- JIT
Performance
for
short
queries
+
Row-‐group
level
caching
+
Asynchronous
IO
Elevator
+
+ MulV-‐threaded
Column
Vector
processing
+

Page
39
©
Hortonworks
Inc.
2011
–
2015.
All
Rights
Reserved

ORC: Vectorization + SIMD

0x00007f13d2e6afb0:
vmovdqu
0x10(%rsi,%rax,8),%ymm2

0x00007f13d2e6afb6:
vaddpd
%ymm1,%ymm2,%ymm2

0x00007f13d2e6afba:
movslq
%eax,%r10

0x00007f13d2e6afbd:
vmovdqu
0x30(%rsi,%r10,8),%ymm3

;*daload
vector.expressions.gen.DoubleColAddDoubleColumn::evaluate
(line
94)

Example:
Query: select ss_ext_tax + 1.0 from store_sales_orc;
JVM Options: HADOOP_OPTS=“ -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly”
Note: Make sure to have hotspot disassembler in $JAVA_HOME/jre/lib
Generated Assembly:
§  AllocaVon
free
Vght
inner
loops
enables
JDK’s
auto-‐vectorizaVon

§  Vectors
can
be
ﬁltered
early
in
ORC

§  String
dicVonary
can
be
used
to
binary-‐search

§  Vectorized
SIMD
Join

§  Improves
performance
for
single
key
joins

AVX - Vector Addition Packed Double
4 doubles loaded to 256 bit registers

Page
42
©
Hortonworks
Inc.
2011
–
2015.
All
Rights
Reserved

Endnotes
(1)  hXps://cwiki.apache.org/conﬂuence/display/Hive/LanguageManual+ORC#LanguageManualORC-‐orc-‐
specORCFormatSpeciﬁca<on

(2)  hXps://code.facebook.com/posts/229861827208629/scaling-‐the-‐facebook-‐data-‐warehouse-‐to-‐300-‐pb/

(3)  hXp://www.slideshare.net/AdamKawa/a-‐perfect-‐hive-‐query-‐for-‐a-‐perfect-‐mee<ng-‐hadoop-‐summit-‐2014

(4)  hXp://www.slideshare.net/Hadoop_Summit/w-‐1205p230-‐aradhakrishnan-‐v3

ORC 2015: Faster, Better, Smaller

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie ORC 2015: Faster, Better, Smaller

Ähnlich wie ORC 2015: Faster, Better, Smaller (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

ORC 2015: Faster, Better, Smaller