Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud

Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Tanel Poder

2
• Enterprise database & performance background (Oracle focused)
• ”All enterprise data can just be a query away”
• Gluent Data Platform
• Supports all major Hadoop distributions, on-premises or in the cloud
• Consolidates data into a centralized location in open data formats
• Transparent Data Virtualization provides simple data sharing across the enterprise
Who we are

3
… but traditional databases don’t cut it anymore!
P
T
P
Big Data IoT
? ?
Enterprise Applications run on Enterprise Databases

5
• An open source big data warehouse system on Hadoop
• Metadata + table structure + data access
• SQL layer over HDFS, cloud storage (HiveQL)
• Cost based optimizer, indexing, partitions, etc
• Used for:
• Access to huge datasets
• Parse large text files, log files, JSON (schema-on-read)
• Binary, columnar storage (schema-on-write)
• Very large queries (running for hours, days)
• Enterprise Data Warehouse offload
• Integration with Business Intelligence tools (fast, interactive queries)
• Insert, Update, Delete, Merge data in Hadoop
What is Apache Hive?
More on these later!

6
Apache Hive - a brief history
2007: Facebook created the
first SQL abstraction layer
for writing MapReduce Java
code to access data in
Hadoop called Hive
2008: Apache
Hive incubating
project created
2010: Apache
Hive first
release (v0.3)
2013: Hortonworks announces
the Stinger initiative -
promising 100x faster Hive
https://hortonworks.com/blog
/100x-faster-hive/
2013: Hive on Tez released
via Hortonworks Data
Platform 2.0
2016: Hive LLAP
included in Apache
Hive 2.0
2016: Hive LLAP
included in Azure
HDInsight

7
• MapReduce
• Original data processing framework
for Hadoop
• Map: filtering, sorting, etc
• Reduce: aggregate (sum, count, etc)
• Each Map + Reduce intermediate
result is written to disk (I/O intensive)
• Apache Tez
• Built on top of YARN
• Dataflow graph - processing steps
defined before the job begins
• Low latency, high throughput
• Intermediate results transferred via
memory
Hive data processing engines
Source: https://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/9

8
• YARN based framework for data processing applications in Hadoop
• Used by Apache Hive, Apache Pig, and others
• Can execute a complex Directed Acyclic Graph (DAG) when processing data
• Any given SQL query can be expressed as a single job
• Data is not physically stored in between tasks as in MapReduce
• Data processing defined as a ”graph”
• Vertices - the processing of data (where the
query logic resides)
• Edges - movement of data in-between
processing (task routing/scheduling)
Apache Tez

9
• Query vectorization
• Process rows in "blocks" of 1024 containing vectors of column values
• Improves operations of scans, aggregations, filters, and joins
• Partitioning
• Reduce the amount of data read to improve I/O
• Each partition becomes a directory
• Bucketing
• Similar to hash subpartitioning
• Each bucket becomes a file
• Best for high cardinality columns
• ORC file format
• Columnar data compression
• Built-in "storage indexes"
Hive performance optimizations

10
Fast, sub-second query response time!
Hive on Tez is great, but what is missing?

11
Introducing Hive LLAP
Now called Hive
Interactive Query

12
• “Live Long and Process” or “Low Latency Analytical Processing”
• Not an execution engine (like Tez), LLAP simply enhances the Hive execution model
• Built for fast query response time against smaller data volumes
• Allows concurrent execution of analytical workloads
• Intelligent memory caching for quick startup and data sharing
• Caches most active data in RAM
• Shared cache across clients
• Persistent server used to instantly
execute queries
• LLAP daemons are “always on”
• Data passed to execution as it becomes ready
Introducing Hive LLAP

13
LLAP Daemons: HiveServer2 Interactive

14
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive 2 with LLAP: Architecture Overview
Deep
Storage
YARN Cluster
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
Query
Coordinators
Coord-
inator
Coord-
inator
Coord-
inator
HiveServer2
(Query
Endpoint)
ODBC /
JDBC
SQL
Queries In-Memory Cache
(Shared Across All Users)
HDFS and
Compatible
S3 WASB Isilon
LLAP architecture
Source: https://www.slideshare.net/Hadoop_Summit/an-apache-hive-based-data-warehouse-80225129
Persistent
daemon

15
Hive data processing
Source: https://hortonworks.com/blog/announcing-apache-hive-2-1-25x-faster-queries-much/
Write resultset
to disk after
each operation
Data cached in-
memory & shared
across clients
MapReduce Tez Tez with LLAP

16
Query performance - Tez vs Tez + LLAP
Source: https://hortonworks.com/blog/announcing-apache-hive-2-1-25x-faster-queries-much/

17
• Persistent columnar caching
• Data, metadata, and indexes are cached in-memory
• Clients share cached data for faster processing (less I/O, less CPU)
• Query fragments
• Higher priority queries can pre-empt other queries
• Fragmenting allows lower priority queries to continue, even if pre-empted
• Smarter map joins
• Build the hash table once and cache it in-memory for sharing with other processes
• Hybrid execution
• Hive queries can run in LLAP, Tez, or a hybrid of both
• Multi-threaded processing
• Data reads, decoding, and processing executed on separate threads
• Dynamic runtime filtering
• Bloom filter automatically built to eliminate rows that cannot match
LLAP features

18
Caching efficiently - LLAP’s tricks
Source: https://www.slideshare.net/t3rmin4t0r/llap-locality-is-dead/13
LLAP’s cache is decentralized, columnar, automatic, additive, packed, and layered
There is no centralized
store of “what’s cached
and where” - the cache
side-steps the block
metadata size concerns.
The cache does not contain
any dead columns. If you run
TPC-H with LLAP, you’ll notice
it never caches billions of
values in L_COMMENT.
Admins don’t
need to run
“cache table” or
new partitions as
they are created.
Data updates are
detected as well.
When a new column or
partition is used, the cache
adds to itself incrementally
- unlike immutable caches.
Caches data with
intact dictionary
and RLE encodings,
to reduce footprint.
Caches ORC indexes which
trigger skips too - a scan for
city = ‘San Francisco’, allows
city = ‘Los Angeles’ to use
cached index data to skip.

20
• Transactions on data stored in HDFS (no longer just INSERT!)
• Uses base files and delta files where insert, update, and delete operations are
recorded
• Useful for
• Slowly changing dimensions
• Data corrections
• Bulk updates
• Streaming ingest of data
• MERGE support now available
• Note: Hive transactions is not OLTP!
Hive ACID - transactional operations in Hadoop
CREATE TABLE customers (
name string,
address string,
city string,
state string
) clustered by (name) into 10 buckets
STORED AS ORC
TBLPROPERTIES('transactional'='true');
Enable transactions

21
Hive roadmap
Source: https://hortonworks.com/apache/hive/#section_3

22
Azure HDInsight
Hive LLAP in the cloud

23
Microsoft Azure Hadoop Stack
Source: https://f.ch9.ms/public/MLDS2016/OptimizingApacheHivePerformanceHDInsight.pptx

24
• Easy deployment
• Elasticity - expand or shrink resources as needed
• Launch transient services for “large” or temporary data processing
• Managed storage
• Never run out of space!
• Hardware maintenance is handled
by the cloud provider
Hadoop in the cloud

25
Hive LLAP performance on HDInsight
Source: https://azure.microsoft.com/en-us/blog/hdinsight-interactive-query-performance-benchmarks-and-integration-with-power-bi-direct-query/
LLAP cached (ORC)
LLAP uncached (ORC)
Spark (Parquet)
LLAP cached (Text) LLAP uncached (Text) LLAP cached (ORC) LLAP uncached (ORC) Presto (ORC) Spark (Parquet)
TotalTime(s)
1478
1878
1061
2216
2416
1503
1500 2000 250010005000 3000

26
Application
Database
After
Gluent’s transparent data virtualization
“No-ETL” Data Sync
On Demand Data Access
Application
Database
Before
On Demand Compute
No existing
app code
changes!
New analytic
tools
Much smaller
footprint & cost
Additional
data sources

27
• Query performance is key for Gluent’s transparent
data virtualization
Gluent and Hive with LLAP

Thank you!
info@gluent.com
gluent.com
@gluent

Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud

Ähnlich wie Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud