Hive was the first popular SQL layer built on Hadoop and has long been known as a heavyweight SQL engine suitable mainly for long-running batch jobs. This has greatly changed since Hive was announced to the world over 8 years ago. Hortonworks and the open source community have evolved Apache Hive into a fast, dynamic SQL on Hadoop engine capable of running highly concurrent query workloads over large datasets with sub-second response time.
The latest Hortonworks and Azure HDInsight platform versions fully support Hive with LLAP execution engine for production use. In this webinar, we will go through the architecture of Hive + LLAP engine and explain how it differs from previous Hive versions. We will then dive deeper and show how features like query vectorization and LLAP columnar caching bring further automatic performance improvements.
In the end, we will show how Gluent brings these new performance benefits to traditional enterprise database platforms via transparent data virtualization, allowing even your largest databases to benefit from all this without changing any application code. Join this webinar to learn about significant improvements in modern Hive architecture and how Gluent and Hive LLAP on Hortonworks or Azure HDInsight platforms can accelerate cloud migrations and greatly improve hybrid query performance!
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
1. Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Tanel Poder
2. 2
• Enterprise database & performance background (Oracle focused)
• ”All enterprise data can just be a query away”
• Gluent Data Platform
• Supports all major Hadoop distributions, on-premises or in the cloud
• Consolidates data into a centralized location in open data formats
• Transparent Data Virtualization provides simple data sharing across the enterprise
Who we are
3. 3
… but traditional databases don’t cut it anymore!
P
T
P
Big Data IoT
? ?
Enterprise Applications run on Enterprise Databases
5. 5
• An open source big data warehouse system on Hadoop
• Metadata + table structure + data access
• SQL layer over HDFS, cloud storage (HiveQL)
• Cost based optimizer, indexing, partitions, etc
• Used for:
• Access to huge datasets
• Parse large text files, log files, JSON (schema-on-read)
• Binary, columnar storage (schema-on-write)
• Very large queries (running for hours, days)
• Enterprise Data Warehouse offload
• Integration with Business Intelligence tools (fast, interactive queries)
• Insert, Update, Delete, Merge data in Hadoop
What is Apache Hive?
More on these later!
6. 6
Apache Hive - a brief history
2007: Facebook created the
first SQL abstraction layer
for writing MapReduce Java
code to access data in
Hadoop called Hive
2008: Apache
Hive incubating
project created
2010: Apache
Hive first
release (v0.3)
2013: Hortonworks announces
the Stinger initiative -
promising 100x faster Hive
https://hortonworks.com/blog
/100x-faster-hive/
2013: Hive on Tez released
via Hortonworks Data
Platform 2.0
2016: Hive LLAP
included in Apache
Hive 2.0
2016: Hive LLAP
included in Azure
HDInsight
7. 7
• MapReduce
• Original data processing framework
for Hadoop
• Map: filtering, sorting, etc
• Reduce: aggregate (sum, count, etc)
• Each Map + Reduce intermediate
result is written to disk (I/O intensive)
• Apache Tez
• Built on top of YARN
• Dataflow graph - processing steps
defined before the job begins
• Low latency, high throughput
• Intermediate results transferred via
memory
Hive data processing engines
Source: https://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/9
8. 8
• YARN based framework for data processing applications in Hadoop
• Used by Apache Hive, Apache Pig, and others
• Can execute a complex Directed Acyclic Graph (DAG) when processing data
• Any given SQL query can be expressed as a single job
• Data is not physically stored in between tasks as in MapReduce
• Data processing defined as a ”graph”
• Vertices - the processing of data (where the
query logic resides)
• Edges - movement of data in-between
processing (task routing/scheduling)
Apache Tez
9. 9
• Query vectorization
• Process rows in "blocks" of 1024 containing vectors of column values
• Improves operations of scans, aggregations, filters, and joins
• Partitioning
• Reduce the amount of data read to improve I/O
• Each partition becomes a directory
• Bucketing
• Similar to hash subpartitioning
• Each bucket becomes a file
• Best for high cardinality columns
• ORC file format
• Columnar data compression
• Built-in "storage indexes"
Hive performance optimizations
12. 12
• “Live Long and Process” or “Low Latency Analytical Processing”
• Not an execution engine (like Tez), LLAP simply enhances the Hive execution model
• Built for fast query response time against smaller data volumes
• Allows concurrent execution of analytical workloads
• Intelligent memory caching for quick startup and data sharing
• Caches most active data in RAM
• Shared cache across clients
• Persistent server used to instantly
execute queries
• LLAP daemons are “always on”
• Data passed to execution as it becomes ready
Introducing Hive LLAP
15. 15
Hive data processing
Source: https://hortonworks.com/blog/announcing-apache-hive-2-1-25x-faster-queries-much/
Write resultset
to disk after
each operation
Data cached in-
memory & shared
across clients
MapReduce Tez Tez with LLAP
16. 16
Query performance - Tez vs Tez + LLAP
Source: https://hortonworks.com/blog/announcing-apache-hive-2-1-25x-faster-queries-much/
17. 17
• Persistent columnar caching
• Data, metadata, and indexes are cached in-memory
• Clients share cached data for faster processing (less I/O, less CPU)
• Query fragments
• Higher priority queries can pre-empt other queries
• Fragmenting allows lower priority queries to continue, even if pre-empted
• Smarter map joins
• Build the hash table once and cache it in-memory for sharing with other processes
• Hybrid execution
• Hive queries can run in LLAP, Tez, or a hybrid of both
• Multi-threaded processing
• Data reads, decoding, and processing executed on separate threads
• Dynamic runtime filtering
• Bloom filter automatically built to eliminate rows that cannot match
LLAP features
18. 18
Caching efficiently - LLAP’s tricks
Source: https://www.slideshare.net/t3rmin4t0r/llap-locality-is-dead/13
LLAP’s cache is decentralized, columnar, automatic, additive, packed, and layered
There is no centralized
store of “what’s cached
and where” - the cache
side-steps the block
metadata size concerns.
The cache does not contain
any dead columns. If you run
TPC-H with LLAP, you’ll notice
it never caches billions of
values in L_COMMENT.
Admins don’t
need to run
“cache table” or
new partitions as
they are created.
Data updates are
detected as well.
When a new column or
partition is used, the cache
adds to itself incrementally
- unlike immutable caches.
Caches data with
intact dictionary
and RLE encodings,
to reduce footprint.
Caches ORC indexes which
trigger skips too - a scan for
city = ‘San Francisco’, allows
city = ‘Los Angeles’ to use
cached index data to skip.
20. 20
• Transactions on data stored in HDFS (no longer just INSERT!)
• Uses base files and delta files where insert, update, and delete operations are
recorded
• Useful for
• Slowly changing dimensions
• Data corrections
• Bulk updates
• Streaming ingest of data
• MERGE support now available
• Note: Hive transactions is not OLTP!
Hive ACID - transactional operations in Hadoop
CREATE TABLE customers (
name string,
address string,
city string,
state string
) clustered by (name) into 10 buckets
STORED AS ORC
TBLPROPERTIES('transactional'='true');
Enable transactions
23. 23
Microsoft Azure Hadoop Stack
Source: https://f.ch9.ms/public/MLDS2016/OptimizingApacheHivePerformanceHDInsight.pptx
24. 24
• Easy deployment
• Elasticity - expand or shrink resources as needed
• Launch transient services for “large” or temporary data processing
• Managed storage
• Never run out of space!
• Hardware maintenance is handled
by the cloud provider
Hadoop in the cloud
26. 26
Application
Database
After
Gluent’s transparent data virtualization
“No-ETL” Data Sync
On Demand Data Access
Application
Database
Before
On Demand Compute
No existing
app code
changes!
New analytic
tools
Much smaller
footprint & cost
Additional
data sources
27. 27
• Query performance is key for Gluent’s transparent
data virtualization
Gluent and Hive with LLAP