SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Downloaden Sie, um offline zu lesen
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Tanel Poder
2
• Enterprise database & performance background (Oracle focused)
• ”All enterprise data can just be a query away”
• Gluent Data Platform
• Supports all major Hadoop distributions, on-premises or in the cloud
• Consolidates data into a centralized location in open data formats
• Transparent Data Virtualization provides simple data sharing across the enterprise
Who we are
3
… but traditional databases don’t cut it anymore!
P
T
P
Big Data IoT
? ?
Enterprise Applications run on Enterprise Databases
4
Gluent Data Virtualization
5
• An open source big data warehouse system on Hadoop
• Metadata + table structure + data access
• SQL layer over HDFS, cloud storage (HiveQL)
• Cost based optimizer, indexing, partitions, etc
• Used for:
• Access to huge datasets
• Parse large text files, log files, JSON (schema-on-read)
• Binary, columnar storage (schema-on-write)
• Very large queries (running for hours, days)
• Enterprise Data Warehouse offload
• Integration with Business Intelligence tools (fast, interactive queries)
• Insert, Update, Delete, Merge data in Hadoop
What is Apache Hive?
More on these later!
6
Apache Hive - a brief history
2007: Facebook created the
first SQL abstraction layer
for writing MapReduce Java
code to access data in
Hadoop called Hive
2008: Apache
Hive incubating
project created
2010: Apache
Hive first
release (v0.3)
2013: Hortonworks announces
the Stinger initiative -
promising 100x faster Hive
https://hortonworks.com/blog
/100x-faster-hive/
2013: Hive on Tez released
via Hortonworks Data
Platform 2.0
2016: Hive LLAP
included in Apache
Hive 2.0
2016: Hive LLAP
included in Azure
HDInsight
7
• MapReduce
• Original data processing framework
for Hadoop
• Map: filtering, sorting, etc
• Reduce: aggregate (sum, count, etc)
• Each Map + Reduce intermediate
result is written to disk (I/O intensive)
• Apache Tez
• Built on top of YARN
• Dataflow graph - processing steps
defined before the job begins
• Low latency, high throughput
• Intermediate results transferred via
memory
Hive data processing engines
Source: https://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/9
8
• YARN based framework for data processing applications in Hadoop
• Used by Apache Hive, Apache Pig, and others
• Can execute a complex Directed Acyclic Graph (DAG) when processing data
• Any given SQL query can be expressed as a single job
• Data is not physically stored in between tasks as in MapReduce
• Data processing defined as a ”graph”
• Vertices - the processing of data (where the
query logic resides)
• Edges - movement of data in-between
processing (task routing/scheduling)
Apache Tez
9
• Query vectorization
• Process rows in "blocks" of 1024 containing vectors of column values
• Improves operations of scans, aggregations, filters, and joins
• Partitioning
• Reduce the amount of data read to improve I/O
• Each partition becomes a directory
• Bucketing
• Similar to hash subpartitioning
• Each bucket becomes a file
• Best for high cardinality columns
• ORC file format
• Columnar data compression
• Built-in "storage indexes"
Hive performance optimizations
10
Fast, sub-second query response time!
Hive on Tez is great, but what is missing?
11
Introducing Hive LLAP
Now called Hive
Interactive Query
12
• “Live Long and Process” or “Low Latency Analytical Processing”
• Not an execution engine (like Tez), LLAP simply enhances the Hive execution model
• Built for fast query response time against smaller data volumes
• Allows concurrent execution of analytical workloads
• Intelligent memory caching for quick startup and data sharing
• Caches most active data in RAM
• Shared cache across clients
• Persistent server used to instantly
execute queries
• LLAP daemons are “always on”
• Data passed to execution as it becomes ready
Introducing Hive LLAP
13
LLAP Daemons: HiveServer2 Interactive
14
8 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Hive	2	with	LLAP:	Architecture	Overview
Deep	
Storage
YARN	Cluster
LLAP	Daemon
Query	
Executors
LLAP	Daemon
Query	
Executors
LLAP	Daemon
Query	
Executors
LLAP	Daemon
Query	
Executors
Query
Coordinators
Coord-
inator
Coord-
inator
Coord-
inator
HiveServer2	
(Query	
Endpoint)
ODBC	/
JDBC
SQL
Queries In-Memory	Cache
(Shared	Across	All	Users)
HDFS	and	
Compatible
S3 WASB Isilon
LLAP architecture
Source: https://www.slideshare.net/Hadoop_Summit/an-apache-hive-based-data-warehouse-80225129
Persistent
daemon
15
Hive data processing
Source: https://hortonworks.com/blog/announcing-apache-hive-2-1-25x-faster-queries-much/
Write resultset
to disk after
each operation
Data cached in-
memory & shared
across clients
MapReduce Tez Tez with LLAP
16
Query performance - Tez vs Tez + LLAP
Source: https://hortonworks.com/blog/announcing-apache-hive-2-1-25x-faster-queries-much/
17
• Persistent columnar caching
• Data, metadata, and indexes are cached in-memory
• Clients share cached data for faster processing (less I/O, less CPU)
• Query fragments
• Higher priority queries can pre-empt other queries
• Fragmenting allows lower priority queries to continue, even if pre-empted
• Smarter map joins
• Build the hash table once and cache it in-memory for sharing with other processes
• Hybrid execution
• Hive queries can run in LLAP, Tez, or a hybrid of both
• Multi-threaded processing
• Data reads, decoding, and processing executed on separate threads
• Dynamic runtime filtering
• Bloom filter automatically built to eliminate rows that cannot match
LLAP features
18
Caching efficiently - LLAP’s tricks
Source: https://www.slideshare.net/t3rmin4t0r/llap-locality-is-dead/13
LLAP’s cache is decentralized, columnar, automatic, additive, packed, and layered
There is no centralized
store of “what’s cached
and where” - the cache
side-steps the block
metadata size concerns.
The cache does not contain
any dead columns. If you run
TPC-H with LLAP, you’ll notice
it never caches billions of
values in L_COMMENT.
Admins don’t
need to run
“cache table” or
new partitions as
they are created.
Data updates are
detected as well.
When a new column or
partition is used, the cache
adds to itself incrementally
- unlike immutable caches.
Caches data with
intact dictionary
and RLE encodings,
to reduce footprint.
Caches ORC indexes which
trigger skips too - a scan for
city = ‘San Francisco’, allows
city = ‘Los Angeles’ to use
cached index data to skip.
19
LLAP demo
20
• Transactions on data stored in HDFS (no longer just INSERT!)
• Uses base files and delta files where insert, update, and delete operations are
recorded
• Useful for
• Slowly changing dimensions
• Data corrections
• Bulk updates
• Streaming ingest of data
• MERGE support now available
• Note: Hive transactions is not OLTP!
Hive ACID - transactional operations in Hadoop
CREATE TABLE customers (
name string,
address string,
city string,
state string
) clustered by (name) into 10 buckets
STORED AS ORC
TBLPROPERTIES('transactional'='true');
Enable transactions
21
Hive roadmap
Source: https://hortonworks.com/apache/hive/#section_3
22
Azure HDInsight
Hive LLAP in the cloud
23
Microsoft Azure Hadoop Stack
Source: https://f.ch9.ms/public/MLDS2016/OptimizingApacheHivePerformanceHDInsight.pptx
24
• Easy deployment
• Elasticity - expand or shrink resources as needed
• Launch transient services for “large” or temporary data processing
• Managed storage
• Never run out of space!
• Hardware maintenance is handled
by the cloud provider
Hadoop in the cloud
25
Hive LLAP performance on HDInsight
Source: https://azure.microsoft.com/en-us/blog/hdinsight-interactive-query-performance-benchmarks-and-integration-with-power-bi-direct-query/
LLAP cached (ORC)
LLAP uncached (ORC)
Spark (Parquet)
LLAP cached (Text) LLAP uncached (Text) LLAP cached (ORC) LLAP uncached (ORC) Presto (ORC) Spark (Parquet)
TotalTime(s)
1478
1878
1061
2216
2416
1503
1500 2000 250010005000 3000
26
Application
Database
After
Gluent’s transparent data virtualization
“No-ETL” Data Sync
On Demand Data Access
Application
Database
Before
On Demand Compute
No existing
app code
changes!
New analytic
tools
Much smaller
footprint & cost
Additional
data sources
27
• Query performance is key for Gluent’s transparent
data virtualization
Gluent and Hive with LLAP
28
Gluent + LLAP Demo
Thank you!
info@gluent.com
gluent.com
@gluent

Weitere ähnliche Inhalte

Was ist angesagt?

Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHortonworks
 
Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL complianceMajor advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL complianceDataWorks Summit/Hadoop Summit
 
Tune up Yarn and Hive
Tune up Yarn and HiveTune up Yarn and Hive
Tune up Yarn and Hiverxu
 
Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Chris Nauroth
 
Apache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft LibraryApache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft LibraryTsz-Wo (Nicholas) Sze
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveDataWorks Summit
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...DataWorks Summit/Hadoop Summit
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BIDataWorks Summit
 
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!Mich Talebzadeh (Ph.D.)
 
ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, SmallerORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, SmallerDataWorks Summit
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitDataWorks Summit
 
Evolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemEvolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemDataWorks Summit/Hadoop Summit
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseDataWorks Summit/Hadoop Summit
 
Hive on Spark, production experience @Uber
 Hive on Spark, production experience @Uber Hive on Spark, production experience @Uber
Hive on Spark, production experience @UberFuture of Data Meetup
 

Was ist angesagt? (20)

Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage SubsystemEvolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, SmallerORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, Smaller
 
Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL complianceMajor advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL compliance
 
Tune up Yarn and Hive
Tune up Yarn and HiveTune up Yarn and Hive
Tune up Yarn and Hive
 
Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5
 
Apache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft LibraryApache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft Library
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Apache Hive ACID Project
Apache Hive ACID ProjectApache Hive ACID Project
Apache Hive ACID Project
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
 
From Device to Data Center to Insights
From Device to Data Center to InsightsFrom Device to Data Center to Insights
From Device to Data Center to Insights
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BI
 
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
 
ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, SmallerORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, Smaller
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
 
Evolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemEvolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage Subsystem
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
 
Hive on Spark, production experience @Uber
 Hive on Spark, production experience @Uber Hive on Spark, production experience @Uber
Hive on Spark, production experience @Uber
 

Ähnlich wie Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud

Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime
 
Customer Education Webcast: New Features in Data Integration and Streaming CDC
Customer Education Webcast: New Features in Data Integration and Streaming CDCCustomer Education Webcast: New Features in Data Integration and Streaming CDC
Customer Education Webcast: New Features in Data Integration and Streaming CDCPrecisely
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.MaharajothiP
 
Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Joan Novino
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemShivaji Dutta
 
Presentation big dataappliance-overview_oow_v3
Presentation   big dataappliance-overview_oow_v3Presentation   big dataappliance-overview_oow_v3
Presentation big dataappliance-overview_oow_v3xKinAnx
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...Rittman Analytics
 
Big data solutions in Azure
Big data solutions in AzureBig data solutions in Azure
Big data solutions in AzureMostafa
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSatish Mohan
 
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data PlatformModernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data PlatformHortonworks
 

Ähnlich wie Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud (20)

Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
Hadoop
HadoopHadoop
Hadoop
 
Customer Education Webcast: New Features in Data Integration and Streaming CDC
Customer Education Webcast: New Features in Data Integration and Streaming CDCCustomer Education Webcast: New Features in Data Integration and Streaming CDC
Customer Education Webcast: New Features in Data Integration and Streaming CDC
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
 
Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Presentation big dataappliance-overview_oow_v3
Presentation   big dataappliance-overview_oow_v3Presentation   big dataappliance-overview_oow_v3
Presentation big dataappliance-overview_oow_v3
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
 
Big data solutions in Azure
Big data solutions in AzureBig data solutions in Azure
Big data solutions in Azure
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
 
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data PlatformModernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
 

Kürzlich hochgeladen

The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...kalichargn70th171
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfayushiqss
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdfPearlKirahMaeRagusta1
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfkalichargn70th171
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfryanfarris8
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024Mind IT Systems
 

Kürzlich hochgeladen (20)

The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 

Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud

  • 1. Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud Tanel Poder
  • 2. 2 • Enterprise database & performance background (Oracle focused) • ”All enterprise data can just be a query away” • Gluent Data Platform • Supports all major Hadoop distributions, on-premises or in the cloud • Consolidates data into a centralized location in open data formats • Transparent Data Virtualization provides simple data sharing across the enterprise Who we are
  • 3. 3 … but traditional databases don’t cut it anymore! P T P Big Data IoT ? ? Enterprise Applications run on Enterprise Databases
  • 5. 5 • An open source big data warehouse system on Hadoop • Metadata + table structure + data access • SQL layer over HDFS, cloud storage (HiveQL) • Cost based optimizer, indexing, partitions, etc • Used for: • Access to huge datasets • Parse large text files, log files, JSON (schema-on-read) • Binary, columnar storage (schema-on-write) • Very large queries (running for hours, days) • Enterprise Data Warehouse offload • Integration with Business Intelligence tools (fast, interactive queries) • Insert, Update, Delete, Merge data in Hadoop What is Apache Hive? More on these later!
  • 6. 6 Apache Hive - a brief history 2007: Facebook created the first SQL abstraction layer for writing MapReduce Java code to access data in Hadoop called Hive 2008: Apache Hive incubating project created 2010: Apache Hive first release (v0.3) 2013: Hortonworks announces the Stinger initiative - promising 100x faster Hive https://hortonworks.com/blog /100x-faster-hive/ 2013: Hive on Tez released via Hortonworks Data Platform 2.0 2016: Hive LLAP included in Apache Hive 2.0 2016: Hive LLAP included in Azure HDInsight
  • 7. 7 • MapReduce • Original data processing framework for Hadoop • Map: filtering, sorting, etc • Reduce: aggregate (sum, count, etc) • Each Map + Reduce intermediate result is written to disk (I/O intensive) • Apache Tez • Built on top of YARN • Dataflow graph - processing steps defined before the job begins • Low latency, high throughput • Intermediate results transferred via memory Hive data processing engines Source: https://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/9
  • 8. 8 • YARN based framework for data processing applications in Hadoop • Used by Apache Hive, Apache Pig, and others • Can execute a complex Directed Acyclic Graph (DAG) when processing data • Any given SQL query can be expressed as a single job • Data is not physically stored in between tasks as in MapReduce • Data processing defined as a ”graph” • Vertices - the processing of data (where the query logic resides) • Edges - movement of data in-between processing (task routing/scheduling) Apache Tez
  • 9. 9 • Query vectorization • Process rows in "blocks" of 1024 containing vectors of column values • Improves operations of scans, aggregations, filters, and joins • Partitioning • Reduce the amount of data read to improve I/O • Each partition becomes a directory • Bucketing • Similar to hash subpartitioning • Each bucket becomes a file • Best for high cardinality columns • ORC file format • Columnar data compression • Built-in "storage indexes" Hive performance optimizations
  • 10. 10 Fast, sub-second query response time! Hive on Tez is great, but what is missing?
  • 11. 11 Introducing Hive LLAP Now called Hive Interactive Query
  • 12. 12 • “Live Long and Process” or “Low Latency Analytical Processing” • Not an execution engine (like Tez), LLAP simply enhances the Hive execution model • Built for fast query response time against smaller data volumes • Allows concurrent execution of analytical workloads • Intelligent memory caching for quick startup and data sharing • Caches most active data in RAM • Shared cache across clients • Persistent server used to instantly execute queries • LLAP daemons are “always on” • Data passed to execution as it becomes ready Introducing Hive LLAP
  • 15. 15 Hive data processing Source: https://hortonworks.com/blog/announcing-apache-hive-2-1-25x-faster-queries-much/ Write resultset to disk after each operation Data cached in- memory & shared across clients MapReduce Tez Tez with LLAP
  • 16. 16 Query performance - Tez vs Tez + LLAP Source: https://hortonworks.com/blog/announcing-apache-hive-2-1-25x-faster-queries-much/
  • 17. 17 • Persistent columnar caching • Data, metadata, and indexes are cached in-memory • Clients share cached data for faster processing (less I/O, less CPU) • Query fragments • Higher priority queries can pre-empt other queries • Fragmenting allows lower priority queries to continue, even if pre-empted • Smarter map joins • Build the hash table once and cache it in-memory for sharing with other processes • Hybrid execution • Hive queries can run in LLAP, Tez, or a hybrid of both • Multi-threaded processing • Data reads, decoding, and processing executed on separate threads • Dynamic runtime filtering • Bloom filter automatically built to eliminate rows that cannot match LLAP features
  • 18. 18 Caching efficiently - LLAP’s tricks Source: https://www.slideshare.net/t3rmin4t0r/llap-locality-is-dead/13 LLAP’s cache is decentralized, columnar, automatic, additive, packed, and layered There is no centralized store of “what’s cached and where” - the cache side-steps the block metadata size concerns. The cache does not contain any dead columns. If you run TPC-H with LLAP, you’ll notice it never caches billions of values in L_COMMENT. Admins don’t need to run “cache table” or new partitions as they are created. Data updates are detected as well. When a new column or partition is used, the cache adds to itself incrementally - unlike immutable caches. Caches data with intact dictionary and RLE encodings, to reduce footprint. Caches ORC indexes which trigger skips too - a scan for city = ‘San Francisco’, allows city = ‘Los Angeles’ to use cached index data to skip.
  • 20. 20 • Transactions on data stored in HDFS (no longer just INSERT!) • Uses base files and delta files where insert, update, and delete operations are recorded • Useful for • Slowly changing dimensions • Data corrections • Bulk updates • Streaming ingest of data • MERGE support now available • Note: Hive transactions is not OLTP! Hive ACID - transactional operations in Hadoop CREATE TABLE customers ( name string, address string, city string, state string ) clustered by (name) into 10 buckets STORED AS ORC TBLPROPERTIES('transactional'='true'); Enable transactions
  • 23. 23 Microsoft Azure Hadoop Stack Source: https://f.ch9.ms/public/MLDS2016/OptimizingApacheHivePerformanceHDInsight.pptx
  • 24. 24 • Easy deployment • Elasticity - expand or shrink resources as needed • Launch transient services for “large” or temporary data processing • Managed storage • Never run out of space! • Hardware maintenance is handled by the cloud provider Hadoop in the cloud
  • 25. 25 Hive LLAP performance on HDInsight Source: https://azure.microsoft.com/en-us/blog/hdinsight-interactive-query-performance-benchmarks-and-integration-with-power-bi-direct-query/ LLAP cached (ORC) LLAP uncached (ORC) Spark (Parquet) LLAP cached (Text) LLAP uncached (Text) LLAP cached (ORC) LLAP uncached (ORC) Presto (ORC) Spark (Parquet) TotalTime(s) 1478 1878 1061 2216 2416 1503 1500 2000 250010005000 3000
  • 26. 26 Application Database After Gluent’s transparent data virtualization “No-ETL” Data Sync On Demand Data Access Application Database Before On Demand Compute No existing app code changes! New analytic tools Much smaller footprint & cost Additional data sources
  • 27. 27 • Query performance is key for Gluent’s transparent data virtualization Gluent and Hive with LLAP