Weitere ähnliche Inhalte Ähnlich wie What's new in apache hive (20) Mehr von DataWorks Summit (20) Kürzlich hochgeladen (20) What's new in apache hive 1. 1 © Hortonworks Inc. 2011–2018. All rights reserved.
What is new in Apache Hive?
Ashutosh Chauhan
2. 2 © Hortonworks Inc. 2011–2018. All rights reserved.
Apache Hive – Distant Past – First Five Years
• Initial use case: batch processing
• Circa 2008
• Read-only data
• MapReduce
• HiveQL
3. 3 © Hortonworks Inc. 2011–2018. All rights reserved.
Apache Hive – Past 5 Years
• Effort to take Hive beyond its batch processing roots
• Started in Apache Hive 0.10.0 (January 2013)
• Latest released version: Apache Hive 3.0 (May 2018)
• Extensive renovation along four different axes
• Runtime : Enable sub-second queries - LLAP
• Compiler : Cost Based Optimizer
• SQL support : Improved coverage of SQL syntax
• Transactional Support : ACID
4. 4 © Hortonworks Inc. 2011–2018. All rights reserved.
Hive – Today
• Comprehensive ANSI SQL including all TPC-DS Queries.
• The only Hadoop SQL with ACID MERGE for easy updates.
• In-Memory caching for MPP performance at Hadoop scale.
• Enables Per-User dynamic row and column security.
• Enables Replication and DR for critical workloads.
• Compatible with every major BI Tool.
• Proven at 300+ PB Scale.
5. 5 © Hortonworks Inc. 2011–2018. All rights reserved.
Apache Hive: Fast Facts
Most Queries Per Hour
100,000 Queries Per Hour
Analytics Performance
100 Million rows/s Per Node
Largest Hive Warehouse
300+ PB Raw Storage
Largest Cluster
4,500+ Nodes
6. 6 © Hortonworks Inc. 2011–2018. All rights reserved.
Hive: Serving ETL Workloads to BI Systems
BI
systems
Materialized
view
Improved
Stats
Constraints
Query
Result
Cache
Workload
manage
ment
ACID v2
• Results return
from HDFS/cache
directly
• Reduce load from
repetitive queries
• Allows more
queries to be run
in parallel
• Reduce resource
starvation in large
clusters
• Also:
Active/Passive HA
• More “tools” for
optimizer to use
• More ”tools” for
DBAs to
tune/optimize
• Invisible tuning of
DB from users’
perspective
• ACID v2 is as fast
as regular tables
7. 7 © Hortonworks Inc. 2011–2018. All rights reserved.
• SIGMOD Software Systems Award
• “For developing seminal software systems that served to bring relational-style
declarative programming to the Hadoop ecosystem.”
• Postgres, SQLLite and MonetDB
8. 8 © Hortonworks Inc. 2011–2018. All rights reserved.
Hive – How Did We Get Here?
• LLAP Enhancements
• CBO Enhancements
• ACID Enhancements
9. 9 © Hortonworks Inc. 2011–2018. All rights reserved.
Materialized Views in Hive
10. 10 © Hortonworks Inc. 2011–2018. All rights reserved.
Accelerating Query Processing
• Change data physical properties (distribute, sort)
• Filter rows
• Denormalize
• Preaggregate
Optimization based on access patterns
11. 11 © Hortonworks Inc. 2011–2018. All rights reserved.
Materialized Views to Rescue
Speed up aggregates and joins via MVs
View navigation via CBO/Calcite
Optionally allow rewrites against out-of-date
materializations
12. 12 © Hortonworks Inc. 2011–2018. All rights reserved.
Materialized Views in Hive 3
• Multiple storage options: Hive, Druid
• Multiple options to control materialized views lifecycle
13. 13 © Hortonworks Inc. 2011–2018. All rights reserved.
Materialized View-based Rewriting
• Materialized view definition
CREATE MATERIALIZED VIEW mv AS
SELECT <dims>,
lo_revenue,
lo_extprice * lo_disc AS d_price,
lo_revenue - lo_supplycost,
FROM
customer, dates, lineorder, part, supplier
WHERE
lo_orderdate = d_datekey
and lo_partkey = p_partkey
and lo_suppkey = s_suppkey
and lo_custkey = c_custkey;
• Query
SELECT sum(lo_extendedprice*lo_discount)
FROM
lineorder, dates
WHERE
lo_orderdate = d_datekey
and d_year = 2013
and lo_discount between 1 and 3;
• Materialized view-based rewriting
SELECT SUM(d_price)
FROM mv
WHERE
d_year = 2013
and lo_discount between 1 and 3;
supplier
part
dates
customerlineorder
mv contents
Query results
14. 14 © Hortonworks Inc. 2011–2018. All rights reserved.
Rebuilding Materialized Views
• ALTER MATERIALIZED VIEW [db_name.]materialized_view_name REBUILD;
• Incremental materialized view maintenance
• Only refresh data that has changed in source tables
15. 15 © Hortonworks Inc. 2011–2018. All rights reserved.
Accelerating Query Processing with
Materialized Views in Apache Hive
Jesus Camacho Rodriguez
Tuesday, June 19
2:50 PM - 3:30 PM
Executive Ballroom 210A/E
17. 17 © Hortonworks Inc. 2011–2018. All rights reserved.
Overview
• Effectively share LLAP cluster resources
• Resource allocation per user policy; separate ETL and BI, etc.
• Resources based guardrails
• Protect against long running queries, high memory usage
• Improved, query-aware scheduling
• Scheduler is aware of query characteristics, types, etc.
• Fragments easy to pre-empt compared to containers
• Queries get guaranteed fractions of the cluster, but
can use empty space
18. 18 © Hortonworks Inc. 2011–2018. All rights reserved.
Resource Plans
• Resource plan is a workload management configuration for a cluster
• Switching is allowed without stopping queries, e.g. based on time of day
• Cluster is divided into query pools (optionally nested)
• Each pool defines query parallelism, cluster resources percentage
• Queries are automatically routed to pools based on user name, app, etc.
• Rules (Triggers) to kill, move, or deprioritized queries based on DFS usage, runtime, etc.
• Example :
CREATE RESOURCE PLAN daytime;
CREATE POOL bi IN daytime (resource_percent=75, concurrent_queries=5);
CREATE POOL etl IN daytime TIME (resource_percent=25, concurrent_queries=10);
CREATE RULE downgrade IN daytime WHEN total_runtime > 120 THEN MOVE etl;
ADD RULE downgrade TO bi IN daytime ;
CREATE MAPPING tableau IN daytime (application='Tableau', pool=bi);
ALTER PLAN daytime SET default_pool='etl';
APPLY PLAN daytime;
19. 19 © Hortonworks Inc. 2011–2018. All rights reserved.
Decentralized Guaranteed Resources
• A guaranteed task for each resource (executor slots)
• HS2 gives N guaranteed tasks to an AM based on configured resource plan
• AMs mark N of its most important tasks as guaranteed at any given time
• Guaranteed tasks pre-empt speculative tasks
20. 20 © Hortonworks Inc. 2011–2018. All rights reserved.
Guaranteed Tasks – BI and ETL Example
BI (80% = 14 guaranteed) ETL (20% = 4 guaranteed)
Query 1 Query 2
LLAP Daemon 1 LLAP Daemon 2 LLAP Daemon 3
Wait Queue
Executors
10 active tasks (running):
10 guaranteed (running)
4 unused for now
19 active tasks (8 running):
4 guaranteed (4 running)
15 speculative (4 running)
HS2
18 executors total
22. 22 © Hortonworks Inc. 2011–2018. All rights reserved.
Caching for BI Workloads
• Fine-grained (columnar), compact (dictionary, RLE encoded)
• Important due to projections over many wide EDW tables
• Prioritized – indexes are cached with higher priority
• Important to make use of predicate pushdown
• Off-heap (no extra GC), supports SSD
• LRFU replacement policy avoids the damage from large scans
23. 23 © Hortonworks Inc. 2011–2018. All rights reserved.
Caching for BI Workloads – Formats, Zero-ETL
• ORC, Parquet
• Cached natively
• Zero-ETL analytics on CSV and JSON data with text caching
• Text is efficiently encoded in background; once cached, queries speed up
24. 24 © Hortonworks Inc. 2011–2018. All rights reserved.
In-memory Processing – Native Columnar (ORC)
I/O threads
SSD
cache
Off-heap
cacheCompact encoded data
Distributed FS
Compressed data
Decoder: ORC
col1
col2
Compression
codec
Read planner
Execution thread
Fragment
Hive
operator
Hive
operator
Vectorized
processing
col1 col2
Native data
vectors
Replacement
policy
25. 25 © Hortonworks Inc. 2011–2018. All rights reserved.
Running Hive queries fast in the cloud
Nita Dembla
Wednesday, June 20
4:00 PM - 4:40 PM
Grand Ballroom 220C
26. 26 © Hortonworks Inc. 2011–2018. All rights reserved.
Druid + Apache Hive
Layer Data Access Pattern Features
Hive Layer Large Scale analytics
Joins
Subqueries
Windowing Functions
Transformations
Complex Aggregations
Advanced Sorting
UDFs
Druid Layer
Needles-in-a-haystack queries with
large numbers of dimensions
Dimensional Aggregates
Top N Queries
Min/Max Values
Timeseries Queries
Approximate Distinct Count
Approximate Histograms
27. 27 © Hortonworks Inc. 2011–2018. All rights reserved.
Druid Integration
• Pushdown of aggregate queries
• Pushdown of complex expressions
• Improvements in Druid to support sql standard NULL semantics
• Store MV In Druid
28. 28 © Hortonworks Inc. 2011–2018. All rights reserved.
Hive 3: Real-time Ingestion
Hive
Kafka-Druid-
Hive ingest
Druid
Real-time analytics
• Druid answers in near real-time
29. 29 © Hortonworks Inc. 2011–2018. All rights reserved.
Druid and Hive Together: Interactive
Realtime Analytics at Scale
Nishant Bangarwa
Tuesday, June 19
4:50 PM - 5:30 PM
Grand Ballroom 220B
30. 30 © Hortonworks Inc. 2011–2018. All rights reserved.
Acid V2
• New On disk storage format for Acid tables
• Run major compactions before you upgrade
• Update = Delete + Insert
• Performance at par with non-Acid tables
• Support for load statements
• New Streaming ingestion library
31. 31 © Hortonworks Inc. 2011–2018. All rights reserved.
Insert-only Tables
• Transactional Semantics for non-ORC tables
• For insert into and Insert overwrite
• With near-zero overhead
• No rename() - Cloud friendly
32. 32 © Hortonworks Inc. 2011–2018. All rights reserved.
Transactional Operations in Apache Hive
Eugene Koifman
Wednesday, June 20
11:50 AM - 12:30 PM
Executive Ballroom 210A/E
33. 33 © Hortonworks Inc. 2011–2018. All rights reserved.
Disaster Recovery for Hive Data
A
A B
B
CentralizedSecurityandGovernance
On-Premise
Data Center (a)
On-Premise
Data Center (b)
Scheduled Policy (A)
(2am, 10am, 6pm daily)
Scheduled Policy (B)
(2am daily)
1 Data replication with scheduled policy
2 Disaster takes down Data Center (b)
3 Failover to Data Center (a); data set B made active
4 Active data set B changes to B’ in Data Center (a)
34. 34 © Hortonworks Inc. 2011–2018. All rights reserved.
Hive-based Replication
• Replv2 introduces new REPL commands
• Incremental replication - only copy delta changes
• Point-in time replication.
• Hive maintains the replication state.
• Additional support for other database objects - for ex: functions, constraint etc.
• Reduce number of copies.
35. 35 © Hortonworks Inc. 2011–2018. All rights reserved.
Seamless Replication and Disaster
Recovery for Apache Hive Warehouse
Sankar Hariappan
Thursday, June 21
9:30 AM - 10:10 AM
Meeting Room 211A/B/C/D
36. 36 © Hortonworks Inc. 2011–2018. All rights reserved.
One Metastore to Rule Them All
HDFS/S3 Kafka
Hive
LLAP
Spark
HMS Atlas
RangerSR
Hive
on Tez
37. 37 © Hortonworks Inc. 2011–2018. All rights reserved.
Between Us and the Grand Vision
• Make HMS separable from Hive
• Standalone Metastore
• Unify HMS and Schema Registry so batch and streaming can see each other’s data
• Also reduces the number of metadata systems admins have to install and maintain
38. 38 © Hortonworks Inc. 2011–2018. All rights reserved.
Sharing Metadata Across the Data Lake
and Streams
Alan Gates
Wednesday, June 20
11:50 AM - 12:30 PM
Meeting Room 230A
39. 39 © Hortonworks Inc. 2011–2018. All rights reserved.
External Access –
Spark Llap
40. 40 © Hortonworks Inc. 2011–2018. All rights reserved.
External Access – Relational View for Everyone
• Hive-on-Tez and other DAG executors can use LLAP directly
• LLAP also provides a "relational datanode" view of the data
• Anyone (with access) can push the (approved) code in, from complex query fragments to
simple data reads
• E.g. a Spark DataFrame can be created with LlapInputFormat
• Gives the external services the access to
• Hive data: centralized, secure data access
• Ability to read all Hive table types, like ACID transactional tables
• Hive features: from column-level security, to LLAP columnar cache
41. 41 © Hortonworks Inc. 2011–2018. All rights reserved.
Support Row/Column-level Security in Spark
spark-shell
pyspark
42. 42 © Hortonworks Inc. 2011–2018. All rights reserved.
What Is Required?
• Apache Ranger
• Apache Hive with LLAP
• Spark-LLAP
• A library to integrate above tech with SparkSQL
43. 43 © Hortonworks Inc. 2011–2018. All rights reserved.
HiveServer2 + LLAP + Ranger
YARN Cluster
HiveServer2
Client App
Hive Query
Coordinator
Plan Generation
TableScan: users
Filter: state = ‘CA’
Projection:
mask(name)
SQL Query:
select name from users
1.Client sends query to HiveServer2.
2.Query plan generation by HiveServer2. Ranger
security policies applied. Plan modified based on
dynamic security policies.
3.Query plan sent to query coordinator
4.Query plan sent to LLAP daemons for execution.
Filtering/masking performed.
5.Results consolidated and sent to client
1 Ranger
Dynamic Policies
5 2
3 4
LLAP
LLAP
LLAP Daemons
44. 44 © Hortonworks Inc. 2011–2018. All rights reserved.
LLAP
InputFor
mat
YARN Cluster
HiveServer2
Client App
Hive Query
Coordinator
Plan Generation
TableScan: users
Filter: state = ‘CA’
Projection:
mask(name)
SQL Query:
select name from users
1.Client requests data locations known as “splits”
from HiveServer2.
2.Query plan generation by HiveServer2. Ranger
security policies applied. Plan modified based on
dynamic security policies.
3.Splits returned to client which include signed
query plan.
4.LLAP splits used by client to securely submit
query plan to LLAP. Filtering/masking performed.
Data returned to client.
1 Ranger
Dynamic Policies
3 2
LLAP
LLAP
LLAP Daemons
HiveServer2 + LLAP + Ranger
4
45. 45 © Hortonworks Inc. 2011–2018. All rights reserved.
“Other” Improvements
• Query reoptimization
• Constraints
• Vectorization
• Query Cache
• Active Passive HS2 HA for llap
• HLL BitVectors
• CachedStore
• Numerous enhancements in Spark Integration
46. 46 © Hortonworks Inc. 2011–2018. All rights reserved.
Future
• Standalone Metastore
• Materialized Views – Automatic Recommendations
• Better integration with cloud storage
• HS2 scalability
47. 47 © Hortonworks Inc. 2011–2018. All rights reserved.
Thanks
to Open Source Community
for continued success for last
10 years.
Now,
Onwards to next 10 years