Apache Hive is a rapidly evolving project, many people are loved by the big data ecosystem. Hive continues to expand support for analytics, reporting, and bilateral queries, and the community is striving to improve support along with many other aspects and use cases. In this lecture, we introduce the latest and greatest features and optimization that appeared in this project last year. This includes benchmarks covering LLAP, Apache Druid's materialized views and integration, workload management, ACID improvements, using Hive in the cloud, and performance improvements. I will also tell you a little about what you can expect in the future.
Driving Behavioral Change for Information Management through Data-Driven Gree...
Â
What's New in Apache Hive 3.0?
1. 1 Š Hortonworks Inc. 2011â2018. All rights reserved
Apache Hive 3.0, A New Horizon
Alan Gates
Hortonworks Co-founder, Apache Hive PMC member
@alanfgates
2. 2 Š Hortonworks Inc. 2011â2018. All rights reserved
Apache Hive â Data Warehousing for Big Data
⢠Comprehensive ANSI SQL
⢠Only open source Hadoop SQL with transactions, INSERT/UPDATE/DELETE/MERGE
⢠BI queries with MPP performance at big data scales
⢠ETL jobs scale with your cluster
⢠Enables per-user dynamic row and column security
⢠Enables replication for HA and DR
⢠Compatible with every major BI tool
⢠Proven at 300+ PB scale
3. 3 Š Hortonworks Inc. 2011â2018. All rights reserved
Hive on Tez
Deep
Storage
Hadoop Cluster
Tez Container
Query
Executors
Tez Container
Query
Executors
Tez Container
Query
Executors
Tez Container
Query
Executors
Tez AM
Tez AM
HiveServer2
(Query
Endpoint)
ODBC /
JDBC
SQL
Queries
HDFS and
Compatible
S3 WASB Isilon
4. 4 Š Hortonworks Inc. 2011â2018. All rights reserved
Hive LLAP - MPP Performance at Hadoop Scale
Deep
Storage
Hadoop Cluster
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
Query
Coordinators
Coord-
inator
Coord-
inator
Coord-
inator
HiveServer2
(Query
Endpoint)
ODBC /
JDBC
SQL
Queries In-Memory Cache
(Shared Across All Users)
HDFS and
Compatible
S3 WASB Isilon
5. 5 Š Hortonworks Inc. 2011â2018. All rights reserved
Hive3: EDW analyst pipeline
BI tools
Materialized
view
Surrogate
key
Constraints
Query
Result
Cache
Workload
management
⢠Results return
from HDFS/cache
directly
⢠Reduce load from
repetitive queries
⢠Allows more
queries to be run
in parallel
⢠Reduce resource
starvation in large
clusters
⢠Active/Passive HA
⢠More âtoolsâ for
optimizer to use
⢠More âtoolsâ for
DBAs to
tune/optimize
⢠Invisible tuning of
DB from usersâ
perspective
⢠ACID v2 is as fast as
regular tables
⢠Hive 3 is optimized
for S3/WASB/GCP
⢠Support for
JDBC/Kafka/Druid
out of the box
ACID v2
Cloud
Storage
Connectors
6. 6 Š Hortonworks Inc. 2011â2018. All rights reserved
⢠Ran all 99 TPCDS queries
⢠Total query runtime have improved multifold in each release!
Benchmark journey
TPCDS 10TB scale on 10 node cluster
HDP 2.5
Hive1
HDP 2.5
LLAP
HDP 2.6
LLAP
25x 3x 2x
HDP 3.0
LLAP
2016 20182017
ACID
tables
7. 7 Š Hortonworks Inc. 2011â2018. All rights reserved
New Features
8. 8 Š Hortonworks Inc. 2011â2018. All rights reserved
Transactional Read and Write
⢠Originally Hive supported write only by adding partitions or loading new files into
existing partitions
⢠Starting in version 0.13, Hive added transactions and INSERT, UPDATE, DELETE
⢠Supports
⢠Slow changing dimensions
⢠Correcting mis-loaded data
⢠GDPR's right to be forgotten
⢠Not OLTP!
⢠Drawbacks:
⢠Transactional tables had to be stored in ORC and had to be bucketed
⢠Reading transactional tables was significantly slower than non-transactional
⢠No support for MERGE or UPSERT functionality
9. 9 Š Hortonworks Inc. 2011â2018. All rights reserved
ACID v2
⢠In 3.0 ACID storage has been reworked
⢠Performance penalty for ACID is now negligible even when compactor has not run
⢠With other optimizations ACID can result in speed up (more on this in the performance talk)
⢠Added MERGE support
⢠CDC can be regularly merged into a fact table with upsert functionality
⢠Removed restrictions:
⢠Tables no longer have to be bucketed
⢠Non-ORC based tables supported (INSERT & SELECT only)
⢠Still not OLTP!
10. 10 Š Hortonworks Inc. 2011â2018. All rights reserved
Materialized Views
1. Create materialized view using Hive tables
⢠Stored by Hive or Druid
2. User or dashboard sends queries to Hive
⢠Hive rewrites queries using available materialized views
⢠Execute rewritten query
Dashboards, BI tools
CREATE MATERIALIZED VIEW `ssb_mv`
STORED AS 'org.apache.hadoop.hive.druid.DruidStorageHandler'
ENABLE REWRITE
AS
<query>;
DBA, recommendation system
â
âĄ
Data
Queries
11. 11 Š Hortonworks Inc. 2011â2018. All rights reserved
Materialized view-based rewriting example
⢠Materialized view definition
CREATE MATERIALIZED VIEW mv AS
SELECT <dims>,
lo_revenue,
lo_extendedprice * lo_discount AS d_price,
lo_revenue - lo_supplycost
FROM
customer, dates, lineorder, part, supplier
WHERE
lo_orderdate = d_datekey
and lo_partkey = p_partkey
and lo_suppkey = s_suppkey
and lo_custkey = c_custkey;
⢠Query
SELECT sum(lo_extendedprice * lo_discount)
FROM
lineorder, dates
WHERE
lo_orderdate = d_datekey
and d_year = 2013
and lo_discount between 1 and 3;
⢠Materialized view-based rewriting
SELECT SUM(d_price)
FROM mv
WHERE
d_year = 2013
and lo_discount between 1 and 3;
supplier
part
dates
customerlineorder
d_year lo_discount <dims> d_price
2013 2 ... 7.55
2014 4 ... 432.60
2013 2 ... 34.45
2012 2 ... 2.05
⌠⌠... âŚ
mv contents
sum
42.0
âŚ
Query results
12. 12 Š Hortonworks Inc. 2011â2018. All rights reserved
Materialized view - Maintenance
⢠Partial table rewrites are supported
⢠Typical: Denormalize last month of data only
⢠Rewrite engine will produce union of latest and historical data
⢠Updates to base tables
⢠Invalidates views, but
⢠Can choose to allow stale views (max staleness) for performance
⢠Can partial match views and compute delta after updates
⢠Incremental updates
⢠Common classes of views allow for incremental updates
⢠Others need full refresh
13. 13 Š Hortonworks Inc. 2011â2018. All rights reserved
LLAP workload management
⏢ Effectively share LLAP cluster resources
â Resource allocation per user policy; separate ETL and BI, etc.
⏢ Resources based guardrails
â Protect against long running queries, high memory usage
⏢ Improved, query-aware scheduling
â Scheduler is aware of query characteristics, types, etc.
â Fragments easy to pre-empt compared to containers
â Queries get guaranteed fractions of the cluster, but can use
empty space
14. 16 Š Hortonworks Inc. 2011â2018. All rights reserved
Plus More
⢠Constraints (primary/foreign keys, not null) and default values supported
⢠Surrogate keys â default values, unique, not monotonically increasing
⢠Replication of data and metadata between Hive instances
⢠SQL Standard Information Schema now supported
⢠In Hive 3 much work has been done to optimize Hive for object stores
⢠Hive uses its ACID system to determine which files to read rather than trust the storage
⢠Moves eliminated where ever possible
⢠More aggressive caching of file metadata and data to reduce file system operations
⢠Apache Parquet and text files now supported in LLAP
⢠Query cache to return results from repeated queries in under 100ms (requires ACID)
⢠Metastore cache for faster query compilation and planning, especially in the cloud
16. 18 Š Hortonworks Inc. 2011â2018. All rights reserved
EDW ingestion pipeline
LLAP
interface
Kafka-Druid-
Hive ingest
Kafka-hive
streaming
ingest
Druid
ACID tables
Real-time analytics
⢠Druid answers in near real-time
⢠JDBC sources
⢠Kafka sources
Easy to use
⢠Query any data via LLAP
⢠No need to de-ACID tables
⢠No bucketing required
⢠Calcite talks SQL
⢠Materialization just works
⢠Cache just worksJDBC sources
MySQL, Postgres, Oracle
17. 19 Š Hortonworks Inc. 2011â2018. All rights reserved
Driver
MetaStore
HiveServer+Tez
LLAP DaemonsExecutors
Spark
Meta
Hive
Meta
HWC (JDBC)
Executors LLAP Daemons
1
2
3
1. Driver submits query to HiveServer
2. Compile query and return âsplitsâ to Driver
3. Execute query on LLAP
c) hive.executeQuery(âSELECT * FROM tâ).sort(âAâ).show()
ACID
Tables
18. 20 Š Hortonworks Inc. 2011â2018. All rights reserved
Driver
MetaStore
HiveServer+Tez
LLAP DaemonsExecutors
Spark
Meta
Hive
Meta
HWC (Arrow)
Executors LLAP Daemons
4
5
4. Executor Tasks run for each split
5. Tasks reads Arrow data from LLAP
6. HWC returns ArrowColumnVectors to Spark
6
c) hive.executeQuery(âSELECT * FROM tâ).sort(âAâ).show()
ACID
Tables
19. 21 Š Hortonworks Inc. 2011â2018. All rights reserved
Druid capabilities
⢠Streaming ingestion capability
⢠Data Freshness â analyze events as they occur
⢠Fast response time (ideally < 1sec query time)
⢠Arbitrary slicing and dicing
⢠Multi-tenancy â 1000s of concurrent users
⢠Scalability and Availability
⢠Rich real-time visualization with Superset
Apache Druid is a distributed, real-time, column-oriented
datastore designed to quickly ingest and index large amounts
of data and make it available for real-time query.
20. 22 Š Hortonworks Inc. 2011â2018. All rights reserved
Hive and Druid, Better Together
Technology Strengths Issues
Hive SQL 2011, JDBC/ODBC
Fast scans
ACID
Not optimized for slice and dice and drill down (OLAP
cubing) operations
Druid Dimensional aggregates support OLAP cubes
Timeseries queries
Realtime ingestion of streaming data
Lacks SQL interface
No joins
Problem: You don't want two systems to manage and load data into
Solution: For data that fits best in Druid, load it in Druid and access it with Hive
⢠Hive supports push down of queries to Druid, optimizer knows what to push and what to run in Hive
⢠Enables SQL and JDBC/ODBC access to data in Druid
⢠Enables join of historical and realtime data
⢠Enables Hive support of slice & dice, drill down for OLAP cubing
⢠Can also create materialized views in Hive and store them in Druid
21. 23 Š Hortonworks Inc. 2011â2018. All rights reserved.
Hortonworks confidential and proprietary information
SOLUTIONS: Heuristic recommendation engine
Fully self-serviced query and storage optimization