HiveACIDPublic

© Hortonworks Inc. 2014
Hive ACID
Hive Streaming & SQL Insert/Update/Delete
Raj Bains, Fall 2015

© Hortonworks Inc. 2014 HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION
Hive – Single tool for all SQL use cases
OLTP, ERP, CRM Systems
Unstructured documents, emails
Clickstream
Server logs
Sentiment, Web Data
Sensor. Machine Data
Geolocation
Interactive
Analytics
Batch Reports /
Deep Analytics
Hive - SQL
ETL / ELT

Hive: Batch to Sub-Second
Hive 0.10
Batch
Processing
Hive 0.13
Human
Interactive
(10 seconds)
Vectorized SQL Engine,
Tez Execution Engine,
ORC Columnar format
52x Average Query
Speedup
7.8 days to 9.3 hours
Hive 0.13
Human
Interactive
(10 seconds)
Stinger Initiative
Cost Based Optimizer
Faster Map Joins
Hive 0.14
Human
Interactive
(5 seconds)
3x Average Query
Speedup
Hive 2.0
Sub-SecondLLAP In-memory Cache,
LLAP Resident Process,
New Metastore for Compile,
Vectorization Improvements
Stinger.Next Initiative
Significant Query
Speedup
Using TPC-DS
Benchmark
Hive 1.2
Human
Interactive
(5 seconds)

Transaction Overview
.

Transaction Use Cases
• Reporting with Analytics (YES)
• Reporting on data with occasional updates
• Corrections to the fact tables, evolving dimension tables
• Low concurrency updates, low TPS
• SQL INSERT / UPDATE / DELETE Support
• Operational Reporting (Next)
• High throughput ingest from operational (OLTP) database
• Periodic inserts every 5-30 minutes
• Bulk updates and deletes are not supported
• SQL Merge
• Requires tool support and changes in our Transactions
• Operational (OLTP) Database (NO)
• Small Transactions, each doing single line inserts
• High Concurrency - Hundreds to thousands of connections
Hive
OLTP Hive
Replication
Analytics Modifications
Hive
High Concurrency
OLTP

Transaction Use Cases
• Streaming Ingest
• Use Hive Streaming API
• Designed for tools
Hive
Analytics
Append

Transaction Compactions
Read-
Optimized
ORCFile
Delta File
Merged
Read-
Optimized
ORCFile
1. Original File
Task reads the latest
ORCFile
Task
Read-
Optimized
ORCFile
Task Task
2. Edits Made
Task reads the ORCFile and merges
the delta file with the edits
3. Edits Merged
Task reads the
updated ORCFile
Hive ACID Compactor
periodically merges the delta
files in the background.
HDFS is a read only file system

More About Compaction
Read-
Optimized
ORCFile
Delta File
Merged
Read-
Optimized
ORCFile
Read-
Optimized
ORCFile
Delta File
Delta File
Delta File
Minor Compaction
10% local
Major Compaction
10% global
Minor and Major Compactions

Setting up Transactions in Ambari 2.1

Transaction Internals
.

Compaction: Scheduling
(driver.compile, driver.execute)
Execution Layer – Tez
HiveCLI Hive
Metastore
Schema
Definition
db
Table1
Schema1
Table2
Schema2
Hive
Metastore DB
Client – beeline, MicroStrategy
Hive Server 2
JDBC/ ODBC
Hive
Metastore
Schema
Definition
db
Table1
Schema1
Table2
Schema2
Hive
Metastore DB
Schedules compaction
jobs as the owner of
HDFS data in their
default queue
Note: You have to setup impersonation so Metastore can run as the owner of the data
Compaction jobs – Workload Management and Usability limitations
• Can I schedule it in the queue I want
• Can I easily figure out if compaction is taking up N% of my cluster resources

Locking and Concurrency
• We follow Snapshot isolation where
• Writers do traditional 2 phase locking (and write newer versions of data)
• Readers read the latest version number when the query arrived
• Readers and writers do not block each other
• Writes can write newer versions
• Readers can read a consistent view based on version number
• We do Table level and Partition level locking
• If we cannot figure out the partition, we’ll do table level locking
• Two transactions trying to update the same Table/Partition will block behind one another

Transactions Implementation
• INSERTS
• inserts write delta files instead of appending rows to new files
• Requires full list of columns:
• INSERT INTO TABLE T VALUES(1, 2, 3) – OK
• UPDATES
• UPDATE T SET name = ‘fred’ WHERE name = ‘freddy’;
• Writes new values to the delta file, complete with transaction information
• New UPDATE privilege added to authorization
• Updated columns passed to authorizer, SELECT privileges also required
• DELETES
• DELETE FROM T WHERE name = ‘freddy’;
• Writes deleted values to the delta file, complete with transaction information
• DELETE privilege already existed in authorization, SELECT privileges also required

Transaction Gotchas – Update Example
• Update Implementation
• User statement
• UPDATE T SET name = ‘fred’ WHERE name = ‘freddy’;
• Rewritten as
• INSERT INTO T SELECT ROW__ID, a, ‘fred’, c FROM T WHERE name = ‘freddy’
(assuming T has columns a, b, c)
• ROW__ID is row identifier information from AcidInputFormat
• Some Consequences
• It is only for updates on the RHS side of 'set' that you can't do subqueries
• BAD: update T set name = (select name from popular_names order by name limit 1);
• OK: update T set popular = true where name in (select name from popular_names)

Restrictions
OVERALL – ACID is V1 with restrictions and usability improvements to come
• Need to declare the table as having Transaction Property
• Table Needs to have buckets
INSERTS
• Still need to say “TABLE” and “PARTITION, (Fix TODO)
• INSERT INTO TABLE T
• INSERT INTO TABLE T PARTITION(ds = ‘today’) VALUES ...
• insert/overwrite disallowed
UPDATES
• Cannot update the partition column or bucket column (as this would require a delete and insert, and we don’t have an
easy way to do that at the moment)
• Expressions on the right side of SET must be supported as projections by Hive, thus no subqueries.
Compactions
• Workload Management needs to be built

Transaction Performance
• Insert – No scans, Single partition/bucket write
• Update – Entire Table scan (unless partition key in where clause)
• Delete – Entire table scan (unless partition key in where clause)
• Multiple rows of values (update/ delete) in a single statement
• Produces single table scan

Other commands and Settings
• SHOW TRANSACTIONS
• SHOW TRANSACTIONS is for use by administrators when Hive transactions are being used. It returns a list of all currently
open and aborted transactions in the system, including this information:
• transaction ID
• transaction state
• user who started the transaction
• machine where the transaction was started
• SHOW COMPACTIONS
• SHOW COMPACTIONS returns a list of all tables and partitions currently being compacted or scheduled for compaction when
Hive transactions are being used, including this information:
• database name
• table name
• partition name (if the table is partitioned)
• whether it is a major or minor compaction
• the state the compaction is in, which can be: "initiated" – waiting in the queue to be compacted "working" – being compacted "ready for
cleaning" – the compaction has been done and the old files are scheduled to be cleaned thread ID of the worker thread doing the compaction
(only if in working state) the time at which the compaction started (only if in working or ready for cleaning state)
• SHOW LOCKS
• SHOW LOCKS <table_name>;
• SHOW LOCKS <table_name> EXTENDED;
• SHOW LOCKS <table_name> PARTITION (<partition_desc>);
• SHOW LOCKS <table_name> PARTITION (<partition_desc>) EXTENDED;

Configuration
Configuration key Values Notes
hive.txn.manager Default: org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager
Need: transactions:org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
DummyTxnManager replicates pre Hive-0.13 behavior and
provides no transactions.
hive.txn.timeout Default: 300 Time after which transactions are declared aborted if the client has not sent a heartbeat,
in seconds.
hive.txn.max.open.batch Default: 1000 Maximum number of transactions that can be fetched in one call to open_txns().
Controls how many transactions streaming agents such as Flume or Storm open
simultaneously.
hive.compactor.initiator.on Default: false
Value to turn on transactions: true (for exactly one instance of the Thrift
metastore service)
Whether to run the initiator and cleaner threads on this
metastore instance.
hive.compactor.worker.threads Default: 0 .Value to turn on transactions: > 0 on at least one instance
of the Thrift metastore service
How many compactor worker threads to run on this
metastore instance.**
hive.compactor.worker.timeout Default: 86400 Time in seconds after which a compaction job will be declared failed and the compaction
re-queued.
hive.compactor.check.interval Default: 300 Time in seconds between checks to see if any tables or partitions need to be
compacted.***
hive.compactor.delta.num.threshold Default: 10 Number of delta directories in a table or partition that will trigger a minor compaction.
hive.compactor.delta.pct.threshold Default: 0.1 Percentage (fractional) size of the delta files relative to the base that will trigger a major
compaction. 1 = 100%, so the default 0.1 = 10%.
hive.compactor.abortedtxn.threshold Default: 1000 Number of aborted transactions involving a given table or partition that will trigger a
major compaction.
Hive.enforce.bucketing true
Hive.exec.dynamic.partition.mode nonstrict
hive.support.concurrency true

Streaming Solution Goals
What business goals are required?

Background - Hive Partitions and Buckets
CREATE TABLE user_info(
user_id BIGINT,
firstname STRING,
lastname STRING)
PARTITIONED BY(ds STRING)
CLUSTERED BY(user_id) INTO 255 BUCKETS;
Buckets
• Within every partitions, rows are distributed into buckets using a hash function
• hash_fn(bucketing_column) mod num_buckets
Table
Partition
date-0
Partition
2015-08-25
Bucket 1 Bucket 2 Bucket … Bucket 256
Partition …
Partition
date-N
Partitions
• All rows in a partition have the same value of partition key
• Partition is a directory

Hive Transactions and Buckets – Physical Structure
Table
Partition
2015-08-25
Delta1
Bucket 1 Bucket … Bucket 256
Delta2
Bucket 1 Bucket … Bucket 256
Partition …
Partition
date-N
CREATE TABLE user_info(
user_id BIGINT,
firstname STRING,
lastname STRING)
PARTITIONED BY(ds STRING)
CLUSTERED BY(user_id) INTO 255 BUCKETS;
STORED AS ORC
TBLPROPERTIES ("transactional"="true");
insert into table user_info values (,,), (,,), (,,), (,,), (,,);

Example Streaming Solution
Kafka
Storm
Hive Bolt
Hive Bolt
Hive Bolt
Hive
Storm
Topology
Hive Streaming API
Streaming Analytics SQL Analytics

Streaming into Hive without Hive Streaming
Hive Bolt
Hive Bolt
Hive Bolt
Partitions

What are your Streaming Sink goals
Write Throughput
• Determines the hardware cost of solution
• (events * size) per node
Ingest to Query Latency
• How long after ingest is the data available for query
Query Speed
• What query speed is needed when reading the data
• How do I want to layout the data to achieve that
Data Quality Guarantees
• What are the semantics of the solution
• At least once, Exactly once

Hive Streaming
Streaming API using ACID + ORC

Hive Streaming API Basics
Storm
Hive Bolt
Hive
HIVE STREAMING API
HiveEndPoint EP(MetastoreURI
DB, Table,
Partitions<>)
TransactionBatch TB (10)
Foreach (t : TB) {
t.begin ( );
t.write (BatchSize rows);
t.commit ( );
}
TB.close ( );
Hive Metastore
HDFS
Performance Note: Fewer Metastore calls is better

ORC File Basics
Row Data indexes Row Groups of 10k rows each
A stripe consists of 500K rows at 500bytes
Read Considerations
When read a subset of data in a common case, you
want to push the predicate and skip over most of the
data by looking at metadata.
- Indexes index most primitive types
- There are min and max that are most effective for
sorted data
- There are Bloom Filters (Hive 1.2+) that are
effective even for non-sorted data

Hive Streaming + ORC: Writes
• Bolts write delta files
• Multiple bolts/threads can write to a bucket, but each writes a
separate delta file
• One thread per bucket yields larger files
• Rows are uniquely identified with <Bucket, Txn, Id>
Bucket 1
Delta files
P
a
r
t
i
t
i
o
n
Bucket 2
Bucket 3
Bucket 4
Hive Bolt
Hive Bolt
Hive Bolt
Base File
Hive Bolt
Hive Bolt

Hive Streaming + ORC: Reads
Bucket 1
Delta files
P
a
r
t
i
t
i
o
n
Bucket 2
Bucket 3
Bucket 4
• The Mappers will read buckets and merge delta files at read
• Sorting within files on keys helps fast merge
• If there is more than one Mapper per bucket
• they can split the base file
• they will have to merge all delta files
Map
Map
Base File
Map
Map

Hive Streaming + ORC: Compactions
Bucket 1
Delta files
P
a
r
t
i
t
i
o
n
Bucket 2
Bucket 3
Bucket 4
Note: There are no base files till the first major compaction
Base file
Bucket 1
Delta file
P
a
r
t
i
t
i
o
n
Bucket 2
Bucket 3
Bucket 4
Base file
Bucket 1P
a
r
t
i
t
i
o
n
Bucket 2
Bucket 3
Bucket 4
Base file
Minor Compaction
A minor compaction will join multiple
delta files into one
Major Compaction
A major compaction will merge delta
files with the base file

Storm Overview
.

Storm
Storm Overview
Kafka
Spout
Kafka
Spout
Kafka
Spout
Kafka
Topic 1
Partition 0
Topic 1
Partition 1
Topic 1
Partition 2
Bolt
Bolt
Analytics
Hive
Bolt
Hive
Bolt
Hive
(HDFS)

Storm
Storm Guarantees and Constraints
Kafka
Spout
Kafka
Spout
Kafka
Spout
Kafka
Topic 1
Partition 0
Topic 1
Partition 0
Topic 1
Partition 0
Bolt
Bolt
Hive
Bolt
Hive
Bolt
Hive
(HDFS)
State in
Memory
ACK
ACK
State needs to be preserved till the data is acknowledged by Hive
Otherwise Kafka data is replayed
This means large inserts can cause state to fill up memory
Storm would like to guarantee data is delivered
to Hive – exactly once

Balancing Constraints
.

Balancing Storm and Hive Preferences
• Storm wants to hold on to minimum state
• Smaller writes make data visible to readers quicker
• Hive prefers larger writes
• Write throughput is good with larger writes
• However, compaction can help
State

Questions
?

HiveACIDPublic

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie HiveACIDPublic

Ähnlich wie HiveACIDPublic (20)

HiveACIDPublic

Hinweis der Redaktion