Weitere ähnliche Inhalte Ähnlich wie Hortonworks Technical Workshop: Interactive Query with Apache Hive (20) Mehr von Hortonworks (20) Kürzlich hochgeladen (20) Hortonworks Technical Workshop: Interactive Query with Apache Hive 1. Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Interactive Query With Apache Hive
Dec 4, 2014
Ajay Singh
2. Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Agenda
• HDP 2.2
• Apache Hive & Stinger Initiative
• Stinger.Next
• Putting It Together
• Q&A
3. Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDP 2.2 Generally Available
Hortonworks Data Platform 2.2
YARN: Data Operating System
(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
Tez
Tez
Java
Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
Others
ISV
Engines
HDFS
(Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBase
Accumulo
Slider
Slider
SECURITYGOVERNANCE OPERATIONSBATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow,
Lifecycle &
Governance
Falcon
Sqoop
Flume
Kafka
NFS
WebHDFS
Authentication
Authorization
Accounting
Data Protection
Storage: HDFS
Resources: YARN
Access: Hive, …
Pipeline: Falcon
Cluster: Knox
Cluster: Ranger
Deployment ChoiceLinux Windows On-Premises Cloud
YARN
is the architectural
center of HDP
Enables batch, interactive
and real-time workloads
Provides comprehensive
enterprise capabilities
The widest range of
deployment options
Delivered Completely in the OPEN
4. Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDP IS Apache Hadoop
There is ONE Enterprise Hadoop: everything else is a vendor derivation
Hortonworks Data Platform 2.2
Hadoop
&YARN
Pig
Hive&HCatalog
HBase
Sqoop
Oozie
Zookeeper
Ambari
Storm
Flume
Knox
Phoenix
Accumulo
2.2.0
0.12.0
0.12.0
2.4.0
0.12.1
Data
Management
0.13.0
0.96.1
0.98.0
0.9.1
1.4.4
1.3.1
1.4.0
1.4.4
1.5.1
3.3.2
4.0.0
3.4.5
0.4.0
4.0.0
1.5.1
Falcon
0.5.0
Ranger
Spark
Kafka
0.14.0
0.14.0
0.98.4
1.6.1
4.2
0.9.3
1.2.0
0.6.0
0.8.1
1.4.5
1.5.0
1.7.0
4.1.0
0.5.0
0.4.0
2.6.0
* version numbers are targets and subject to change at time of general availability in accordance with ASF release process
3.4.5
Tez
0.4.0
Slider
0.60
HDP 2.0
October
2013
HDP 2.2
October
2014
HDP 2.1
April
2014
Solr
4.7.2
4.10.0
0.5.1
Data Access
Governance
& Integration
SecurityOperations
5. Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Complete List of New Features in HDP 2.2
Apache Hadoop YARN
• Slide existing services onto YARN through ‘Slider’
• GA release of HBase, Accumulo, and Storm on
YARN
• Support long running services: handling of logs,
containers not killed when AM dies, secure token
renewal, YARN Labels for tagging nodes for specific
workloads
• Support for CPU Scheduling and CPU Resource
Isolation through CGroups
Apache Hadoop HDFS
• Heterogeneous storage: Support for archival
• Rolling Upgrade (This is an item that applies to the
entire HDP Stack. YARN, Hive, HBase, everything.
We now support comprehensive Rolling Upgrade
across the HDP Stack).
• Multi-NIC Support
• Heterogeneous storage: Support memory as a
storage tier (TP)
• HDFS Transparent Data Encryption (TP)
Apache Hive, Apache Pig, and Apache Tez
• Hive Cost Based Optimizer: Function Pushdown &
Join re-ordering support for other join types: star &
bushy.
• Hive SQL Enhancements including:
• ACID Support: Insert, Update, Delete
• Temporary Tables
• Metadata-only queries return instantly
• Pig on Tez
• Including DataFu for use with Pig
• Vectorized shuffle
• Tez Debug Tooling & UI
Hue
• Support for HiveServer 2
• Support for Resource Manager HA
Apache Spark
• Refreshed Tech Preview to Spark 1.1.0 (available
now)
• ORC File support & Hive 0.13 integration
• Planned for GA of Spark 1.2.0
• Operations integration via YARN ATS and Ambari
• Security: Authentication
• Apache Solr
• Added Banana, a rich and flexible UI for visualizing
time series data indexed in Solr
• Cascading
• Cascading 3.0 on Tez distributed with HDP
— coming soon
Apache Falcon
• Authentication Integration
• Lineage – now GA. (it’s been a tech preview
feature…)
• Improve UI for pipeline management & editing: list,
detail, and create new (from existing elements)
• Replicate to Cloud – Azure & S3
Apache Sqoop, Apache Flume & Apache Oozie
• Sqoop import support for Hive types via HCatalog
• Secure Windows cluster support: Sqoop, Flume,
Oozie
• Flume streaming support: sink to HCat on secure
cluster
• Oozie HA now supports secure clusters
• Oozie Rolling Upgrade
• Operational improvements for Oozie to better support
Falcon
• Capture workflow job logs in HDFS
• Don’t start new workflows for re-run
• Allow job property updates on running jobs
Apache HBase, Apache Phoenix, & Apache
Accumulo
• HBase & Accumulo on YARN via Slider
• HBase HA
• Replicas update in real-time
• Fully supports region split/merge
• Scan API now supports standby RegionServers
• HBase Block cache compression
• HBase optimizations for low latency
• Phoenix Robust Secondary Indexes
• Performance enhancements for bulk import into
Phoenix
• Hive over HBase Snapshots
• Hive Connector to Accumulo
• HBase & Accumulo wire-level encryption
• Accumulo multi-datacenter replication
Apache Storm
• Storm-on-YARN via Slider
• Ingest & notification for JMS (IBM MQ not supported)
• Kafka bolt for Storm – supports sophisticated
chaining of topologies through Kafka
• Kerberos support
• Hive update support – Streaming Ingest
• Connector improvements for HBase and HDFS
• Deliver Kafka as a companion component
• Kafka install, start/stop via Ambari
• Security Authorization Integration with Ranger
Apache Slider
• Allow on-demand create and run different versions of
heterogeneous applications
• Allow users to configure different application
instances differently
• Manage operational lifecycle of application instances
• Expand / shrink application instances
• Provide application registry for publish and discovery
Apache Knox & Apache Ranger (Argus) & HDP
Security
• Apache Ranger – Support authorization and auditing
for Storm and Knox
• Introducing REST APIs for managing policies in
Apache Ranger
• Apache Ranger – Support native grant/revoke
permissions in Hive and HBase
• Apache Ranger – Support Oracle DB and storing of
audit logs in HDFS
• Apache Ranger to run on Windows environment
• Apache Knox to protect YARN RM
• Apache Knox support for HDFS HA
• Apache Ambari install, start/stop of Knox
Apache Ambari
• Support for HDP 2.2 Stack, including support for
Kafka, Knox and Slider
• Enhancements to Ambari Web configuration
management including: versioning, history and
revert, setting final properties and downloading client
configurations
• Launch and monitor HDFS rebalance
• Perform Capacity Scheduler queue refresh
• Configure High Availability for ResourceManager
• Ambari Administration framework for managing user
and group access to Ambari
• Ambari Views development framework for
customizing the Ambari Web user experience
• Ambari Stacks for extending Ambari to bring custom
Services under Ambari management
• Ambari Blueprints for automating cluster
deployments
• Performance improvements and enterprise usability
guardrails
6. Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Just How Many New Features are in HDP 2.2?
Apache Hadoop YARN
• Slide existing services onto YARN through ‘Slider’
• GA release of HBase, Accumulo, and Storm on
YARN
• Support long running services: handling of logs,
containers not killed when AM dies, secure token
renewal, YARN Labels for tagging nodes for specific
workloads
• Support for CPU Scheduling and CPU Resource
Isolation through CGroups
Apache Hadoop HDFS
• Heterogeneous storage: Support for archival
• Rolling Upgrade (This is an item that applies to the
entire HDP Stack. YARN, Hive, HBase, everything.
We now support comprehensive Rolling Upgrade
across the HDP Stack).
• Multi-NIC Support
• Heterogeneous storage: Support memory as a
storage tier (TP)
• HDFS Transparent Data Encryption (TP)
Apache Hive, Apache Pig, and Apache Tez
• Hive Cost Based Optimizer: Function Pushdown &
Join re-ordering support for other join types: star &
bushy.
• Hive SQL Enhancements including:
• ACID Support: Insert, Update, Delete
• Temporary Tables
• Metadata-only queries return instantly
• Pig on Tez
• Including DataFu for use with Pig
• Vectorized shuffle
• Tez Debug Tooling & UI
Hue
• Support for HiveServer 2
• Support for Resource Manager HA
Apache Spark
• Refreshed Tech Preview to Spark 1.1.0 (available
now)
• ORC File support & Hive 0.13 integration
• Planned for GA of Spark 1.2.0
• Operations integration via YARN ATS and Ambari
• Security: Authentication
• Apache Solr
• Added Banana, a rich and flexible UI for visualizing
time series data indexed in Solr
• Cascading
• Cascading 3.0 on Tez distributed with HDP
— coming soon
Apache Falcon
• Authentication Integration
• Lineage – now GA. (it’s been a tech preview
feature…)
• Improve UI for pipeline management & editing: list,
detail, and create new (from existing elements)
• Replicate to Cloud – Azure & S3
Apache Sqoop, Apache Flume & Apache Oozie
• Sqoop import support for Hive types via HCatalog
• Secure Windows cluster support: Sqoop, Flume,
Oozie
• Flume streaming support: sink to HCat on secure
cluster
• Oozie HA now supports secure clusters
• Oozie Rolling Upgrade
• Operational improvements for Oozie to better support
Falcon
• Capture workflow job logs in HDFS
• Don’t start new workflows for re-run
• Allow job property updates on running jobs
Apache HBase, Apache Phoenix, & Apache
Accumulo
• HBase & Accumulo on YARN via Slider
• HBase HA
• Replicas update in real-time
• Fully supports region split/merge
• Scan API now supports standby RegionServers
• HBase Block cache compression
• HBase optimizations for low latency
• Phoenix Robust Secondary Indexes
• Performance enhancements for bulk import into
Phoenix
• Hive over HBase Snapshots
• Hive Connector to Accumulo
• HBase & Accumulo wire-level encryption
• Accumulo multi-datacenter replication
Apache Storm
• Storm-on-YARN via Slider
• Ingest & notification for JMS (IBM MQ not supported)
• Kafka bolt for Storm – supports sophisticated
chaining of topologies through Kafka
• Kerberos support
• Hive update support – Streaming Ingest
• Connector improvements for HBase and HDFS
• Deliver Kafka as a companion component
• Kafka install, start/stop via Ambari
• Security Authorization Integration with Ranger
Apache Slider
• Allow on-demand create and run different versions of
heterogeneous applications
• Allow users to configure different application
instances differently
• Manage operational lifecycle of application instances
• Expand / shrink application instances
• Provide application registry for publish and discovery
Apache Knox & Apache Ranger (Argus) & HDP
Security
• Apache Ranger – Support authorization and auditing
for Storm and Knox
• Introducing REST APIs for managing policies in
Apache Ranger
• Apache Ranger – Support native grant/revoke
permissions in Hive and HBase
• Apache Ranger – Support Oracle DB and storing of
audit logs in HDFS
• Apache Ranger to run on Windows environment
• Apache Knox to protect YARN RM
• Apache Knox support for HDFS HA
• Apache Ambari install, start/stop of Knox
Apache Ambari
• Support for HDP 2.2 Stack, including support for
Kafka, Knox and Slider
• Enhancements to Ambari Web configuration
management including: versioning, history and
revert, setting final properties and downloading client
configurations
• Launch and monitor HDFS rebalance
• Perform Capacity Scheduler queue refresh
• Configure High Availability for ResourceManager
• Ambari Administration framework for managing user
and group access to Ambari
• Ambari Views development framework for
customizing the Ambari Web user experience
• Ambari Stacks for extending Ambari to bring custom
Services under Ambari management
• Ambari Blueprints for automating cluster
deployments
• Performance improvements and enterprise usability
guardrails
88Astonishing amount of innovation
in the OPEN Apache Community
HDP
is Apache
Hadoop
7. Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Hive & Stinger Initiative
8. Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hive – Single tool for all SQL use cases
OLTP,
ERP,
CRM
Systems
Unstructured
documents,
emails
Clickstream
Server
logs
Sen>ment,
Web
Data
Sensor.
Machine
Data
Geoloca>on
Interactive
Analytics
Batch Reports /
Deep Analytics
Hive - SQL
ETL / ELT
9. Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hive Scales To Any Workload
Page 9
" The original developers of Hive.
" More data than existing RDBMS could handle.
" 100+ PB of data under management.
" 15+ TB of data loaded daily.
" 60,000+ Hive queries per day.
" More than 1,000 users per day.
10. Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hive Join Strategies
Page 10
Type
Approach
Pros
Cons
Shuffle Join
Join keys are shuffled using map/
reduce and joins performed
reduce side.
Works regardless of data
size or layout.
Most resource-intensive
and slowest join type.
Broadcast
Join
Small tables are loaded into
memory in all nodes, mapper
scans through the large table and
joins.
Very fast, single scan
through largest table.
All but one table must be
small enough to fit in
RAM.
Sort-Merge-
Bucket Join
Mappers take advantage of co-
location of keys to do efficient
joins.
Very fast for tables of any
size.
Data must be bucketed
ahead of time.
11. Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Stinger Initiative
• Stinger Initiative – DELIVERED
Next generation SQL based interactive query in
Hadoop
Speed
Improve Hive query performance has increased by 100X to allow
for interactive query times (seconds)
Scale
The only SQL interface to Hadoop designed for queries that scale
from TB to PB
SQL
Support broadest range of SQL semantics for analytic applications
running against Hadoop
Governance
&Integration
Security
Operations
Data Access
Data Management
HDP 2.1
An Open Community at its finest: Apache Hive Contribution
1,672Jira Tickets Closed
145Developers
44Companies
360,000Lines Of Code Added… (2.5x)
Apache
YARN
Apache
MapReduce
1
°
°
°
°
°
°
°
°
°
°
°
°
°
N
HDFS
(Hadoop
Distributed
File
System)
Apache
Tez
Apache
Hive
SQL
Business
Analy=cs
Custom
Apps
13Months
Hive 10
100’s
to
1000’s
of
seconds
seconds
Hive 13
Dramatically faster
queries
speeds time to
insight
12. Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Stinger Initiative - Key Innovations
File
Format
ORCFile
Execution
Engine
Tez
= 100X+ +
Query Planner
CBO
13. Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Tez (“Speed”)
• What is it?
– A data processing framework as an alternative to MapReduce
• Who else is involved?
– Hortonworks, Facebook, Twitter, Yahoo, Microsoft
• Why does it matter?
– Widens the platform for Hadoop use cases
– Crucial to improving the performance of low-latency applications
– Core to the Stinger initiative
– Evidence of Hortonworks leading the community in the evolution of Enterprise Hadoop
14. Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hive – MR Hive – Tez
Comparing: Hive/MR vs. Hive/Tez
Page 14
SELECT a.state, COUNT(*), AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state
SELECT a.state
JOIN (a, c)
SELECT c.price
SELECT b.id
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)
M M M
R R
M M
R
M M
R
M M
R
HDFS
HDFS
HDFS
M M M
R R
R
M M
R
R
SELECT a.state,
c.itemId
JOIN (a, c)
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)
SELECT b.id
Tez avoids unneeded
writes to HDFS
15. Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
ORCFile – Columnar Storage for Hive
• Columns stored separately
• Knows types
– Uses type-specific encoders
– Stores statistics (min, max, sum, count)
• Has light-weight index
– Skip over blocks of rows that don’t matter
Page 15
16. Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
ORCFile – Columnar Storage for Hive
Large block size ideal for
map/reduce.
Columnar format enables
high compression and high
performance.
17. Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Query Planner – Cost Based Optimizer in Hive
The Cost-Based Optimizer (CBO) uses statistics within Hive tables to
produce optimal query plans
Why cost-based optimization?
• Ease of Use – Join Reordering
• Reduces the need for specialists to tune queries.
• More efficient query plans lead to better cluster utilization.
Page 17
18. Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Statistics: Foundations for CBO
Kind of statistics
Table Statistics – Collected on load per
partition
• Uncompressed size
• Number of rows
• Number of files
Column Statistics – Required by CBO
• NDV (Number of Distinct Values)
• Nulls, Min, Max
Usability - How does the data get Statistics
Analyze Table Command
• Analyze entire table
• Run this command per partition
• Run for some partitions and the compiler will extrapolate
statistics
Collecting statistics on load
• Table stats can be collected if you insert via hive using set
hive.stats.autogather=true
• Not with load data file
19. Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
A Journey to SQL Compliance
Evolu=on
of
SQL
Compliance
in
Hive
SQL
Datatypes
SQL
Seman=cs
INT/TINYINT/SMALLINT/BIGINT
SELECT,
INSERT
FLOAT/DOUBLE
GROUP
BY,
ORDER
BY,
HAVING
BOOLEAN
JOIN
on
explicit
join
key
ARRAY,
MAP,
STRUCT,
UNION
Inner,
outer,
cross
and
semi
joins
STRING
Sub-‐queries
in
the
FROM
clause
BINARY
ROLLUP
and
CUBE
TIMESTAMP
UNION
DECIMAL
Standard
aggrega>ons
(sum,
avg,
etc.)
DATE
Custom
Java
UDFs
VARCHAR
Windowing
func>ons
(OVER,
RANK,
etc.)
CHAR
Advanced
UDFs
(ngram,
XPath,
URL)
JOINs
in
WHERE
Clause
Sub-‐queries
for
IN/NOT
IN,
HAVING
Legend
Hive
10
or
earlier
Hive
11
Hive
12
Hive
13
Governance
&Integration
Security
Operations
Data Access
Data Management
HDP 2.1
20. Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Now this is not the end. It is not even the
beginning of the end. But it is, perhaps, the
end of the beginning.
-Winston Churchill
Hive 0.13
21. Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Stinger.Next
22. Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Stinger.Next: Delivery Themes
Hive
0.14
• Transac>ons
with
ACID
allowing
insert,
update
and
delete
• Streaming
Ingest
• Cost
Based
Op>mizer
op>mizes
star
and
bushy
join
queries
Sub-‐Second
1st
Half
2015
• Sub-‐Second
queries
with
LLAP
• Hive-‐Spark
Machine
Learning
integra>on
• Opera>onal
repor>ng
with
Hive
Streaming
Ingest
and
Transac>ons
Richer
Analy=cs
2nd
Half
2015
• Toward
SQL:2011
Analy>cs
• Materialized
Views
• Cross-‐Geo
Queries
• Workload
Management
via
YARN
and
LLAP
integra>on
23. Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Transaction Use Cases
Reporting with Analytics (YES)
Reporting on data with occasional updates
Corrections to the fact tables, evolving dimension tables
Low concurrency updates, low TPS4
Operational Reporting (YES)
High throughput ingest from operational (OLTP) database
Periodic inserts every 5-30 minutes
Requires tool support and changes in our Transactions
Operational (OLTP) Database (NO)
Small Transactions, each doing single line inserts
High Concurrency - Hundreds to thousands of connections
Hive
OLTP Hive
Replication
Analytics Modifications
Hive
High Concurrency
OLTP
24. Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Deep Dive: Transactions
Transaction Support in Hive with ACID semantics
• Hive native support for INSERT, UPDATE, DELETE.
• Split Into Phases:
• Phase 1: Hive Streaming Ingest (append)
• Phase 2: INSERT / UPDATE / DELETE Support
• Phase 3: BEGIN / COMMIT / ROLLBACK Txn
[Done]
[Done]
[Next]
Read-
Optimized
ORCFile
Delta File
Merged
Read-
Optimized
ORCFile
1. Original File
Task reads the latest
ORCFile
Task
Read-
Optimized
ORCFile
Task Task
2. Edits Made
Task reads the ORCFile and merges
the delta file with the edits
3. Edits Merged
Task reads the
updated ORCFile
Hive ACID Compactor
periodically merges the delta
files in the background.
25. Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Transactions - Requirements
Needs to declare table as having Transaction Property
Table must be in ORC format
Tables must to be bucketed
Page 25
26. Page 26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Putting It Together
27. Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Step 1 - Turn On Transactions
Hive Configuration
§ hive.support.concurrency=true
§ hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
§ hive.compactor.initiator.on=true
§ hive.compactor.worker.threads=2
§ hive.enforce.bucketing=true
§ hive.exec.dynamic.partition.mode=nonstrict
Page 27
28. Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Step 2 – Enable Concurrency By Defining Queues
YARN Configuration
§ yarn.scheduler.capacity.root.default.capacity=50
§ yarn.scheduler.capacity.root.hiveserver.capacity=50
§ yarn.scheduler.capacity.root.hiveserver.hive1.capacity=50
§ yarn.scheduler.capacity.root.hiveserver.hive1.user-limit-factor=4
§ yarn.scheduler.capacity.root.hiveserver.hive2.capacity=50
§ yarn.scheduler.capacity.root.hiveserver.hive2.user-limit-factor=4
§ yarn.scheduler.capacity.root.hiveserver.queues=hive1,hive2
§ yarn.scheduler.capacity.root.queues=default,hiveserver
Default
Hive1
Hive2
Cluster Capacity
29. Page 29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Step 3 – Deliver Capacity Guarantees BY Enabling YARN
Preemption
YARN Configuration
§ yarn.resourcemanager.scheduler.monitor.enable=true
§ yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourceman
ager.monitor.capacity.ProportionalCapacityPreemptionPolicy
§ yarn.resourcemanager.monitor.capacity.preemption.monitoring_interval=1000
§ yarn.resourcemanager.monitor.capacity.preemption.max_wait_before_kill=5000
§ yarn.resourcemanager.monitor.capacity.preemption.total_preemption_per_round=0.4
30. Page 30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Enable Sessions For Hive Queues
Step 4 – Enable Tez Execution Engine & Tez Sessions
Hive Configuration
§ hive.execution.engine=tez
§ hive.server2.tez.initialize.default.sessions=true
§ hive.server2.tez.default.queues=hive1,hive2
§ hive.server2.tez.sessions.per.default.queue=1
§ hive.server2.enable.doAs=false
§ hive.vectorized.groupby.maxentries=10240
§ hive.vectorized.groupby.flush.percent=0.1
31. Page 31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Step 5 - Create Partitioned & Bucketed ORC Tables
Create table if not exists test (id int, val string)
partitioned by (year string,month string,day string)
clustered by (id) into 7 buckets
stored as orc TBLPROPERTIES ("transactional"="true”);
Note:
§ Transaction Requires Bucketed tables in ORC
Format. Tables cannot be sorted.
§ Transactional=true must be set as table
properties
§ For performance, table Partition is
recommended but not mandatory
§ Partition on filter columns with low
cardinality
§ For optimal performance stay below 1000
partitions
§ Cluster on join columns
§ Number of buckets contingent on dataset
size
32. Page 32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Step 6 - Loading Data into ORC table
§ SQOOP, FLUME & STORM support direct ingestion to ORC Tables
§ Have a Text File ?
§ Load to a Table stored as textfile
§ Transfer to ORC Table using Hive insert statement
33. Page 33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Step 7 - Compute Statistics
§ Compute Table Stats
analyze table test partition(year,month,day) compute
statistics;
§ Compute Column Stats
analyze table test partition(year,month,day) compute
statistics for columns;
§ Keep Stats Updated
§ Speed computation by limiting it to partitions that have
changed
Note:
§ In hive 0.14, column stats can be
calculated for all partitions in a single
statement
§ To limit computation to a specific
partition, specify partition keys
34. Page 34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Sample Code – Sqoop Import To ORC Table
sqoop import --verbose --connect 'jdbc:mysql://localhost/people' --table persons
--username root --hcatalog-table persons --hcatalog-storage-stanza "stored
as orc" -m 1
Use Hcatalog to import to ORC Table
35. Page 35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Sample Code – Flume Configuration For Hive
Streaming Ingest
## Agent
agent.sources = csvfile
agent.sources.csvfile.type = exec
agent.sources.csvfile.command = tail -F /root/test.txt
agent.sources.csvfile.batchSize = 1
agent.sources.csvfile.channels = memoryChannel
agent.sources.csvfile.interceptors = intercepttime
agent.sources.csvfile.interceptors.intercepttime.type =
timestamp
## Channels
agent.channels = memoryChannel
agent.channels.memoryChannel.type = memory
agent.channels.memoryChannel.capacity = 10000
## Hive Streaming Sink
agent.sinks = hiveout
agent.sinks.hiveout.type = hive
agent.sinks.hiveout.hive.metastore=thrift://localhost:9083
agent.sinks.hiveout.hive.database=default
agent.sinks.hiveout.hive.table=test
agent.sinks.hiveout.hive.partition=%Y,%m,%d
agent.sinks.hiveout.serializer = DELIMITED
agent.sinks.hiveout.serializer.fieldnames =id,val
agent.sinks.hiveout.channel = memoryChannel
36. Page 36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Q&A