SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Downloaden Sie, um offline zu lesen
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Interactive Query With Apache Hive
Dec 4, 2014
Ajay Singh
Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Agenda
•  HDP 2.2
•  Apache Hive & Stinger Initiative
•  Stinger.Next
•  Putting It Together
•  Q&A
Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDP 2.2 Generally Available
Hortonworks Data Platform 2.2
YARN: Data Operating System
(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
Tez
Tez
Java
Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
Others
ISV
Engines
HDFS
(Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBase
Accumulo
Slider
 Slider
SECURITYGOVERNANCE OPERATIONSBATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow,
Lifecycle &
Governance
Falcon
Sqoop
Flume
Kafka
NFS
WebHDFS
Authentication
Authorization
Accounting
Data Protection
Storage: HDFS
Resources: YARN
Access: Hive, …
Pipeline: Falcon
Cluster: Knox
Cluster: Ranger
Deployment ChoiceLinux Windows On-Premises Cloud
YARN
is the architectural
center of HDP
Enables batch, interactive
and real-time workloads
Provides comprehensive
enterprise capabilities
The widest range of
deployment options
Delivered Completely in the OPEN
Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDP IS Apache Hadoop
There is ONE Enterprise Hadoop: everything else is a vendor derivation
Hortonworks Data Platform 2.2
Hadoop
&YARN
Pig
Hive&HCatalog
HBase
Sqoop
Oozie
Zookeeper
Ambari
Storm
Flume
Knox
Phoenix
Accumulo
2.2.0
0.12.0
0.12.0
2.4.0
0.12.1
Data
Management
0.13.0
0.96.1
0.98.0
0.9.1
1.4.4
1.3.1
1.4.0
1.4.4
1.5.1
3.3.2
4.0.0
3.4.5
0.4.0
4.0.0
1.5.1
Falcon
0.5.0
Ranger
Spark
Kafka
0.14.0
0.14.0
0.98.4
1.6.1
4.2
0.9.3
1.2.0
0.6.0
0.8.1
1.4.5
1.5.0
1.7.0
4.1.0
0.5.0
0.4.0
2.6.0
* version numbers are targets and subject to change at time of general availability in accordance with ASF release process
3.4.5
Tez
0.4.0
Slider
0.60
HDP 2.0
October
2013
HDP 2.2
October
2014
HDP 2.1
April
2014
Solr
4.7.2
4.10.0
0.5.1
Data Access
Governance
& Integration
SecurityOperations
Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Complete List of New Features in HDP 2.2
Apache Hadoop YARN
•  Slide existing services onto YARN through ‘Slider’
•  GA release of HBase, Accumulo, and Storm on
YARN
•  Support long running services: handling of logs,
containers not killed when AM dies, secure token
renewal, YARN Labels for tagging nodes for specific
workloads
•  Support for CPU Scheduling and CPU Resource
Isolation through CGroups
Apache Hadoop HDFS
•  Heterogeneous storage: Support for archival
•  Rolling Upgrade (This is an item that applies to the
entire HDP Stack. YARN, Hive, HBase, everything.
We now support comprehensive Rolling Upgrade
across the HDP Stack).
•  Multi-NIC Support
•  Heterogeneous storage: Support memory as a
storage tier (TP)
•  HDFS Transparent Data Encryption (TP)
Apache Hive, Apache Pig, and Apache Tez
•  Hive Cost Based Optimizer: Function Pushdown &
Join re-ordering support for other join types: star &
bushy.
•  Hive SQL Enhancements including:
•  ACID Support: Insert, Update, Delete
•  Temporary Tables
•  Metadata-only queries return instantly
•  Pig on Tez
•  Including DataFu for use with Pig
•  Vectorized shuffle
•  Tez Debug Tooling & UI
Hue
•  Support for HiveServer 2
•  Support for Resource Manager HA
Apache Spark
•  Refreshed Tech Preview to Spark 1.1.0 (available
now)
•  ORC File support & Hive 0.13 integration
•  Planned for GA of Spark 1.2.0
•  Operations integration via YARN ATS and Ambari
•  Security: Authentication
•  Apache Solr
•  Added Banana, a rich and flexible UI for visualizing
time series data indexed in Solr
•  Cascading
•  Cascading 3.0 on Tez distributed with HDP
— coming soon
Apache Falcon
•  Authentication Integration
•  Lineage – now GA. (it’s been a tech preview
feature…)
•  Improve UI for pipeline management & editing: list,
detail, and create new (from existing elements)
•  Replicate to Cloud – Azure & S3
Apache Sqoop, Apache Flume & Apache Oozie
•  Sqoop import support for Hive types via HCatalog
•  Secure Windows cluster support: Sqoop, Flume,
Oozie
•  Flume streaming support: sink to HCat on secure
cluster
•  Oozie HA now supports secure clusters
•  Oozie Rolling Upgrade
•  Operational improvements for Oozie to better support
Falcon
•  Capture workflow job logs in HDFS
•  Don’t start new workflows for re-run
•  Allow job property updates on running jobs
Apache HBase, Apache Phoenix, & Apache
Accumulo
•  HBase & Accumulo on YARN via Slider
•  HBase HA
•  Replicas update in real-time
•  Fully supports region split/merge
•  Scan API now supports standby RegionServers
•  HBase Block cache compression
•  HBase optimizations for low latency
•  Phoenix Robust Secondary Indexes
•  Performance enhancements for bulk import into
Phoenix
•  Hive over HBase Snapshots
•  Hive Connector to Accumulo
•  HBase & Accumulo wire-level encryption
•  Accumulo multi-datacenter replication
Apache Storm
•  Storm-on-YARN via Slider
•  Ingest & notification for JMS (IBM MQ not supported)
•  Kafka bolt for Storm – supports sophisticated
chaining of topologies through Kafka
•  Kerberos support
•  Hive update support – Streaming Ingest
•  Connector improvements for HBase and HDFS
•  Deliver Kafka as a companion component
•  Kafka install, start/stop via Ambari
•  Security Authorization Integration with Ranger
Apache Slider
•  Allow on-demand create and run different versions of
heterogeneous applications
•  Allow users to configure different application
instances differently
•  Manage operational lifecycle of application instances
•  Expand / shrink application instances
•  Provide application registry for publish and discovery
Apache Knox & Apache Ranger (Argus) & HDP
Security
•  Apache Ranger – Support authorization and auditing
for Storm and Knox
•  Introducing REST APIs for managing policies in
Apache Ranger
•  Apache Ranger – Support native grant/revoke
permissions in Hive and HBase
•  Apache Ranger – Support Oracle DB and storing of
audit logs in HDFS
•  Apache Ranger to run on Windows environment
•  Apache Knox to protect YARN RM
•  Apache Knox support for HDFS HA
•  Apache Ambari install, start/stop of Knox
Apache Ambari
•  Support for HDP 2.2 Stack, including support for
Kafka, Knox and Slider
•  Enhancements to Ambari Web configuration
management including: versioning, history and
revert, setting final properties and downloading client
configurations
•  Launch and monitor HDFS rebalance
•  Perform Capacity Scheduler queue refresh
•  Configure High Availability for ResourceManager
•  Ambari Administration framework for managing user
and group access to Ambari
•  Ambari Views development framework for
customizing the Ambari Web user experience
•  Ambari Stacks for extending Ambari to bring custom
Services under Ambari management
•  Ambari Blueprints for automating cluster
deployments
•  Performance improvements and enterprise usability
guardrails
Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Just How Many New Features are in HDP 2.2?
Apache Hadoop YARN
•  Slide existing services onto YARN through ‘Slider’
•  GA release of HBase, Accumulo, and Storm on
YARN
•  Support long running services: handling of logs,
containers not killed when AM dies, secure token
renewal, YARN Labels for tagging nodes for specific
workloads
•  Support for CPU Scheduling and CPU Resource
Isolation through CGroups
Apache Hadoop HDFS
•  Heterogeneous storage: Support for archival
•  Rolling Upgrade (This is an item that applies to the
entire HDP Stack. YARN, Hive, HBase, everything.
We now support comprehensive Rolling Upgrade
across the HDP Stack).
•  Multi-NIC Support
•  Heterogeneous storage: Support memory as a
storage tier (TP)
•  HDFS Transparent Data Encryption (TP)
Apache Hive, Apache Pig, and Apache Tez
•  Hive Cost Based Optimizer: Function Pushdown &
Join re-ordering support for other join types: star &
bushy.
•  Hive SQL Enhancements including:
•  ACID Support: Insert, Update, Delete
•  Temporary Tables
•  Metadata-only queries return instantly
•  Pig on Tez
•  Including DataFu for use with Pig
•  Vectorized shuffle
•  Tez Debug Tooling & UI
Hue
•  Support for HiveServer 2
•  Support for Resource Manager HA
Apache Spark
•  Refreshed Tech Preview to Spark 1.1.0 (available
now)
•  ORC File support & Hive 0.13 integration
•  Planned for GA of Spark 1.2.0
•  Operations integration via YARN ATS and Ambari
•  Security: Authentication
•  Apache Solr
•  Added Banana, a rich and flexible UI for visualizing
time series data indexed in Solr
•  Cascading
•  Cascading 3.0 on Tez distributed with HDP
— coming soon
Apache Falcon
•  Authentication Integration
•  Lineage – now GA. (it’s been a tech preview
feature…)
•  Improve UI for pipeline management & editing: list,
detail, and create new (from existing elements)
•  Replicate to Cloud – Azure & S3
Apache Sqoop, Apache Flume & Apache Oozie
•  Sqoop import support for Hive types via HCatalog
•  Secure Windows cluster support: Sqoop, Flume,
Oozie
•  Flume streaming support: sink to HCat on secure
cluster
•  Oozie HA now supports secure clusters
•  Oozie Rolling Upgrade
•  Operational improvements for Oozie to better support
Falcon
•  Capture workflow job logs in HDFS
•  Don’t start new workflows for re-run
•  Allow job property updates on running jobs
Apache HBase, Apache Phoenix, & Apache
Accumulo
•  HBase & Accumulo on YARN via Slider
•  HBase HA
•  Replicas update in real-time
•  Fully supports region split/merge
•  Scan API now supports standby RegionServers
•  HBase Block cache compression
•  HBase optimizations for low latency
•  Phoenix Robust Secondary Indexes
•  Performance enhancements for bulk import into
Phoenix
•  Hive over HBase Snapshots
•  Hive Connector to Accumulo
•  HBase & Accumulo wire-level encryption
•  Accumulo multi-datacenter replication
Apache Storm
•  Storm-on-YARN via Slider
•  Ingest & notification for JMS (IBM MQ not supported)
•  Kafka bolt for Storm – supports sophisticated
chaining of topologies through Kafka
•  Kerberos support
•  Hive update support – Streaming Ingest
•  Connector improvements for HBase and HDFS
•  Deliver Kafka as a companion component
•  Kafka install, start/stop via Ambari
•  Security Authorization Integration with Ranger
Apache Slider
•  Allow on-demand create and run different versions of
heterogeneous applications
•  Allow users to configure different application
instances differently
•  Manage operational lifecycle of application instances
•  Expand / shrink application instances
•  Provide application registry for publish and discovery
Apache Knox & Apache Ranger (Argus) & HDP
Security
•  Apache Ranger – Support authorization and auditing
for Storm and Knox
•  Introducing REST APIs for managing policies in
Apache Ranger
•  Apache Ranger – Support native grant/revoke
permissions in Hive and HBase
•  Apache Ranger – Support Oracle DB and storing of
audit logs in HDFS
•  Apache Ranger to run on Windows environment
•  Apache Knox to protect YARN RM
•  Apache Knox support for HDFS HA
•  Apache Ambari install, start/stop of Knox
Apache Ambari
•  Support for HDP 2.2 Stack, including support for
Kafka, Knox and Slider
•  Enhancements to Ambari Web configuration
management including: versioning, history and
revert, setting final properties and downloading client
configurations
•  Launch and monitor HDFS rebalance
•  Perform Capacity Scheduler queue refresh
•  Configure High Availability for ResourceManager
•  Ambari Administration framework for managing user
and group access to Ambari
•  Ambari Views development framework for
customizing the Ambari Web user experience
•  Ambari Stacks for extending Ambari to bring custom
Services under Ambari management
•  Ambari Blueprints for automating cluster
deployments
•  Performance improvements and enterprise usability
guardrails
88Astonishing amount of innovation
in the OPEN Apache Community
HDP
is Apache
Hadoop
Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Hive & Stinger Initiative
Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hive – Single tool for all SQL use cases
OLTP,	
  ERP,	
  CRM	
  Systems	
  
Unstructured	
  documents,	
  emails	
  
Clickstream	
  
Server	
  logs	
  
Sen>ment,	
  Web	
  Data	
  
Sensor.	
  Machine	
  Data	
  
Geoloca>on	
  
Interactive
Analytics
Batch Reports /
Deep Analytics
Hive - SQL
ETL / ELT
Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hive Scales To Any Workload
Page 9
"  The original developers of Hive.
"  More data than existing RDBMS could handle.
"  100+ PB of data under management.
"  15+ TB of data loaded daily.
"  60,000+ Hive queries per day.
"  More than 1,000 users per day.
Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hive Join Strategies
Page 10
Type	
   Approach	
   Pros	
   Cons	
  
Shuffle Join	
  
Join keys are shuffled using map/
reduce and joins performed
reduce side.	
  
Works regardless of data
size or layout.	
  
Most resource-intensive
and slowest join type.	
  
Broadcast
Join	
  
Small tables are loaded into
memory in all nodes, mapper
scans through the large table and
joins.	
  
Very fast, single scan
through largest table.	
  
All but one table must be
small enough to fit in
RAM.	
  
Sort-Merge-
Bucket Join	
  
Mappers take advantage of co-
location of keys to do efficient
joins.	
  
Very fast for tables of any
size.	
  
Data must be bucketed
ahead of time.	
  
Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Stinger Initiative
• Stinger Initiative – DELIVERED
Next generation SQL based interactive query in
Hadoop
Speed
Improve Hive query performance has increased by 100X to allow
for interactive query times (seconds)
Scale
The only SQL interface to Hadoop designed for queries that scale
from TB to PB
SQL
Support broadest range of SQL semantics for analytic applications
running against Hadoop
Governance
&Integration
Security
Operations
Data Access
Data Management
HDP 2.1
An Open Community at its finest: Apache Hive Contribution
1,672Jira Tickets Closed
145Developers
44Companies
360,000Lines Of Code Added… (2.5x)
Apache	
  YARN	
  
	
  
	
  
Apache	
  MapReduce	
  
	
  
1	
   °	
   °	
   °	
  
°	
   °	
   °	
   °	
  
°	
   °	
   °	
   °	
  
°	
  
°	
  
N	
  
HDFS	
  	
  
(Hadoop	
  Distributed	
  File	
  System)	
  
	
  
	
  
Apache	
  	
  
Tez	
  
	
  
Apache	
  Hive	
  
SQL	
  
Business	
  Analy=cs	
  
Custom	
  
Apps	
  
13Months
Hive 10
100’s	
  to	
  
1000’s	
  of	
  
seconds	
  
seconds	
  
Hive 13
Dramatically faster
queries
speeds time to
insight
Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Stinger Initiative - Key Innovations
File
Format
ORCFile
Execution
Engine
Tez
= 100X+ +
Query Planner
CBO
Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Tez (“Speed”)
• What is it?
– A data processing framework as an alternative to MapReduce
• Who else is involved?
– Hortonworks, Facebook, Twitter, Yahoo, Microsoft
• Why does it matter?
– Widens the platform for Hadoop use cases
– Crucial to improving the performance of low-latency applications
– Core to the Stinger initiative
– Evidence of Hortonworks leading the community in the evolution of Enterprise Hadoop
Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hive – MR Hive – Tez
Comparing: Hive/MR vs. Hive/Tez
Page 14
SELECT a.state, COUNT(*), AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state
SELECT a.state
JOIN (a, c)
SELECT c.price
SELECT b.id
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)
M M M
R R
M M
R
M M
R
M M
R
HDFS
HDFS
HDFS
M M M
R R
R
M M
R
R
SELECT a.state,
c.itemId
JOIN (a, c)
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)
SELECT b.id
Tez avoids unneeded
writes to HDFS
Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
ORCFile – Columnar Storage for Hive
• Columns stored separately
• Knows types
– Uses type-specific encoders
– Stores statistics (min, max, sum, count)
• Has light-weight index
– Skip over blocks of rows that don’t matter
Page 15
Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
ORCFile – Columnar Storage for Hive
Large block size ideal for
map/reduce.
Columnar format enables
high compression and high
performance.
Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Query Planner – Cost Based Optimizer in Hive
The Cost-Based Optimizer (CBO) uses statistics within Hive tables to
produce optimal query plans
Why cost-based optimization?
•  Ease of Use – Join Reordering
•  Reduces the need for specialists to tune queries.
•  More efficient query plans lead to better cluster utilization.
Page 17
Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Statistics: Foundations for CBO
Kind of statistics
Table Statistics – Collected on load per
partition
•  Uncompressed size
•  Number of rows
•  Number of files
Column Statistics – Required by CBO
•  NDV (Number of Distinct Values)
•  Nulls, Min, Max
Usability - How does the data get Statistics
Analyze Table Command
•  Analyze entire table
•  Run this command per partition
•  Run for some partitions and the compiler will extrapolate
statistics
Collecting statistics on load
•  Table stats can be collected if you insert via hive using set
hive.stats.autogather=true
•  Not with load data file
Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
A Journey to SQL Compliance
Evolu=on	
  of	
  SQL	
  Compliance	
  in	
  Hive	
  
SQL	
  Datatypes	
   SQL	
  Seman=cs	
  
INT/TINYINT/SMALLINT/BIGINT	
   SELECT,	
  INSERT	
  
FLOAT/DOUBLE	
   GROUP	
  BY,	
  ORDER	
  BY,	
  HAVING	
  
BOOLEAN	
   JOIN	
  on	
  explicit	
  join	
  key	
  
ARRAY,	
  MAP,	
  STRUCT,	
  UNION	
   Inner,	
  outer,	
  cross	
  and	
  semi	
  joins	
  
STRING	
   Sub-­‐queries	
  in	
  the	
  FROM	
  clause	
  
BINARY	
   ROLLUP	
  and	
  CUBE	
  
TIMESTAMP	
   UNION	
  
DECIMAL	
   Standard	
  aggrega>ons	
  (sum,	
  avg,	
  etc.)	
  
DATE	
   Custom	
  Java	
  UDFs	
  
VARCHAR	
   Windowing	
  func>ons	
  (OVER,	
  RANK,	
  etc.)	
  
CHAR	
   Advanced	
  UDFs	
  (ngram,	
  XPath,	
  URL)	
  
JOINs	
  in	
  WHERE	
  Clause	
  
Sub-­‐queries	
  for	
  IN/NOT	
  IN,	
  HAVING	
  
Legend	
  
Hive	
  10	
  or	
  earlier	
  
Hive	
  11	
  
Hive	
  12	
  
Hive	
  13	
  
Governance
&Integration
Security
Operations
Data Access
Data Management
HDP 2.1
Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Now this is not the end. It is not even the
beginning of the end. But it is, perhaps, the
end of the beginning.
-Winston Churchill
Hive 0.13
Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Stinger.Next
Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Stinger.Next: Delivery Themes
Hive	
  0.14	
  
	
  
•  Transac>ons	
  with	
  ACID	
  allowing	
  
insert,	
  update	
  and	
  delete	
  
•  Streaming	
  Ingest	
  
•  Cost	
  Based	
  Op>mizer	
  op>mizes	
  
star	
  and	
  bushy	
  join	
  queries	
  
Sub-­‐Second	
  
1st	
  Half	
  2015	
  
	
  
•  Sub-­‐Second	
  queries	
  with	
  LLAP	
  
•  Hive-­‐Spark	
  Machine	
  Learning	
  
integra>on	
  
•  Opera>onal	
  repor>ng	
  with	
  Hive	
  
Streaming	
  Ingest	
  and	
  
Transac>ons	
  	
  
Richer	
  Analy=cs	
  
2nd	
  Half	
  2015	
  
	
  
•  Toward	
  SQL:2011	
  Analy>cs	
  
•  Materialized	
  Views	
  	
  
•  Cross-­‐Geo	
  Queries	
  
•  Workload	
  Management	
  via	
  YARN	
  
and	
  LLAP	
  integra>on	
  
Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Transaction Use Cases
Reporting with Analytics (YES)
Reporting on data with occasional updates
Corrections to the fact tables, evolving dimension tables
Low concurrency updates, low TPS4
Operational Reporting (YES)
High throughput ingest from operational (OLTP) database
Periodic inserts every 5-30 minutes
Requires tool support and changes in our Transactions
Operational (OLTP) Database (NO)
Small Transactions, each doing single line inserts
High Concurrency - Hundreds to thousands of connections
Hive
OLTP Hive
Replication
Analytics Modifications
Hive
High Concurrency
OLTP
Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Deep Dive: Transactions
Transaction Support in Hive with ACID semantics
•  Hive native support for INSERT, UPDATE, DELETE.
•  Split Into Phases:
•  Phase 1: Hive Streaming Ingest (append)
•  Phase 2: INSERT / UPDATE / DELETE Support
•  Phase 3: BEGIN / COMMIT / ROLLBACK Txn
[Done]
[Done]
[Next]
Read-
Optimized
ORCFile
Delta File
Merged
Read-
Optimized
ORCFile
1. Original File
Task reads the latest
ORCFile
Task
Read-
Optimized
ORCFile
Task Task
2. Edits Made
Task reads the ORCFile and merges
the delta file with the edits
3. Edits Merged
Task reads the
updated ORCFile
Hive ACID Compactor
periodically merges the delta
files in the background.
Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Transactions - Requirements
Needs to declare table as having Transaction Property
Table must be in ORC format
Tables must to be bucketed
Page 25
Page 26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Putting It Together
Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Step 1 - Turn On Transactions
Hive Configuration
§  hive.support.concurrency=true
§  hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
§  hive.compactor.initiator.on=true
§  hive.compactor.worker.threads=2
§  hive.enforce.bucketing=true
§  hive.exec.dynamic.partition.mode=nonstrict
Page 27
Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Step 2 – Enable Concurrency By Defining Queues
YARN Configuration
§  yarn.scheduler.capacity.root.default.capacity=50
§  yarn.scheduler.capacity.root.hiveserver.capacity=50
§  yarn.scheduler.capacity.root.hiveserver.hive1.capacity=50
§  yarn.scheduler.capacity.root.hiveserver.hive1.user-limit-factor=4
§  yarn.scheduler.capacity.root.hiveserver.hive2.capacity=50
§  yarn.scheduler.capacity.root.hiveserver.hive2.user-limit-factor=4
§  yarn.scheduler.capacity.root.hiveserver.queues=hive1,hive2
§  yarn.scheduler.capacity.root.queues=default,hiveserver
Default
Hive1
Hive2
Cluster Capacity
Page 29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Step 3 – Deliver Capacity Guarantees BY Enabling YARN
Preemption
YARN Configuration
§  yarn.resourcemanager.scheduler.monitor.enable=true
§  yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourceman
ager.monitor.capacity.ProportionalCapacityPreemptionPolicy
§  yarn.resourcemanager.monitor.capacity.preemption.monitoring_interval=1000
§  yarn.resourcemanager.monitor.capacity.preemption.max_wait_before_kill=5000
§  yarn.resourcemanager.monitor.capacity.preemption.total_preemption_per_round=0.4
Page 30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Enable Sessions For Hive Queues
Step 4 – Enable Tez Execution Engine & Tez Sessions
Hive Configuration
§  hive.execution.engine=tez
§  hive.server2.tez.initialize.default.sessions=true
§  hive.server2.tez.default.queues=hive1,hive2
§  hive.server2.tez.sessions.per.default.queue=1
§  hive.server2.enable.doAs=false
§  hive.vectorized.groupby.maxentries=10240
§  hive.vectorized.groupby.flush.percent=0.1
Page 31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Step 5 - Create Partitioned & Bucketed ORC Tables
Create table if not exists test (id int, val string)
partitioned by (year string,month string,day string)
clustered by (id) into 7 buckets
stored as orc TBLPROPERTIES ("transactional"="true”);
Note:
§  Transaction Requires Bucketed tables in ORC
Format. Tables cannot be sorted.
§  Transactional=true must be set as table
properties
§  For performance, table Partition is
recommended but not mandatory
§  Partition on filter columns with low
cardinality
§  For optimal performance stay below 1000
partitions
§  Cluster on join columns
§  Number of buckets contingent on dataset
size
Page 32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Step 6 - Loading Data into ORC table
§  SQOOP, FLUME & STORM support direct ingestion to ORC Tables
§  Have a Text File ?
§  Load to a Table stored as textfile
§  Transfer to ORC Table using Hive insert statement
Page 33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Step 7 - Compute Statistics
§  Compute Table Stats
analyze table test partition(year,month,day) compute
statistics;
§  Compute Column Stats
analyze table test partition(year,month,day) compute
statistics for columns;
§  Keep Stats Updated
§  Speed computation by limiting it to partitions that have
changed
Note:
§  In hive 0.14, column stats can be
calculated for all partitions in a single
statement
§  To limit computation to a specific
partition, specify partition keys
Page 34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Sample Code – Sqoop Import To ORC Table
sqoop import --verbose --connect 'jdbc:mysql://localhost/people' --table persons
--username root --hcatalog-table persons --hcatalog-storage-stanza "stored
as orc" -m 1
Use Hcatalog to import to ORC Table
Page 35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Sample Code – Flume Configuration For Hive
Streaming Ingest
## Agent
agent.sources = csvfile
agent.sources.csvfile.type = exec
agent.sources.csvfile.command = tail -F /root/test.txt
agent.sources.csvfile.batchSize = 1
agent.sources.csvfile.channels = memoryChannel
agent.sources.csvfile.interceptors = intercepttime
agent.sources.csvfile.interceptors.intercepttime.type =
timestamp
## Channels
agent.channels = memoryChannel
agent.channels.memoryChannel.type = memory
agent.channels.memoryChannel.capacity = 10000
## Hive Streaming Sink
agent.sinks = hiveout
agent.sinks.hiveout.type = hive
agent.sinks.hiveout.hive.metastore=thrift://localhost:9083
agent.sinks.hiveout.hive.database=default
agent.sinks.hiveout.hive.table=test
agent.sinks.hiveout.hive.partition=%Y,%m,%d
agent.sinks.hiveout.serializer = DELIMITED
agent.sinks.hiveout.serializer.fieldnames =id,val
agent.sinks.hiveout.channel = memoryChannel
Page 36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Q&A

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
DataWorks Summit
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Databricks
 

Was ist angesagt? (20)

Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Druid deep dive
Druid deep diveDruid deep dive
Druid deep dive
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan Blue
 
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3GoHigh Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 

Andere mochten auch

Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Innovative Management Services
 
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache HiveAdding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
DataWorks Summit
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Viswanath Gangavaram
 
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache HiveAdding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
DataWorks Summit
 
An intriduction to hive
An intriduction to hiveAn intriduction to hive
An intriduction to hive
Reza Ameri
 
Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)
Takrim Ul Islam Laskar
 

Andere mochten auch (20)

What's new in Apache Hive
What's new in Apache HiveWhat's new in Apache Hive
What's new in Apache Hive
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
Data organization: hive meetup
Data organization: hive meetupData organization: hive meetup
Data organization: hive meetup
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Advanced Analytics using Apache Hive
Advanced Analytics using Apache HiveAdvanced Analytics using Apache Hive
Advanced Analytics using Apache Hive
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
 
Hadoop Pig Syntax Card
Hadoop Pig Syntax CardHadoop Pig Syntax Card
Hadoop Pig Syntax Card
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
 
Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pig
 
Advanced Hadoop Tuning and Optimization
Advanced Hadoop Tuning and Optimization Advanced Hadoop Tuning and Optimization
Advanced Hadoop Tuning and Optimization
 
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache HiveAdding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
 
Hive commands
Hive commandsHive commands
Hive commands
 
LLAP Nov Meetup
LLAP Nov MeetupLLAP Nov Meetup
LLAP Nov Meetup
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data Analytics
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
 
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache HiveAdding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
 
Hive case studies
Hive case studiesHive case studies
Hive case studies
 
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
 
An intriduction to hive
An intriduction to hiveAn intriduction to hive
An intriduction to hive
 
Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)
 

Ähnlich wie Hortonworks Technical Workshop: Interactive Query with Apache Hive

Hadoop Security Today and Tomorrow
Hadoop Security Today and TomorrowHadoop Security Today and Tomorrow
Hadoop Security Today and Tomorrow
DataWorks Summit
 
Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox Gateway
DataWorks Summit
 

Ähnlich wie Hortonworks Technical Workshop: Interactive Query with Apache Hive (20)

Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Building Streaming Applications with Apache Storm 1.1
Building Streaming Applications with Apache Storm 1.1Building Streaming Applications with Apache Storm 1.1
Building Streaming Applications with Apache Storm 1.1
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop Security
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
 
Hadoop Security Today and Tomorrow
Hadoop Security Today and TomorrowHadoop Security Today and Tomorrow
Hadoop Security Today and Tomorrow
 
Hadoop Security Today & Tomorrow with Apache Knox
Hadoop Security Today & Tomorrow with Apache KnoxHadoop Security Today & Tomorrow with Apache Knox
Hadoop Security Today & Tomorrow with Apache Knox
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARN
 
xPatterns - Spark Summit 2014
xPatterns - Spark Summit   2014xPatterns - Spark Summit   2014
xPatterns - Spark Summit 2014
 
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFSDiscover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
 
messaging.pptx
messaging.pptxmessaging.pptx
messaging.pptx
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
 
xPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, TachyonxPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, Tachyon
 
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...
 
Past, Present, and Future of Apache Storm
Past, Present, and Future of Apache StormPast, Present, and Future of Apache Storm
Past, Present, and Future of Apache Storm
 
Hdp security overview
Hdp security overview Hdp security overview
Hdp security overview
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stinger
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduce
 
Securing Hadoop's REST APIs with Apache Knox Gateway Hadoop Summit June 6th, ...
Securing Hadoop's REST APIs with Apache Knox Gateway Hadoop Summit June 6th, ...Securing Hadoop's REST APIs with Apache Knox Gateway Hadoop Summit June 6th, ...
Securing Hadoop's REST APIs with Apache Knox Gateway Hadoop Summit June 6th, ...
 
Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox Gateway
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
 

Mehr von Hortonworks

Mehr von Hortonworks (20)

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with Cloudbreak
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data Management
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Hortonworks Technical Workshop: Interactive Query with Apache Hive

  • 1. Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Interactive Query With Apache Hive Dec 4, 2014 Ajay Singh
  • 2. Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Agenda •  HDP 2.2 •  Apache Hive & Stinger Initiative •  Stinger.Next •  Putting It Together •  Q&A
  • 3. Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HDP 2.2 Generally Available Hortonworks Data Platform 2.2 YARN: Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Script Pig SQL Hive Tez Tez Java Scala Cascading Tez ° ° ° ° ° ° ° ° ° ° ° ° ° ° Others ISV Engines HDFS (Hadoop Distributed File System) Stream Storm Search Solr NoSQL HBase Accumulo Slider Slider SECURITYGOVERNANCE OPERATIONSBATCH, INTERACTIVE & REAL-TIME DATA ACCESS In-Memory Spark Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Data Workflow, Lifecycle & Governance Falcon Sqoop Flume Kafka NFS WebHDFS Authentication Authorization Accounting Data Protection Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon Cluster: Knox Cluster: Ranger Deployment ChoiceLinux Windows On-Premises Cloud YARN is the architectural center of HDP Enables batch, interactive and real-time workloads Provides comprehensive enterprise capabilities The widest range of deployment options Delivered Completely in the OPEN
  • 4. Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HDP IS Apache Hadoop There is ONE Enterprise Hadoop: everything else is a vendor derivation Hortonworks Data Platform 2.2 Hadoop &YARN Pig Hive&HCatalog HBase Sqoop Oozie Zookeeper Ambari Storm Flume Knox Phoenix Accumulo 2.2.0 0.12.0 0.12.0 2.4.0 0.12.1 Data Management 0.13.0 0.96.1 0.98.0 0.9.1 1.4.4 1.3.1 1.4.0 1.4.4 1.5.1 3.3.2 4.0.0 3.4.5 0.4.0 4.0.0 1.5.1 Falcon 0.5.0 Ranger Spark Kafka 0.14.0 0.14.0 0.98.4 1.6.1 4.2 0.9.3 1.2.0 0.6.0 0.8.1 1.4.5 1.5.0 1.7.0 4.1.0 0.5.0 0.4.0 2.6.0 * version numbers are targets and subject to change at time of general availability in accordance with ASF release process 3.4.5 Tez 0.4.0 Slider 0.60 HDP 2.0 October 2013 HDP 2.2 October 2014 HDP 2.1 April 2014 Solr 4.7.2 4.10.0 0.5.1 Data Access Governance & Integration SecurityOperations
  • 5. Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Complete List of New Features in HDP 2.2 Apache Hadoop YARN •  Slide existing services onto YARN through ‘Slider’ •  GA release of HBase, Accumulo, and Storm on YARN •  Support long running services: handling of logs, containers not killed when AM dies, secure token renewal, YARN Labels for tagging nodes for specific workloads •  Support for CPU Scheduling and CPU Resource Isolation through CGroups Apache Hadoop HDFS •  Heterogeneous storage: Support for archival •  Rolling Upgrade (This is an item that applies to the entire HDP Stack. YARN, Hive, HBase, everything. We now support comprehensive Rolling Upgrade across the HDP Stack). •  Multi-NIC Support •  Heterogeneous storage: Support memory as a storage tier (TP) •  HDFS Transparent Data Encryption (TP) Apache Hive, Apache Pig, and Apache Tez •  Hive Cost Based Optimizer: Function Pushdown & Join re-ordering support for other join types: star & bushy. •  Hive SQL Enhancements including: •  ACID Support: Insert, Update, Delete •  Temporary Tables •  Metadata-only queries return instantly •  Pig on Tez •  Including DataFu for use with Pig •  Vectorized shuffle •  Tez Debug Tooling & UI Hue •  Support for HiveServer 2 •  Support for Resource Manager HA Apache Spark •  Refreshed Tech Preview to Spark 1.1.0 (available now) •  ORC File support & Hive 0.13 integration •  Planned for GA of Spark 1.2.0 •  Operations integration via YARN ATS and Ambari •  Security: Authentication •  Apache Solr •  Added Banana, a rich and flexible UI for visualizing time series data indexed in Solr •  Cascading •  Cascading 3.0 on Tez distributed with HDP — coming soon Apache Falcon •  Authentication Integration •  Lineage – now GA. (it’s been a tech preview feature…) •  Improve UI for pipeline management & editing: list, detail, and create new (from existing elements) •  Replicate to Cloud – Azure & S3 Apache Sqoop, Apache Flume & Apache Oozie •  Sqoop import support for Hive types via HCatalog •  Secure Windows cluster support: Sqoop, Flume, Oozie •  Flume streaming support: sink to HCat on secure cluster •  Oozie HA now supports secure clusters •  Oozie Rolling Upgrade •  Operational improvements for Oozie to better support Falcon •  Capture workflow job logs in HDFS •  Don’t start new workflows for re-run •  Allow job property updates on running jobs Apache HBase, Apache Phoenix, & Apache Accumulo •  HBase & Accumulo on YARN via Slider •  HBase HA •  Replicas update in real-time •  Fully supports region split/merge •  Scan API now supports standby RegionServers •  HBase Block cache compression •  HBase optimizations for low latency •  Phoenix Robust Secondary Indexes •  Performance enhancements for bulk import into Phoenix •  Hive over HBase Snapshots •  Hive Connector to Accumulo •  HBase & Accumulo wire-level encryption •  Accumulo multi-datacenter replication Apache Storm •  Storm-on-YARN via Slider •  Ingest & notification for JMS (IBM MQ not supported) •  Kafka bolt for Storm – supports sophisticated chaining of topologies through Kafka •  Kerberos support •  Hive update support – Streaming Ingest •  Connector improvements for HBase and HDFS •  Deliver Kafka as a companion component •  Kafka install, start/stop via Ambari •  Security Authorization Integration with Ranger Apache Slider •  Allow on-demand create and run different versions of heterogeneous applications •  Allow users to configure different application instances differently •  Manage operational lifecycle of application instances •  Expand / shrink application instances •  Provide application registry for publish and discovery Apache Knox & Apache Ranger (Argus) & HDP Security •  Apache Ranger – Support authorization and auditing for Storm and Knox •  Introducing REST APIs for managing policies in Apache Ranger •  Apache Ranger – Support native grant/revoke permissions in Hive and HBase •  Apache Ranger – Support Oracle DB and storing of audit logs in HDFS •  Apache Ranger to run on Windows environment •  Apache Knox to protect YARN RM •  Apache Knox support for HDFS HA •  Apache Ambari install, start/stop of Knox Apache Ambari •  Support for HDP 2.2 Stack, including support for Kafka, Knox and Slider •  Enhancements to Ambari Web configuration management including: versioning, history and revert, setting final properties and downloading client configurations •  Launch and monitor HDFS rebalance •  Perform Capacity Scheduler queue refresh •  Configure High Availability for ResourceManager •  Ambari Administration framework for managing user and group access to Ambari •  Ambari Views development framework for customizing the Ambari Web user experience •  Ambari Stacks for extending Ambari to bring custom Services under Ambari management •  Ambari Blueprints for automating cluster deployments •  Performance improvements and enterprise usability guardrails
  • 6. Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Just How Many New Features are in HDP 2.2? Apache Hadoop YARN •  Slide existing services onto YARN through ‘Slider’ •  GA release of HBase, Accumulo, and Storm on YARN •  Support long running services: handling of logs, containers not killed when AM dies, secure token renewal, YARN Labels for tagging nodes for specific workloads •  Support for CPU Scheduling and CPU Resource Isolation through CGroups Apache Hadoop HDFS •  Heterogeneous storage: Support for archival •  Rolling Upgrade (This is an item that applies to the entire HDP Stack. YARN, Hive, HBase, everything. We now support comprehensive Rolling Upgrade across the HDP Stack). •  Multi-NIC Support •  Heterogeneous storage: Support memory as a storage tier (TP) •  HDFS Transparent Data Encryption (TP) Apache Hive, Apache Pig, and Apache Tez •  Hive Cost Based Optimizer: Function Pushdown & Join re-ordering support for other join types: star & bushy. •  Hive SQL Enhancements including: •  ACID Support: Insert, Update, Delete •  Temporary Tables •  Metadata-only queries return instantly •  Pig on Tez •  Including DataFu for use with Pig •  Vectorized shuffle •  Tez Debug Tooling & UI Hue •  Support for HiveServer 2 •  Support for Resource Manager HA Apache Spark •  Refreshed Tech Preview to Spark 1.1.0 (available now) •  ORC File support & Hive 0.13 integration •  Planned for GA of Spark 1.2.0 •  Operations integration via YARN ATS and Ambari •  Security: Authentication •  Apache Solr •  Added Banana, a rich and flexible UI for visualizing time series data indexed in Solr •  Cascading •  Cascading 3.0 on Tez distributed with HDP — coming soon Apache Falcon •  Authentication Integration •  Lineage – now GA. (it’s been a tech preview feature…) •  Improve UI for pipeline management & editing: list, detail, and create new (from existing elements) •  Replicate to Cloud – Azure & S3 Apache Sqoop, Apache Flume & Apache Oozie •  Sqoop import support for Hive types via HCatalog •  Secure Windows cluster support: Sqoop, Flume, Oozie •  Flume streaming support: sink to HCat on secure cluster •  Oozie HA now supports secure clusters •  Oozie Rolling Upgrade •  Operational improvements for Oozie to better support Falcon •  Capture workflow job logs in HDFS •  Don’t start new workflows for re-run •  Allow job property updates on running jobs Apache HBase, Apache Phoenix, & Apache Accumulo •  HBase & Accumulo on YARN via Slider •  HBase HA •  Replicas update in real-time •  Fully supports region split/merge •  Scan API now supports standby RegionServers •  HBase Block cache compression •  HBase optimizations for low latency •  Phoenix Robust Secondary Indexes •  Performance enhancements for bulk import into Phoenix •  Hive over HBase Snapshots •  Hive Connector to Accumulo •  HBase & Accumulo wire-level encryption •  Accumulo multi-datacenter replication Apache Storm •  Storm-on-YARN via Slider •  Ingest & notification for JMS (IBM MQ not supported) •  Kafka bolt for Storm – supports sophisticated chaining of topologies through Kafka •  Kerberos support •  Hive update support – Streaming Ingest •  Connector improvements for HBase and HDFS •  Deliver Kafka as a companion component •  Kafka install, start/stop via Ambari •  Security Authorization Integration with Ranger Apache Slider •  Allow on-demand create and run different versions of heterogeneous applications •  Allow users to configure different application instances differently •  Manage operational lifecycle of application instances •  Expand / shrink application instances •  Provide application registry for publish and discovery Apache Knox & Apache Ranger (Argus) & HDP Security •  Apache Ranger – Support authorization and auditing for Storm and Knox •  Introducing REST APIs for managing policies in Apache Ranger •  Apache Ranger – Support native grant/revoke permissions in Hive and HBase •  Apache Ranger – Support Oracle DB and storing of audit logs in HDFS •  Apache Ranger to run on Windows environment •  Apache Knox to protect YARN RM •  Apache Knox support for HDFS HA •  Apache Ambari install, start/stop of Knox Apache Ambari •  Support for HDP 2.2 Stack, including support for Kafka, Knox and Slider •  Enhancements to Ambari Web configuration management including: versioning, history and revert, setting final properties and downloading client configurations •  Launch and monitor HDFS rebalance •  Perform Capacity Scheduler queue refresh •  Configure High Availability for ResourceManager •  Ambari Administration framework for managing user and group access to Ambari •  Ambari Views development framework for customizing the Ambari Web user experience •  Ambari Stacks for extending Ambari to bring custom Services under Ambari management •  Ambari Blueprints for automating cluster deployments •  Performance improvements and enterprise usability guardrails 88Astonishing amount of innovation in the OPEN Apache Community HDP is Apache Hadoop
  • 7. Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Hive & Stinger Initiative
  • 8. Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Hive – Single tool for all SQL use cases OLTP,  ERP,  CRM  Systems   Unstructured  documents,  emails   Clickstream   Server  logs   Sen>ment,  Web  Data   Sensor.  Machine  Data   Geoloca>on   Interactive Analytics Batch Reports / Deep Analytics Hive - SQL ETL / ELT
  • 9. Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Hive Scales To Any Workload Page 9 "  The original developers of Hive. "  More data than existing RDBMS could handle. "  100+ PB of data under management. "  15+ TB of data loaded daily. "  60,000+ Hive queries per day. "  More than 1,000 users per day.
  • 10. Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Hive Join Strategies Page 10 Type   Approach   Pros   Cons   Shuffle Join   Join keys are shuffled using map/ reduce and joins performed reduce side.   Works regardless of data size or layout.   Most resource-intensive and slowest join type.   Broadcast Join   Small tables are loaded into memory in all nodes, mapper scans through the large table and joins.   Very fast, single scan through largest table.   All but one table must be small enough to fit in RAM.   Sort-Merge- Bucket Join   Mappers take advantage of co- location of keys to do efficient joins.   Very fast for tables of any size.   Data must be bucketed ahead of time.  
  • 11. Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Stinger Initiative • Stinger Initiative – DELIVERED Next generation SQL based interactive query in Hadoop Speed Improve Hive query performance has increased by 100X to allow for interactive query times (seconds) Scale The only SQL interface to Hadoop designed for queries that scale from TB to PB SQL Support broadest range of SQL semantics for analytic applications running against Hadoop Governance &Integration Security Operations Data Access Data Management HDP 2.1 An Open Community at its finest: Apache Hive Contribution 1,672Jira Tickets Closed 145Developers 44Companies 360,000Lines Of Code Added… (2.5x) Apache  YARN       Apache  MapReduce     1   °   °   °   °   °   °   °   °   °   °   °   °   °   N   HDFS     (Hadoop  Distributed  File  System)       Apache     Tez     Apache  Hive   SQL   Business  Analy=cs   Custom   Apps   13Months Hive 10 100’s  to   1000’s  of   seconds   seconds   Hive 13 Dramatically faster queries speeds time to insight
  • 12. Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Stinger Initiative - Key Innovations File Format ORCFile Execution Engine Tez = 100X+ + Query Planner CBO
  • 13. Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Tez (“Speed”) • What is it? – A data processing framework as an alternative to MapReduce • Who else is involved? – Hortonworks, Facebook, Twitter, Yahoo, Microsoft • Why does it matter? – Widens the platform for Hadoop use cases – Crucial to improving the performance of low-latency applications – Core to the Stinger initiative – Evidence of Hortonworks leading the community in the evolution of Enterprise Hadoop
  • 14. Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Hive – MR Hive – Tez Comparing: Hive/MR vs. Hive/Tez Page 14 SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a JOIN b ON (a.id = b.id) JOIN c ON (a.itemId = c.itemId) GROUP BY a.state SELECT a.state JOIN (a, c) SELECT c.price SELECT b.id JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) M M M R R M M R M M R M M R HDFS HDFS HDFS M M M R R R M M R R SELECT a.state, c.itemId JOIN (a, c) JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) SELECT b.id Tez avoids unneeded writes to HDFS
  • 15. Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved ORCFile – Columnar Storage for Hive • Columns stored separately • Knows types – Uses type-specific encoders – Stores statistics (min, max, sum, count) • Has light-weight index – Skip over blocks of rows that don’t matter Page 15
  • 16. Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved ORCFile – Columnar Storage for Hive Large block size ideal for map/reduce. Columnar format enables high compression and high performance.
  • 17. Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Query Planner – Cost Based Optimizer in Hive The Cost-Based Optimizer (CBO) uses statistics within Hive tables to produce optimal query plans Why cost-based optimization? •  Ease of Use – Join Reordering •  Reduces the need for specialists to tune queries. •  More efficient query plans lead to better cluster utilization. Page 17
  • 18. Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Statistics: Foundations for CBO Kind of statistics Table Statistics – Collected on load per partition •  Uncompressed size •  Number of rows •  Number of files Column Statistics – Required by CBO •  NDV (Number of Distinct Values) •  Nulls, Min, Max Usability - How does the data get Statistics Analyze Table Command •  Analyze entire table •  Run this command per partition •  Run for some partitions and the compiler will extrapolate statistics Collecting statistics on load •  Table stats can be collected if you insert via hive using set hive.stats.autogather=true •  Not with load data file
  • 19. Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved A Journey to SQL Compliance Evolu=on  of  SQL  Compliance  in  Hive   SQL  Datatypes   SQL  Seman=cs   INT/TINYINT/SMALLINT/BIGINT   SELECT,  INSERT   FLOAT/DOUBLE   GROUP  BY,  ORDER  BY,  HAVING   BOOLEAN   JOIN  on  explicit  join  key   ARRAY,  MAP,  STRUCT,  UNION   Inner,  outer,  cross  and  semi  joins   STRING   Sub-­‐queries  in  the  FROM  clause   BINARY   ROLLUP  and  CUBE   TIMESTAMP   UNION   DECIMAL   Standard  aggrega>ons  (sum,  avg,  etc.)   DATE   Custom  Java  UDFs   VARCHAR   Windowing  func>ons  (OVER,  RANK,  etc.)   CHAR   Advanced  UDFs  (ngram,  XPath,  URL)   JOINs  in  WHERE  Clause   Sub-­‐queries  for  IN/NOT  IN,  HAVING   Legend   Hive  10  or  earlier   Hive  11   Hive  12   Hive  13   Governance &Integration Security Operations Data Access Data Management HDP 2.1
  • 20. Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Now this is not the end. It is not even the beginning of the end. But it is, perhaps, the end of the beginning. -Winston Churchill Hive 0.13
  • 21. Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Stinger.Next
  • 22. Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Stinger.Next: Delivery Themes Hive  0.14     •  Transac>ons  with  ACID  allowing   insert,  update  and  delete   •  Streaming  Ingest   •  Cost  Based  Op>mizer  op>mizes   star  and  bushy  join  queries   Sub-­‐Second   1st  Half  2015     •  Sub-­‐Second  queries  with  LLAP   •  Hive-­‐Spark  Machine  Learning   integra>on   •  Opera>onal  repor>ng  with  Hive   Streaming  Ingest  and   Transac>ons     Richer  Analy=cs   2nd  Half  2015     •  Toward  SQL:2011  Analy>cs   •  Materialized  Views     •  Cross-­‐Geo  Queries   •  Workload  Management  via  YARN   and  LLAP  integra>on  
  • 23. Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Transaction Use Cases Reporting with Analytics (YES) Reporting on data with occasional updates Corrections to the fact tables, evolving dimension tables Low concurrency updates, low TPS4 Operational Reporting (YES) High throughput ingest from operational (OLTP) database Periodic inserts every 5-30 minutes Requires tool support and changes in our Transactions Operational (OLTP) Database (NO) Small Transactions, each doing single line inserts High Concurrency - Hundreds to thousands of connections Hive OLTP Hive Replication Analytics Modifications Hive High Concurrency OLTP
  • 24. Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Deep Dive: Transactions Transaction Support in Hive with ACID semantics •  Hive native support for INSERT, UPDATE, DELETE. •  Split Into Phases: •  Phase 1: Hive Streaming Ingest (append) •  Phase 2: INSERT / UPDATE / DELETE Support •  Phase 3: BEGIN / COMMIT / ROLLBACK Txn [Done] [Done] [Next] Read- Optimized ORCFile Delta File Merged Read- Optimized ORCFile 1. Original File Task reads the latest ORCFile Task Read- Optimized ORCFile Task Task 2. Edits Made Task reads the ORCFile and merges the delta file with the edits 3. Edits Merged Task reads the updated ORCFile Hive ACID Compactor periodically merges the delta files in the background.
  • 25. Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Transactions - Requirements Needs to declare table as having Transaction Property Table must be in ORC format Tables must to be bucketed Page 25
  • 26. Page 26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Putting It Together
  • 27. Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Step 1 - Turn On Transactions Hive Configuration §  hive.support.concurrency=true §  hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager §  hive.compactor.initiator.on=true §  hive.compactor.worker.threads=2 §  hive.enforce.bucketing=true §  hive.exec.dynamic.partition.mode=nonstrict Page 27
  • 28. Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Step 2 – Enable Concurrency By Defining Queues YARN Configuration §  yarn.scheduler.capacity.root.default.capacity=50 §  yarn.scheduler.capacity.root.hiveserver.capacity=50 §  yarn.scheduler.capacity.root.hiveserver.hive1.capacity=50 §  yarn.scheduler.capacity.root.hiveserver.hive1.user-limit-factor=4 §  yarn.scheduler.capacity.root.hiveserver.hive2.capacity=50 §  yarn.scheduler.capacity.root.hiveserver.hive2.user-limit-factor=4 §  yarn.scheduler.capacity.root.hiveserver.queues=hive1,hive2 §  yarn.scheduler.capacity.root.queues=default,hiveserver Default Hive1 Hive2 Cluster Capacity
  • 29. Page 29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Step 3 – Deliver Capacity Guarantees BY Enabling YARN Preemption YARN Configuration §  yarn.resourcemanager.scheduler.monitor.enable=true §  yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourceman ager.monitor.capacity.ProportionalCapacityPreemptionPolicy §  yarn.resourcemanager.monitor.capacity.preemption.monitoring_interval=1000 §  yarn.resourcemanager.monitor.capacity.preemption.max_wait_before_kill=5000 §  yarn.resourcemanager.monitor.capacity.preemption.total_preemption_per_round=0.4
  • 30. Page 30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Enable Sessions For Hive Queues Step 4 – Enable Tez Execution Engine & Tez Sessions Hive Configuration §  hive.execution.engine=tez §  hive.server2.tez.initialize.default.sessions=true §  hive.server2.tez.default.queues=hive1,hive2 §  hive.server2.tez.sessions.per.default.queue=1 §  hive.server2.enable.doAs=false §  hive.vectorized.groupby.maxentries=10240 §  hive.vectorized.groupby.flush.percent=0.1
  • 31. Page 31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Step 5 - Create Partitioned & Bucketed ORC Tables Create table if not exists test (id int, val string) partitioned by (year string,month string,day string) clustered by (id) into 7 buckets stored as orc TBLPROPERTIES ("transactional"="true”); Note: §  Transaction Requires Bucketed tables in ORC Format. Tables cannot be sorted. §  Transactional=true must be set as table properties §  For performance, table Partition is recommended but not mandatory §  Partition on filter columns with low cardinality §  For optimal performance stay below 1000 partitions §  Cluster on join columns §  Number of buckets contingent on dataset size
  • 32. Page 32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Step 6 - Loading Data into ORC table §  SQOOP, FLUME & STORM support direct ingestion to ORC Tables §  Have a Text File ? §  Load to a Table stored as textfile §  Transfer to ORC Table using Hive insert statement
  • 33. Page 33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Step 7 - Compute Statistics §  Compute Table Stats analyze table test partition(year,month,day) compute statistics; §  Compute Column Stats analyze table test partition(year,month,day) compute statistics for columns; §  Keep Stats Updated §  Speed computation by limiting it to partitions that have changed Note: §  In hive 0.14, column stats can be calculated for all partitions in a single statement §  To limit computation to a specific partition, specify partition keys
  • 34. Page 34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Sample Code – Sqoop Import To ORC Table sqoop import --verbose --connect 'jdbc:mysql://localhost/people' --table persons --username root --hcatalog-table persons --hcatalog-storage-stanza "stored as orc" -m 1 Use Hcatalog to import to ORC Table
  • 35. Page 35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Sample Code – Flume Configuration For Hive Streaming Ingest ## Agent agent.sources = csvfile agent.sources.csvfile.type = exec agent.sources.csvfile.command = tail -F /root/test.txt agent.sources.csvfile.batchSize = 1 agent.sources.csvfile.channels = memoryChannel agent.sources.csvfile.interceptors = intercepttime agent.sources.csvfile.interceptors.intercepttime.type = timestamp ## Channels agent.channels = memoryChannel agent.channels.memoryChannel.type = memory agent.channels.memoryChannel.capacity = 10000 ## Hive Streaming Sink agent.sinks = hiveout agent.sinks.hiveout.type = hive agent.sinks.hiveout.hive.metastore=thrift://localhost:9083 agent.sinks.hiveout.hive.database=default agent.sinks.hiveout.hive.table=test agent.sinks.hiveout.hive.partition=%Y,%m,%d agent.sinks.hiveout.serializer = DELIMITED agent.sinks.hiveout.serializer.fieldnames =id,val agent.sinks.hiveout.channel = memoryChannel
  • 36. Page 36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Q&A