Hortonworks Technical Workshop: Interactive Query with Apache Hive

© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Interactive Query With Apache Hive
Dec 4, 2014
Ajay Singh

Agenda
•  HDP 2.2
•  Apache Hive & Stinger Initiative
•  Stinger.Next
•  Putting It Together
•  Q&A

HDP 2.2 Generally Available
Hortonworks Data Platform 2.2
YARN: Data Operating System
(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
Tez
Tez
Java
Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
Others
ISV
Engines
HDFS
(Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBase
Accumulo
Slider
Slider
SECURITYGOVERNANCE OPERATIONSBATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow,
Lifecycle &
Governance
Falcon
Sqoop
Flume
Kafka
NFS
WebHDFS
Authentication
Authorization
Accounting
Data Protection
Storage: HDFS
Resources: YARN
Access: Hive, …
Pipeline: Falcon
Cluster: Knox
Cluster: Ranger
Deployment ChoiceLinux Windows On-Premises Cloud
YARN
is the architectural
center of HDP
Enables batch, interactive
and real-time workloads
Provides comprehensive
enterprise capabilities
The widest range of
deployment options
Delivered Completely in the OPEN

HDP IS Apache Hadoop
There is ONE Enterprise Hadoop: everything else is a vendor derivation
Hortonworks Data Platform 2.2
Hadoop
&YARN
Pig
Hive&HCatalog
HBase
Sqoop
Oozie
Zookeeper
Ambari
Storm
Flume
Knox
Phoenix
Accumulo
2.2.0
0.12.0
0.12.0
2.4.0
0.12.1
Data
Management
0.13.0
0.96.1
0.98.0
0.9.1
1.4.4
1.3.1
1.4.0
1.4.4
1.5.1
3.3.2
4.0.0
3.4.5
0.4.0
4.0.0
1.5.1
Falcon
0.5.0
Ranger
Spark
Kafka
0.14.0
0.14.0
0.98.4
1.6.1
4.2
0.9.3
1.2.0
0.6.0
0.8.1
1.4.5
1.5.0
1.7.0
4.1.0
0.5.0
0.4.0
2.6.0
* version numbers are targets and subject to change at time of general availability in accordance with ASF release process
3.4.5
Tez
0.4.0
Slider
0.60
HDP 2.0
October
2013
HDP 2.2
October
2014
HDP 2.1
April
2014
Solr
4.7.2
4.10.0
0.5.1
Data Access
Governance
& Integration
SecurityOperations

Complete List of New Features in HDP 2.2
Apache Hadoop YARN
•  Slide existing services onto YARN through ‘Slider’
•  GA release of HBase, Accumulo, and Storm on
YARN
•  Support long running services: handling of logs,
containers not killed when AM dies, secure token
renewal, YARN Labels for tagging nodes for specific
workloads
•  Support for CPU Scheduling and CPU Resource
Isolation through CGroups
Apache Hadoop HDFS
•  Heterogeneous storage: Support for archival
•  Rolling Upgrade (This is an item that applies to the
entire HDP Stack. YARN, Hive, HBase, everything.
We now support comprehensive Rolling Upgrade
across the HDP Stack).
•  Multi-NIC Support
•  Heterogeneous storage: Support memory as a
storage tier (TP)
•  HDFS Transparent Data Encryption (TP)
Apache Hive, Apache Pig, and Apache Tez
•  Hive Cost Based Optimizer: Function Pushdown &
Join re-ordering support for other join types: star &
bushy.
•  Hive SQL Enhancements including:
•  ACID Support: Insert, Update, Delete
•  Temporary Tables
•  Metadata-only queries return instantly
•  Pig on Tez
•  Including DataFu for use with Pig
•  Vectorized shuffle
•  Tez Debug Tooling & UI
Hue
•  Support for HiveServer 2
•  Support for Resource Manager HA
Apache Spark
•  Refreshed Tech Preview to Spark 1.1.0 (available
now)
•  ORC File support & Hive 0.13 integration
•  Planned for GA of Spark 1.2.0
•  Operations integration via YARN ATS and Ambari
•  Security: Authentication
•  Apache Solr
•  Added Banana, a rich and flexible UI for visualizing
time series data indexed in Solr
•  Cascading
•  Cascading 3.0 on Tez distributed with HDP
— coming soon
Apache Falcon
•  Authentication Integration
•  Lineage – now GA. (it’s been a tech preview
feature…)
•  Improve UI for pipeline management & editing: list,
detail, and create new (from existing elements)
•  Replicate to Cloud – Azure & S3
Apache Sqoop, Apache Flume & Apache Oozie
•  Sqoop import support for Hive types via HCatalog
•  Secure Windows cluster support: Sqoop, Flume,
Oozie
•  Flume streaming support: sink to HCat on secure
cluster
•  Oozie HA now supports secure clusters
•  Oozie Rolling Upgrade
•  Operational improvements for Oozie to better support
Falcon
•  Capture workflow job logs in HDFS
•  Don’t start new workflows for re-run
•  Allow job property updates on running jobs
Apache HBase, Apache Phoenix, & Apache
Accumulo
•  HBase & Accumulo on YARN via Slider
•  HBase HA
•  Replicas update in real-time
•  Fully supports region split/merge
•  Scan API now supports standby RegionServers
•  HBase Block cache compression
•  HBase optimizations for low latency
•  Phoenix Robust Secondary Indexes
•  Performance enhancements for bulk import into
Phoenix
•  Hive over HBase Snapshots
•  Hive Connector to Accumulo
•  HBase & Accumulo wire-level encryption
•  Accumulo multi-datacenter replication
Apache Storm
•  Storm-on-YARN via Slider
•  Ingest & notification for JMS (IBM MQ not supported)
•  Kafka bolt for Storm – supports sophisticated
chaining of topologies through Kafka
•  Kerberos support
•  Hive update support – Streaming Ingest
•  Connector improvements for HBase and HDFS
•  Deliver Kafka as a companion component
•  Kafka install, start/stop via Ambari
•  Security Authorization Integration with Ranger
Apache Slider
•  Allow on-demand create and run different versions of
heterogeneous applications
•  Allow users to configure different application
instances differently
•  Manage operational lifecycle of application instances
•  Expand / shrink application instances
•  Provide application registry for publish and discovery
Apache Knox & Apache Ranger (Argus) & HDP
Security
•  Apache Ranger – Support authorization and auditing
for Storm and Knox
•  Introducing REST APIs for managing policies in
Apache Ranger
•  Apache Ranger – Support native grant/revoke
permissions in Hive and HBase
•  Apache Ranger – Support Oracle DB and storing of
audit logs in HDFS
•  Apache Ranger to run on Windows environment
•  Apache Knox to protect YARN RM
•  Apache Knox support for HDFS HA
•  Apache Ambari install, start/stop of Knox
Apache Ambari
•  Support for HDP 2.2 Stack, including support for
Kafka, Knox and Slider
•  Enhancements to Ambari Web configuration
management including: versioning, history and
revert, setting final properties and downloading client
configurations
•  Launch and monitor HDFS rebalance
•  Perform Capacity Scheduler queue refresh
•  Configure High Availability for ResourceManager
•  Ambari Administration framework for managing user
and group access to Ambari
•  Ambari Views development framework for
customizing the Ambari Web user experience
•  Ambari Stacks for extending Ambari to bring custom
Services under Ambari management
•  Ambari Blueprints for automating cluster
deployments
•  Performance improvements and enterprise usability
guardrails

Just How Many New Features are in HDP 2.2?
Apache Hadoop YARN
•  Slide existing services onto YARN through ‘Slider’
•  GA release of HBase, Accumulo, and Storm on
YARN
•  Support long running services: handling of logs,
containers not killed when AM dies, secure token
renewal, YARN Labels for tagging nodes for specific
workloads
•  Support for CPU Scheduling and CPU Resource
Isolation through CGroups
Apache Hadoop HDFS
•  Heterogeneous storage: Support for archival
•  Rolling Upgrade (This is an item that applies to the
entire HDP Stack. YARN, Hive, HBase, everything.
We now support comprehensive Rolling Upgrade
across the HDP Stack).
•  Multi-NIC Support
•  Heterogeneous storage: Support memory as a
storage tier (TP)
•  HDFS Transparent Data Encryption (TP)
Apache Hive, Apache Pig, and Apache Tez
•  Hive Cost Based Optimizer: Function Pushdown &
Join re-ordering support for other join types: star &
bushy.
•  Hive SQL Enhancements including:
•  ACID Support: Insert, Update, Delete
•  Temporary Tables
•  Metadata-only queries return instantly
•  Pig on Tez
•  Including DataFu for use with Pig
•  Vectorized shuffle
•  Tez Debug Tooling & UI
Hue
•  Support for HiveServer 2
•  Support for Resource Manager HA
Apache Spark
•  Refreshed Tech Preview to Spark 1.1.0 (available
now)
•  ORC File support & Hive 0.13 integration
•  Planned for GA of Spark 1.2.0
•  Operations integration via YARN ATS and Ambari
•  Security: Authentication
•  Apache Solr
•  Added Banana, a rich and flexible UI for visualizing
time series data indexed in Solr
•  Cascading
•  Cascading 3.0 on Tez distributed with HDP
— coming soon
Apache Falcon
•  Authentication Integration
•  Lineage – now GA. (it’s been a tech preview
feature…)
•  Improve UI for pipeline management & editing: list,
detail, and create new (from existing elements)
•  Replicate to Cloud – Azure & S3
Apache Sqoop, Apache Flume & Apache Oozie
•  Sqoop import support for Hive types via HCatalog
•  Secure Windows cluster support: Sqoop, Flume,
Oozie
•  Flume streaming support: sink to HCat on secure
cluster
•  Oozie HA now supports secure clusters
•  Oozie Rolling Upgrade
•  Operational improvements for Oozie to better support
Falcon
•  Capture workflow job logs in HDFS
•  Don’t start new workflows for re-run
•  Allow job property updates on running jobs
Apache HBase, Apache Phoenix, & Apache
Accumulo
•  HBase & Accumulo on YARN via Slider
•  HBase HA
•  Replicas update in real-time
•  Fully supports region split/merge
•  Scan API now supports standby RegionServers
•  HBase Block cache compression
•  HBase optimizations for low latency
•  Phoenix Robust Secondary Indexes
•  Performance enhancements for bulk import into
Phoenix
•  Hive over HBase Snapshots
•  Hive Connector to Accumulo
•  HBase & Accumulo wire-level encryption
•  Accumulo multi-datacenter replication
Apache Storm
•  Storm-on-YARN via Slider
•  Ingest & notification for JMS (IBM MQ not supported)
•  Kafka bolt for Storm – supports sophisticated
chaining of topologies through Kafka
•  Kerberos support
•  Hive update support – Streaming Ingest
•  Connector improvements for HBase and HDFS
•  Deliver Kafka as a companion component
•  Kafka install, start/stop via Ambari
•  Security Authorization Integration with Ranger
Apache Slider
•  Allow on-demand create and run different versions of
heterogeneous applications
•  Allow users to configure different application
instances differently
•  Manage operational lifecycle of application instances
•  Expand / shrink application instances
•  Provide application registry for publish and discovery
Apache Knox & Apache Ranger (Argus) & HDP
Security
•  Apache Ranger – Support authorization and auditing
for Storm and Knox
•  Introducing REST APIs for managing policies in
Apache Ranger
•  Apache Ranger – Support native grant/revoke
permissions in Hive and HBase
•  Apache Ranger – Support Oracle DB and storing of
audit logs in HDFS
•  Apache Ranger to run on Windows environment
•  Apache Knox to protect YARN RM
•  Apache Knox support for HDFS HA
•  Apache Ambari install, start/stop of Knox
Apache Ambari
•  Support for HDP 2.2 Stack, including support for
Kafka, Knox and Slider
•  Enhancements to Ambari Web configuration
management including: versioning, history and
revert, setting final properties and downloading client
configurations
•  Launch and monitor HDFS rebalance
•  Perform Capacity Scheduler queue refresh
•  Configure High Availability for ResourceManager
•  Ambari Administration framework for managing user
and group access to Ambari
•  Ambari Views development framework for
customizing the Ambari Web user experience
•  Ambari Stacks for extending Ambari to bring custom
Services under Ambari management
•  Ambari Blueprints for automating cluster
deployments
•  Performance improvements and enterprise usability
guardrails
88Astonishing amount of innovation
in the OPEN Apache Community
HDP
is Apache
Hadoop

Apache Hive & Stinger Initiative

Hive – Single tool for all SQL use cases
OLTP,
ERP,
CRM
Systems

Unstructured
documents,
emails

Clickstream

Server
logs

Sen>ment,
Web
Data

Sensor.
Machine
Data

Geoloca>on

Interactive
Analytics
Batch Reports /
Deep Analytics
Hive - SQL
ETL / ELT

Hive Scales To Any Workload
Page 9
"  The original developers of Hive.
"  More data than existing RDBMS could handle.
"  100+ PB of data under management.
"  15+ TB of data loaded daily.
"  60,000+ Hive queries per day.
"  More than 1,000 users per day.

Hive Join Strategies
Page 10
Type
Approach
Pros
Cons

Shuffle Join

Join keys are shuffled using map/
reduce and joins performed
reduce side.

Works regardless of data
size or layout.

Most resource-intensive
and slowest join type.

Broadcast
Join

Small tables are loaded into
memory in all nodes, mapper
scans through the large table and
joins.

Very fast, single scan
through largest table.

All but one table must be
small enough to fit in
RAM.

Sort-Merge-
Bucket Join

Mappers take advantage of co-
location of keys to do efficient
joins.

Very fast for tables of any
size.

Data must be bucketed
ahead of time.

Stinger Initiative
• Stinger Initiative – DELIVERED
Next generation SQL based interactive query in
Hadoop
Speed
Improve Hive query performance has increased by 100X to allow
for interactive query times (seconds)
Scale
The only SQL interface to Hadoop designed for queries that scale
from TB to PB
SQL
Support broadest range of SQL semantics for analytic applications
running against Hadoop
Governance
&Integration
Security
Operations
Data Access
Data Management
HDP 2.1
An Open Community at its finest: Apache Hive Contribution
1,672Jira Tickets Closed
145Developers
44Companies
360,000Lines Of Code Added… (2.5x)
Apache
YARN

Apache
MapReduce

1
°
°
°

°
°
°
°

°
°
°
°

°

°

N

HDFS

(Hadoop
Distributed
File
System)

Apache

Tez

Apache
Hive

SQL

Business
Analy=cs

Custom

Apps

13Months
Hive 10
100’s
to

1000’s
of

seconds

seconds

Hive 13
Dramatically faster
queries
speeds time to
insight

Stinger Initiative - Key Innovations
File
Format
ORCFile
Execution
Engine
Tez
= 100X+ +
Query Planner
CBO

Tez (“Speed”)
• What is it?
– A data processing framework as an alternative to MapReduce
• Who else is involved?
– Hortonworks, Facebook, Twitter, Yahoo, Microsoft
• Why does it matter?
– Widens the platform for Hadoop use cases
– Crucial to improving the performance of low-latency applications
– Core to the Stinger initiative
– Evidence of Hortonworks leading the community in the evolution of Enterprise Hadoop

Hive – MR Hive – Tez
Comparing: Hive/MR vs. Hive/Tez
Page 14
SELECT a.state, COUNT(*), AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state
SELECT a.state
JOIN (a, c)
SELECT c.price
SELECT b.id
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)
M M M
R R
M M
R
M M
R
M M
R
HDFS
HDFS
HDFS
M M M
R R
R
M M
R
R
SELECT a.state,
c.itemId
JOIN (a, c)
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)
SELECT b.id
Tez avoids unneeded
writes to HDFS

ORCFile – Columnar Storage for Hive
• Columns stored separately
• Knows types
– Uses type-specific encoders
– Stores statistics (min, max, sum, count)
• Has light-weight index
– Skip over blocks of rows that don’t matter
Page 15

ORCFile – Columnar Storage for Hive
Large block size ideal for
map/reduce.
Columnar format enables
high compression and high
performance.

Query Planner – Cost Based Optimizer in Hive
The Cost-Based Optimizer (CBO) uses statistics within Hive tables to
produce optimal query plans
Why cost-based optimization?
•  Ease of Use – Join Reordering
•  Reduces the need for specialists to tune queries.
•  More efficient query plans lead to better cluster utilization.
Page 17

Statistics: Foundations for CBO
Kind of statistics
Table Statistics – Collected on load per
partition
•  Uncompressed size
•  Number of rows
•  Number of files
Column Statistics – Required by CBO
•  NDV (Number of Distinct Values)
•  Nulls, Min, Max
Usability - How does the data get Statistics
Analyze Table Command
•  Analyze entire table
•  Run this command per partition
•  Run for some partitions and the compiler will extrapolate
statistics
Collecting statistics on load
•  Table stats can be collected if you insert via hive using set
hive.stats.autogather=true
•  Not with load data file

A Journey to SQL Compliance
Evolu=on
of
SQL
Compliance
in
Hive

SQL
Datatypes
SQL
Seman=cs

INT/TINYINT/SMALLINT/BIGINT
SELECT,
INSERT

FLOAT/DOUBLE
GROUP
BY,
ORDER
BY,
HAVING

BOOLEAN
JOIN
on
explicit
join
key

ARRAY,
MAP,
STRUCT,
UNION
Inner,
outer,
cross
and
semi
joins

STRING
Sub-‐queries
in
the
FROM
clause

BINARY
ROLLUP
and
CUBE

TIMESTAMP
UNION

DECIMAL
Standard
aggrega>ons
(sum,
avg,
etc.)

DATE
Custom
Java
UDFs

VARCHAR
Windowing
func>ons
(OVER,
RANK,
etc.)

CHAR
Advanced
UDFs
(ngram,
XPath,
URL)

JOINs
in
WHERE
Clause

Sub-‐queries
for
IN/NOT
IN,
HAVING

Legend

Hive
10
or
earlier

Hive
11

Hive
12

Hive
13

Governance
&Integration
Security
Operations
Data Access
Data Management
HDP 2.1

Now this is not the end. It is not even the
beginning of the end. But it is, perhaps, the
end of the beginning.
-Winston Churchill
Hive 0.13

Stinger.Next

Stinger.Next: Delivery Themes
Hive
0.14

•  Transac>ons
with
ACID
allowing

insert,
update
and
delete

•  Streaming
Ingest

•  Cost
Based
Op>mizer
op>mizes

star
and
bushy
join
queries

Sub-‐Second

1st
Half
2015

•  Sub-‐Second
queries
with
LLAP

•  Hive-‐Spark
Machine
Learning

integra>on

•  Opera>onal
repor>ng
with
Hive

Streaming
Ingest
and

Transac>ons

Richer
Analy=cs

2nd
Half
2015

•  Toward
SQL:2011
Analy>cs

•  Materialized
Views

•  Cross-‐Geo
Queries

•  Workload
Management
via
YARN

and
LLAP
integra>on

Transaction Use Cases
Reporting with Analytics (YES)
Reporting on data with occasional updates
Corrections to the fact tables, evolving dimension tables
Low concurrency updates, low TPS4
Operational Reporting (YES)
High throughput ingest from operational (OLTP) database
Periodic inserts every 5-30 minutes
Requires tool support and changes in our Transactions
Operational (OLTP) Database (NO)
Small Transactions, each doing single line inserts
High Concurrency - Hundreds to thousands of connections
Hive
OLTP Hive
Replication
Analytics Modifications
Hive
High Concurrency
OLTP

Deep Dive: Transactions
Transaction Support in Hive with ACID semantics
•  Hive native support for INSERT, UPDATE, DELETE.
•  Split Into Phases:
•  Phase 1: Hive Streaming Ingest (append)
•  Phase 2: INSERT / UPDATE / DELETE Support
•  Phase 3: BEGIN / COMMIT / ROLLBACK Txn
[Done]
[Done]
[Next]
Read-
Optimized
ORCFile
Delta File
Merged
Read-
Optimized
ORCFile
1. Original File
Task reads the latest
ORCFile
Task
Read-
Optimized
ORCFile
Task Task
2. Edits Made
Task reads the ORCFile and merges
the delta file with the edits
3. Edits Merged
Task reads the
updated ORCFile
Hive ACID Compactor
periodically merges the delta
files in the background.

Transactions - Requirements
Needs to declare table as having Transaction Property
Table must be in ORC format
Tables must to be bucketed
Page 25

Putting It Together

Step 1 - Turn On Transactions
Hive Configuration
§  hive.support.concurrency=true
§  hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
§  hive.compactor.initiator.on=true
§  hive.compactor.worker.threads=2
§  hive.enforce.bucketing=true
§  hive.exec.dynamic.partition.mode=nonstrict
Page 27

Step 2 – Enable Concurrency By Defining Queues
YARN Configuration
§  yarn.scheduler.capacity.root.default.capacity=50
§  yarn.scheduler.capacity.root.hiveserver.capacity=50
§  yarn.scheduler.capacity.root.hiveserver.hive1.capacity=50
§  yarn.scheduler.capacity.root.hiveserver.hive1.user-limit-factor=4
§  yarn.scheduler.capacity.root.hiveserver.hive2.capacity=50
§  yarn.scheduler.capacity.root.hiveserver.hive2.user-limit-factor=4
§  yarn.scheduler.capacity.root.hiveserver.queues=hive1,hive2
§  yarn.scheduler.capacity.root.queues=default,hiveserver
Default
Hive1
Hive2
Cluster Capacity

Step 3 – Deliver Capacity Guarantees BY Enabling YARN
Preemption
YARN Configuration
§  yarn.resourcemanager.scheduler.monitor.enable=true
§  yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourceman
ager.monitor.capacity.ProportionalCapacityPreemptionPolicy
§  yarn.resourcemanager.monitor.capacity.preemption.monitoring_interval=1000
§  yarn.resourcemanager.monitor.capacity.preemption.max_wait_before_kill=5000
§  yarn.resourcemanager.monitor.capacity.preemption.total_preemption_per_round=0.4

Enable Sessions For Hive Queues
Step 4 – Enable Tez Execution Engine & Tez Sessions
Hive Configuration
§  hive.execution.engine=tez
§  hive.server2.tez.initialize.default.sessions=true
§  hive.server2.tez.default.queues=hive1,hive2
§  hive.server2.tez.sessions.per.default.queue=1
§  hive.server2.enable.doAs=false
§  hive.vectorized.groupby.maxentries=10240
§  hive.vectorized.groupby.flush.percent=0.1

Step 5 - Create Partitioned & Bucketed ORC Tables
Create table if not exists test (id int, val string)
partitioned by (year string,month string,day string)
clustered by (id) into 7 buckets
stored as orc TBLPROPERTIES ("transactional"="true”);
Note:
§  Transaction Requires Bucketed tables in ORC
Format. Tables cannot be sorted.
§  Transactional=true must be set as table
properties
§  For performance, table Partition is
recommended but not mandatory
§  Partition on filter columns with low
cardinality
§  For optimal performance stay below 1000
partitions
§  Cluster on join columns
§  Number of buckets contingent on dataset
size

Step 6 - Loading Data into ORC table
§  SQOOP, FLUME & STORM support direct ingestion to ORC Tables
§  Have a Text File ?
§  Load to a Table stored as textfile
§  Transfer to ORC Table using Hive insert statement

Step 7 - Compute Statistics
§  Compute Table Stats
analyze table test partition(year,month,day) compute
statistics;
§  Compute Column Stats
analyze table test partition(year,month,day) compute
statistics for columns;
§  Keep Stats Updated
§  Speed computation by limiting it to partitions that have
changed
Note:
§  In hive 0.14, column stats can be
calculated for all partitions in a single
statement
§  To limit computation to a specific
partition, specify partition keys

Sample Code – Sqoop Import To ORC Table
sqoop import --verbose --connect 'jdbc:mysql://localhost/people' --table persons
--username root --hcatalog-table persons --hcatalog-storage-stanza "stored
as orc" -m 1
Use Hcatalog to import to ORC Table

Sample Code – Flume Configuration For Hive
Streaming Ingest
## Agent
agent.sources = csvfile
agent.sources.csvfile.type = exec
agent.sources.csvfile.command = tail -F /root/test.txt
agent.sources.csvfile.batchSize = 1
agent.sources.csvfile.channels = memoryChannel
agent.sources.csvfile.interceptors = intercepttime
agent.sources.csvfile.interceptors.intercepttime.type =
timestamp
## Channels
agent.channels = memoryChannel
agent.channels.memoryChannel.type = memory
agent.channels.memoryChannel.capacity = 10000
## Hive Streaming Sink
agent.sinks = hiveout
agent.sinks.hiveout.type = hive
agent.sinks.hiveout.hive.metastore=thrift://localhost:9083
agent.sinks.hiveout.hive.database=default
agent.sinks.hiveout.hive.table=test
agent.sinks.hiveout.hive.partition=%Y,%m,%d
agent.sinks.hiveout.serializer = DELIMITED
agent.sinks.hiveout.serializer.fieldnames =id,val
agent.sinks.hiveout.channel = memoryChannel

Q&A

Hortonworks Technical Workshop: Interactive Query with Apache Hive

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Hortonworks Technical Workshop: Interactive Query with Apache Hive

Ähnlich wie Hortonworks Technical Workshop: Interactive Query with Apache Hive (20)

Mehr von Hortonworks

Mehr von Hortonworks (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Hortonworks Technical Workshop: Interactive Query with Apache Hive