Enabling the Active Data Warehouse with Apache Kudu

Enabling The Active Data
Warehouse With Apache Kudu
December 2019
Grant Henke
Software Engineer

© 2019 Cloudera, Inc. All rights reserved. 2
AGENDA
• What is an Active Data Warehouse?
• Use Cases
• What is Apache Kudu?
• The Active Data Warehouse with Apache Kudu
• Future Plans for Kudu
• Examples & Resources

What is an Active Data Warehouse?

An active data warehouse allows you to continuously
collect, modify, and analyze data from varied sources to
provide meaningful business insights in real-time.

An active data warehouse enables real-time analytics,
dashboarding, and operational use cases while still
supporting traditional ad-hoc bulk analytics and archival
use cases.

In an active data warehouse, not only is the data
continuously ingested and changing, but the schema may
also be changing.

Query and Analyze Massive Amounts of Real-time Data
• Businesses collect ever-growing volumes of time-series data
– IoT devices, sensors, financial transactions, user activity…
• Businesses need to process these signals to make decisions
– Monitor, repair, and replace malfunctioning equipment
– Detect and react to anomalies in user behavior
– Take advantage of opportunities
• Analyzing data even minutes after it arrives is often too late

• Data:
– Network and user events
– Sensor and IoT signals
• Results
– Detect and repair outages
– Prevent and detect fraud
– Preventive maintenance
– On-demand and predictive provisioning
– Improve downtime and utilization
– Up to 50% reduction of data by deduping on ingest
Use case : Telecommunications

• Data:
– Noise levels (acoustic data) in real-time from turbines
– Power station data across plants
– Data from smart meters
• Results
– Detect anomalies
– Monitor turbine health in real time and predict failures before they
happen
– Lower downtime
– Lower maintenance cost
Use case : Utilities

• Data:
– Banking and trading transactions
– Signals from ATM and POS devices
– Mobile and web app telemetry
• Results
– Detect and prevent fraud
– Analyze trends and react in real-time
– Improve customer experience with relevant and timely messaging
– Unlock revenue relevant customer offers delivered at the right time
Use case : Financial services

An Open Source Data Storage Engine
That Makes
Fast Analytics on Fast And Changing Data Easy

Open source & open data standards are especially important
when storing your data.
Apache Kudu is a top-level Apache Software Foundation project released under the
Apache 2 license and values community participation.
We believe that Kudu's long-term success depends on building a vibrant community
of developers and users from diverse organizations and backgrounds.
That Makes

Allows users to focus on the use case and not the storage details.
Manages the storage of your data including schema, layout, encoding,
compression and compaction to allow for efficient disk usage and minimize IO.
Separates storage management from computation. Though Kudu utilizes
pushdown projections, predicates/filters, and more to optimize data access, it
leverages tools like Impala, Hive, and Spark for complex computation.
That Makes

Provides a combination of fast ingest and efficient columnar scans to enable
multiple real-time analytic workloads across a single storage layer.
Designed to strike a balance between full scan performance and low-latency random
access allowing it to address a wide array of analytical use cases.
Scale up and out to utilize all of the resources given to it across the cluster and on
each node.
Designed for next-generation hardware.
That Makes

It is important to support a
variety of workloads.

Data is immediately available to be analyzed as soon as it lands in Kudu.
Supports updates and deletes in order to address a wide variety of use cases without
exotic workarounds.
Supports sustained high throughput ingest to capture all of your data,
streaming or batch.
That Makes

Kudu was built to be simple to deploy, monitor, operate and use.
Familiar concepts such as tables, partitions, and insert/update/delete operations to
minimize the expertise required to use it effectively.
Simple data model and mutability makes it a breeze to port legacy analytical
applications or build new ones.
Integrates with the big data ecosystem, and integrating it with other data processing
frameworks is simple.
That Makes

Ecosystem Integration
Flow Process Query Security Cloud

The Active Data Warehouse with
Apache Kudu

The Active Data Warehouse with Apache Kudu
CDF
IOT Devices
Applications
Metrics
Logs & Files
HDFS/
Object Storage
Hot Storage
Cold Storage
SQL
○ s
u
p
p
o
r
t
Real-Time
Analytics
Alerting
Event Driven
Applications
Dashboards

IOT Devices
Applications
Metrics
Logs & Files
HDFS/
Object Storage
Hot Storage
Cold Storage
SQL Streaming
Analytics
Alerting
Event Driven
Applications
Dashboards
● Data is ingested into Kudu via Spark & NiFi support
most any data source.
● Ingest is often streaming but may also be
scheduled in batches.
● Ingest may contain late arriving data and UPSERT,
UPDATE, and DELETE operations.
● Kudu tables are often time-oriented fact tables or
low volume dimension/lookup tables.
● Kudu tables can be used to enrich the data via NiFi
and Spark during ingest.
CDF

CDF
IOT Devices
Applications
Metrics
Logs & Files
HDFS/
Object Storage
Hot Storage
Cold Storage
SQL Streaming
Analytics
Alerting
Event Driven
Applications
Dashboards
● Data is available to query immediately.
● Kudu manages schema, encoding, compression,
replication, and compaction automatically
○ No small files problem on HDFS or S3.
● Kudu’s columnar layout, primary keys, and
partitioning support allow for minimal IO and
blazing fast queries.

CDF
IOT Devices
Applications
Metrics
Logs & Files
HDFS/
Object Storage
Hot Storage
Cold Storage
SQL Streaming
Analytics
Alerting
Event Driven
Applications
Dashboards
● Time oriented data can be seamlessly offloaded
into HDFS or Object storage.
● This reduces cost and increases scale while still
maintaining data access.

Transparent Hierarchical Storage Pattern

CDF
IOT Devices
Applications
Metrics
Logs & Files
HDFS/
Object Storage
Hot Storage
Cold Storage
SQL Real-Time
Analytics
Alerting
Event Driven
Applications
Dashboards
● Analyze and explore the data via SQL using your
computation engine (Impala, Hive, Spark) and
interface of choice.
● Using Impala’s JDBC or ODBC support, use
almost any third-party business intelligence tool.
● Use Cloudera Data Science Workbench (CDSW)
to build distributed machine learning algorithms.

An enterprise data warehouse
must be secure

CDF
IOT Devices
Applications
Metrics
Logs & Files
HDFS/
Object Storage
Hot Storage
Cold Storage
SQL Real-Time
Analytics
Alerting
Event Driven
Applications
Dashboards
Authorization Audit & LineageAuthentication
Kerberos
Encryption
NavEncrypt

CDF
IOT Devices
Applications
Metrics
Logs & Files
HDFS/
Object Storage
Hot Storage
Cold Storage
SQL Real-Time
Analytics
Alerting
Event Driven
Applications
Dashboards
Kerberos
Encryption
NavEncrypt
● Authentication via Kerberos prevents untrusted actors
from gaining access to Kudu.
● Authentication securely identifies the connecting user or
services for authorization checks.
● Easily integrated, deployed, and managed by Cloudera
Manager.

CDF
IOT Devices
Applications
Metrics
Logs & Files
HDFS/
Object Storage
Hot Storage
Cold Storage
SQL Real-Time
Analytics
Alerting
Event Driven
Applications
Dashboards
Kerberos
Encryption
NavEncrypt
● Wire encryption via TLS without requiring you to
manually deploy certificates on every node.
● At-rest encryption can be achieved using Cloudera
NavEncrypt to encrypt the volumes storing Kudu data.

CDF
IOT Devices
Applications
Metrics
Logs & Files
HDFS/
Object Storage
Hot Storage
Cold Storage
SQL Real-Time
Analytics
Alerting
Event Driven
Applications
Dashboards
Kerberos
Encryption
NavEncrypt
● Coarse-Grained authorization via Kudu configuration.
○ All or nothing
● Fine-Grained authorization via Apache Sentry and
Apache Ranger.
○ Native Apache Sentry support in CDH 6.3
○ Native Apache Ranger support coming soon
○ Ranger support via Impala & Hive works today

CDF
IOT Devices
Applications
Metrics
Logs & Files
HDFS/
Object Storage
Hot Storage
Cold Storage
SQL Real-Time
Analytics
Alerting
Event Driven
Applications
Dashboards
Kerberos
Encryption
NavEncrypt
● Audit data access and activities.
● Use Lineage to see how data moves through the
environment with data lineage.
● CDH: Cloudera Navigator events for integrations.
● CDP: Apache Atlas support for integrations.
● Native Apache Atlas support coming soon.

37
Active Data Warehouse in Cloudera Ecosystem
• On CDH 6.3 with Sentry
• On CDP Data Center 7.0
• On CDP Public Cloud
• Available in the Cloudera Data Hub
• In the future Kudu will be available Cloudera Data Warehouse too
How can you deploy an Active Data Warehouse today?

First, you should upgrade Kudu
• Kudu development is very active and recent releases have a lot of great
improvements.
• The Kudu community highly prioritizes improving Kudu usability and
stability.
• Upgrading Kudu is easy because clients are forward and backward
compatible.

Near future :: WIP
• Native integration with Apache Ranger for fine grained authorization
• Native integration with Apache Atlas for audit & lineage
• More data types
a. Varchar, Date, Array, Map
• Maintenance mode for Kudu tablet servers
• Automated rolling restart of Kudu tablet servers
• Automated tablet rebalancing
• Built-in NTP client
• NiFi Kudu Lookup Service

Kudu future :: Medium/Long term
• Auto-generated keys & keyless tables
• Dynamic master configuration
• Secondary indexes
• Transactional bulk load
• Aggregations and rollups

Kudu future :: Cloud
• Autoscaling Kudu tablet servers
• Automatic offload of cold data to object storage
• Global stretch clusters
• Graceful decommission of tablet servers
• Pause/Resume Kudu cluster

Apache Kudu Quickstart Cluster
https://kudu.apache.org/docs/quickstart.html
A Docker based quickstart cluster for local experimentation
git clone https://github.com/apache/kudu
cd kudu
export KUDU_QUICKSTART_IP=$(ifconfig | grep "inet " | grep
-Fv 127.0.0.1 | awk '{print $2}' | tail -1)
# Starts a 3 master server, 5 tablet server docker cluster.
docker-compose -f docker/quickstart.yml up -d
# Visit the master server web-ui by visiting localhost:8050

Apache Kudu Quickstart Cluster + Kudu CLI
https://kudu.apache.org/docs/command_line_tools_reference.html
Getting familiar with the command line tools
# Get a bash shell in the kudu-master-1 container
docker exec -it $(docker ps -aqf "name=kudu-master-1")
/bin/bash
# Check the cluster health
kudu cluster ksck kudu-master-1:7051,kudu-master-
2:7151,kudu-master-3:7251
# List the tables in Kudu
kudu table list kudu-master-1:7051,kudu-master-2:7151,kudu-
master-3:7251

Apache Kudu + Apache Spark Quickstart
https://github.com/apache/kudu/tree/master/examples/quickstart/spark
Load, query, and modify a real data set in Apache Kudu.

Apache Kudu + Apache NiFi Quickstart
https://github.com/apache/kudu/tree/master/examples/quickstart/nifi
Ingest user data into Apache Kudu.

Apache Kudu + Apache Impala Example
https://kudu.apache.org/docs/kudu_impala_integration.html
DDL & DML Example

Apache Kudu + Apache Hive Example
https://cwiki.apache.org/confluence/display/Hive/Kudu+Integration
Experimental Query Support in Hive 4.0 & CDP-DC 7.0

Related Kudu Blog Posts
• CDH 6.3 Release: What’s new in Kudu
– https://blog.cloudera.com/cdh-6-3-release-whats-new-in-kudu/
• Fine-Grained Authorization with Apache Kudu and Impala
– https://blog.cloudera.com/fine-grained-authorization-with-apache-
kudu-and-impala/
– Useful pattern for Sentry before CDH 6.3
– Useful pattern for Ranger in CDP-DC 7.0

Related Kudu Blog Posts
• Transparent Hierarchical Storage Management with Apache Kudu and
Impala
– https://blog.cloudera.com/transparent-hierarchical-storage-
management-with-apache-kudu-and-impala/
• Testing Apache Kudu Applications on the JVM
– https://blog.cloudera.com/testing-apache-kudu-applications-on-the-
jvm/

Cloudera Time Series Analytics Reference Architecture
https://www.cloudera.com/campaign/time-series.html
Data source
1
Data source
2
Data source
N
NiFi / CDF
Kafka Spark
Streaming
Kudu Impala
Parquet on
HDFS / S3 / etc
SQL users
Spark
CDSW Data scientists

Documentation
• Kudu Documentation
– https://kudu.apache.org/
– Downloads, release notes, examples, etc.
• Cloudera Documentation
– https://docs.cloudera.com/
– CDH, CDP Public Cloud, and CDP Data Center

Help & Contacts
• Apache Community Slack & Mailing Lists
– https://kudu.apache.org/community.html
• Cloudera Community Forum
– https://community.cloudera.com/
• Email
– Grant Henke - grant@cloudera.com

Enabling the Active Data Warehouse with Apache Kudu

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Enabling the Active Data Warehouse with Apache Kudu

Similar to Enabling the Active Data Warehouse with Apache Kudu (20)

Recently uploaded

Recently uploaded (20)

Enabling the Active Data Warehouse with Apache Kudu