SlideShare a Scribd company logo
1 of 55
Enabling The Active Data
Warehouse With Apache Kudu
December 2019
Grant Henke
Software Engineer
© 2019 Cloudera, Inc. All rights reserved. 2
AGENDA
• What is an Active Data Warehouse?
• Use Cases
• What is Apache Kudu?
• The Active Data Warehouse with Apache Kudu
• Future Plans for Kudu
• Examples & Resources
What is an Active Data Warehouse?
© 2019 Cloudera, Inc. All rights reserved. 4
What is an Active Data Warehouse?
An active data warehouse allows you to continuously
collect, modify, and analyze data from varied sources to
provide meaningful business insights in real-time.
© 2019 Cloudera, Inc. All rights reserved. 5
What is an Active Data Warehouse?
An active data warehouse enables real-time analytics,
dashboarding, and operational use cases while still
supporting traditional ad-hoc bulk analytics and archival
use cases.
© 2019 Cloudera, Inc. All rights reserved. 6
What is an Active Data Warehouse?
In an active data warehouse, not only is the data
continuously ingested and changing, but the schema may
also be changing.
Use Cases
© 2019 Cloudera, Inc. All rights reserved. 8
Query and Analyze Massive Amounts of Real-time Data
• Businesses collect ever-growing volumes of time-series data
– IoT devices, sensors, financial transactions, user activity…
• Businesses need to process these signals to make decisions
– Monitor, repair, and replace malfunctioning equipment
– Detect and react to anomalies in user behavior
– Take advantage of opportunities
• Analyzing data even minutes after it arrives is often too late
© 2019 Cloudera, Inc. All rights reserved. 9
• Data:
– Network and user events
– Sensor and IoT signals
• Results
– Detect and repair outages
– Prevent and detect fraud
– Preventive maintenance
– On-demand and predictive provisioning
– Improve downtime and utilization
– Up to 50% reduction of data by deduping on ingest
Use case : Telecommunications
© 2019 Cloudera, Inc. All rights reserved. 10
• Data:
– Noise levels (acoustic data) in real-time from turbines
– Power station data across plants
– Data from smart meters
• Results
– Detect anomalies
– Monitor turbine health in real time and predict failures before they
happen
– Lower downtime
– Lower maintenance cost
Use case : Utilities
© 2019 Cloudera, Inc. All rights reserved. 11
• Data:
– Banking and trading transactions
– Signals from ATM and POS devices
– Mobile and web app telemetry
• Results
– Detect and prevent fraud
– Analyze trends and react in real-time
– Improve customer experience with relevant and timely messaging
– Unlock revenue relevant customer offers delivered at the right time
Use case : Financial services
What is Apache Kudu?
Apache Kudu is...
© 2019 Cloudera, Inc. All rights reserved. 14
An Open Source Data Storage Engine
That Makes
Fast Analytics on Fast And Changing Data Easy
Open source & open data standards are especially important
when storing your data.
Apache Kudu is a top-level Apache Software Foundation project released under the
Apache 2 license and values community participation.
We believe that Kudu's long-term success depends on building a vibrant community
of developers and users from diverse organizations and backgrounds.
An Open Source Data Storage Engine
That Makes
Fast Analytics on Fast And Changing Data Easy
Allows users to focus on the use case and not the storage details.
Manages the storage of your data including schema, layout, encoding,
compression and compaction to allow for efficient disk usage and minimize IO.
Separates storage management from computation. Though Kudu utilizes
pushdown projections, predicates/filters, and more to optimize data access, it
leverages tools like Impala, Hive, and Spark for complex computation.
An Open Source Data Storage Engine
That Makes
Fast Analytics on Fast And Changing Data Easy
Provides a combination of fast ingest and efficient columnar scans to enable
multiple real-time analytic workloads across a single storage layer.
Designed to strike a balance between full scan performance and low-latency random
access allowing it to address a wide array of analytical use cases.
Scale up and out to utilize all of the resources given to it across the cluster and on
each node.
Designed for next-generation hardware.
An Open Source Data Storage Engine
That Makes
Fast Analytics on Fast And Changing Data Easy
© 2019 Cloudera, Inc. All rights reserved. 18
It is important to support a
variety of workloads.
Data is immediately available to be analyzed as soon as it lands in Kudu.
Supports updates and deletes in order to address a wide variety of use cases without
exotic workarounds.
Supports sustained high throughput ingest to capture all of your data,
streaming or batch.
An Open Source Data Storage Engine
That Makes
Fast Analytics on Fast And Changing Data Easy
Kudu was built to be simple to deploy, monitor, operate and use.
Familiar concepts such as tables, partitions, and insert/update/delete operations to
minimize the expertise required to use it effectively.
Simple data model and mutability makes it a breeze to port legacy analytical
applications or build new ones.
Integrates with the big data ecosystem, and integrating it with other data processing
frameworks is simple.
An Open Source Data Storage Engine
That Makes
Fast Analytics on Fast And Changing Data Easy
© 2019 Cloudera, Inc. All rights reserved. 21
Ecosystem Integration
Flow Process Query Security Cloud
The Active Data Warehouse with
Apache Kudu
© 2019 Cloudera, Inc. All rights reserved. 23
The Active Data Warehouse with Apache Kudu
CDF
IOT Devices
Applications
Metrics
Logs & Files
HDFS/
Object Storage
Hot Storage
Cold Storage
SQL
○ s
u
p
p
o
r
t
Real-Time
Analytics
Alerting
Event Driven
Applications
Dashboards
© 2019 Cloudera, Inc. All rights reserved. 24
The Active Data Warehouse with Apache Kudu
IOT Devices
Applications
Metrics
Logs & Files
HDFS/
Object Storage
Hot Storage
Cold Storage
SQL Streaming
Analytics
Alerting
Event Driven
Applications
Dashboards
● Data is ingested into Kudu via Spark & NiFi support
most any data source.
● Ingest is often streaming but may also be
scheduled in batches.
● Ingest may contain late arriving data and UPSERT,
UPDATE, and DELETE operations.
● Kudu tables are often time-oriented fact tables or
low volume dimension/lookup tables.
● Kudu tables can be used to enrich the data via NiFi
and Spark during ingest.
CDF
© 2019 Cloudera, Inc. All rights reserved. 25
The Active Data Warehouse with Apache Kudu
CDF
IOT Devices
Applications
Metrics
Logs & Files
HDFS/
Object Storage
Hot Storage
Cold Storage
SQL Streaming
Analytics
Alerting
Event Driven
Applications
Dashboards
● Data is available to query immediately.
● Kudu manages schema, encoding, compression,
replication, and compaction automatically
○ No small files problem on HDFS or S3.
● Kudu’s columnar layout, primary keys, and
partitioning support allow for minimal IO and
blazing fast queries.
© 2019 Cloudera, Inc. All rights reserved. 26
The Active Data Warehouse with Apache Kudu
CDF
IOT Devices
Applications
Metrics
Logs & Files
HDFS/
Object Storage
Hot Storage
Cold Storage
SQL Streaming
Analytics
Alerting
Event Driven
Applications
Dashboards
● Time oriented data can be seamlessly offloaded
into HDFS or Object storage.
● This reduces cost and increases scale while still
maintaining data access.
© 2019 Cloudera, Inc. All rights reserved. 27
Transparent Hierarchical Storage Pattern
© 2019 Cloudera, Inc. All rights reserved. 28
Transparent Hierarchical Storage Pattern
© 2019 Cloudera, Inc. All rights reserved. 29
Transparent Hierarchical Storage Pattern
© 2019 Cloudera, Inc. All rights reserved. 30
The Active Data Warehouse with Apache Kudu
CDF
IOT Devices
Applications
Metrics
Logs & Files
HDFS/
Object Storage
Hot Storage
Cold Storage
SQL Real-Time
Analytics
Alerting
Event Driven
Applications
Dashboards
● Analyze and explore the data via SQL using your
computation engine (Impala, Hive, Spark) and
interface of choice.
● Using Impala’s JDBC or ODBC support, use
almost any third-party business intelligence tool.
● Use Cloudera Data Science Workbench (CDSW)
to build distributed machine learning algorithms.
© 2019 Cloudera, Inc. All rights reserved. 31
An enterprise data warehouse
must be secure
© 2019 Cloudera, Inc. All rights reserved. 32
CDF
The Active Data Warehouse with Apache Kudu
IOT Devices
Applications
Metrics
Logs & Files
HDFS/
Object Storage
Hot Storage
Cold Storage
SQL Real-Time
Analytics
Alerting
Event Driven
Applications
Dashboards
Authorization Audit & LineageAuthentication
Kerberos
Encryption
NavEncrypt
© 2019 Cloudera, Inc. All rights reserved. 33
The Active Data Warehouse with Apache Kudu
CDF
IOT Devices
Applications
Metrics
Logs & Files
HDFS/
Object Storage
Hot Storage
Cold Storage
SQL Real-Time
Analytics
Alerting
Event Driven
Applications
Dashboards
Authorization Audit & LineageAuthentication
Kerberos
Encryption
NavEncrypt
● Authentication via Kerberos prevents untrusted actors
from gaining access to Kudu.
● Authentication securely identifies the connecting user or
services for authorization checks.
● Easily integrated, deployed, and managed by Cloudera
Manager.
© 2019 Cloudera, Inc. All rights reserved. 34
The Active Data Warehouse with Apache Kudu
CDF
IOT Devices
Applications
Metrics
Logs & Files
HDFS/
Object Storage
Hot Storage
Cold Storage
SQL Real-Time
Analytics
Alerting
Event Driven
Applications
Dashboards
Authorization Audit & LineageAuthentication
Kerberos
Encryption
NavEncrypt
● Wire encryption via TLS without requiring you to
manually deploy certificates on every node.
● At-rest encryption can be achieved using Cloudera
NavEncrypt to encrypt the volumes storing Kudu data.
© 2019 Cloudera, Inc. All rights reserved. 35
The Active Data Warehouse with Apache Kudu
CDF
IOT Devices
Applications
Metrics
Logs & Files
HDFS/
Object Storage
Hot Storage
Cold Storage
SQL Real-Time
Analytics
Alerting
Event Driven
Applications
Dashboards
Authorization Audit & LineageAuthentication
Kerberos
Encryption
NavEncrypt
● Coarse-Grained authorization via Kudu configuration.
○ All or nothing
● Fine-Grained authorization via Apache Sentry and
Apache Ranger.
○ Native Apache Sentry support in CDH 6.3
○ Native Apache Ranger support coming soon
○ Ranger support via Impala & Hive works today
© 2019 Cloudera, Inc. All rights reserved. 36
The Active Data Warehouse with Apache Kudu
CDF
IOT Devices
Applications
Metrics
Logs & Files
HDFS/
Object Storage
Hot Storage
Cold Storage
SQL Real-Time
Analytics
Alerting
Event Driven
Applications
Dashboards
Authorization Audit & LineageAuthentication
Kerberos
Encryption
NavEncrypt
● Audit data access and activities.
● Use Lineage to see how data moves through the
environment with data lineage.
● CDH: Cloudera Navigator events for integrations.
● CDP: Apache Atlas support for integrations.
● Native Apache Atlas support coming soon.
37
Active Data Warehouse in Cloudera Ecosystem
• On CDH 6.3 with Sentry
• On CDP Data Center 7.0
• On CDP Public Cloud
• Available in the Cloudera Data Hub
• In the future Kudu will be available Cloudera Data Warehouse too
How can you deploy an Active Data Warehouse today?
Future plans for Kudu
© 2019 Cloudera, Inc. All rights reserved. 39
First, you should upgrade Kudu
• Kudu development is very active and recent releases have a lot of great
improvements.
• The Kudu community highly prioritizes improving Kudu usability and
stability.
• Upgrading Kudu is easy because clients are forward and backward
compatible.
© 2019 Cloudera, Inc. All rights reserved. 40
Near future :: WIP
• Native integration with Apache Ranger for fine grained authorization
• Native integration with Apache Atlas for audit & lineage
• More data types
a. Varchar, Date, Array, Map
• Maintenance mode for Kudu tablet servers
• Automated rolling restart of Kudu tablet servers
• Automated tablet rebalancing
• Built-in NTP client
• NiFi Kudu Lookup Service
© 2019 Cloudera, Inc. All rights reserved. 41
Kudu future :: Medium/Long term
• Auto-generated keys & keyless tables
• Dynamic master configuration
• Secondary indexes
• Transactional bulk load
• Aggregations and rollups
© 2019 Cloudera, Inc. All rights reserved. 42
Kudu future :: Cloud
• Autoscaling Kudu tablet servers
• Automatic offload of cold data to object storage
• Global stretch clusters
• Graceful decommission of tablet servers
• Pause/Resume Kudu cluster
Examples & Resources
© 2019 Cloudera, Inc. All rights reserved. 44
Apache Kudu Quickstart Cluster
https://kudu.apache.org/docs/quickstart.html
A Docker based quickstart cluster for local experimentation
git clone https://github.com/apache/kudu
cd kudu
export KUDU_QUICKSTART_IP=$(ifconfig | grep "inet " | grep
-Fv 127.0.0.1 | awk '{print $2}' | tail -1)
# Starts a 3 master server, 5 tablet server docker cluster.
docker-compose -f docker/quickstart.yml up -d
# Visit the master server web-ui by visiting localhost:8050
© 2019 Cloudera, Inc. All rights reserved. 45
Apache Kudu Quickstart Cluster + Kudu CLI
https://kudu.apache.org/docs/command_line_tools_reference.html
Getting familiar with the command line tools
# Get a bash shell in the kudu-master-1 container
docker exec -it $(docker ps -aqf "name=kudu-master-1")
/bin/bash
# Check the cluster health
kudu cluster ksck kudu-master-1:7051,kudu-master-
2:7151,kudu-master-3:7251
# List the tables in Kudu
kudu table list kudu-master-1:7051,kudu-master-2:7151,kudu-
master-3:7251
© 2019 Cloudera, Inc. All rights reserved. 46
Apache Kudu + Apache Spark Quickstart
https://github.com/apache/kudu/tree/master/examples/quickstart/spark
Load, query, and modify a real data set in Apache Kudu.
© 2019 Cloudera, Inc. All rights reserved. 47
Apache Kudu + Apache NiFi Quickstart
https://github.com/apache/kudu/tree/master/examples/quickstart/nifi
Ingest user data into Apache Kudu.
© 2019 Cloudera, Inc. All rights reserved. 48
Apache Kudu + Apache Impala Example
https://kudu.apache.org/docs/kudu_impala_integration.html
DDL & DML Example
© 2019 Cloudera, Inc. All rights reserved. 49
Apache Kudu + Apache Hive Example
https://cwiki.apache.org/confluence/display/Hive/Kudu+Integration
Experimental Query Support in Hive 4.0 & CDP-DC 7.0
© 2019 Cloudera, Inc. All rights reserved. 50
Related Kudu Blog Posts
• CDH 6.3 Release: What’s new in Kudu
– https://blog.cloudera.com/cdh-6-3-release-whats-new-in-kudu/
• Fine-Grained Authorization with Apache Kudu and Impala
– https://blog.cloudera.com/fine-grained-authorization-with-apache-
kudu-and-impala/
– Useful pattern for Sentry before CDH 6.3
– Useful pattern for Ranger in CDP-DC 7.0
© 2019 Cloudera, Inc. All rights reserved. 51
Related Kudu Blog Posts
• Transparent Hierarchical Storage Management with Apache Kudu and
Impala
– https://blog.cloudera.com/transparent-hierarchical-storage-
management-with-apache-kudu-and-impala/
• Testing Apache Kudu Applications on the JVM
– https://blog.cloudera.com/testing-apache-kudu-applications-on-the-
jvm/
© 2019 Cloudera, Inc. All rights reserved. 52
Cloudera Time Series Analytics Reference Architecture
https://www.cloudera.com/campaign/time-series.html
Data source
1
Data source
2
Data source
N
NiFi / CDF
Kafka Spark
Streaming
Kudu Impala
Parquet on
HDFS / S3 / etc
SQL users
Spark
CDSW Data scientists
© 2019 Cloudera, Inc. All rights reserved. 53
Documentation
• Kudu Documentation
– https://kudu.apache.org/
– Downloads, release notes, examples, etc.
• Cloudera Documentation
– https://docs.cloudera.com/
– CDH, CDP Public Cloud, and CDP Data Center
© 2019 Cloudera, Inc. All rights reserved. 54
Help & Contacts
• Apache Community Slack & Mailing Lists
– https://kudu.apache.org/community.html
• Cloudera Community Forum
– https://community.cloudera.com/
• Email
– Grant Henke - grant@cloudera.com
THANK YOU

More Related Content

What's hot

Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)James Serra
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
 
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...Amazon Web Services
 
Snowflake essentials
Snowflake essentialsSnowflake essentials
Snowflake essentialsqureshihamid
 
Lift SSIS package to Azure Data Factory V2
Lift SSIS package to Azure Data Factory V2Lift SSIS package to Azure Data Factory V2
Lift SSIS package to Azure Data Factory V2Manjeet Singh
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySparkRussell Jurney
 
Running Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration OptionsRunning Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration OptionsTimothy Spann
 
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Cloudera, Inc.
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduCloudera, Inc.
 
Azure Data Factory
Azure Data FactoryAzure Data Factory
Azure Data FactoryHARIHARAN R
 
Building Dynamic Pipelines in Azure Data Factory (SQLSaturday Oslo)
Building Dynamic Pipelines in Azure Data Factory (SQLSaturday Oslo)Building Dynamic Pipelines in Azure Data Factory (SQLSaturday Oslo)
Building Dynamic Pipelines in Azure Data Factory (SQLSaturday Oslo)Cathrine Wilhelmsen
 
Snowflake Best Practices for Elastic Data Warehousing
Snowflake Best Practices for Elastic Data WarehousingSnowflake Best Practices for Elastic Data Warehousing
Snowflake Best Practices for Elastic Data WarehousingAmazon Web Services
 
Snowflake for Data Engineering
Snowflake for Data EngineeringSnowflake for Data Engineering
Snowflake for Data EngineeringHarald Erb
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseJames Serra
 
Introduction to Modern Software Architecture
Introduction to Modern Software ArchitectureIntroduction to Modern Software Architecture
Introduction to Modern Software ArchitectureJérôme Kehrli
 
Snowflake: The most cost-effective agile and scalable data warehouse ever!
Snowflake: The most cost-effective agile and scalable data warehouse ever!Snowflake: The most cost-effective agile and scalable data warehouse ever!
Snowflake: The most cost-effective agile and scalable data warehouse ever!Visual_BI
 
Introduction to Azure Data Factory
Introduction to Azure Data FactoryIntroduction to Azure Data Factory
Introduction to Azure Data FactorySlava Kokaev
 
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...Kai Wähner
 

What's hot (20)

Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
 
Snowflake essentials
Snowflake essentialsSnowflake essentials
Snowflake essentials
 
Azure Synapse Analytics
Azure Synapse AnalyticsAzure Synapse Analytics
Azure Synapse Analytics
 
Lift SSIS package to Azure Data Factory V2
Lift SSIS package to Azure Data Factory V2Lift SSIS package to Azure Data Factory V2
Lift SSIS package to Azure Data Factory V2
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
Running Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration OptionsRunning Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration Options
 
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache Kudu
 
Azure Data Factory
Azure Data FactoryAzure Data Factory
Azure Data Factory
 
Azure storage
Azure storageAzure storage
Azure storage
 
Building Dynamic Pipelines in Azure Data Factory (SQLSaturday Oslo)
Building Dynamic Pipelines in Azure Data Factory (SQLSaturday Oslo)Building Dynamic Pipelines in Azure Data Factory (SQLSaturday Oslo)
Building Dynamic Pipelines in Azure Data Factory (SQLSaturday Oslo)
 
Snowflake Best Practices for Elastic Data Warehousing
Snowflake Best Practices for Elastic Data WarehousingSnowflake Best Practices for Elastic Data Warehousing
Snowflake Best Practices for Elastic Data Warehousing
 
Snowflake for Data Engineering
Snowflake for Data EngineeringSnowflake for Data Engineering
Snowflake for Data Engineering
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data Warehouse
 
Introduction to Modern Software Architecture
Introduction to Modern Software ArchitectureIntroduction to Modern Software Architecture
Introduction to Modern Software Architecture
 
Snowflake: The most cost-effective agile and scalable data warehouse ever!
Snowflake: The most cost-effective agile and scalable data warehouse ever!Snowflake: The most cost-effective agile and scalable data warehouse ever!
Snowflake: The most cost-effective agile and scalable data warehouse ever!
 
Introduction to Azure Data Factory
Introduction to Azure Data FactoryIntroduction to Azure Data Factory
Introduction to Azure Data Factory
 
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...
 

Similar to Enabling the Active Data Warehouse with Apache Kudu

Turning Data into Business Value with a Modern Data Platform
Turning Data into Business Value with a Modern Data PlatformTurning Data into Business Value with a Modern Data Platform
Turning Data into Business Value with a Modern Data PlatformCloudera, Inc.
 
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017Stefan Lipp
 
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...Cloudera, Inc.
 
IoT Connected Brewery
IoT Connected BreweryIoT Connected Brewery
IoT Connected BreweryJason Hubbard
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020Adam Doyle
 
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data InsightSyncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data InsightSteven Totman
 
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data InsightSyncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data InsightPrecisely
 
Hadoop and Manufacturing
Hadoop and ManufacturingHadoop and Manufacturing
Hadoop and ManufacturingCloudera, Inc.
 
The Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubThe Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubCloudera, Inc.
 
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
 Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ... Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...Cloudera, Inc.
 
Data-Centric Infrastructure for Agile Development
Data-Centric Infrastructure for Agile DevelopmentData-Centric Infrastructure for Agile Development
Data-Centric Infrastructure for Agile DevelopmentDATAVERSITY
 
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWSACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWSAWS User Group Kochi
 
The Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubThe Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubCloudera, Inc.
 
Big data oracle_introduccion
Big data oracle_introduccionBig data oracle_introduccion
Big data oracle_introduccionFran Navarro
 
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud Stefan Lipp
 
How to Build Continuous Ingestion for the Internet of Things
How to Build Continuous Ingestion for the Internet of ThingsHow to Build Continuous Ingestion for the Internet of Things
How to Build Continuous Ingestion for the Internet of ThingsCloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data InsightSyncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data InsightCloudera, Inc.
 
Oracle Cloud : Big Data Use Cases and Architecture
Oracle Cloud : Big Data Use Cases and ArchitectureOracle Cloud : Big Data Use Cases and Architecture
Oracle Cloud : Big Data Use Cases and ArchitectureRiccardo Romani
 

Similar to Enabling the Active Data Warehouse with Apache Kudu (20)

Turning Data into Business Value with a Modern Data Platform
Turning Data into Business Value with a Modern Data PlatformTurning Data into Business Value with a Modern Data Platform
Turning Data into Business Value with a Modern Data Platform
 
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
 
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
 
IoT Connected Brewery
IoT Connected BreweryIoT Connected Brewery
IoT Connected Brewery
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020
 
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data InsightSyncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
 
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data InsightSyncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
 
Hadoop and Manufacturing
Hadoop and ManufacturingHadoop and Manufacturing
Hadoop and Manufacturing
 
The Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubThe Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data Hub
 
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
 Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ... Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
 
Data-Centric Infrastructure for Agile Development
Data-Centric Infrastructure for Agile DevelopmentData-Centric Infrastructure for Agile Development
Data-Centric Infrastructure for Agile Development
 
Top 5 IoT Use Cases
Top 5 IoT Use CasesTop 5 IoT Use Cases
Top 5 IoT Use Cases
 
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWSACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
 
The Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubThe Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data Hub
 
Big data oracle_introduccion
Big data oracle_introduccionBig data oracle_introduccion
Big data oracle_introduccion
 
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
 
How to Build Continuous Ingestion for the Internet of Things
How to Build Continuous Ingestion for the Internet of ThingsHow to Build Continuous Ingestion for the Internet of Things
How to Build Continuous Ingestion for the Internet of Things
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data InsightSyncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
 
Oracle Cloud : Big Data Use Cases and Architecture
Oracle Cloud : Big Data Use Cases and ArchitectureOracle Cloud : Big Data Use Cases and Architecture
Oracle Cloud : Big Data Use Cases and Architecture
 

Recently uploaded

Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Intelisync
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 

Recently uploaded (20)

Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 

Enabling the Active Data Warehouse with Apache Kudu

  • 1. Enabling The Active Data Warehouse With Apache Kudu December 2019 Grant Henke Software Engineer
  • 2. © 2019 Cloudera, Inc. All rights reserved. 2 AGENDA • What is an Active Data Warehouse? • Use Cases • What is Apache Kudu? • The Active Data Warehouse with Apache Kudu • Future Plans for Kudu • Examples & Resources
  • 3. What is an Active Data Warehouse?
  • 4. © 2019 Cloudera, Inc. All rights reserved. 4 What is an Active Data Warehouse? An active data warehouse allows you to continuously collect, modify, and analyze data from varied sources to provide meaningful business insights in real-time.
  • 5. © 2019 Cloudera, Inc. All rights reserved. 5 What is an Active Data Warehouse? An active data warehouse enables real-time analytics, dashboarding, and operational use cases while still supporting traditional ad-hoc bulk analytics and archival use cases.
  • 6. © 2019 Cloudera, Inc. All rights reserved. 6 What is an Active Data Warehouse? In an active data warehouse, not only is the data continuously ingested and changing, but the schema may also be changing.
  • 8. © 2019 Cloudera, Inc. All rights reserved. 8 Query and Analyze Massive Amounts of Real-time Data • Businesses collect ever-growing volumes of time-series data – IoT devices, sensors, financial transactions, user activity… • Businesses need to process these signals to make decisions – Monitor, repair, and replace malfunctioning equipment – Detect and react to anomalies in user behavior – Take advantage of opportunities • Analyzing data even minutes after it arrives is often too late
  • 9. © 2019 Cloudera, Inc. All rights reserved. 9 • Data: – Network and user events – Sensor and IoT signals • Results – Detect and repair outages – Prevent and detect fraud – Preventive maintenance – On-demand and predictive provisioning – Improve downtime and utilization – Up to 50% reduction of data by deduping on ingest Use case : Telecommunications
  • 10. © 2019 Cloudera, Inc. All rights reserved. 10 • Data: – Noise levels (acoustic data) in real-time from turbines – Power station data across plants – Data from smart meters • Results – Detect anomalies – Monitor turbine health in real time and predict failures before they happen – Lower downtime – Lower maintenance cost Use case : Utilities
  • 11. © 2019 Cloudera, Inc. All rights reserved. 11 • Data: – Banking and trading transactions – Signals from ATM and POS devices – Mobile and web app telemetry • Results – Detect and prevent fraud – Analyze trends and react in real-time – Improve customer experience with relevant and timely messaging – Unlock revenue relevant customer offers delivered at the right time Use case : Financial services
  • 12. What is Apache Kudu?
  • 14. © 2019 Cloudera, Inc. All rights reserved. 14 An Open Source Data Storage Engine That Makes Fast Analytics on Fast And Changing Data Easy
  • 15. Open source & open data standards are especially important when storing your data. Apache Kudu is a top-level Apache Software Foundation project released under the Apache 2 license and values community participation. We believe that Kudu's long-term success depends on building a vibrant community of developers and users from diverse organizations and backgrounds. An Open Source Data Storage Engine That Makes Fast Analytics on Fast And Changing Data Easy
  • 16. Allows users to focus on the use case and not the storage details. Manages the storage of your data including schema, layout, encoding, compression and compaction to allow for efficient disk usage and minimize IO. Separates storage management from computation. Though Kudu utilizes pushdown projections, predicates/filters, and more to optimize data access, it leverages tools like Impala, Hive, and Spark for complex computation. An Open Source Data Storage Engine That Makes Fast Analytics on Fast And Changing Data Easy
  • 17. Provides a combination of fast ingest and efficient columnar scans to enable multiple real-time analytic workloads across a single storage layer. Designed to strike a balance between full scan performance and low-latency random access allowing it to address a wide array of analytical use cases. Scale up and out to utilize all of the resources given to it across the cluster and on each node. Designed for next-generation hardware. An Open Source Data Storage Engine That Makes Fast Analytics on Fast And Changing Data Easy
  • 18. © 2019 Cloudera, Inc. All rights reserved. 18 It is important to support a variety of workloads.
  • 19. Data is immediately available to be analyzed as soon as it lands in Kudu. Supports updates and deletes in order to address a wide variety of use cases without exotic workarounds. Supports sustained high throughput ingest to capture all of your data, streaming or batch. An Open Source Data Storage Engine That Makes Fast Analytics on Fast And Changing Data Easy
  • 20. Kudu was built to be simple to deploy, monitor, operate and use. Familiar concepts such as tables, partitions, and insert/update/delete operations to minimize the expertise required to use it effectively. Simple data model and mutability makes it a breeze to port legacy analytical applications or build new ones. Integrates with the big data ecosystem, and integrating it with other data processing frameworks is simple. An Open Source Data Storage Engine That Makes Fast Analytics on Fast And Changing Data Easy
  • 21. © 2019 Cloudera, Inc. All rights reserved. 21 Ecosystem Integration Flow Process Query Security Cloud
  • 22. The Active Data Warehouse with Apache Kudu
  • 23. © 2019 Cloudera, Inc. All rights reserved. 23 The Active Data Warehouse with Apache Kudu CDF IOT Devices Applications Metrics Logs & Files HDFS/ Object Storage Hot Storage Cold Storage SQL ○ s u p p o r t Real-Time Analytics Alerting Event Driven Applications Dashboards
  • 24. © 2019 Cloudera, Inc. All rights reserved. 24 The Active Data Warehouse with Apache Kudu IOT Devices Applications Metrics Logs & Files HDFS/ Object Storage Hot Storage Cold Storage SQL Streaming Analytics Alerting Event Driven Applications Dashboards ● Data is ingested into Kudu via Spark & NiFi support most any data source. ● Ingest is often streaming but may also be scheduled in batches. ● Ingest may contain late arriving data and UPSERT, UPDATE, and DELETE operations. ● Kudu tables are often time-oriented fact tables or low volume dimension/lookup tables. ● Kudu tables can be used to enrich the data via NiFi and Spark during ingest. CDF
  • 25. © 2019 Cloudera, Inc. All rights reserved. 25 The Active Data Warehouse with Apache Kudu CDF IOT Devices Applications Metrics Logs & Files HDFS/ Object Storage Hot Storage Cold Storage SQL Streaming Analytics Alerting Event Driven Applications Dashboards ● Data is available to query immediately. ● Kudu manages schema, encoding, compression, replication, and compaction automatically ○ No small files problem on HDFS or S3. ● Kudu’s columnar layout, primary keys, and partitioning support allow for minimal IO and blazing fast queries.
  • 26. © 2019 Cloudera, Inc. All rights reserved. 26 The Active Data Warehouse with Apache Kudu CDF IOT Devices Applications Metrics Logs & Files HDFS/ Object Storage Hot Storage Cold Storage SQL Streaming Analytics Alerting Event Driven Applications Dashboards ● Time oriented data can be seamlessly offloaded into HDFS or Object storage. ● This reduces cost and increases scale while still maintaining data access.
  • 27. © 2019 Cloudera, Inc. All rights reserved. 27 Transparent Hierarchical Storage Pattern
  • 28. © 2019 Cloudera, Inc. All rights reserved. 28 Transparent Hierarchical Storage Pattern
  • 29. © 2019 Cloudera, Inc. All rights reserved. 29 Transparent Hierarchical Storage Pattern
  • 30. © 2019 Cloudera, Inc. All rights reserved. 30 The Active Data Warehouse with Apache Kudu CDF IOT Devices Applications Metrics Logs & Files HDFS/ Object Storage Hot Storage Cold Storage SQL Real-Time Analytics Alerting Event Driven Applications Dashboards ● Analyze and explore the data via SQL using your computation engine (Impala, Hive, Spark) and interface of choice. ● Using Impala’s JDBC or ODBC support, use almost any third-party business intelligence tool. ● Use Cloudera Data Science Workbench (CDSW) to build distributed machine learning algorithms.
  • 31. © 2019 Cloudera, Inc. All rights reserved. 31 An enterprise data warehouse must be secure
  • 32. © 2019 Cloudera, Inc. All rights reserved. 32 CDF The Active Data Warehouse with Apache Kudu IOT Devices Applications Metrics Logs & Files HDFS/ Object Storage Hot Storage Cold Storage SQL Real-Time Analytics Alerting Event Driven Applications Dashboards Authorization Audit & LineageAuthentication Kerberos Encryption NavEncrypt
  • 33. © 2019 Cloudera, Inc. All rights reserved. 33 The Active Data Warehouse with Apache Kudu CDF IOT Devices Applications Metrics Logs & Files HDFS/ Object Storage Hot Storage Cold Storage SQL Real-Time Analytics Alerting Event Driven Applications Dashboards Authorization Audit & LineageAuthentication Kerberos Encryption NavEncrypt ● Authentication via Kerberos prevents untrusted actors from gaining access to Kudu. ● Authentication securely identifies the connecting user or services for authorization checks. ● Easily integrated, deployed, and managed by Cloudera Manager.
  • 34. © 2019 Cloudera, Inc. All rights reserved. 34 The Active Data Warehouse with Apache Kudu CDF IOT Devices Applications Metrics Logs & Files HDFS/ Object Storage Hot Storage Cold Storage SQL Real-Time Analytics Alerting Event Driven Applications Dashboards Authorization Audit & LineageAuthentication Kerberos Encryption NavEncrypt ● Wire encryption via TLS without requiring you to manually deploy certificates on every node. ● At-rest encryption can be achieved using Cloudera NavEncrypt to encrypt the volumes storing Kudu data.
  • 35. © 2019 Cloudera, Inc. All rights reserved. 35 The Active Data Warehouse with Apache Kudu CDF IOT Devices Applications Metrics Logs & Files HDFS/ Object Storage Hot Storage Cold Storage SQL Real-Time Analytics Alerting Event Driven Applications Dashboards Authorization Audit & LineageAuthentication Kerberos Encryption NavEncrypt ● Coarse-Grained authorization via Kudu configuration. ○ All or nothing ● Fine-Grained authorization via Apache Sentry and Apache Ranger. ○ Native Apache Sentry support in CDH 6.3 ○ Native Apache Ranger support coming soon ○ Ranger support via Impala & Hive works today
  • 36. © 2019 Cloudera, Inc. All rights reserved. 36 The Active Data Warehouse with Apache Kudu CDF IOT Devices Applications Metrics Logs & Files HDFS/ Object Storage Hot Storage Cold Storage SQL Real-Time Analytics Alerting Event Driven Applications Dashboards Authorization Audit & LineageAuthentication Kerberos Encryption NavEncrypt ● Audit data access and activities. ● Use Lineage to see how data moves through the environment with data lineage. ● CDH: Cloudera Navigator events for integrations. ● CDP: Apache Atlas support for integrations. ● Native Apache Atlas support coming soon.
  • 37. 37 Active Data Warehouse in Cloudera Ecosystem • On CDH 6.3 with Sentry • On CDP Data Center 7.0 • On CDP Public Cloud • Available in the Cloudera Data Hub • In the future Kudu will be available Cloudera Data Warehouse too How can you deploy an Active Data Warehouse today?
  • 39. © 2019 Cloudera, Inc. All rights reserved. 39 First, you should upgrade Kudu • Kudu development is very active and recent releases have a lot of great improvements. • The Kudu community highly prioritizes improving Kudu usability and stability. • Upgrading Kudu is easy because clients are forward and backward compatible.
  • 40. © 2019 Cloudera, Inc. All rights reserved. 40 Near future :: WIP • Native integration with Apache Ranger for fine grained authorization • Native integration with Apache Atlas for audit & lineage • More data types a. Varchar, Date, Array, Map • Maintenance mode for Kudu tablet servers • Automated rolling restart of Kudu tablet servers • Automated tablet rebalancing • Built-in NTP client • NiFi Kudu Lookup Service
  • 41. © 2019 Cloudera, Inc. All rights reserved. 41 Kudu future :: Medium/Long term • Auto-generated keys & keyless tables • Dynamic master configuration • Secondary indexes • Transactional bulk load • Aggregations and rollups
  • 42. © 2019 Cloudera, Inc. All rights reserved. 42 Kudu future :: Cloud • Autoscaling Kudu tablet servers • Automatic offload of cold data to object storage • Global stretch clusters • Graceful decommission of tablet servers • Pause/Resume Kudu cluster
  • 44. © 2019 Cloudera, Inc. All rights reserved. 44 Apache Kudu Quickstart Cluster https://kudu.apache.org/docs/quickstart.html A Docker based quickstart cluster for local experimentation git clone https://github.com/apache/kudu cd kudu export KUDU_QUICKSTART_IP=$(ifconfig | grep "inet " | grep -Fv 127.0.0.1 | awk '{print $2}' | tail -1) # Starts a 3 master server, 5 tablet server docker cluster. docker-compose -f docker/quickstart.yml up -d # Visit the master server web-ui by visiting localhost:8050
  • 45. © 2019 Cloudera, Inc. All rights reserved. 45 Apache Kudu Quickstart Cluster + Kudu CLI https://kudu.apache.org/docs/command_line_tools_reference.html Getting familiar with the command line tools # Get a bash shell in the kudu-master-1 container docker exec -it $(docker ps -aqf "name=kudu-master-1") /bin/bash # Check the cluster health kudu cluster ksck kudu-master-1:7051,kudu-master- 2:7151,kudu-master-3:7251 # List the tables in Kudu kudu table list kudu-master-1:7051,kudu-master-2:7151,kudu- master-3:7251
  • 46. © 2019 Cloudera, Inc. All rights reserved. 46 Apache Kudu + Apache Spark Quickstart https://github.com/apache/kudu/tree/master/examples/quickstart/spark Load, query, and modify a real data set in Apache Kudu.
  • 47. © 2019 Cloudera, Inc. All rights reserved. 47 Apache Kudu + Apache NiFi Quickstart https://github.com/apache/kudu/tree/master/examples/quickstart/nifi Ingest user data into Apache Kudu.
  • 48. © 2019 Cloudera, Inc. All rights reserved. 48 Apache Kudu + Apache Impala Example https://kudu.apache.org/docs/kudu_impala_integration.html DDL & DML Example
  • 49. © 2019 Cloudera, Inc. All rights reserved. 49 Apache Kudu + Apache Hive Example https://cwiki.apache.org/confluence/display/Hive/Kudu+Integration Experimental Query Support in Hive 4.0 & CDP-DC 7.0
  • 50. © 2019 Cloudera, Inc. All rights reserved. 50 Related Kudu Blog Posts • CDH 6.3 Release: What’s new in Kudu – https://blog.cloudera.com/cdh-6-3-release-whats-new-in-kudu/ • Fine-Grained Authorization with Apache Kudu and Impala – https://blog.cloudera.com/fine-grained-authorization-with-apache- kudu-and-impala/ – Useful pattern for Sentry before CDH 6.3 – Useful pattern for Ranger in CDP-DC 7.0
  • 51. © 2019 Cloudera, Inc. All rights reserved. 51 Related Kudu Blog Posts • Transparent Hierarchical Storage Management with Apache Kudu and Impala – https://blog.cloudera.com/transparent-hierarchical-storage- management-with-apache-kudu-and-impala/ • Testing Apache Kudu Applications on the JVM – https://blog.cloudera.com/testing-apache-kudu-applications-on-the- jvm/
  • 52. © 2019 Cloudera, Inc. All rights reserved. 52 Cloudera Time Series Analytics Reference Architecture https://www.cloudera.com/campaign/time-series.html Data source 1 Data source 2 Data source N NiFi / CDF Kafka Spark Streaming Kudu Impala Parquet on HDFS / S3 / etc SQL users Spark CDSW Data scientists
  • 53. © 2019 Cloudera, Inc. All rights reserved. 53 Documentation • Kudu Documentation – https://kudu.apache.org/ – Downloads, release notes, examples, etc. • Cloudera Documentation – https://docs.cloudera.com/ – CDH, CDP Public Cloud, and CDP Data Center
  • 54. © 2019 Cloudera, Inc. All rights reserved. 54 Help & Contacts • Apache Community Slack & Mailing Lists – https://kudu.apache.org/community.html • Cloudera Community Forum – https://community.cloudera.com/ • Email – Grant Henke - grant@cloudera.com