SlideShare ist ein Scribd-Unternehmen logo
1 von 44
1© Cloudera, Inc. All rights reserved.
Using Kafka and Kudu for fast,
low-latency SQL analytics on
streaming data
Mike Percy, Software Engineer, Cloudera
Ashish Singh, Software Engineer, Cloudera
2© Cloudera, Inc. All rights reserved.
The problem
• Building a system that supports low-latency streaming and batch workloads
simultaneously is hard
• Current solutions to this are complicated and error-prone (e.g. lambda arch)
• Too much expertise is currently required to make it work
3© Cloudera, Inc. All rights reserved.
Building a near-real time data architecture
How can we...
1.Enable data to flow into our query system quickly, reliably, and efficiently
2.Allow for stream-processing of this data if desired
3.Enable batch processing to access up-to-the-second information
...while keeping complexity at a minimum?
4© Cloudera, Inc. All rights reserved.
Problem domains that require stream + batch
Credit Card
& Monetary
Transactions
Identify
fraudulent
transactions
as soon as
they occur.
Transportation
& Logistics
• Real-time
traffic
conditions
• Tracking
fleet and cargo locations and
dynamic re-routing to meet
SLAs
Retail
• Real-time
in-store
Offers and Recommendations.
• Email and
marketing campaigns based on real-
time social trends
Consumer
Internet,
Mobile &
E-Commerce
Optimize user
engagement based
on user’s current
behavior. Deliver
recommendations relevant “in
the moment”
Manufacturing
• Identify
equipment
failures and
react instantly
• Perform proactive
maintenance.
• Identify product
quality defects immediately to
prevent resource wastage.
Security & Surveillance
Identify
threats
and intrusions,
both digital and physical, in real-
time.
Digital
Advertising
& Marketing
Optimize and personalize digital
ads based on real-time
information.
Continuously
monitor patient
vital stats and proactively identify
at-risk patients.
Report on this data.
Healthcare
5© Cloudera, Inc. All rights reserved.
Agenda
• A new low-latency, high throughput architecture for analytics
• Building an ingest pipeline
• Storing and querying structured data
• Design tradeoffs
• Demo
• Q&A
6© Cloudera, Inc. All rights reserved.
A modern, low-latency analytics architecture
Data
Sources
Kafka Kudu
(optional)
Impala
or
SparkSQL
7© Cloudera, Inc. All rights reserved.
“Traditional” real-time analytics in Hadoop
Fraud detection in the real world means storage complexity
Considerations:
● How do I handle failure
during this process?
● How often do I reorganize
data streaming in into a
format appropriate for
reporting?
● When reporting, how do I see
data that has not yet been
reorganized?
● How do I ensure that
important jobs aren’t
interrupted by maintenance?
New Partition
Most Recent Partition
Historical Data
HBase
Parquet
File
Have we
accumulated
enough data?
Reorganize
HBase file
into Parquet
• Wait for running operations to complete
• Define new Impala partition referencing
the newly written Parquet file
Kafka
Reporting
Request
Storage in HDFS
8© Cloudera, Inc. All rights reserved.
Real-time analytics in Hadoop with Kudu
Improvements:
● Fewer systems to operate
● No cron jobs or background
processes
● Handle late arrivals or data
corrections with ease
● New data available
immediately for analytics or
operations
Historical and Real-time
Data
Kafka
Reporting
Request
Storage in Kudu
9© Cloudera, Inc. All rights reserved.
Xiaomi use case
• World’s 4th largest smart-phone maker (most popular in China)
• Gather important RPC tracing events from mobile app and backend service.
• Service monitoring & troubleshooting tool.
High write throughput
• >5 Billion records/day and growing
Query latest data and quick response
• Identify and resolve issues quickly
Can search for individual records
• Easy for troubleshooting
10© Cloudera, Inc. All rights reserved.
Xiaomi big data analytics pipeline
Before Kudu
Large ETL pipeline delays
● High data visibility latency
(from 1 hour up to 1 day)
● Data format conversion woes
Ordering issues
● Log arrival (storage) not
exactly in correct order
● Must read 2 – 3 days of data
to get all of the data points
for a single day
11© Cloudera, Inc. All rights reserved.
Xiaomi big data analytics pipeline
Simplified with Kafka and Kudu
Low latency ETL pipeline
● ~10s data latency
● For apps that need to avoid
direct backpressure or need
ETL for record enrichment
Direct zero-latency path
● For apps that can tolerate
backpressure and can use the
NoSQL APIs
● Apps that don’t need ETL
enrichment for storage /
retrieval
OLAP scan
Side table lookup
Result store
12© Cloudera, Inc. All rights reserved.
A modern, low-latency analytics architecture
Data
Sources
Kafka Kudu
(optional)
Impala
or
SparkSQL
13© Cloudera, Inc. All rights reserved.
14© Cloudera, Inc. All rights reserved.
Client Backend
Data Pipelines Start like this.
15© Cloudera, Inc. All rights reserved.
Client Backend
Client
Client
Client
Then we reuse them
16© Cloudera, Inc. All rights reserved.
Client Backend
Client
Client
Client
Then we add multiple backends
Another
Backend
17© Cloudera, Inc. All rights reserved.
Client Backend
Client
Client
Client
Then it starts to look like this
Another
Backend
Another
Backend
Another
Backend
18© Cloudera, Inc. All rights reserved.
Client Backend
Client
Client
Client
With maybe some of this
Another
Backend
Another
Backend
Another
Backend
19© Cloudera, Inc. All rights reserved.
Adding applications should be easier
We need:
• Shared infrastructure for sending records
• Infrastructure must scale
• Set of agreed-upon record schemas
20© Cloudera, Inc. All rights reserved.
Kafka decouples data pipelines
Why Kafka
20
Source System Source System Source System Source System
Hadoop Security Systems
Real-time
monitoring
Data Warehouse
Kafka
Producers
Broker
Consumers
21© Cloudera, Inc. All rights reserved.
About Kafka
• Publish/Subscribe Messaging System From LinkedIn
• High throughput (100’s of k messages/sec)
• Low latency (sub-second to low seconds)
• Fault-tolerant (Replicated and Distributed)
• Supports Agnostic Messaging
• Standardizes format and delivery
• Huge community
22© Cloudera, Inc. All rights reserved.
Architecture
Producer
Consumer Consumer
Producers
Kafka
Cluster
Consumers
Broker Broker Broker Broker
Producer
Zookeeper
23© Cloudera, Inc. All rights reserved.
24© Cloudera, Inc. All rights reserved.
Kudu is a high-performance distributed storage engine
Storage for fast (low latency) analytics on fast (high throughput) data
• Simplifies the architecture for building
analytic applications on changing data
• Optimized for fast analytic performance
• Natively integrated with the Hadoop
ecosystem of components
FILESYSTEM
HDFS
NoSQL
HBASE
INGEST – SQOOP, FLUME, KAFKA
DATA INTEGRATION & STORAGE
SECURITY – SENTRY
RESOURCE MANAGEMENT – YARN
UNIFIED DATA SERVICES
BATCH STREAM SQL SEARCH MODEL ONLINE
DATA ENGINEERING DATA DISCOVERY & ANALYTICS DATA APPS
SPARK,
HIVE, PIG
SPARK IMPALA SOLR SPARK HBASE
RELATIONAL
KUDU
25© Cloudera, Inc. All rights reserved.
Kudu: Scalable and fast tabular storage
Scalable
• Tested up to 275 nodes (~3PB cluster)
• Designed to scale to 1000s of nodes and tens of PBs
Fast
• Millions of read/write operations per second across cluster
• Multiple GB/second read throughput per node
Tabular
• Store tables like a normal database
• Individual record-level access to 100+ billion row tables
26© Cloudera, Inc. All rights reserved.
• High throughput for big scans
Goal: Within 2x of Parquet
• Low-latency for short accesses
Goal: 1ms read/write on SSD
• Database-like semantics
Initially, single-row atomicity
• Relational data model
• SQL queries should be natural and easy
• Include NoSQL-style scan, insert, and update APIs
Kudu design goals
27© Cloudera, Inc. All rights reserved.
Kudu storage system interfaces
• A Kudu table has a SQL-like schema
• And a finite number of columns (unlike HBase/Cassandra)
• Types: BOOL, INT8, INT16, INT32, INT64, FLOAT, DOUBLE, STRING, BINARY,
TIMESTAMP
• Some subset of columns makes up a possibly-composite primary key
• Fast ALTER TABLE
• Java, C++, and Python NoSQL-style APIs
• Insert(), Update(), Delete(), Scan()
• Integrations with Kafka, MapReduce, Spark, Flume, and Impala
• Apache Drill work-in-progress
28© Cloudera, Inc. All rights reserved.
Kudu use cases
Kudu is best for use cases requiring:
•Simultaneous combination of sequential and random reads and writes
•Minimal to zero data latencies
Time series
•Examples: Streaming market data, fraud detection / prevention, risk monitoring
•Workload: Insert, updates, scans, lookups
Machine data analytics
•Example: Network threat detection
•Workload: Inserts, scans, lookups
Online reporting / data warehousing
•Example: Operational data store (ODS)
•Workload: Inserts, updates, scans, lookups
29© Cloudera, Inc. All rights reserved.
Tables and tablets
• Each table is horizontally partitioned into tablets
• Range or hash partitioning
• PRIMARY KEY (host, metric, timestamp) DISTRIBUTE BY
HASH(timestamp) INTO 100 BUCKETS
• Translation: bucketNumber = hashCode(row[‘timestamp’]) % 100
• Each tablet has N replicas (3 or 5), kept consistent with Raft consensus
• Tablet servers host tablets on local disk drives
30© Cloudera, Inc. All rights reserved.
Metadata
• Replicated master
• Acts as a tablet directory
• Acts as a catalog (which tables exist, etc)
• Acts as a load balancer (tracks TS liveness, re-replicates under-replicated
tablets)
• Caches all metadata in RAM for high performance
• Client configured with master addresses
• Asks master for tablet locations as needed and caches them
31© Cloudera, Inc. All rights reserved.
Impala integration
• CREATE TABLE … DISTRIBUTE BY HASH(col1) INTO 16 BUCKETS
AS SELECT … FROM …
• INSERT/UPDATE/DELETE
• Optimizations like predicate pushdown, scan parallelism, plans for
more on the way
32© Cloudera, Inc. All rights reserved.
Spark DataSource integration
sqlContext.load("org.kududb.spark",
Map("kudu.table" -> “foo”,
"kudu.master" -> “master.example.com”))
.registerTempTable(“mytable”)
df = sqlContext.sql(
“select col_a, col_b from mytable “ +
“where col_c = 123”)
33© Cloudera, Inc. All rights reserved.
TPC-H (analytics benchmark)
• 75 server cluster
• 12 (spinning) disks each, enough RAM to fit dataset
• TPC-H Scale Factor 100 (100GB)
• Example query:
• SELECT n_name, sum(l_extendedprice * (1 - l_discount)) as revenue FROM customer, orders,
lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND l_orderkey =
o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey AND s_nationkey =
n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' AND o_orderdate >= date
'1994-01-01' AND o_orderdate < '1995-01-01’ GROUP BY n_name ORDER BY revenue desc;
34© Cloudera, Inc. All rights reserved.
• Kudu outperforms Parquet by 31% (geometric mean) for RAM-resident data
35© Cloudera, Inc. All rights reserved.
Versus other NoSQL storage
• Apache Phoenix: OLTP SQL engine built on HBase
• 10 node cluster (9 worker, 1 master)
• TPC-H LINEITEM table only (6B rows)
36© Cloudera, Inc. All rights reserved.
• Custom client, i.e., Kafka consumer that writes to Kudu
Getting Data from Kafka into Kudu
37© Cloudera, Inc. All rights reserved.
• Custom client, i.e., Kafka consumer that writes to Kudu
• Kafka-Flume source/channel + Kudu-Flume sink
Getting Data from Kafka into Kudu
38© Cloudera, Inc. All rights reserved.
• Custom client, i.e., Kafka consumer that writes to Kudu
• Kafka-Flume source/channel + Kudu-Flume sink
• Kafka connect
Getting Data from Kafka into Kudu
39© Cloudera, Inc. All rights reserved.
Kafka + Kudu: A low latency data visibility path
• Upstream application pushes data to Kafka
• Kafka then acts as a buffer in order to handle backpressure from Kudu
• The Kafka Connect plugin pushes data to Kudu as it becomes available
• As soon as the data is ingested into Kudu, it becomes available
40© Cloudera, Inc. All rights reserved.
Tradeoffs
What if I need a zero-latency path?
• Possible to write to Kudu directly using the NoSQL API
• However, the app will need to tolerate queueing and backpressure itself
What if I want to store unstructured data or large binary blobs?
• Consider using Kafka + HBase instead of Kudu
•But you won’t get the same SQL query performance
41© Cloudera, Inc. All rights reserved.
Demo
Data
Sources
Kafka Kudu
(optional)
Impala
or
SparkSQL
42© Cloudera, Inc. All rights reserved.
About the Kudu project
•Apache Software Foundation incubating project
•Latest version 0.8.0 (beta) released in April
•Plans are for a 1.0 version to be released in August
•Web site: getkudu.io (also kudu.incubator.apache.org soon)
•Slack chat room for devs and users (auto-invite): getkudu-slack.herokuapp.com
•Twitter handle: @ApacheKudu
•Code: github.com/apache/incubator-kudu
Want to hear more about Kudu and Spark?
• Come to the Vancouver Spark meetup tonight here at the Hyatt at 6pm
• More info: www.meetup.com/Vancouver-Spark/
43© Cloudera, Inc. All rights reserved.
Thank you
44© Cloudera, Inc. All rights reserved.
Questions?
Mike Percy | @mike_percy
mpercy@cloudera.com
Ashish Singh | @singhasdev
asingh@cloudera.com

Weitere ähnliche Inhalte

Was ist angesagt?

Query Compilation in Impala
Query Compilation in ImpalaQuery Compilation in Impala
Query Compilation in ImpalaCloudera, Inc.
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IODatabricks
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala InternalsDavid Groozman
 
Part 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache KuduPart 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache KuduCloudera, Inc.
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Cloudera, Inc.
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...DataWorks Summit/Hadoop Summit
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache SparkDatabricks
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Eric Sun
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergAnant Corporation
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebookragho
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
Cloudera Impala Source Code Explanation and Analysis
Cloudera Impala Source Code Explanation and AnalysisCloudera Impala Source Code Explanation and Analysis
Cloudera Impala Source Code Explanation and AnalysisYue Chen
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Databricks
 

Was ist angesagt? (20)

Kudu Deep-Dive
Kudu Deep-DiveKudu Deep-Dive
Kudu Deep-Dive
 
Query Compilation in Impala
Query Compilation in ImpalaQuery Compilation in Impala
Query Compilation in Impala
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala Internals
 
Part 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache KuduPart 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache Kudu
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
 
Apache kudu
Apache kuduApache kudu
Apache kudu
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Cloudera Impala Source Code Explanation and Analysis
Cloudera Impala Source Code Explanation and AnalysisCloudera Impala Source Code Explanation and Analysis
Cloudera Impala Source Code Explanation and Analysis
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
 

Andere mochten auch

Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Wes McKinney
 
Hadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphHadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphDataWorks Summit
 
Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab CreateTuri, Inc.
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processingsscdotopen
 
Time Series Analysis with Spark
Time Series Analysis with SparkTime Series Analysis with Spark
Time Series Analysis with SparkSandy Ryza
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataRyan Bosshart
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXrhatr
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsTuri, Inc.
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowWes McKinney
 

Andere mochten auch (11)

HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
 
Hadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphHadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache Giraph
 
Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab Create
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
 
Time Series Analysis with Spark
Time Series Analysis with SparkTime Series Analysis with Spark
Time Series Analysis with Spark
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
 

Ähnlich wie Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data

Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016StampedeCon
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Hadoop / Spark Conference Japan
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupMike Percy
 
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast DataKudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Datamichaelguia
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit
 
End to End Streaming Architectures
End to End Streaming ArchitecturesEnd to End Streaming Architectures
End to End Streaming ArchitecturesCloudera, Inc.
 
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...Dataconomy Media
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Data Con LA
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Mladen Kovacevic
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Cloudera, Inc.
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataCloudera, Inc.
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform WebinarCloudera, Inc.
 
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Cloudera, Inc.
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupCaserta
 
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud Stefan Lipp
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime
 

Ähnlich wie Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data (20)

Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
 
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast DataKudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Data
 
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptxKudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
 
End to End Streaming Architectures
End to End Streaming ArchitecturesEnd to End Streaming Architectures
End to End Streaming Architectures
 
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
 
SFHUG Kudu Talk
SFHUG Kudu TalkSFHUG Kudu Talk
SFHUG Kudu Talk
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
 
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 

Kürzlich hochgeladen

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 

Kürzlich hochgeladen (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 

Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data

  • 1. 1© Cloudera, Inc. All rights reserved. Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data Mike Percy, Software Engineer, Cloudera Ashish Singh, Software Engineer, Cloudera
  • 2. 2© Cloudera, Inc. All rights reserved. The problem • Building a system that supports low-latency streaming and batch workloads simultaneously is hard • Current solutions to this are complicated and error-prone (e.g. lambda arch) • Too much expertise is currently required to make it work
  • 3. 3© Cloudera, Inc. All rights reserved. Building a near-real time data architecture How can we... 1.Enable data to flow into our query system quickly, reliably, and efficiently 2.Allow for stream-processing of this data if desired 3.Enable batch processing to access up-to-the-second information ...while keeping complexity at a minimum?
  • 4. 4© Cloudera, Inc. All rights reserved. Problem domains that require stream + batch Credit Card & Monetary Transactions Identify fraudulent transactions as soon as they occur. Transportation & Logistics • Real-time traffic conditions • Tracking fleet and cargo locations and dynamic re-routing to meet SLAs Retail • Real-time in-store Offers and Recommendations. • Email and marketing campaigns based on real- time social trends Consumer Internet, Mobile & E-Commerce Optimize user engagement based on user’s current behavior. Deliver recommendations relevant “in the moment” Manufacturing • Identify equipment failures and react instantly • Perform proactive maintenance. • Identify product quality defects immediately to prevent resource wastage. Security & Surveillance Identify threats and intrusions, both digital and physical, in real- time. Digital Advertising & Marketing Optimize and personalize digital ads based on real-time information. Continuously monitor patient vital stats and proactively identify at-risk patients. Report on this data. Healthcare
  • 5. 5© Cloudera, Inc. All rights reserved. Agenda • A new low-latency, high throughput architecture for analytics • Building an ingest pipeline • Storing and querying structured data • Design tradeoffs • Demo • Q&A
  • 6. 6© Cloudera, Inc. All rights reserved. A modern, low-latency analytics architecture Data Sources Kafka Kudu (optional) Impala or SparkSQL
  • 7. 7© Cloudera, Inc. All rights reserved. “Traditional” real-time analytics in Hadoop Fraud detection in the real world means storage complexity Considerations: ● How do I handle failure during this process? ● How often do I reorganize data streaming in into a format appropriate for reporting? ● When reporting, how do I see data that has not yet been reorganized? ● How do I ensure that important jobs aren’t interrupted by maintenance? New Partition Most Recent Partition Historical Data HBase Parquet File Have we accumulated enough data? Reorganize HBase file into Parquet • Wait for running operations to complete • Define new Impala partition referencing the newly written Parquet file Kafka Reporting Request Storage in HDFS
  • 8. 8© Cloudera, Inc. All rights reserved. Real-time analytics in Hadoop with Kudu Improvements: ● Fewer systems to operate ● No cron jobs or background processes ● Handle late arrivals or data corrections with ease ● New data available immediately for analytics or operations Historical and Real-time Data Kafka Reporting Request Storage in Kudu
  • 9. 9© Cloudera, Inc. All rights reserved. Xiaomi use case • World’s 4th largest smart-phone maker (most popular in China) • Gather important RPC tracing events from mobile app and backend service. • Service monitoring & troubleshooting tool. High write throughput • >5 Billion records/day and growing Query latest data and quick response • Identify and resolve issues quickly Can search for individual records • Easy for troubleshooting
  • 10. 10© Cloudera, Inc. All rights reserved. Xiaomi big data analytics pipeline Before Kudu Large ETL pipeline delays ● High data visibility latency (from 1 hour up to 1 day) ● Data format conversion woes Ordering issues ● Log arrival (storage) not exactly in correct order ● Must read 2 – 3 days of data to get all of the data points for a single day
  • 11. 11© Cloudera, Inc. All rights reserved. Xiaomi big data analytics pipeline Simplified with Kafka and Kudu Low latency ETL pipeline ● ~10s data latency ● For apps that need to avoid direct backpressure or need ETL for record enrichment Direct zero-latency path ● For apps that can tolerate backpressure and can use the NoSQL APIs ● Apps that don’t need ETL enrichment for storage / retrieval OLAP scan Side table lookup Result store
  • 12. 12© Cloudera, Inc. All rights reserved. A modern, low-latency analytics architecture Data Sources Kafka Kudu (optional) Impala or SparkSQL
  • 13. 13© Cloudera, Inc. All rights reserved.
  • 14. 14© Cloudera, Inc. All rights reserved. Client Backend Data Pipelines Start like this.
  • 15. 15© Cloudera, Inc. All rights reserved. Client Backend Client Client Client Then we reuse them
  • 16. 16© Cloudera, Inc. All rights reserved. Client Backend Client Client Client Then we add multiple backends Another Backend
  • 17. 17© Cloudera, Inc. All rights reserved. Client Backend Client Client Client Then it starts to look like this Another Backend Another Backend Another Backend
  • 18. 18© Cloudera, Inc. All rights reserved. Client Backend Client Client Client With maybe some of this Another Backend Another Backend Another Backend
  • 19. 19© Cloudera, Inc. All rights reserved. Adding applications should be easier We need: • Shared infrastructure for sending records • Infrastructure must scale • Set of agreed-upon record schemas
  • 20. 20© Cloudera, Inc. All rights reserved. Kafka decouples data pipelines Why Kafka 20 Source System Source System Source System Source System Hadoop Security Systems Real-time monitoring Data Warehouse Kafka Producers Broker Consumers
  • 21. 21© Cloudera, Inc. All rights reserved. About Kafka • Publish/Subscribe Messaging System From LinkedIn • High throughput (100’s of k messages/sec) • Low latency (sub-second to low seconds) • Fault-tolerant (Replicated and Distributed) • Supports Agnostic Messaging • Standardizes format and delivery • Huge community
  • 22. 22© Cloudera, Inc. All rights reserved. Architecture Producer Consumer Consumer Producers Kafka Cluster Consumers Broker Broker Broker Broker Producer Zookeeper
  • 23. 23© Cloudera, Inc. All rights reserved.
  • 24. 24© Cloudera, Inc. All rights reserved. Kudu is a high-performance distributed storage engine Storage for fast (low latency) analytics on fast (high throughput) data • Simplifies the architecture for building analytic applications on changing data • Optimized for fast analytic performance • Natively integrated with the Hadoop ecosystem of components FILESYSTEM HDFS NoSQL HBASE INGEST – SQOOP, FLUME, KAFKA DATA INTEGRATION & STORAGE SECURITY – SENTRY RESOURCE MANAGEMENT – YARN UNIFIED DATA SERVICES BATCH STREAM SQL SEARCH MODEL ONLINE DATA ENGINEERING DATA DISCOVERY & ANALYTICS DATA APPS SPARK, HIVE, PIG SPARK IMPALA SOLR SPARK HBASE RELATIONAL KUDU
  • 25. 25© Cloudera, Inc. All rights reserved. Kudu: Scalable and fast tabular storage Scalable • Tested up to 275 nodes (~3PB cluster) • Designed to scale to 1000s of nodes and tens of PBs Fast • Millions of read/write operations per second across cluster • Multiple GB/second read throughput per node Tabular • Store tables like a normal database • Individual record-level access to 100+ billion row tables
  • 26. 26© Cloudera, Inc. All rights reserved. • High throughput for big scans Goal: Within 2x of Parquet • Low-latency for short accesses Goal: 1ms read/write on SSD • Database-like semantics Initially, single-row atomicity • Relational data model • SQL queries should be natural and easy • Include NoSQL-style scan, insert, and update APIs Kudu design goals
  • 27. 27© Cloudera, Inc. All rights reserved. Kudu storage system interfaces • A Kudu table has a SQL-like schema • And a finite number of columns (unlike HBase/Cassandra) • Types: BOOL, INT8, INT16, INT32, INT64, FLOAT, DOUBLE, STRING, BINARY, TIMESTAMP • Some subset of columns makes up a possibly-composite primary key • Fast ALTER TABLE • Java, C++, and Python NoSQL-style APIs • Insert(), Update(), Delete(), Scan() • Integrations with Kafka, MapReduce, Spark, Flume, and Impala • Apache Drill work-in-progress
  • 28. 28© Cloudera, Inc. All rights reserved. Kudu use cases Kudu is best for use cases requiring: •Simultaneous combination of sequential and random reads and writes •Minimal to zero data latencies Time series •Examples: Streaming market data, fraud detection / prevention, risk monitoring •Workload: Insert, updates, scans, lookups Machine data analytics •Example: Network threat detection •Workload: Inserts, scans, lookups Online reporting / data warehousing •Example: Operational data store (ODS) •Workload: Inserts, updates, scans, lookups
  • 29. 29© Cloudera, Inc. All rights reserved. Tables and tablets • Each table is horizontally partitioned into tablets • Range or hash partitioning • PRIMARY KEY (host, metric, timestamp) DISTRIBUTE BY HASH(timestamp) INTO 100 BUCKETS • Translation: bucketNumber = hashCode(row[‘timestamp’]) % 100 • Each tablet has N replicas (3 or 5), kept consistent with Raft consensus • Tablet servers host tablets on local disk drives
  • 30. 30© Cloudera, Inc. All rights reserved. Metadata • Replicated master • Acts as a tablet directory • Acts as a catalog (which tables exist, etc) • Acts as a load balancer (tracks TS liveness, re-replicates under-replicated tablets) • Caches all metadata in RAM for high performance • Client configured with master addresses • Asks master for tablet locations as needed and caches them
  • 31. 31© Cloudera, Inc. All rights reserved. Impala integration • CREATE TABLE … DISTRIBUTE BY HASH(col1) INTO 16 BUCKETS AS SELECT … FROM … • INSERT/UPDATE/DELETE • Optimizations like predicate pushdown, scan parallelism, plans for more on the way
  • 32. 32© Cloudera, Inc. All rights reserved. Spark DataSource integration sqlContext.load("org.kududb.spark", Map("kudu.table" -> “foo”, "kudu.master" -> “master.example.com”)) .registerTempTable(“mytable”) df = sqlContext.sql( “select col_a, col_b from mytable “ + “where col_c = 123”)
  • 33. 33© Cloudera, Inc. All rights reserved. TPC-H (analytics benchmark) • 75 server cluster • 12 (spinning) disks each, enough RAM to fit dataset • TPC-H Scale Factor 100 (100GB) • Example query: • SELECT n_name, sum(l_extendedprice * (1 - l_discount)) as revenue FROM customer, orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' AND o_orderdate >= date '1994-01-01' AND o_orderdate < '1995-01-01’ GROUP BY n_name ORDER BY revenue desc;
  • 34. 34© Cloudera, Inc. All rights reserved. • Kudu outperforms Parquet by 31% (geometric mean) for RAM-resident data
  • 35. 35© Cloudera, Inc. All rights reserved. Versus other NoSQL storage • Apache Phoenix: OLTP SQL engine built on HBase • 10 node cluster (9 worker, 1 master) • TPC-H LINEITEM table only (6B rows)
  • 36. 36© Cloudera, Inc. All rights reserved. • Custom client, i.e., Kafka consumer that writes to Kudu Getting Data from Kafka into Kudu
  • 37. 37© Cloudera, Inc. All rights reserved. • Custom client, i.e., Kafka consumer that writes to Kudu • Kafka-Flume source/channel + Kudu-Flume sink Getting Data from Kafka into Kudu
  • 38. 38© Cloudera, Inc. All rights reserved. • Custom client, i.e., Kafka consumer that writes to Kudu • Kafka-Flume source/channel + Kudu-Flume sink • Kafka connect Getting Data from Kafka into Kudu
  • 39. 39© Cloudera, Inc. All rights reserved. Kafka + Kudu: A low latency data visibility path • Upstream application pushes data to Kafka • Kafka then acts as a buffer in order to handle backpressure from Kudu • The Kafka Connect plugin pushes data to Kudu as it becomes available • As soon as the data is ingested into Kudu, it becomes available
  • 40. 40© Cloudera, Inc. All rights reserved. Tradeoffs What if I need a zero-latency path? • Possible to write to Kudu directly using the NoSQL API • However, the app will need to tolerate queueing and backpressure itself What if I want to store unstructured data or large binary blobs? • Consider using Kafka + HBase instead of Kudu •But you won’t get the same SQL query performance
  • 41. 41© Cloudera, Inc. All rights reserved. Demo Data Sources Kafka Kudu (optional) Impala or SparkSQL
  • 42. 42© Cloudera, Inc. All rights reserved. About the Kudu project •Apache Software Foundation incubating project •Latest version 0.8.0 (beta) released in April •Plans are for a 1.0 version to be released in August •Web site: getkudu.io (also kudu.incubator.apache.org soon) •Slack chat room for devs and users (auto-invite): getkudu-slack.herokuapp.com •Twitter handle: @ApacheKudu •Code: github.com/apache/incubator-kudu Want to hear more about Kudu and Spark? • Come to the Vancouver Spark meetup tonight here at the Hyatt at 6pm • More info: www.meetup.com/Vancouver-Spark/
  • 43. 43© Cloudera, Inc. All rights reserved. Thank you
  • 44. 44© Cloudera, Inc. All rights reserved. Questions? Mike Percy | @mike_percy mpercy@cloudera.com Ashish Singh | @singhasdev asingh@cloudera.com