Suche senden
Hochladen
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
•
Als PPTX, PDF herunterladen
•
15 gefällt mir
•
9,913 views
Mike Percy
Folgen
Presentation given at the Apache: Big Data 2016 conference in Vancouver
Weniger lesen
Mehr lesen
Technologie
Melden
Teilen
Melden
Teilen
1 von 44
Jetzt herunterladen
Empfohlen
Moving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache Kudu
Cloudera, Inc.
A Closer Look at Apache Kudu
A Closer Look at Apache Kudu
Andriy Zabavskyy
Strata London 2019 Scaling Impala
Strata London 2019 Scaling Impala
Manish Maheshwari
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache Kudu
Cloudera, Inc.
The Impala Cookbook
The Impala Cookbook
Cloudera, Inc.
Introduction to Apache Kudu
Introduction to Apache Kudu
Jeff Holoman
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
Spark with Delta Lake
Spark with Delta Lake
Knoldus Inc.
Empfohlen
Moving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache Kudu
Cloudera, Inc.
A Closer Look at Apache Kudu
A Closer Look at Apache Kudu
Andriy Zabavskyy
Strata London 2019 Scaling Impala
Strata London 2019 Scaling Impala
Manish Maheshwari
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache Kudu
Cloudera, Inc.
The Impala Cookbook
The Impala Cookbook
Cloudera, Inc.
Introduction to Apache Kudu
Introduction to Apache Kudu
Jeff Holoman
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
Spark with Delta Lake
Spark with Delta Lake
Knoldus Inc.
Kudu Deep-Dive
Kudu Deep-Dive
Supriya Sahay
Query Compilation in Impala
Query Compilation in Impala
Cloudera, Inc.
Hive: Loading Data
Hive: Loading Data
Benjamin Leonhardi
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
Databricks
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
Databricks
Cloudera Impala Internals
Cloudera Impala Internals
David Groozman
Part 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache Kudu
Cloudera, Inc.
Apache Kudu: Technical Deep Dive
Apache Kudu: Technical Deep Dive
Cloudera, Inc.
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
DataWorks Summit/Hadoop Summit
Apache kudu
Apache kudu
Asim Jalis
Memory Management in Apache Spark
Memory Management in Apache Spark
Databricks
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
Eric Sun
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
Optimizing Hive Queries
Optimizing Hive Queries
Owen O'Malley
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Anant Corporation
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
ragho
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
Cloudera Impala Source Code Explanation and Analysis
Cloudera Impala Source Code Explanation and Analysis
Yue Chen
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Databricks
HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016
DataWorks Summit/Hadoop Summit
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
Wes McKinney
Weitere ähnliche Inhalte
Was ist angesagt?
Kudu Deep-Dive
Kudu Deep-Dive
Supriya Sahay
Query Compilation in Impala
Query Compilation in Impala
Cloudera, Inc.
Hive: Loading Data
Hive: Loading Data
Benjamin Leonhardi
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
Databricks
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
Databricks
Cloudera Impala Internals
Cloudera Impala Internals
David Groozman
Part 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache Kudu
Cloudera, Inc.
Apache Kudu: Technical Deep Dive
Apache Kudu: Technical Deep Dive
Cloudera, Inc.
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
DataWorks Summit/Hadoop Summit
Apache kudu
Apache kudu
Asim Jalis
Memory Management in Apache Spark
Memory Management in Apache Spark
Databricks
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
Eric Sun
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
Optimizing Hive Queries
Optimizing Hive Queries
Owen O'Malley
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Anant Corporation
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
ragho
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
Cloudera Impala Source Code Explanation and Analysis
Cloudera Impala Source Code Explanation and Analysis
Yue Chen
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Databricks
Was ist angesagt?
(20)
Kudu Deep-Dive
Kudu Deep-Dive
Query Compilation in Impala
Query Compilation in Impala
Hive: Loading Data
Hive: Loading Data
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
Cloudera Impala Internals
Cloudera Impala Internals
Part 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache Kudu
Apache Kudu: Technical Deep Dive
Apache Kudu: Technical Deep Dive
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
Apache kudu
Apache kudu
Memory Management in Apache Spark
Memory Management in Apache Spark
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Optimizing Hive Queries
Optimizing Hive Queries
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
Cloudera Impala Source Code Explanation and Analysis
Cloudera Impala Source Code Explanation and Analysis
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Andere mochten auch
HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016
DataWorks Summit/Hadoop Summit
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
Wes McKinney
Hadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache Giraph
DataWorks Summit
Machine Learning with GraphLab Create
Machine Learning with GraphLab Create
Turi, Inc.
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
sscdotopen
Time Series Analysis with Spark
Time Series Analysis with Spark
Sandy Ryza
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
Ryan Bosshart
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
rhatr
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Turi, Inc.
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney
Andere mochten auch
(11)
HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
Hadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache Giraph
Machine Learning with GraphLab Create
Machine Learning with GraphLab Create
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
Time Series Analysis with Spark
Time Series Analysis with Spark
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
Ähnlich wie Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
StampedeCon
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Hadoop / Spark Conference Japan
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
Mike Percy
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Data
michaelguia
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
Felicia Haggarty
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
Spark Summit
End to End Streaming Architectures
End to End Streaming Architectures
Cloudera, Inc.
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Dataconomy Media
SFHUG Kudu Talk
SFHUG Kudu Talk
Felicia Haggarty
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Data Con LA
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Hakka Labs
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Mladen Kovacevic
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Cloudera, Inc.
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
markgrover
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Cloudera, Inc.
Spark One Platform Webinar
Spark One Platform Webinar
Cloudera, Inc.
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Cloudera, Inc.
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
Caserta
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Stefan Lipp
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
cdmaxime
Ähnlich wie Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
(20)
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Data
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
End to End Streaming Architectures
End to End Streaming Architectures
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
SFHUG Kudu Talk
SFHUG Kudu Talk
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Spark One Platform Webinar
Spark One Platform Webinar
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Kürzlich hochgeladen
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
sammart93
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
wesley chun
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
The Digital Insurer
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
DianaGray10
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
Khem
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
ThousandEyes
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
Gabriella Davis
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
Maria Levchenko
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
apidays
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Miguel Araújo
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
Principled Technologies
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
lior mazor
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
Safe Software
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
Radu Cotescu
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
Rafal Los
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
Boston Institute of Analytics
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
sudhanshuwaghmare1
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
Antenna Manufacturer Coco
Kürzlich hochgeladen
(20)
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
1.
1© Cloudera, Inc.
All rights reserved. Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data Mike Percy, Software Engineer, Cloudera Ashish Singh, Software Engineer, Cloudera
2.
2© Cloudera, Inc.
All rights reserved. The problem • Building a system that supports low-latency streaming and batch workloads simultaneously is hard • Current solutions to this are complicated and error-prone (e.g. lambda arch) • Too much expertise is currently required to make it work
3.
3© Cloudera, Inc.
All rights reserved. Building a near-real time data architecture How can we... 1.Enable data to flow into our query system quickly, reliably, and efficiently 2.Allow for stream-processing of this data if desired 3.Enable batch processing to access up-to-the-second information ...while keeping complexity at a minimum?
4.
4© Cloudera, Inc.
All rights reserved. Problem domains that require stream + batch Credit Card & Monetary Transactions Identify fraudulent transactions as soon as they occur. Transportation & Logistics • Real-time traffic conditions • Tracking fleet and cargo locations and dynamic re-routing to meet SLAs Retail • Real-time in-store Offers and Recommendations. • Email and marketing campaigns based on real- time social trends Consumer Internet, Mobile & E-Commerce Optimize user engagement based on user’s current behavior. Deliver recommendations relevant “in the moment” Manufacturing • Identify equipment failures and react instantly • Perform proactive maintenance. • Identify product quality defects immediately to prevent resource wastage. Security & Surveillance Identify threats and intrusions, both digital and physical, in real- time. Digital Advertising & Marketing Optimize and personalize digital ads based on real-time information. Continuously monitor patient vital stats and proactively identify at-risk patients. Report on this data. Healthcare
5.
5© Cloudera, Inc.
All rights reserved. Agenda • A new low-latency, high throughput architecture for analytics • Building an ingest pipeline • Storing and querying structured data • Design tradeoffs • Demo • Q&A
6.
6© Cloudera, Inc.
All rights reserved. A modern, low-latency analytics architecture Data Sources Kafka Kudu (optional) Impala or SparkSQL
7.
7© Cloudera, Inc.
All rights reserved. “Traditional” real-time analytics in Hadoop Fraud detection in the real world means storage complexity Considerations: ● How do I handle failure during this process? ● How often do I reorganize data streaming in into a format appropriate for reporting? ● When reporting, how do I see data that has not yet been reorganized? ● How do I ensure that important jobs aren’t interrupted by maintenance? New Partition Most Recent Partition Historical Data HBase Parquet File Have we accumulated enough data? Reorganize HBase file into Parquet • Wait for running operations to complete • Define new Impala partition referencing the newly written Parquet file Kafka Reporting Request Storage in HDFS
8.
8© Cloudera, Inc.
All rights reserved. Real-time analytics in Hadoop with Kudu Improvements: ● Fewer systems to operate ● No cron jobs or background processes ● Handle late arrivals or data corrections with ease ● New data available immediately for analytics or operations Historical and Real-time Data Kafka Reporting Request Storage in Kudu
9.
9© Cloudera, Inc.
All rights reserved. Xiaomi use case • World’s 4th largest smart-phone maker (most popular in China) • Gather important RPC tracing events from mobile app and backend service. • Service monitoring & troubleshooting tool. High write throughput • >5 Billion records/day and growing Query latest data and quick response • Identify and resolve issues quickly Can search for individual records • Easy for troubleshooting
10.
10© Cloudera, Inc.
All rights reserved. Xiaomi big data analytics pipeline Before Kudu Large ETL pipeline delays ● High data visibility latency (from 1 hour up to 1 day) ● Data format conversion woes Ordering issues ● Log arrival (storage) not exactly in correct order ● Must read 2 – 3 days of data to get all of the data points for a single day
11.
11© Cloudera, Inc.
All rights reserved. Xiaomi big data analytics pipeline Simplified with Kafka and Kudu Low latency ETL pipeline ● ~10s data latency ● For apps that need to avoid direct backpressure or need ETL for record enrichment Direct zero-latency path ● For apps that can tolerate backpressure and can use the NoSQL APIs ● Apps that don’t need ETL enrichment for storage / retrieval OLAP scan Side table lookup Result store
12.
12© Cloudera, Inc.
All rights reserved. A modern, low-latency analytics architecture Data Sources Kafka Kudu (optional) Impala or SparkSQL
13.
13© Cloudera, Inc.
All rights reserved.
14.
14© Cloudera, Inc.
All rights reserved. Client Backend Data Pipelines Start like this.
15.
15© Cloudera, Inc.
All rights reserved. Client Backend Client Client Client Then we reuse them
16.
16© Cloudera, Inc.
All rights reserved. Client Backend Client Client Client Then we add multiple backends Another Backend
17.
17© Cloudera, Inc.
All rights reserved. Client Backend Client Client Client Then it starts to look like this Another Backend Another Backend Another Backend
18.
18© Cloudera, Inc.
All rights reserved. Client Backend Client Client Client With maybe some of this Another Backend Another Backend Another Backend
19.
19© Cloudera, Inc.
All rights reserved. Adding applications should be easier We need: • Shared infrastructure for sending records • Infrastructure must scale • Set of agreed-upon record schemas
20.
20© Cloudera, Inc.
All rights reserved. Kafka decouples data pipelines Why Kafka 20 Source System Source System Source System Source System Hadoop Security Systems Real-time monitoring Data Warehouse Kafka Producers Broker Consumers
21.
21© Cloudera, Inc.
All rights reserved. About Kafka • Publish/Subscribe Messaging System From LinkedIn • High throughput (100’s of k messages/sec) • Low latency (sub-second to low seconds) • Fault-tolerant (Replicated and Distributed) • Supports Agnostic Messaging • Standardizes format and delivery • Huge community
22.
22© Cloudera, Inc.
All rights reserved. Architecture Producer Consumer Consumer Producers Kafka Cluster Consumers Broker Broker Broker Broker Producer Zookeeper
23.
23© Cloudera, Inc.
All rights reserved.
24.
24© Cloudera, Inc.
All rights reserved. Kudu is a high-performance distributed storage engine Storage for fast (low latency) analytics on fast (high throughput) data • Simplifies the architecture for building analytic applications on changing data • Optimized for fast analytic performance • Natively integrated with the Hadoop ecosystem of components FILESYSTEM HDFS NoSQL HBASE INGEST – SQOOP, FLUME, KAFKA DATA INTEGRATION & STORAGE SECURITY – SENTRY RESOURCE MANAGEMENT – YARN UNIFIED DATA SERVICES BATCH STREAM SQL SEARCH MODEL ONLINE DATA ENGINEERING DATA DISCOVERY & ANALYTICS DATA APPS SPARK, HIVE, PIG SPARK IMPALA SOLR SPARK HBASE RELATIONAL KUDU
25.
25© Cloudera, Inc.
All rights reserved. Kudu: Scalable and fast tabular storage Scalable • Tested up to 275 nodes (~3PB cluster) • Designed to scale to 1000s of nodes and tens of PBs Fast • Millions of read/write operations per second across cluster • Multiple GB/second read throughput per node Tabular • Store tables like a normal database • Individual record-level access to 100+ billion row tables
26.
26© Cloudera, Inc.
All rights reserved. • High throughput for big scans Goal: Within 2x of Parquet • Low-latency for short accesses Goal: 1ms read/write on SSD • Database-like semantics Initially, single-row atomicity • Relational data model • SQL queries should be natural and easy • Include NoSQL-style scan, insert, and update APIs Kudu design goals
27.
27© Cloudera, Inc.
All rights reserved. Kudu storage system interfaces • A Kudu table has a SQL-like schema • And a finite number of columns (unlike HBase/Cassandra) • Types: BOOL, INT8, INT16, INT32, INT64, FLOAT, DOUBLE, STRING, BINARY, TIMESTAMP • Some subset of columns makes up a possibly-composite primary key • Fast ALTER TABLE • Java, C++, and Python NoSQL-style APIs • Insert(), Update(), Delete(), Scan() • Integrations with Kafka, MapReduce, Spark, Flume, and Impala • Apache Drill work-in-progress
28.
28© Cloudera, Inc.
All rights reserved. Kudu use cases Kudu is best for use cases requiring: •Simultaneous combination of sequential and random reads and writes •Minimal to zero data latencies Time series •Examples: Streaming market data, fraud detection / prevention, risk monitoring •Workload: Insert, updates, scans, lookups Machine data analytics •Example: Network threat detection •Workload: Inserts, scans, lookups Online reporting / data warehousing •Example: Operational data store (ODS) •Workload: Inserts, updates, scans, lookups
29.
29© Cloudera, Inc.
All rights reserved. Tables and tablets • Each table is horizontally partitioned into tablets • Range or hash partitioning • PRIMARY KEY (host, metric, timestamp) DISTRIBUTE BY HASH(timestamp) INTO 100 BUCKETS • Translation: bucketNumber = hashCode(row[‘timestamp’]) % 100 • Each tablet has N replicas (3 or 5), kept consistent with Raft consensus • Tablet servers host tablets on local disk drives
30.
30© Cloudera, Inc.
All rights reserved. Metadata • Replicated master • Acts as a tablet directory • Acts as a catalog (which tables exist, etc) • Acts as a load balancer (tracks TS liveness, re-replicates under-replicated tablets) • Caches all metadata in RAM for high performance • Client configured with master addresses • Asks master for tablet locations as needed and caches them
31.
31© Cloudera, Inc.
All rights reserved. Impala integration • CREATE TABLE … DISTRIBUTE BY HASH(col1) INTO 16 BUCKETS AS SELECT … FROM … • INSERT/UPDATE/DELETE • Optimizations like predicate pushdown, scan parallelism, plans for more on the way
32.
32© Cloudera, Inc.
All rights reserved. Spark DataSource integration sqlContext.load("org.kududb.spark", Map("kudu.table" -> “foo”, "kudu.master" -> “master.example.com”)) .registerTempTable(“mytable”) df = sqlContext.sql( “select col_a, col_b from mytable “ + “where col_c = 123”)
33.
33© Cloudera, Inc.
All rights reserved. TPC-H (analytics benchmark) • 75 server cluster • 12 (spinning) disks each, enough RAM to fit dataset • TPC-H Scale Factor 100 (100GB) • Example query: • SELECT n_name, sum(l_extendedprice * (1 - l_discount)) as revenue FROM customer, orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' AND o_orderdate >= date '1994-01-01' AND o_orderdate < '1995-01-01’ GROUP BY n_name ORDER BY revenue desc;
34.
34© Cloudera, Inc.
All rights reserved. • Kudu outperforms Parquet by 31% (geometric mean) for RAM-resident data
35.
35© Cloudera, Inc.
All rights reserved. Versus other NoSQL storage • Apache Phoenix: OLTP SQL engine built on HBase • 10 node cluster (9 worker, 1 master) • TPC-H LINEITEM table only (6B rows)
36.
36© Cloudera, Inc.
All rights reserved. • Custom client, i.e., Kafka consumer that writes to Kudu Getting Data from Kafka into Kudu
37.
37© Cloudera, Inc.
All rights reserved. • Custom client, i.e., Kafka consumer that writes to Kudu • Kafka-Flume source/channel + Kudu-Flume sink Getting Data from Kafka into Kudu
38.
38© Cloudera, Inc.
All rights reserved. • Custom client, i.e., Kafka consumer that writes to Kudu • Kafka-Flume source/channel + Kudu-Flume sink • Kafka connect Getting Data from Kafka into Kudu
39.
39© Cloudera, Inc.
All rights reserved. Kafka + Kudu: A low latency data visibility path • Upstream application pushes data to Kafka • Kafka then acts as a buffer in order to handle backpressure from Kudu • The Kafka Connect plugin pushes data to Kudu as it becomes available • As soon as the data is ingested into Kudu, it becomes available
40.
40© Cloudera, Inc.
All rights reserved. Tradeoffs What if I need a zero-latency path? • Possible to write to Kudu directly using the NoSQL API • However, the app will need to tolerate queueing and backpressure itself What if I want to store unstructured data or large binary blobs? • Consider using Kafka + HBase instead of Kudu •But you won’t get the same SQL query performance
41.
41© Cloudera, Inc.
All rights reserved. Demo Data Sources Kafka Kudu (optional) Impala or SparkSQL
42.
42© Cloudera, Inc.
All rights reserved. About the Kudu project •Apache Software Foundation incubating project •Latest version 0.8.0 (beta) released in April •Plans are for a 1.0 version to be released in August •Web site: getkudu.io (also kudu.incubator.apache.org soon) •Slack chat room for devs and users (auto-invite): getkudu-slack.herokuapp.com •Twitter handle: @ApacheKudu •Code: github.com/apache/incubator-kudu Want to hear more about Kudu and Spark? • Come to the Vancouver Spark meetup tonight here at the Hyatt at 6pm • More info: www.meetup.com/Vancouver-Spark/
43.
43© Cloudera, Inc.
All rights reserved. Thank you
44.
44© Cloudera, Inc.
All rights reserved. Questions? Mike Percy | @mike_percy mpercy@cloudera.com Ashish Singh | @singhasdev asingh@cloudera.com
Jetzt herunterladen