SlideShare ist ein Scribd-Unternehmen logo
1 von 37
1© Cloudera, Inc. All rights reserved.
Michael Crutcher
Director, Product Management - Storage
Lambda Architecture
2© Cloudera, Inc. All rights reserved.
Agenda
• Big Data Challenges
• What is Lambda?
• Lambda Advantages and Disadvantages
• Kudu as a Lambda alternative
3© Cloudera, Inc. All rights reserved.
Big Data Challenges
4© Cloudera, Inc. All rights reserved.
“Something interesting is happening”
The world’s largest
taxi company owns
ZERO vehicles.
The world’s largest
accommodation provider
owns ZERO real estate.
The world’s most
popular media owner
creates ZERO content.
The world’s leading
music platform owns
no music.
5© Cloudera, Inc. All rights reserved.
Data is now a strategic asset
Instrumentation
Consumerization
Experimentation
Today, everything that can be
measured will be measured.
Today, data IS the
application.
Today, becoming data-driven
is a business imperative.
6© Cloudera, Inc. All rights reserved.
“It will soon be technically
feasible & affordable to
record & store everything…”
— New York Times
“Digital technologies will, in
the near future, accomplish
many tasks once considered
uniquely human.”
.
— Second Machine Age
Data is abundant,
diverse & shared freely
As is how we store,
process and analyze it
Streaming Machine Learning BI
ETL Modeling
7© Cloudera, Inc. All rights reserved.
The new analytics paradigm
Understand
why it
happened
Change
what
happens
next
Determine
what
happened
Make it
happen
consistently
8© Cloudera, Inc. All rights reserved.
So Why Big Data?
What does the reporting look
like at your business today?
What if it could happen in half
the time, or half that time?
What data are you looking at?
What data do you want to know
about your customers? How
can you best use external data?
Too often data is archived,
combined, or simplified to save
space and strain on systems.
Once data is combined we loose
the ability to dig deeper.
Better Business Forecasting Better Views of CustomersFull Fidelity Data Access
9© Cloudera, Inc. All rights reserved.
What is Lambda architecture?
10© Cloudera, Inc. All rights reserved.
What is Lambda Architecture?
Batch Layer
Serving Layer
Speed Layer
New Data
Data Lake
(HDFS)
Precompute
Views
Stream or
Micro Batch
Increment
Views
Data
Application
“Real-time” Increment
Batch Recompute
Merge
Hadoop
Storm/Spark
HBase
Impala
11© Cloudera, Inc. All rights reserved.
Batch Layer
• Manages the master data set, an immutable, append-only set of raw data
• Pre-computes views of the data
• “Traditionally” this has been in HDFS and processed with Map/Reduce
• There has already been some shift to cloud based object storage and processing
in other frameworks like Spark
12© Cloudera, Inc. All rights reserved.
Speed Layer
• This layer ingests streaming data or micro-batches
• Spark and Storm are traditionally used
• In some cases micro-batches are directly ingested into NoSQL data stores like
HBase
• This data is periodically expunged
• In many “Lambda-like” architectures I’ve seen, this layer is used to provide an
“active partition” that provides a limited window of mutability
13© Cloudera, Inc. All rights reserved.
Serving Layer
• As you might guess from the name, this is the layer that serves data
• It would be unusual for raw data to be served directly
• This could be an application written directly against a data store like HBase
• It could be a SQL engine on top of a file system, Impala + Parquet is an example
14© Cloudera, Inc. All rights reserved.
What is a Kappa Architecture?
Batch Layer
Serving Layer
Speed Layer
New Data
Data Lake
(HDFS)
Precompute
Views
Stream or
Micro Batch
Increment
Views
Data
Application
“Real-time” Increment
Batch Recompute
Merge
Hadoop
Storm/Spark
HBase
Impala
15© Cloudera, Inc. All rights reserved.
Everything Has a New Name
Batch Layer
Serving Layer
Speed Layer
New Data
Data Lake
(HDFS)
Precompute
Views
Stream or
Micro Batch
Increment
Views
Data
Application
“Real-time” Increment
Batch Recompute
Merge
System of Record (OLTP)
Operational Data Store Derived Tables (EDW)
In-Memory Database
Star/Snowflake, Cubes,
or In-Memory Tables
16© Cloudera, Inc. All rights reserved.
The Log as Storage
• Lambda and Kappa architectures are both predicated on immutable source data
• Data can be modeled as a series of events recorded at specific points in time
about entities
• Updates are modeled as new events and the current or historic value associated
with an entity can be reconstructed through the collected events
• Kappa calls this ordered set of events “a log”, it’s safe to say they didn’t invent
this term
A B C D B E F A G B
1 10
Ordered over Time
17© Cloudera, Inc. All rights reserved.
Is Raw Data the Right Logical Model?
• It’s possible to derive many higher level logical abstractions from raw data
• As an example, I could construct a customer account balance from raw account
activity data
• This doesn’t mean it’s a good idea
A B C B A C B A A C
t0 t12Account Activity
+$10 +$20 +$15 -$10 +$35 -$5 +$25 +$15 -$20 +$10
Easy:
What was the last account event for Customer C?
Harder:
What is the account balance for Customer A at t12?
18© Cloudera, Inc. All rights reserved.
There are Only Two Hard CS Problems
1) Cache invalidation
2) Naming things
-- Phil Karlton
19© Cloudera, Inc. All rights reserved.
Data Engineering has one hard problem
• When should I denormalize to maximize performance?
• When should I normalize to minimize maintenance problems?
Denormalize Everything!
Normalize Everything!
I wish things were faster!
I wish things were easier
to maintain!
20© Cloudera, Inc. All rights reserved.
Lambda Advantages and
Disadvantages
21© Cloudera, Inc. All rights reserved.
Lambda Advantages
• Marries diverse strengths of existing open source software into a unified
architecture
• Provides scalability via the batch layer
• Provides real time performance via the speed layer
22© Cloudera, Inc. All rights reserved.
Lambda Disadvantages
• Complexity
• Many moving parts
• Restatement is difficult
• Two code bases must be kept in sync
• Proper failure handling is complex
23© Cloudera, Inc. All rights reserved.
Lambda Complexity
Batch Layer
Serving Layer
Speed Layer
New Data
Data Lake
(HDFS)
Precompute
Views
Stream or
Micro Batch
Increment
Views
Data
Application
“Real-time” Increment
Batch Recompute
Merge
Hadoop
Storm/Spark
HBase
Impala
Code must be kept in sync
Restatement is difficult
24© Cloudera, Inc. All rights reserved.
Lambda Complexity
Batch Layer
Serving Layer
Speed Layer
New Data
Data Lake
(HDFS)
Precompute
Views
Stream or
Micro Batch
Increment
Views
Data
Application
“Real-time” Increment
Batch Recompute
Merge
Hadoop
Storm/Spark
HBase
Impala
Hmm… this data looks fishy
Problem Here?
Here?
Here?
Here?
Here?
Here?
Here?
25© Cloudera, Inc. All rights reserved.
The Log as Storage
• The idea of representing data as immutable log information is not new and is not
without tradeoffs:
• Space amplification: how many bytes of data are stored, relative to how many
logical bytes the database contains
• Write amplification: how many bytes of data are written by the database
compared to the number of bytes changed by the user
• Read amplification: how many bytes the database has to physically read to
return values to the user compared to the bytes returned
• Complexity: am I solving a CS problem or a customer problem?
• These are not simple issues and there’s no straightforward “right” answer
26© Cloudera, Inc. All rights reserved.
Premature Optimization
Programmers waste enormous amounts of time thinking about, or worrying about,
the speed of noncritical parts of their programs, and these attempts at efficiency
actually have a strong negative impact when debugging and maintenance are
considered. We should forget about small efficiencies, say about 97% of the time:
premature optimization is the root of all evil.
--Donald Knuth
27© Cloudera, Inc. All rights reserved.
Gap Filling vs. Optimization
• Some Lambda implementations are deployed on big data systems that don’t require
significant optimization to deliver desired SLAs
• Often, Lambda architectures are used to fill the very stark difference in workload
processing capabilities of technologies that are used typically used for the batch (long
scan) and fast layers (quick point lookups)
• Anecdotally, Lambda architectures seem to be deployed much more often with current
generation open source technology than they were with legacy commercial offerings
• Part of this is because of data volume, variety, and velocity caused by our increasingly
data driven world, but I think part of this is also because legacy technologies haven’t had
as stark of a difference in what workloads they’re optimal for
• Are you deploying a Lambda architecture because you need to squeeze out all of the
performance possible, or because you have a mixed workload that can’t be deployed on
one single storage technology?
28© Cloudera, Inc. All rights reserved.
Gap Filling v2: Lack of Mutability
• Some Lambda implementations aim to fill the gap of
the lack of mutability in HDFS
• Raw, master data should be immutable, but in the real
world raw data could potentially need to be adjusted
• Sensors could have been miscalibrated, data may have
been incorrectly entered, raw data might be an
approximation before finalization, etc.
• Derived aggregations might more efficiently modified
in place, vs. recalculated from raw data, recalculating
all of history is often not practically possible
Incoming Data
(Messaging
System)
New Partition
Most Recent Partition
Historic Data
HBase
Parquet
File
• Wait for running operations to complete
• Define new Impala partition referencing the
newly written Parquet file
Reporting
Request
Impala on HDFS
29© Cloudera, Inc. All rights reserved.
Kudu as a Lambda alternative
30© Cloudera, Inc. All rights reserved.
HDFS
Fast Scans, Analytics
and Processing of
Stored Data
Fast On-Line
Updates &
Data Serving
Arbitrary Storage
(Active Archive)
Fast Analytics
(on fast-changing or
frequently-updated data)
Kudu: Fast Analytics on Fast-Changing Data
New storage engine enables new Hadoop use cases
Unchanging
Fast Changing
Frequent Updates
HBase
Append-Only
Real-Time
Kudu Kudu fills the Gap
Modern analytic
applications often
require complex data
flow & difficult
integration work to
move data between
HBase & HDFS
Analytic
Gap
Pace of Analysis
PaceofData
31© Cloudera, Inc. All rights reserved.
Kudu Increases the Value of Time Series Data
Time Series
Inserts, updates, scans, lookups
Workload
Examples
Stream market data, fraud detection &
prevention, risk monitoring
Time series data is most valuable if you can
analyze it to change outcomes in real time.
Kudu simulateneously enables:
• Time series data inserted/updated as it arrives
• Analytic scans to find trends on fresh time series data
• Lookups to quickly visit the point in time where an
event occurred for further investigation
32© Cloudera, Inc. All rights reserved.
Kudu can help spot problems before they
happen. Real-time data inserts with the ability to
analyze trends identifies potential problems.
Kudu identifies trouble through:
• Extreme scale, allowing better historic trend analysis
• Fast inserts to enable an up-to-date view of your
business
• Fast scans identify/flag undesired states for remedy
Kudu Keeps Your Business Operational
Machine Data
Analytics
Inserts, scans, lookups
Workload
Examples
Network threat detection, IoT, predictive
maintenance and failure detection
33© Cloudera, Inc. All rights reserved.
More Versatility in Online Reporting
Online
Reporting
Inserts, updates, scans, lookups
Workload
Examples
“Active” Reporting
Online reporting has traditionally been limited by
data volume and analytic capabilitiy, keeping
only recent data designed for granular queries.
Kudu adds online reporting versatility through:
• Fast inserts and updates to keep data fresh
• Fast lookups and analytic scans in one data store
34© Cloudera, Inc. All rights reserved.
Xiaomi use case
• World’s 4th largest smart-phone maker (most popular in China)
• Gather important RPC tracing events from mobile app and backend service.
• Service monitoring & troubleshooting tool.
High write throughput
• >5 Billion records/day and growing
Query latest data and quick response
• Identify and resolve issues quickly
Can search for individual records
• Easy for troubleshooting
35© Cloudera, Inc. All rights reserved.
Xiaomi big data analytics pipeline
Large ETL pipeline delays
● High data visibility latency
(from 1 hour up to 1 day)
● Data format conversion woes
Ordering issues
● Log arrival (storage) not
exactly in correct order
● Must read 2 – 3 days of data
to get all of the data points
for a single day
36© Cloudera, Inc. All rights reserved.
Xiaomi big data analytics pipeline
Simplified with Kudu
Low latency ETL pipeline
● ~10s data latency
● For apps that need to avoid
direct backpressure or need
ETL for record enrichment
Direct zero-latency path
● For apps that can tolerate
backpressure and can use the
NoSQL APIs
● Apps that don’t need ETL
enrichment for storage /
retrieval
OLAP scan
Side table lookup
Result store
37© Cloudera, Inc. All rights reserved.
Conclusions
• Lambda has a real place in big data architectures
• Optimize as needed, but beware of the cost of premature optimization
• Kudu is designed to be a simple solution for when you need a data store that’s
updatable and provides “good enough” performance for analytic and real time
workloads simultaneously

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptx
 
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
 
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3GoHigh Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
Solr Introduction
Solr IntroductionSolr Introduction
Solr Introduction
 
Snowflake free trial_lab_guide
Snowflake free trial_lab_guideSnowflake free trial_lab_guide
Snowflake free trial_lab_guide
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
Intro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on SnowflakeIntro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on Snowflake
 
Apache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshApache Kafka® and the Data Mesh
Apache Kafka® and the Data Mesh
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
 
9. Document Oriented Databases
9. Document Oriented Databases9. Document Oriented Databases
9. Document Oriented Databases
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 

Ähnlich wie Moving Beyond Lambda Architectures with Apache Kudu

Webinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaWebinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafka
Jeffrey T. Pollock
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
Introducing Technologies for Handling Big Data by Jaseela
Introducing Technologies for Handling Big Data by JaseelaIntroducing Technologies for Handling Big Data by Jaseela
Introducing Technologies for Handling Big Data by Jaseela
Student
 

Ähnlich wie Moving Beyond Lambda Architectures with Apache Kudu (20)

From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
 
Cloud-Native Data: What data questions to ask when building cloud-native apps
Cloud-Native Data: What data questions to ask when building cloud-native appsCloud-Native Data: What data questions to ask when building cloud-native apps
Cloud-Native Data: What data questions to ask when building cloud-native apps
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Unlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator OptimizerUnlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator Optimizer
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Webinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaWebinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafka
 
Idera live 2021: Managing Databases in the Cloud - the First Step, a Succes...
Idera live 2021:   Managing Databases in the Cloud - the First Step, a Succes...Idera live 2021:   Managing Databases in the Cloud - the First Step, a Succes...
Idera live 2021: Managing Databases in the Cloud - the First Step, a Succes...
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
 
Cloud - NDT - Presentation
Cloud - NDT - PresentationCloud - NDT - Presentation
Cloud - NDT - Presentation
 
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptxKudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
 
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-BaltagiModern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
Data Virtualization: An Essential Component of a Cloud Data Lake
Data Virtualization: An Essential Component of a Cloud Data LakeData Virtualization: An Essential Component of a Cloud Data Lake
Data Virtualization: An Essential Component of a Cloud Data Lake
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Introducing Technologies for Handling Big Data by Jaseela
Introducing Technologies for Handling Big Data by JaseelaIntroducing Technologies for Handling Big Data by Jaseela
Introducing Technologies for Handling Big Data by Jaseela
 

Mehr von Cloudera, Inc.

Mehr von Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Kürzlich hochgeladen

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 

Kürzlich hochgeladen (20)

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 

Moving Beyond Lambda Architectures with Apache Kudu

  • 1. 1© Cloudera, Inc. All rights reserved. Michael Crutcher Director, Product Management - Storage Lambda Architecture
  • 2. 2© Cloudera, Inc. All rights reserved. Agenda • Big Data Challenges • What is Lambda? • Lambda Advantages and Disadvantages • Kudu as a Lambda alternative
  • 3. 3© Cloudera, Inc. All rights reserved. Big Data Challenges
  • 4. 4© Cloudera, Inc. All rights reserved. “Something interesting is happening” The world’s largest taxi company owns ZERO vehicles. The world’s largest accommodation provider owns ZERO real estate. The world’s most popular media owner creates ZERO content. The world’s leading music platform owns no music.
  • 5. 5© Cloudera, Inc. All rights reserved. Data is now a strategic asset Instrumentation Consumerization Experimentation Today, everything that can be measured will be measured. Today, data IS the application. Today, becoming data-driven is a business imperative.
  • 6. 6© Cloudera, Inc. All rights reserved. “It will soon be technically feasible & affordable to record & store everything…” — New York Times “Digital technologies will, in the near future, accomplish many tasks once considered uniquely human.” . — Second Machine Age Data is abundant, diverse & shared freely As is how we store, process and analyze it Streaming Machine Learning BI ETL Modeling
  • 7. 7© Cloudera, Inc. All rights reserved. The new analytics paradigm Understand why it happened Change what happens next Determine what happened Make it happen consistently
  • 8. 8© Cloudera, Inc. All rights reserved. So Why Big Data? What does the reporting look like at your business today? What if it could happen in half the time, or half that time? What data are you looking at? What data do you want to know about your customers? How can you best use external data? Too often data is archived, combined, or simplified to save space and strain on systems. Once data is combined we loose the ability to dig deeper. Better Business Forecasting Better Views of CustomersFull Fidelity Data Access
  • 9. 9© Cloudera, Inc. All rights reserved. What is Lambda architecture?
  • 10. 10© Cloudera, Inc. All rights reserved. What is Lambda Architecture? Batch Layer Serving Layer Speed Layer New Data Data Lake (HDFS) Precompute Views Stream or Micro Batch Increment Views Data Application “Real-time” Increment Batch Recompute Merge Hadoop Storm/Spark HBase Impala
  • 11. 11© Cloudera, Inc. All rights reserved. Batch Layer • Manages the master data set, an immutable, append-only set of raw data • Pre-computes views of the data • “Traditionally” this has been in HDFS and processed with Map/Reduce • There has already been some shift to cloud based object storage and processing in other frameworks like Spark
  • 12. 12© Cloudera, Inc. All rights reserved. Speed Layer • This layer ingests streaming data or micro-batches • Spark and Storm are traditionally used • In some cases micro-batches are directly ingested into NoSQL data stores like HBase • This data is periodically expunged • In many “Lambda-like” architectures I’ve seen, this layer is used to provide an “active partition” that provides a limited window of mutability
  • 13. 13© Cloudera, Inc. All rights reserved. Serving Layer • As you might guess from the name, this is the layer that serves data • It would be unusual for raw data to be served directly • This could be an application written directly against a data store like HBase • It could be a SQL engine on top of a file system, Impala + Parquet is an example
  • 14. 14© Cloudera, Inc. All rights reserved. What is a Kappa Architecture? Batch Layer Serving Layer Speed Layer New Data Data Lake (HDFS) Precompute Views Stream or Micro Batch Increment Views Data Application “Real-time” Increment Batch Recompute Merge Hadoop Storm/Spark HBase Impala
  • 15. 15© Cloudera, Inc. All rights reserved. Everything Has a New Name Batch Layer Serving Layer Speed Layer New Data Data Lake (HDFS) Precompute Views Stream or Micro Batch Increment Views Data Application “Real-time” Increment Batch Recompute Merge System of Record (OLTP) Operational Data Store Derived Tables (EDW) In-Memory Database Star/Snowflake, Cubes, or In-Memory Tables
  • 16. 16© Cloudera, Inc. All rights reserved. The Log as Storage • Lambda and Kappa architectures are both predicated on immutable source data • Data can be modeled as a series of events recorded at specific points in time about entities • Updates are modeled as new events and the current or historic value associated with an entity can be reconstructed through the collected events • Kappa calls this ordered set of events “a log”, it’s safe to say they didn’t invent this term A B C D B E F A G B 1 10 Ordered over Time
  • 17. 17© Cloudera, Inc. All rights reserved. Is Raw Data the Right Logical Model? • It’s possible to derive many higher level logical abstractions from raw data • As an example, I could construct a customer account balance from raw account activity data • This doesn’t mean it’s a good idea A B C B A C B A A C t0 t12Account Activity +$10 +$20 +$15 -$10 +$35 -$5 +$25 +$15 -$20 +$10 Easy: What was the last account event for Customer C? Harder: What is the account balance for Customer A at t12?
  • 18. 18© Cloudera, Inc. All rights reserved. There are Only Two Hard CS Problems 1) Cache invalidation 2) Naming things -- Phil Karlton
  • 19. 19© Cloudera, Inc. All rights reserved. Data Engineering has one hard problem • When should I denormalize to maximize performance? • When should I normalize to minimize maintenance problems? Denormalize Everything! Normalize Everything! I wish things were faster! I wish things were easier to maintain!
  • 20. 20© Cloudera, Inc. All rights reserved. Lambda Advantages and Disadvantages
  • 21. 21© Cloudera, Inc. All rights reserved. Lambda Advantages • Marries diverse strengths of existing open source software into a unified architecture • Provides scalability via the batch layer • Provides real time performance via the speed layer
  • 22. 22© Cloudera, Inc. All rights reserved. Lambda Disadvantages • Complexity • Many moving parts • Restatement is difficult • Two code bases must be kept in sync • Proper failure handling is complex
  • 23. 23© Cloudera, Inc. All rights reserved. Lambda Complexity Batch Layer Serving Layer Speed Layer New Data Data Lake (HDFS) Precompute Views Stream or Micro Batch Increment Views Data Application “Real-time” Increment Batch Recompute Merge Hadoop Storm/Spark HBase Impala Code must be kept in sync Restatement is difficult
  • 24. 24© Cloudera, Inc. All rights reserved. Lambda Complexity Batch Layer Serving Layer Speed Layer New Data Data Lake (HDFS) Precompute Views Stream or Micro Batch Increment Views Data Application “Real-time” Increment Batch Recompute Merge Hadoop Storm/Spark HBase Impala Hmm… this data looks fishy Problem Here? Here? Here? Here? Here? Here? Here?
  • 25. 25© Cloudera, Inc. All rights reserved. The Log as Storage • The idea of representing data as immutable log information is not new and is not without tradeoffs: • Space amplification: how many bytes of data are stored, relative to how many logical bytes the database contains • Write amplification: how many bytes of data are written by the database compared to the number of bytes changed by the user • Read amplification: how many bytes the database has to physically read to return values to the user compared to the bytes returned • Complexity: am I solving a CS problem or a customer problem? • These are not simple issues and there’s no straightforward “right” answer
  • 26. 26© Cloudera, Inc. All rights reserved. Premature Optimization Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. --Donald Knuth
  • 27. 27© Cloudera, Inc. All rights reserved. Gap Filling vs. Optimization • Some Lambda implementations are deployed on big data systems that don’t require significant optimization to deliver desired SLAs • Often, Lambda architectures are used to fill the very stark difference in workload processing capabilities of technologies that are used typically used for the batch (long scan) and fast layers (quick point lookups) • Anecdotally, Lambda architectures seem to be deployed much more often with current generation open source technology than they were with legacy commercial offerings • Part of this is because of data volume, variety, and velocity caused by our increasingly data driven world, but I think part of this is also because legacy technologies haven’t had as stark of a difference in what workloads they’re optimal for • Are you deploying a Lambda architecture because you need to squeeze out all of the performance possible, or because you have a mixed workload that can’t be deployed on one single storage technology?
  • 28. 28© Cloudera, Inc. All rights reserved. Gap Filling v2: Lack of Mutability • Some Lambda implementations aim to fill the gap of the lack of mutability in HDFS • Raw, master data should be immutable, but in the real world raw data could potentially need to be adjusted • Sensors could have been miscalibrated, data may have been incorrectly entered, raw data might be an approximation before finalization, etc. • Derived aggregations might more efficiently modified in place, vs. recalculated from raw data, recalculating all of history is often not practically possible Incoming Data (Messaging System) New Partition Most Recent Partition Historic Data HBase Parquet File • Wait for running operations to complete • Define new Impala partition referencing the newly written Parquet file Reporting Request Impala on HDFS
  • 29. 29© Cloudera, Inc. All rights reserved. Kudu as a Lambda alternative
  • 30. 30© Cloudera, Inc. All rights reserved. HDFS Fast Scans, Analytics and Processing of Stored Data Fast On-Line Updates & Data Serving Arbitrary Storage (Active Archive) Fast Analytics (on fast-changing or frequently-updated data) Kudu: Fast Analytics on Fast-Changing Data New storage engine enables new Hadoop use cases Unchanging Fast Changing Frequent Updates HBase Append-Only Real-Time Kudu Kudu fills the Gap Modern analytic applications often require complex data flow & difficult integration work to move data between HBase & HDFS Analytic Gap Pace of Analysis PaceofData
  • 31. 31© Cloudera, Inc. All rights reserved. Kudu Increases the Value of Time Series Data Time Series Inserts, updates, scans, lookups Workload Examples Stream market data, fraud detection & prevention, risk monitoring Time series data is most valuable if you can analyze it to change outcomes in real time. Kudu simulateneously enables: • Time series data inserted/updated as it arrives • Analytic scans to find trends on fresh time series data • Lookups to quickly visit the point in time where an event occurred for further investigation
  • 32. 32© Cloudera, Inc. All rights reserved. Kudu can help spot problems before they happen. Real-time data inserts with the ability to analyze trends identifies potential problems. Kudu identifies trouble through: • Extreme scale, allowing better historic trend analysis • Fast inserts to enable an up-to-date view of your business • Fast scans identify/flag undesired states for remedy Kudu Keeps Your Business Operational Machine Data Analytics Inserts, scans, lookups Workload Examples Network threat detection, IoT, predictive maintenance and failure detection
  • 33. 33© Cloudera, Inc. All rights reserved. More Versatility in Online Reporting Online Reporting Inserts, updates, scans, lookups Workload Examples “Active” Reporting Online reporting has traditionally been limited by data volume and analytic capabilitiy, keeping only recent data designed for granular queries. Kudu adds online reporting versatility through: • Fast inserts and updates to keep data fresh • Fast lookups and analytic scans in one data store
  • 34. 34© Cloudera, Inc. All rights reserved. Xiaomi use case • World’s 4th largest smart-phone maker (most popular in China) • Gather important RPC tracing events from mobile app and backend service. • Service monitoring & troubleshooting tool. High write throughput • >5 Billion records/day and growing Query latest data and quick response • Identify and resolve issues quickly Can search for individual records • Easy for troubleshooting
  • 35. 35© Cloudera, Inc. All rights reserved. Xiaomi big data analytics pipeline Large ETL pipeline delays ● High data visibility latency (from 1 hour up to 1 day) ● Data format conversion woes Ordering issues ● Log arrival (storage) not exactly in correct order ● Must read 2 – 3 days of data to get all of the data points for a single day
  • 36. 36© Cloudera, Inc. All rights reserved. Xiaomi big data analytics pipeline Simplified with Kudu Low latency ETL pipeline ● ~10s data latency ● For apps that need to avoid direct backpressure or need ETL for record enrichment Direct zero-latency path ● For apps that can tolerate backpressure and can use the NoSQL APIs ● Apps that don’t need ETL enrichment for storage / retrieval OLAP scan Side table lookup Result store
  • 37. 37© Cloudera, Inc. All rights reserved. Conclusions • Lambda has a real place in big data architectures • Optimize as needed, but beware of the cost of premature optimization • Kudu is designed to be a simple solution for when you need a data store that’s updatable and provides “good enough” performance for analytic and real time workloads simultaneously