SlideShare ist ein Scribd-Unternehmen logo
1 von 63
Resilient Predictive Data
Pipelines
Sid Anand (@r39132)
QCon London 2016
1
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/agari-resilient-predictive-data-pipes
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon London
www.qconlondon.com
About Me
2
Work [ed | s] @
Maintainer on
Report to
Co-Chair for
airbnb/airflow
Motivation
Why is a Data Pipeline talk in this High
Availability Track?
3
Different Types of Data Pipelines
4
ETL
• used for : loading data related
to business health into a Data
Warehouse
• user-engagement stats (e.g.
social networking)
• product success stats (e.g.
e-commerce)
• audience : Business, BizOps
• downtime? : 1-3 days
Predictive
• used for :
•building recommendation
products (e.g. social
networking, shopping)
•updating fraud prevention
endpoints (e.g. security,
payments, e-commerce)
• audience : Customers
• downtime? : < 1 hour
VS
5
ETL
• used for : loading data into a
Data Warehouse —> Reports
• user-engagement stats (e.g.
social networking)
• product success stats (e.g.
e-commerce)
• audience : Business
• downtime? : 1-3 days
Predictive
• used for :
•building recommendation
products (e.g. social
networking, shopping)
•updating fraud prevention
endpoints (e.g. security,
payments, e-commerce)
• audience : Customers
• downtime? : < 1 hour
VS
Different Types of Data Pipelines
Why Do We Care About Resilience?
6
d4d5d6d7d8
DB
Search
Engines
Recommenders
d1 d2 d3
DB
Why Do We Care About Resilience?
7
d4
d6d7d8
Search
Engines
Recommenders
d1 d2 d3
d5
Why Do We Care About Resilience?
8
d7d8
DB
Search
Engines
Recommenders
d1 d2 d3 d4
d5
d6
Why Do We Care About Resilience?
9
d7d8
DB
Search
Engines
Recommenders
d1 d2 d3 d4
d5
d6
Any Take-aways?
10
d7d8
DB
Search
Engines
Recommenders
d1 d2 d3 d4
d5
d6
• Bugs happen!
• Bugs in Predictive Data Pipelines have a large blast
radius
• The bugs can affect customers and a company’s
profits & reputation!
Design Goals
Desirable Qualities of a Resilient Data Pipeline
11
12
• Scalable
• Available
• Instrumented, Monitored, & Alert-enabled
• Quickly Recoverable
Desirable Qualities of a Resilient
Data Pipeline
13
• Scalable
• Build your pipelines using [infinitely] scalable components
• The scalability of your system is determined by its least-scalable
component
• Available
• Instrumented, Monitored, & Alert-enabled
• Quickly Recoverable
Desirable Qualities of a Resilient
Data Pipeline
Desirable Qualities of a Resilient
Data Pipeline
14
• Scalable
• Build your pipelines using [infinitely] scalable components
• The scalability of your system is determined by its least-scalable
component
• Available
• Ditto
• Instrumented, Monitored, & Alert-enabled
• Quickly Recoverable
Instrumented
15
Instrumentation must reveal SLA metrics at each stage of the pipeline!
What SLA metrics do we care about? Correctness & Timeliness
• Correctness
• No Data Loss
• No Data Corruption
• No Data Duplication
• A Defined Acceptable Staleness of Intermediate Data
• Timeliness
• A late result == a useless result
• Delayed processing of now()’s data may delay the processing of future
data
Instrumented, Monitored, & Alert-
enabled
16
• Instrument
• Instrument Correctness & Timeliness SLA metrics at each stage of the
pipeline
• Monitor
• Continuously monitor that SLA metrics fall within acceptable bounds (i.e.
pre-defined SLAs)
• Alert
• Alert when we miss SLAs
Desirable Qualities of a Resilient
Data Pipeline
17
• Scalable
• Build your pipelines using [infinitely] scalable components
• The scalability of your system is determined by its least-scalable
component
• Available
• Ditto
• Instrumented, Monitored, & Alert-enabled
• Quickly Recoverable
Quickly Recoverable
18
• Bugs happen!
• Bugs in Predictive Data Pipelines have a large blast radius
• Optimize for MTTR
Implementation
Using AWS to meet Design Goals
19
SQS
Simple Queue Service
20
SQS - Overview
21
AWS’s low-latency, highly scalable, highly available message queue
Infinitely Scalable Queue (though not FIFO)
Low End-to-end latency (generally sub-second)
Pull-based
How it Works!
Producers publish messages, which can be batched, to an SQS queue
Consumers
consume messages, which can be batched, from the queue
commit message contents to a data store
ACK the messages as a batch
visibility
timer
SQS - Typical Operation Flow
22
Producer
Producer
Producer
m1m2m3m4m5
Consumer
Consumer
Consumer
DB
m1
SQS
Step 1: A consumer reads a message from
SQS. This starts a visibility timer!
visibility
timer
SQS - Typical Operation Flow
23
Producer
Producer
Producer
m1m2m3m4m5
Consumer
Consumer
Consumer
DB
m1
SQS
Step 2: Consumer persists message
contents to DB
visibility
timer
SQS - Typical Operation Flow
24
Producer
Producer
Producer
m1m2m3m4m5
Consumer
Consumer
Consumer
DB
m1
SQS
Step 3: Consumer ACKs message in SQS
visibility
timer
SQS - Time Out Example
25
Producer
Producer
Producer
m1m2m3m4m5
Consumer
Consumer
Consumer
DB
m1
SQS
Step 1: A consumer reads a message from
SQS
visibility
timer
SQS - Time Out Example
26
Producer
Producer
Producer
m1m2m3m4m5
Consumer
Consumer
Consumer
DB
m1
SQS
Step 2: Consumer attempts persists
message contents to DB
visibility
time out
SQS - Time Out Example
27
Producer
Producer
Producer
m1m2m3m4m5
Consumer
Consumer
Consumer
DB
m1
SQS
Step 3: A Visibility Timeout occurs & the
message becomes visible again.
visibility
timer
SQS - Time Out Example
28
Producer
Producer
Producer
m1m2m3m4m5
Consumer
Consumer
Consumer
DB
m1
m1
SQS
Step 4: Another consumer reads and
persists the same message
visibility
timer
SQS - Time Out Example
29
Producer
Producer
Producer
m1m2m3m4m5
Consumer
Consumer
Consumer
DB
m1
SQS
Step 5: Consumer ACKs message in SQS
SQS - Dead Letter Queue
30
SQS - DLQ
visibility
timer
Producer
Producer
Producer
m2m3m4m5
Consumer
Consumer
Consumer
DB
m1
SQS
Redrive
rule : 2x
m1
SNS
Simple Notification Service
31
SNS - Overview
32
Highly Scalable, Highly Available, Push-based Topic Service
Whereas SQS ensures each message is seen by at least 1 consumer
SNS ensures that each message is seen by every consumer
Reliable Multi-Push
Whereas SQS is pull-based, SNS is push-based
There is no message retention & there is a finite retry count
No Reliable Message Delivery
Can we work around this limitation?
SNS + SQS Design Pattern
33
m1m2
m1m2
m1m2
SQS Q1
SQS Q2
SNS T1
Reliable
Multi
Push
Reliable
Message
Delivery
SNS + SQS
34
Producer
Producer
Producer
m1m2
Consumer
Consumer
Consumer
DB
m1
m1m2
m1m2
SQS Q1
SQS Q2
SNS T1
Consumer
Consumer
Consumer
ES
m1
S3 + SNS + SQS Design Pattern
35
m1m2
m1m2
m1m2
SQS Q1
SQS Q2
SNS T1
Reliable
Multi
Push
Transactions
S3
d1
d2
Batch Pipeline Architecture
Putting the Pieces Together
36
Architecture
37
Architectural Elements
38
A Schema-aware Data format for all data (Avro)
The entire pipeline is built from Highly-Available/Highly-Scalable
components
S3, SNS, SQS, ASG, EMR Spark (exception DB)
The pipeline is never blocked because we use a DLQ for messages we
cannot process
We use queue-based auto-scaling to get high on-demand ingest rates
We manage everything with Airflow
Every stage in the pipeline is idempotent
Every stage in the pipeline is instrumented
ASG
Auto Scaling Group
39
ASG - Overview
40
What is it?
A means to automatically scale out/in
clusters to handle variable load/traffic
A means to keep a cluster/service always
up
Fulfills AWS’s pay-per-use promise!
When to use it?
Feed-processing, web traffic load
balancing, zone outage, etc…
ASG - Data Pipeline
41
importer
importer
importer
importer
Importer
ASG
scaleout/in
SQS
DB
ASG : CPU-based
42
Sent
CPU
ACKd/Recvd
CPU-based auto-scaling is
good at scaling in/out to
keep the average CPU
constant
ASG : CPU-based
43
Sent
CPU
Recv
Premature
Scale-in
Premature Scale-in: The CPU drops to noise-levels before all
messages are consumed. This causes scale in to occur while the
last few messages are still being committed resulting in a long time-
to-drain for the queue!
ASG - Queue-based
44
Scale-out: When Visible Messages > 0 (a.k.a. when queue depth > 0)
Scale-in: When Invisible Messages = 0 (a.k.a. when the last in-flight message is
ACK’d)
This causes the
ASG to grow
This causes the
ASG to shrink
Architecture
45
Reliable Hourly Job
Scheduling
Workflow Automation & Scheduling
46
47
Historical Context
Our first cut at the pipeline used cron to schedule hourly runs of Spark
Problem
We only knew if Spark succeeded. What if a downstream task failed?
We needed something smarter than cron that
Reliably managed a graph of tasks (DAG - Directed Acyclic Graph)
Orchestrated hourly runs of that DAG
Retried failed tasks
Tracked the performance of each run and its tasks
Reported on failure/success of runs
Our Needs
Airflow
Workflow Automation & Scheduling
48
Airflow - DAG Dashboard
49
Airflow: It’s easy to manage multiple DAGs
Airflow - Authoring DAGs
50
Airflow: Visualizing a DAG
51
Airflow: Author DAGs in Python! No need to bundle many config files!
Airflow - Authoring DAGs
Airflow - Performance Insights
52
Airflow: Gantt chart view reveals the slowest tasks for a run!
53
Airflow: …And we can easily see performance trends over time
Airflow - Performance Insights
Airflow - Alerting
54
Airflow: …And easy to integrate with Ops tools!
Airflow - Monitoring
55
56
Airflow - Join the Community
With >30 Companies, >100 Contributors , and >2500 Commits,
Airflow is growing rapidly!
We are looking for more contributors to help support the
community!
Disclaimer : I’m a maintainer on the project
Design Goal Scorecard
Are We Meeting Our Design Goals?
57
58
• Scalable
• Available
• Instrumented, Monitored, & Alert-enabled
• Quickly Recoverable
Desirable Qualities of a Resilient
Data Pipeline
59
• Scalable
• Build using scalable components from AWS
• SQS, SNS, S3, ASG, EMR Spark
• Exception = DB (WIP)
• Available
• Build using available components from AWS
• Airflow for reliable job scheduling
• Instrumented, Monitored, & Alert-enabled
• Airflow
• Quickly Recoverable
• Airflow, DLQs, ASGs, Spark & DB
Desirable Qualities of a Resilient
Data Pipeline
Questions? (@r39132)
60
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/agari-
resilient-predictive-data-pipes

Weitere ähnliche Inhalte

Mehr von C4Media

Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileC4Media
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020C4Media
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsC4Media
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No KeeperC4Media
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like OwnersC4Media
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaC4Media
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideC4Media
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsC4Media
 

Mehr von C4Media (20)

Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy Mobile
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java Applications
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery Teams
 

Kürzlich hochgeladen

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 

Kürzlich hochgeladen (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

Resilient Predictive Data Pipelines

  • 1. Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1
  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /agari-resilient-predictive-data-pipes
  • 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon London www.qconlondon.com
  • 4. About Me 2 Work [ed | s] @ Maintainer on Report to Co-Chair for airbnb/airflow
  • 5. Motivation Why is a Data Pipeline talk in this High Availability Track? 3
  • 6. Different Types of Data Pipelines 4 ETL • used for : loading data related to business health into a Data Warehouse • user-engagement stats (e.g. social networking) • product success stats (e.g. e-commerce) • audience : Business, BizOps • downtime? : 1-3 days Predictive • used for : •building recommendation products (e.g. social networking, shopping) •updating fraud prevention endpoints (e.g. security, payments, e-commerce) • audience : Customers • downtime? : < 1 hour VS
  • 7. 5 ETL • used for : loading data into a Data Warehouse —> Reports • user-engagement stats (e.g. social networking) • product success stats (e.g. e-commerce) • audience : Business • downtime? : 1-3 days Predictive • used for : •building recommendation products (e.g. social networking, shopping) •updating fraud prevention endpoints (e.g. security, payments, e-commerce) • audience : Customers • downtime? : < 1 hour VS Different Types of Data Pipelines
  • 8. Why Do We Care About Resilience? 6 d4d5d6d7d8 DB Search Engines Recommenders d1 d2 d3
  • 9. DB Why Do We Care About Resilience? 7 d4 d6d7d8 Search Engines Recommenders d1 d2 d3 d5
  • 10. Why Do We Care About Resilience? 8 d7d8 DB Search Engines Recommenders d1 d2 d3 d4 d5 d6
  • 11. Why Do We Care About Resilience? 9 d7d8 DB Search Engines Recommenders d1 d2 d3 d4 d5 d6
  • 12. Any Take-aways? 10 d7d8 DB Search Engines Recommenders d1 d2 d3 d4 d5 d6 • Bugs happen! • Bugs in Predictive Data Pipelines have a large blast radius • The bugs can affect customers and a company’s profits & reputation!
  • 13. Design Goals Desirable Qualities of a Resilient Data Pipeline 11
  • 14. 12 • Scalable • Available • Instrumented, Monitored, & Alert-enabled • Quickly Recoverable Desirable Qualities of a Resilient Data Pipeline
  • 15. 13 • Scalable • Build your pipelines using [infinitely] scalable components • The scalability of your system is determined by its least-scalable component • Available • Instrumented, Monitored, & Alert-enabled • Quickly Recoverable Desirable Qualities of a Resilient Data Pipeline
  • 16. Desirable Qualities of a Resilient Data Pipeline 14 • Scalable • Build your pipelines using [infinitely] scalable components • The scalability of your system is determined by its least-scalable component • Available • Ditto • Instrumented, Monitored, & Alert-enabled • Quickly Recoverable
  • 17. Instrumented 15 Instrumentation must reveal SLA metrics at each stage of the pipeline! What SLA metrics do we care about? Correctness & Timeliness • Correctness • No Data Loss • No Data Corruption • No Data Duplication • A Defined Acceptable Staleness of Intermediate Data • Timeliness • A late result == a useless result • Delayed processing of now()’s data may delay the processing of future data
  • 18. Instrumented, Monitored, & Alert- enabled 16 • Instrument • Instrument Correctness & Timeliness SLA metrics at each stage of the pipeline • Monitor • Continuously monitor that SLA metrics fall within acceptable bounds (i.e. pre-defined SLAs) • Alert • Alert when we miss SLAs
  • 19. Desirable Qualities of a Resilient Data Pipeline 17 • Scalable • Build your pipelines using [infinitely] scalable components • The scalability of your system is determined by its least-scalable component • Available • Ditto • Instrumented, Monitored, & Alert-enabled • Quickly Recoverable
  • 20. Quickly Recoverable 18 • Bugs happen! • Bugs in Predictive Data Pipelines have a large blast radius • Optimize for MTTR
  • 21. Implementation Using AWS to meet Design Goals 19
  • 23. SQS - Overview 21 AWS’s low-latency, highly scalable, highly available message queue Infinitely Scalable Queue (though not FIFO) Low End-to-end latency (generally sub-second) Pull-based How it Works! Producers publish messages, which can be batched, to an SQS queue Consumers consume messages, which can be batched, from the queue commit message contents to a data store ACK the messages as a batch
  • 24. visibility timer SQS - Typical Operation Flow 22 Producer Producer Producer m1m2m3m4m5 Consumer Consumer Consumer DB m1 SQS Step 1: A consumer reads a message from SQS. This starts a visibility timer!
  • 25. visibility timer SQS - Typical Operation Flow 23 Producer Producer Producer m1m2m3m4m5 Consumer Consumer Consumer DB m1 SQS Step 2: Consumer persists message contents to DB
  • 26. visibility timer SQS - Typical Operation Flow 24 Producer Producer Producer m1m2m3m4m5 Consumer Consumer Consumer DB m1 SQS Step 3: Consumer ACKs message in SQS
  • 27. visibility timer SQS - Time Out Example 25 Producer Producer Producer m1m2m3m4m5 Consumer Consumer Consumer DB m1 SQS Step 1: A consumer reads a message from SQS
  • 28. visibility timer SQS - Time Out Example 26 Producer Producer Producer m1m2m3m4m5 Consumer Consumer Consumer DB m1 SQS Step 2: Consumer attempts persists message contents to DB
  • 29. visibility time out SQS - Time Out Example 27 Producer Producer Producer m1m2m3m4m5 Consumer Consumer Consumer DB m1 SQS Step 3: A Visibility Timeout occurs & the message becomes visible again.
  • 30. visibility timer SQS - Time Out Example 28 Producer Producer Producer m1m2m3m4m5 Consumer Consumer Consumer DB m1 m1 SQS Step 4: Another consumer reads and persists the same message
  • 31. visibility timer SQS - Time Out Example 29 Producer Producer Producer m1m2m3m4m5 Consumer Consumer Consumer DB m1 SQS Step 5: Consumer ACKs message in SQS
  • 32. SQS - Dead Letter Queue 30 SQS - DLQ visibility timer Producer Producer Producer m2m3m4m5 Consumer Consumer Consumer DB m1 SQS Redrive rule : 2x m1
  • 34. SNS - Overview 32 Highly Scalable, Highly Available, Push-based Topic Service Whereas SQS ensures each message is seen by at least 1 consumer SNS ensures that each message is seen by every consumer Reliable Multi-Push Whereas SQS is pull-based, SNS is push-based There is no message retention & there is a finite retry count No Reliable Message Delivery Can we work around this limitation?
  • 35. SNS + SQS Design Pattern 33 m1m2 m1m2 m1m2 SQS Q1 SQS Q2 SNS T1 Reliable Multi Push Reliable Message Delivery
  • 37. S3 + SNS + SQS Design Pattern 35 m1m2 m1m2 m1m2 SQS Q1 SQS Q2 SNS T1 Reliable Multi Push Transactions S3 d1 d2
  • 38. Batch Pipeline Architecture Putting the Pieces Together 36
  • 40. Architectural Elements 38 A Schema-aware Data format for all data (Avro) The entire pipeline is built from Highly-Available/Highly-Scalable components S3, SNS, SQS, ASG, EMR Spark (exception DB) The pipeline is never blocked because we use a DLQ for messages we cannot process We use queue-based auto-scaling to get high on-demand ingest rates We manage everything with Airflow Every stage in the pipeline is idempotent Every stage in the pipeline is instrumented
  • 42. ASG - Overview 40 What is it? A means to automatically scale out/in clusters to handle variable load/traffic A means to keep a cluster/service always up Fulfills AWS’s pay-per-use promise! When to use it? Feed-processing, web traffic load balancing, zone outage, etc…
  • 43. ASG - Data Pipeline 41 importer importer importer importer Importer ASG scaleout/in SQS DB
  • 44. ASG : CPU-based 42 Sent CPU ACKd/Recvd CPU-based auto-scaling is good at scaling in/out to keep the average CPU constant
  • 45. ASG : CPU-based 43 Sent CPU Recv Premature Scale-in Premature Scale-in: The CPU drops to noise-levels before all messages are consumed. This causes scale in to occur while the last few messages are still being committed resulting in a long time- to-drain for the queue!
  • 46. ASG - Queue-based 44 Scale-out: When Visible Messages > 0 (a.k.a. when queue depth > 0) Scale-in: When Invisible Messages = 0 (a.k.a. when the last in-flight message is ACK’d) This causes the ASG to grow This causes the ASG to shrink
  • 48. Reliable Hourly Job Scheduling Workflow Automation & Scheduling 46
  • 49. 47 Historical Context Our first cut at the pipeline used cron to schedule hourly runs of Spark Problem We only knew if Spark succeeded. What if a downstream task failed? We needed something smarter than cron that Reliably managed a graph of tasks (DAG - Directed Acyclic Graph) Orchestrated hourly runs of that DAG Retried failed tasks Tracked the performance of each run and its tasks Reported on failure/success of runs Our Needs
  • 51. Airflow - DAG Dashboard 49 Airflow: It’s easy to manage multiple DAGs
  • 52. Airflow - Authoring DAGs 50 Airflow: Visualizing a DAG
  • 53. 51 Airflow: Author DAGs in Python! No need to bundle many config files! Airflow - Authoring DAGs
  • 54. Airflow - Performance Insights 52 Airflow: Gantt chart view reveals the slowest tasks for a run!
  • 55. 53 Airflow: …And we can easily see performance trends over time Airflow - Performance Insights
  • 56. Airflow - Alerting 54 Airflow: …And easy to integrate with Ops tools!
  • 58. 56 Airflow - Join the Community With >30 Companies, >100 Contributors , and >2500 Commits, Airflow is growing rapidly! We are looking for more contributors to help support the community! Disclaimer : I’m a maintainer on the project
  • 59. Design Goal Scorecard Are We Meeting Our Design Goals? 57
  • 60. 58 • Scalable • Available • Instrumented, Monitored, & Alert-enabled • Quickly Recoverable Desirable Qualities of a Resilient Data Pipeline
  • 61. 59 • Scalable • Build using scalable components from AWS • SQS, SNS, S3, ASG, EMR Spark • Exception = DB (WIP) • Available • Build using available components from AWS • Airflow for reliable job scheduling • Instrumented, Monitored, & Alert-enabled • Airflow • Quickly Recoverable • Airflow, DLQs, ASGs, Spark & DB Desirable Qualities of a Resilient Data Pipeline
  • 63. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/agari- resilient-predictive-data-pipes