SlideShare ist ein Scribd-Unternehmen logo
1 von 52
Downloaden Sie, um offline zu lesen
Spark & Storm: When & Where?
www.mammothdata.com | @mammothdataco
The Leader in Big Data Consulting
● BI/Data Strategy
○ Development of a business intelligence/ data architecture strategy.
● Installation
○ Installation of Hadoop or relevant technology.
● Data Consolidation
○ Load data from diverse sources into a single scalable repository.
● Streaming - Mammoth will write ingestion and/or analytics which operate on the data as it comes in as well as design dashboards,
feeds or computer-driven decision making processes to derive insights and make decisions.
● Visualization Tools
○ Mammoth will set up visualization tool (ex: Tableau, Pentaho, etc…) We will also create initial reports and provide training to
necessary employees who will analyze the data.
Mammoth Data, based in downtown Durham (right above Toast)
www.mammothdata.com | @mammothdataco
● Lead Consultant on all things DevOps and Spark
● @carsondial on Twitter
Me!
www.mammothdata.com | @mammothdataco
● Quick overview of Spark Streaming
● Reasons why Spark Streaming can be tricky in practice
● Performance and tuning tips we’ve learnt over the past two years
● …and when to pack it all in and use Storm instead
What This Talk Is About
www.mammothdata.com | @mammothdataco
This IS WEB SCALE!
www.mammothdata.com | @mammothdataco
● I kid, Rails!
● (mostly)
Beyond Web Scale
www.mammothdata.com | @mammothdataco
● Spark & Storm - millions of requests / second on commodity
hardware
● Different problems at different scales!
Beyond Web Scale
www.mammothdata.com | @mammothdataco
● Directed Acyclic Graph Data Processing Engine
● Based around the Resilient Distributed Dataset (RDD) primitive
Spark
www.mammothdata.com | @mammothdataco
Spark Streaming — Overview
www.mammothdata.com | @mammothdataco
Spark Streaming — In Production?
● Yes!
● (Alibaba, AutoTrader, Cisco, Netflix, etc.)
www.mammothdata.com | @mammothdataco
● Streaming by running batches very quickly!
● Batch length: can be as low as 0.5s / batch
● Every X seconds, get Y records (DStream/RDDs)
Spark Streaming — Overview
www.mammothdata.com | @mammothdataco
● Using same implementation (mostly) for batch and stream
processing (Lambda Architecture hipster points ahoy!)
● Access to rest of Spark - Dataframes, MLLib, GraphX, etc.
Spark Streaming — Good Things
www.mammothdata.com | @mammothdataco
● What happens if you can’t process Y records in X seconds?
● What happens if you require sub-second latency?
Spark Streaming — Bad Things!
www.mammothdata.com | @mammothdataco
Spark Streaming — I’m so sorry.
www.mammothdata.com | @mammothdataco
● What happens if you can’t process Y records in X seconds?
● Data builds up in executors
● Executors run out of memory…
Spark Streaming — Bad Things!
www.mammothdata.com | @mammothdataco
● “Hey, we forgot to tell you Ops people that we have a major new
client adding stuff into the firehose sometime today. That’s fine,
right?”
Spark Streaming — Bad Things!
www.mammothdata.com | @mammothdataco
Spark Streaming — It Will Be Okay
www.mammothdata.com | @mammothdataco
● As a former Ops person:
● WE WILL REMEMBER.
Spark Streaming — Bad Things!
www.mammothdata.com | @mammothdataco
● Do you need low-latency?
● If so, a 10-minute nap is advisable!
● Everybody else, let’s dive in…
Spark Streaming — Tuning
www.mammothdata.com | @mammothdataco
Spark Streaming — Tuning
www.mammothdata.com | @mammothdataco
Spark Streaming — Down In The Hole
www.mammothdata.com | @mammothdataco
Spark Streaming — Down In The Hole
www.mammothdata.com | @mammothdataco
● Easiest method — alter the batch window until it’s all fine!
● Tiny batches provide tight execution times!
Spark Streaming — Down In The Hole
www.mammothdata.com | @mammothdataco
● Use Kafka.
● Data source with the most love (e.g. exactly-once semantics
without Write Ahead Logs and receiver-less operation in 1.3+)
● (other sources get the features…eventually)
Spark Streaming — Tuning
www.mammothdata.com | @mammothdataco
● Use Scala.
● CPython = slower in execution
● PyPy is much faster…but…
● New features always come to Scala first.
Spark Streaming — Tuning
www.mammothdata.com | @mammothdataco
● (or Java if you really must)
Spark Streaming — Tuning
www.mammothdata.com | @mammothdataco
● Spark Streaming = data receivers + Spark
● spark.cores.max = x * number of receivers
● For Great Data Locality and Parallelism!
Spark Streaming — Cores
www.mammothdata.com | @mammothdataco
● Are you using a foreachRDD loop?
rdd.foreachRDD{ rdd =>
rdd.cache()
…
rdd.unpersist()
}
Spark Streaming — Caching
www.mammothdata.com | @mammothdataco
● If routing to multiple stores / iterating over an RDD multiple
times using cache() is a quick win
● It really shouldn’t work so well…
Spark Streaming — Caching
www.mammothdata.com | @mammothdataco
● Hurrah for Spark 1.5!
● spark.streaming.backpressure.enabled = true
● Spark dynamically alters incoming data rates (keeping the data in
Kafka rather than in the executors)
● Works for all data sources (for once!)
Spark Streaming — Backpressure
www.mammothdata.com | @mammothdataco
● I really need that low-latency response!
Storm
www.mammothdata.com | @mammothdataco
● Directed Acyclic Graph Data Processing Engine
Storm
www.mammothdata.com | @mammothdataco
Spark
“Very Good, Sir”
www.mammothdata.com | @mammothdataco
Storm
“Here you go!”
www.mammothdata.com | @mammothdataco
● Stream of tuples
● Bolts
● Spouts
● Topologies
Storm Concepts
www.mammothdata.com | @mammothdataco
● Unbounded stream of tuples
● Tuples are defined via schema (usual base types plus custom
serializers)
Storm — Streams
www.mammothdata.com | @mammothdataco
● Sources of tuples in a topology
● Read from external sources (e.g. Kafka) and emitting them
● Can emit multiple streams from a spout!
Storm — Spouts
www.mammothdata.com | @mammothdataco
● Where your processing happens
● Roll your own aggregations / filtering / windowing
● Bolts can feed into other bolts
● Potentially easier to test than Spark Streaming
● Many Bolt connectors for external sources (e.g. Cassandra,
Redis, Hive, etc)
Storm — Bolts
www.mammothdata.com | @mammothdataco
● The DAG of the spouts and bolts
● Built programmatically in code and submitted to the Storm
cluster
● Flux - Do It In YAML (and then complain about whitespace)
Storm — Topologies
www.mammothdata.com | @mammothdataco
● Each bolt or spout runs 'tasks' across the cluster
● How parallelism works in Storm
● Set in topology submission
Storm — Tasks
www.mammothdata.com | @mammothdataco
● Where the topology runs
● 1 worker = 1 JVM
● Tasks run as threads on a worker
● Storm distributes tasks evenly across cluster
Storm — Workers
www.mammothdata.com | @mammothdataco
● True Streaming
● Tuples processed as they enter topology - low latency
● Scales far beyond Spark Streaming (currently)
Storm — Good Things
www.mammothdata.com | @mammothdataco
● Battle-tested at Twitter & Yahoo!
● Yahoo! has 300-node clusters and working to support 1000+
nodes
● Single node clocked at over 1.5m tuples / second at Twitter
Storm — Good Things
www.mammothdata.com | @mammothdataco
● Very DIY (bring your own aggregations, ML, etc)
● Your DAG construction may not be optimal
● Operationally more complex (and Storm WebUI is more primitive)
● Where’s Me REPL?
Storm — Bad Things
www.mammothdata.com | @mammothdataco
Spark or Storm?
www.mammothdata.com | @mammothdataco
● SLA on latency?
Spark or Storm?
www.mammothdata.com | @mammothdataco
● Storm!
● (though simply because it’s possible doesn’t mean you’ll get it!)
Spark or Storm?
www.mammothdata.com | @mammothdataco
● Insane data needs (e.g. ~100m records/second?)
Spark or Storm?
www.mammothdata.com | @mammothdataco
● Storm!
● (though, again, it’s not a magic bullet!)
Spark or Storm?
www.mammothdata.com | @mammothdataco
● For almost anything else? Spark.
● High-level vs. Low-level
● Each new version of Spark delivers improvements!
Spark or Storm?
www.mammothdata.com | @mammothdataco
● Other frameworks that show promise:
○ Flink
○ Apex
○ Samza
○ Heron (Twitter’s not-public Storm replacement)
Other Listing Magazines Are Available
www.mammothdata.com | @mammothdataco
Questions?

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (12)

[500DISTRO] Going for Global: 5 Guerrilla Tactics When the Slick Stuff Fails
[500DISTRO] Going for Global: 5 Guerrilla Tactics When the Slick Stuff Fails	[500DISTRO] Going for Global: 5 Guerrilla Tactics When the Slick Stuff Fails
[500DISTRO] Going for Global: 5 Guerrilla Tactics When the Slick Stuff Fails
 
Getting Serious About Carbon Pricing: Putting a Price on Carbon #priceoncarbon
Getting Serious About Carbon Pricing: Putting a Price on Carbon #priceoncarbonGetting Serious About Carbon Pricing: Putting a Price on Carbon #priceoncarbon
Getting Serious About Carbon Pricing: Putting a Price on Carbon #priceoncarbon
 
Javascript State of the Union 2015 - English
Javascript State of the Union 2015 - EnglishJavascript State of the Union 2015 - English
Javascript State of the Union 2015 - English
 
HR Gurus A-Z List: Revisiting the Current Industry Experts for Q4 2017
HR Gurus A-Z List: Revisiting the Current Industry Experts for Q4 2017HR Gurus A-Z List: Revisiting the Current Industry Experts for Q4 2017
HR Gurus A-Z List: Revisiting the Current Industry Experts for Q4 2017
 
100% Renewable Energy by 2050: Fact or Fantasy
100% Renewable Energy by 2050: Fact or Fantasy100% Renewable Energy by 2050: Fact or Fantasy
100% Renewable Energy by 2050: Fact or Fantasy
 
The Future of Real-Time in Spark
The Future of Real-Time in SparkThe Future of Real-Time in Spark
The Future of Real-Time in Spark
 
Consumer Driven Contracts and Your Microservice Architecture
Consumer Driven Contracts and Your Microservice ArchitectureConsumer Driven Contracts and Your Microservice Architecture
Consumer Driven Contracts and Your Microservice Architecture
 
The Wealthfront Equity Plan (Stanford GSB, March 2016)
The Wealthfront Equity Plan (Stanford GSB, March 2016)The Wealthfront Equity Plan (Stanford GSB, March 2016)
The Wealthfront Equity Plan (Stanford GSB, March 2016)
 
The State of Sales & Marketing at the 50 Fastest-Growing B2B Companies
The State of Sales & Marketing at the 50 Fastest-Growing B2B CompaniesThe State of Sales & Marketing at the 50 Fastest-Growing B2B Companies
The State of Sales & Marketing at the 50 Fastest-Growing B2B Companies
 
From Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover WeeklyFrom Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover Weekly
 
Solve for X with AI: a VC view of the Machine Learning & AI landscape
Solve for X with AI: a VC view of the Machine Learning & AI landscapeSolve for X with AI: a VC view of the Machine Learning & AI landscape
Solve for X with AI: a VC view of the Machine Learning & AI landscape
 
The Future of Everything
The Future of EverythingThe Future of Everything
The Future of Everything
 

Kürzlich hochgeladen

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 

Kürzlich hochgeladen (20)

Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxBUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 

All Things Open - Spark & Storm - Where & When?