SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Spark and Storm at Yahoo 
Wh y c h o o s e o n e o v e r t h e o t h e r ? 
P R E S E N T E D B Y B o b b y E v a n s a n d T o m G r a v e s
Tom Graves 
Bobby Evans (bobby@apache.org) 
2 
 Committers and PMC/PPMC Members for 
› Apache Storm incubating (Bobby) 
› Apache Hadoop (Tom and Bobby) 
› Apache Spark (Tom and Bobby) 
› Apache TEZ (Tom and Bobby) 
 Low Latency Big Data team at Yahoo (Part of the Hadoop Team) 
› Apache Storm as a service 
• 1,300+ nodes total, 250 node cluster (soon to be 4000 nodes). 
› Apache Spark on YARN 
• 40,000 nodes total, 5000+ node cluster 
› Help with distributed ML and deep learning.
Where we come from 
Yahoo Champaign: 
• 100+ engineers 
• Located in UIUC Research Park http://researchpark.illinois.edu/ 
• Split between Advertising and Data Platform team and Hadoop team. 
• Hadoop team provides the Hadoop ecosystem as a service to all of Yahoo. 
• Site is 7 years old, and we are building a new building with room for 200. 
• We are Hiring 
• resume-hadoop@yahoo-inc.com 
• http://bit.ly/1ybTXMe
Agenda 
Spark Overview (1.1) 
Storm Overview (0.9.2) 
Things to Consider 
Example Architectures 
4 Yahoo Confidential & Proprietary
Apache Spark 
5
Spark Key Concepts 
Write programs in terms of 
transformations on distributed 
Resilient Distributed 
Datasets 
 Collections of objects spread 
across a cluster, stored in RAM 
or on Disk 
 Built through parallel 
transformations 
 Automatically rebuilt on failure 
Operations 
 Transformations 
(e.g. map, filter, 
groupBy) 
 Actions 
(e.g. count, collect, 
save) 
datasets
Working With RDDs 
RDD 
RDD 
RDD 
RDD 
Transformations 
textFile = sc.textFile(”SomeFile.txt”) 
Action Value 
linesWithSpark = textFile.filter(lambda line: "Spark” in line) 
linesWithSpark.count() 
74 
linesWithSpark.first() 
# Apache Spark
Example: Word Count 
> lines = sc.textFile(“hamlet.txt”) 
> counts = lines.flatMap(lambda line: line.split(“ ”)) 
.map(lambda word => (word, 1)) 
.reduceByKey(lambda x, y: x + y) 
“to be or” 
“not to be” 
“to” 
“be” 
“or” 
“not” 
“to” 
“be” 
(to, 1) 
(be, 1) 
(or, 1) 
(not, 1) 
(to, 1) 
(be, 1) 
(be, 2) 
(not, 1) 
(or, 1) 
(to, 2)
Spark Streaming Word Count 
updateFunc = (values: Seq[Int], state: Option[Int]) => { 
val currentCount = values.foldLeft(0)(_ + _) 
val previousCount = state.getOrElse(0) 
Some(currentCount + previousCount) 
} 
… 
lines = ssc.socketTextStream(args(0), args(1).toInt) 
Words = lines.flatMap(lambda line: line.split(“ ”)) 
wordDstream = words.map(lambda word => (word, 1)) 
stateDstream = wordDstream.updateStateByKey[Int](updateFunc) 
ssc.start() 
ssc.awaitTermination()
10 
Apache Storm
Storm Concepts 
1. Streams 
› Unbounded sequence of tuples 
2. Spout 
› Source of Stream 
› E.g. Read from Twitter streaming API 
3. Bolts 
› Processes input streams and produces 
new streams 
› E.g. Functions, Filters, Aggregation, 
Joins 
4. Topologies 
› Network of spouts and bolts
Storm Architecture 
Master 
Node 
Cluster 
Coordination 
Worker 
Worker 
Worker 
Worker 
Processes 
Nimbus 
Zookeeper 
Zookeeper 
Zookeeper 
Supervisor 
Supervisor 
Supervisor 
Supervisor Worker 
Launches 
Workers
Trident (Storm) Word Count 
TridentTopology topology = new TridentTopology(); 
TridentState wordCounts = topology.newStream("spout1", spout) 
.each(new Fields("sentence"), new Split(), new Fields("word")) 
.groupBy(new Fields("word")) 
.persistentAggregate(new MemoryMapState.Factory(), new Count(), 
new Fields("count")).parallelismHint(6); 
“to be or” 
“to” 
“be” 
“or” 
(to, 1) 
(be, 1) 
(or, 1) 
1) 
1) 
“not to be” 
“not” 
“to” 
“be” 
(not, 1) 
(to, 1) 
(be, 1) 
(be, 2) 
(not, 1) 
(or, 1) 
(to, 2)
Use the Right Tool for the Job 
14 
https://www.flickr.com/photos/hikingartist/4193330368/
Things to Consider 
15 
Scale 
Latency 
 Iterative Processing 
› Are there suitable non-iterative alternatives? 
Use What You Know 
Code Reuse 
Maturity
When We Recommend Spark 
16 
 Iterative Batch Processing (most Machine Learning) 
› There really is nothing else right now. 
› Has some scale issues. 
 Tried ETL (Not at Yahoo scale yet) 
 Tried Shark/Interactive Queries (Not at Yahoo scale yet) 
 < 1 TB (or memory size of your cluster) 
 Tuning it to run well can be a pain 
 Data Bricks and others are working on scaling. 
 Streaming is all μ-batch so latency is at least 1 sec 
 Streaming has single points of failure still 
 All streaming inputs are replicated in memory
When We Recommend Storm 
17 
 Latency < 1 second (single event at a time) 
› There is little else (especially not open source) 
 “Real Time” … 
› Analytics 
› Budgeting 
› ML 
› Anything 
 Lower Level API than Spark 
 No built-in concept of look back aggregations 
 Takes more effort to combine batch with streaming
Fictitious Example: My Commute App 
18 
 Mobile App that lets users track their commute. 
 Cities, users, companies, etc. compete daily for 
› Shortest commute time 
› Greenest commute 
 Make money by selling location based ads and aggregate data to 
› Governments 
› Advertisers 
 Feel free to steal my crazy idea, I just want to be invited to the launch 
party, and I wouldn't say no to some stock.
Chicago vs. Champaign Urbana 
19 
Champaign Urbana: 14-15 min 
Chicago: 20-30 min 
35 
30 
25 
20 
15 
10 
5 
0 
Bobby 
CU Chicago 
Source: http://project.wnyc.org/commute-times-us/embed.html#5.00/42.000/-89.500
Things to Consider 
20 
Scale 
› everyone in the world!!! 
Latency 
› a few seconds max 
 Iterative Processing 
› Possibly for targeting, but there are alternatives
Architecture 
App Web 
Service 
(User, Commute 
ID, Location 
History, MPG) 
Kafka Storm 
HBase/NoSQ 
L 
HDFS Spark 
Customer 
21
Architecture (Alternative) 
App Web 
Service 
(User, Commute 
ID, Location 
History, MPG) 
HBase/NOS 
QL 
HDFS Spark 
Customer 
22 
Go directly to Spark Streaming, 
but data loss potential goes up.
Architecture (Alternative 2) 
App Web 
Service 
(User, Commute 
ID, Location 
History, MPG) 
Kafka Storm 
HBase/NOS 
QL 
Customer 
23 
Streaming Operations Only 
(Kappa Architecture)
Fictitious Example 2: Web Scale Monitoring 
24 
 Look for trends that can indicate a problem. 
› Alert or provide automated corrections 
 Provide an interface to visualize 
› Current data very quickly 
› Historical data in depth 
 If you commercialize this one please give me/Yahoo a free license for 
life (open source works too)
Things to Consider 
25 
Scale 
› Lots of events from many different servers 
Latency 
› a few seconds max, but the fewer the better 
 Iterative Processing 
› For in depth analysis definetly
Fictitious Example 2: Web Scale Monitoring 
26 
Servers 
HBase 
Kafka Storm 
HDFS Spark 
UI 
Alert!! 
JDBC 
Server 
Rules 
ML and trend 
analysis
Questions? 
bobby@apache.org resume-hadoop@yahoo-inc.com 
http://bit.ly/1ybTXMe

Weitere ähnliche Inhalte

Was ist angesagt?

Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Databricks
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
DataWorks Summit
 

Was ist angesagt? (20)

Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Realtime Indexing for Fast Queries on Massive Semi-Structured Data
Realtime Indexing for Fast Queries on Massive Semi-Structured DataRealtime Indexing for Fast Queries on Massive Semi-Structured Data
Realtime Indexing for Fast Queries on Massive Semi-Structured Data
 
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with  Apache Pulsar and Apache PinotBuilding a Real-Time Analytics Application with  Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
 
Zeppelin(Spark)으로 데이터 분석하기
Zeppelin(Spark)으로 데이터 분석하기Zeppelin(Spark)으로 데이터 분석하기
Zeppelin(Spark)으로 데이터 분석하기
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
 
Best practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at RenaultBest practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at Renault
 
Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive
 
Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices
 
Application Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureApplication Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and Future
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
 
Improved alerting with Prometheus and Alertmanager
Improved alerting with Prometheus and AlertmanagerImproved alerting with Prometheus and Alertmanager
Improved alerting with Prometheus and Alertmanager
 
Apache Ambari: Past, Present, Future
Apache Ambari: Past, Present, FutureApache Ambari: Past, Present, Future
Apache Ambari: Past, Present, Future
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
 
Easy Cloud Native Transformation using HashiCorp Nomad
Easy Cloud Native Transformation using HashiCorp NomadEasy Cloud Native Transformation using HashiCorp Nomad
Easy Cloud Native Transformation using HashiCorp Nomad
 
[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기
 
(ARC307) Infrastructure as Code
(ARC307) Infrastructure as Code(ARC307) Infrastructure as Code
(ARC307) Infrastructure as Code
 

Andere mochten auch (6)

Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm Architecture
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
 
Safety hand tools & grinding
Safety hand tools & grindingSafety hand tools & grinding
Safety hand tools & grinding
 
Kafka Tutorial Advanced Kafka Consumers
Kafka Tutorial Advanced Kafka ConsumersKafka Tutorial Advanced Kafka Consumers
Kafka Tutorial Advanced Kafka Consumers
 

Ähnlich wie Yahoo compares Storm and Spark

Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverHadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game Forever
DataWorks Summit
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData
 

Ähnlich wie Yahoo compares Storm and Spark (20)

Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
 
Open Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCOpen Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOC
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
 
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
 
Making Machine Learning Easy with H2O and WebFlux
Making Machine Learning Easy with H2O and WebFluxMaking Machine Learning Easy with H2O and WebFlux
Making Machine Learning Easy with H2O and WebFlux
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverHadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game Forever
 
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizardPhily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
 
Strata Stinger Talk October 2013
Strata Stinger Talk October 2013Strata Stinger Talk October 2013
Strata Stinger Talk October 2013
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Apache Beam (incubating)
Apache Beam (incubating)Apache Beam (incubating)
Apache Beam (incubating)
 
Real-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesReal-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpaces
 
Scio - Moving to Google Cloud, A Spotify Story
 Scio - Moving to Google Cloud, A Spotify Story Scio - Moving to Google Cloud, A Spotify Story
Scio - Moving to Google Cloud, A Spotify Story
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
 
Tech
TechTech
Tech
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 

Mehr von Chicago Hadoop Users Group

Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Chicago Hadoop Users Group
 
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your BusinessChoosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Chicago Hadoop Users Group
 
HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917
Chicago Hadoop Users Group
 

Mehr von Chicago Hadoop Users Group (19)

Kinetica master chug_9.12
Kinetica master chug_9.12Kinetica master chug_9.12
Kinetica master chug_9.12
 
Chug dl presentation
Chug dl presentationChug dl presentation
Chug dl presentation
 
Using Apache Drill
Using Apache DrillUsing Apache Drill
Using Apache Drill
 
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
 
Meet Spark
Meet SparkMeet Spark
Meet Spark
 
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your BusinessChoosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
 
An Overview of Ambari
An Overview of AmbariAn Overview of Ambari
An Overview of Ambari
 
Hadoop and Big Data Security
Hadoop and Big Data SecurityHadoop and Big Data Security
Hadoop and Big Data Security
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Advanced Oozie
Advanced OozieAdvanced Oozie
Advanced Oozie
 
Scalding for Hadoop
Scalding for HadoopScalding for Hadoop
Scalding for Hadoop
 
Financial Data Analytics with Hadoop
Financial Data Analytics with HadoopFinancial Data Analytics with Hadoop
Financial Data Analytics with Hadoop
 
Everything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about OozieEverything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about Oozie
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache Hadoop
 
HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917
 
Map Reduce v2 and YARN - CHUG - 20120604
Map Reduce v2 and YARN - CHUG - 20120604Map Reduce v2 and YARN - CHUG - 20120604
Map Reduce v2 and YARN - CHUG - 20120604
 
Hadoop in a Windows Shop - CHUG - 20120416
Hadoop in a Windows Shop - CHUG - 20120416Hadoop in a Windows Shop - CHUG - 20120416
Hadoop in a Windows Shop - CHUG - 20120416
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416
 

Kürzlich hochgeladen

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

Yahoo compares Storm and Spark

  • 1. Spark and Storm at Yahoo Wh y c h o o s e o n e o v e r t h e o t h e r ? P R E S E N T E D B Y B o b b y E v a n s a n d T o m G r a v e s
  • 2. Tom Graves Bobby Evans (bobby@apache.org) 2  Committers and PMC/PPMC Members for › Apache Storm incubating (Bobby) › Apache Hadoop (Tom and Bobby) › Apache Spark (Tom and Bobby) › Apache TEZ (Tom and Bobby)  Low Latency Big Data team at Yahoo (Part of the Hadoop Team) › Apache Storm as a service • 1,300+ nodes total, 250 node cluster (soon to be 4000 nodes). › Apache Spark on YARN • 40,000 nodes total, 5000+ node cluster › Help with distributed ML and deep learning.
  • 3. Where we come from Yahoo Champaign: • 100+ engineers • Located in UIUC Research Park http://researchpark.illinois.edu/ • Split between Advertising and Data Platform team and Hadoop team. • Hadoop team provides the Hadoop ecosystem as a service to all of Yahoo. • Site is 7 years old, and we are building a new building with room for 200. • We are Hiring • resume-hadoop@yahoo-inc.com • http://bit.ly/1ybTXMe
  • 4. Agenda Spark Overview (1.1) Storm Overview (0.9.2) Things to Consider Example Architectures 4 Yahoo Confidential & Proprietary
  • 6. Spark Key Concepts Write programs in terms of transformations on distributed Resilient Distributed Datasets  Collections of objects spread across a cluster, stored in RAM or on Disk  Built through parallel transformations  Automatically rebuilt on failure Operations  Transformations (e.g. map, filter, groupBy)  Actions (e.g. count, collect, save) datasets
  • 7. Working With RDDs RDD RDD RDD RDD Transformations textFile = sc.textFile(”SomeFile.txt”) Action Value linesWithSpark = textFile.filter(lambda line: "Spark” in line) linesWithSpark.count() 74 linesWithSpark.first() # Apache Spark
  • 8. Example: Word Count > lines = sc.textFile(“hamlet.txt”) > counts = lines.flatMap(lambda line: line.split(“ ”)) .map(lambda word => (word, 1)) .reduceByKey(lambda x, y: x + y) “to be or” “not to be” “to” “be” “or” “not” “to” “be” (to, 1) (be, 1) (or, 1) (not, 1) (to, 1) (be, 1) (be, 2) (not, 1) (or, 1) (to, 2)
  • 9. Spark Streaming Word Count updateFunc = (values: Seq[Int], state: Option[Int]) => { val currentCount = values.foldLeft(0)(_ + _) val previousCount = state.getOrElse(0) Some(currentCount + previousCount) } … lines = ssc.socketTextStream(args(0), args(1).toInt) Words = lines.flatMap(lambda line: line.split(“ ”)) wordDstream = words.map(lambda word => (word, 1)) stateDstream = wordDstream.updateStateByKey[Int](updateFunc) ssc.start() ssc.awaitTermination()
  • 11. Storm Concepts 1. Streams › Unbounded sequence of tuples 2. Spout › Source of Stream › E.g. Read from Twitter streaming API 3. Bolts › Processes input streams and produces new streams › E.g. Functions, Filters, Aggregation, Joins 4. Topologies › Network of spouts and bolts
  • 12. Storm Architecture Master Node Cluster Coordination Worker Worker Worker Worker Processes Nimbus Zookeeper Zookeeper Zookeeper Supervisor Supervisor Supervisor Supervisor Worker Launches Workers
  • 13. Trident (Storm) Word Count TridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")).parallelismHint(6); “to be or” “to” “be” “or” (to, 1) (be, 1) (or, 1) 1) 1) “not to be” “not” “to” “be” (not, 1) (to, 1) (be, 1) (be, 2) (not, 1) (or, 1) (to, 2)
  • 14. Use the Right Tool for the Job 14 https://www.flickr.com/photos/hikingartist/4193330368/
  • 15. Things to Consider 15 Scale Latency  Iterative Processing › Are there suitable non-iterative alternatives? Use What You Know Code Reuse Maturity
  • 16. When We Recommend Spark 16  Iterative Batch Processing (most Machine Learning) › There really is nothing else right now. › Has some scale issues.  Tried ETL (Not at Yahoo scale yet)  Tried Shark/Interactive Queries (Not at Yahoo scale yet)  < 1 TB (or memory size of your cluster)  Tuning it to run well can be a pain  Data Bricks and others are working on scaling.  Streaming is all μ-batch so latency is at least 1 sec  Streaming has single points of failure still  All streaming inputs are replicated in memory
  • 17. When We Recommend Storm 17  Latency < 1 second (single event at a time) › There is little else (especially not open source)  “Real Time” … › Analytics › Budgeting › ML › Anything  Lower Level API than Spark  No built-in concept of look back aggregations  Takes more effort to combine batch with streaming
  • 18. Fictitious Example: My Commute App 18  Mobile App that lets users track their commute.  Cities, users, companies, etc. compete daily for › Shortest commute time › Greenest commute  Make money by selling location based ads and aggregate data to › Governments › Advertisers  Feel free to steal my crazy idea, I just want to be invited to the launch party, and I wouldn't say no to some stock.
  • 19. Chicago vs. Champaign Urbana 19 Champaign Urbana: 14-15 min Chicago: 20-30 min 35 30 25 20 15 10 5 0 Bobby CU Chicago Source: http://project.wnyc.org/commute-times-us/embed.html#5.00/42.000/-89.500
  • 20. Things to Consider 20 Scale › everyone in the world!!! Latency › a few seconds max  Iterative Processing › Possibly for targeting, but there are alternatives
  • 21. Architecture App Web Service (User, Commute ID, Location History, MPG) Kafka Storm HBase/NoSQ L HDFS Spark Customer 21
  • 22. Architecture (Alternative) App Web Service (User, Commute ID, Location History, MPG) HBase/NOS QL HDFS Spark Customer 22 Go directly to Spark Streaming, but data loss potential goes up.
  • 23. Architecture (Alternative 2) App Web Service (User, Commute ID, Location History, MPG) Kafka Storm HBase/NOS QL Customer 23 Streaming Operations Only (Kappa Architecture)
  • 24. Fictitious Example 2: Web Scale Monitoring 24  Look for trends that can indicate a problem. › Alert or provide automated corrections  Provide an interface to visualize › Current data very quickly › Historical data in depth  If you commercialize this one please give me/Yahoo a free license for life (open source works too)
  • 25. Things to Consider 25 Scale › Lots of events from many different servers Latency › a few seconds max, but the fewer the better  Iterative Processing › For in depth analysis definetly
  • 26. Fictitious Example 2: Web Scale Monitoring 26 Servers HBase Kafka Storm HDFS Spark UI Alert!! JDBC Server Rules ML and trend analysis

Hinweis der Redaktion

  1. RDD  Colloquially referred to as RDDs (e.g. caching in RAM) Lazy operations to build RDDs from other RDDs Return a result or write it to storage
  2. Let me illustrate this with some bad powerpoint diagrams and animations This diagram is LOGICAL,
  3. Trend analysis is difficult but sketches for approximations on many aggregates and Gradient Decent or VW for ML make this still an attractive option.