SlideShare ist ein Scribd-Unternehmen logo
1 von 44
Downloaden Sie, um offline zu lesen
Productionalizing
Spark Streaming
Spark Summit 2013
Ryan Weald
@rweald

@rweald
Monday, December 2, 13
What We’re Going to Cover
•What we do and Why we choose Spark
•Fault tolerance for long lived streaming jobs
•Common patterns and functional abstractions
•Testing before we “do it live”

@rweald
Monday, December 2, 13
Special focus on
common patterns and
their solutions
@rweald
Monday, December 2, 13
What is Sharethrough?
Advertising for the Modern Internet

Form

@rweald
Monday, December 2, 13

Function
What is Sharethrough?

@rweald
Monday, December 2, 13
Why Spark Streaming?

@rweald
Monday, December 2, 13
Why Spark Streaming
•Liked theoretical foundation of mini-batch
•Scala codebase + functional API
•Young project with opportunities to contribute
•Batch model for iterative ML algorithms

@rweald
Monday, December 2, 13
Great...
Now productionalize it

@rweald
Monday, December 2, 13
Fault Tolerance

@rweald
Monday, December 2, 13
Keys to Fault Tolerance

1.Receiver fault tolerance
2.Monitoring job progress

@rweald
Monday, December 2, 13
Receiver Fault Tolerance

•Use Actors with supervisors
•Use self healing connection pools

@rweald
Monday, December 2, 13
Use Actors
class RabbitMQStreamReceiver (uri:String, exchangeName: String,
routingKey: String) extends Actor with Receiver with Logging {
implicit val system = ActorSystem()
override def preStart() = {
//Your code to setup connections and actors
//Include inner class to process messages
}
def receive: Receive = {
case _
=> logInfo("unknown message")
}
}

@rweald
Monday, December 2, 13
Track All Outputs
•Low watermarks - Google MillWheel
•Database updated_at
•Expected output file size alerting

@rweald
Monday, December 2, 13
Common Patterns
&
Functional
Programming
@rweald
Monday, December 2, 13
Common Job Pattern

Map -> Aggregate ->Store

@rweald
Monday, December 2, 13
Mapping Data

inputData.map { rawRequest =>
val params = QueryParams.parse(rawRequest)
(params.getOrElse("beaconType", "unknown"), 1L)
}

@rweald
Monday, December 2, 13
Aggregation

@rweald
Monday, December 2, 13
Basic Aggregation

//beacons is DStream[String, Long]
//example Seq(("click", 1L), ("click", 1L))
val sum: (Long, Long) => Long = _ + _
beacons.reduceByKey(sum)

@rweald
Monday, December 2, 13
What Happens when
we want to sum
multiple things?
@rweald
Monday, December 2, 13
Long Basic Aggregation
val inputData = Seq(
("user_1",(1L, 1L, 1L)),
("user_1",(2L, 2L, 2L))
)
def sum(l: (Long, Long, Long),
r: (Long, Long, Long)) = {
(l._1 + r._1, l._2 + r._2, l._3 + r._3)
}
inputData.reduceByKey(sum)

@rweald
Monday, December 2, 13
Now Sum 4 Ints
instead
(ノಥ益ಥ)ノ ┻━┻
@rweald
Monday, December 2, 13
Monoids to the Rescue

@rweald
Monday, December 2, 13
WTF is a Monoid?
trait Monoid[T] {
def zero: T
def plus(r: T, l: T): T
}
* Just need to make sure plus is associative.
(1+ 5) + 2 == (2 + 1) + 5

@rweald
Monday, December 2, 13
Monoid Based Aggregation
object LongMonoid extends Monoid[(Long, Long, Long)]
{
def zero = (0, 0, 0)
def plus(r: (Long, Long, Long),
l: (Long, Long, Long)) = {
(l._1 + r._1, l._2 + r._2, l._3 + r._3)
}
}
inputData.reduceByKey(LongMonid.plus(_, _))

@rweald
Monday, December 2, 13
Twitter Algebird
http://github.com/twitter/algebird

@rweald
Monday, December 2, 13
Algebird Based Aggregation

import com.twitter.algebird._
val aggregator = implicitly[Monoid[(Long,Long, Long)]]
inputData.reduceByKey(aggregator.plus(_, _))

@rweald
Monday, December 2, 13
How many unique
users per publisher?

@rweald
Monday, December 2, 13
Too big for memory
based naive Map

@rweald
Monday, December 2, 13
HyperLogLog FTW

@rweald
Monday, December 2, 13
HLL Aggregation

import com.twitter.algebird._
val aggregator = new HyperLogLogMonoid(12)
inputData.reduceByKey(aggregator.plus(_, _))

@rweald
Monday, December 2, 13
Monoids == Reusable
Aggregation

@rweald
Monday, December 2, 13
Common Job Pattern

Map -> Aggregate ->Store

@rweald
Monday, December 2, 13
Store

@rweald
Monday, December 2, 13
How do we store the
results?

@rweald
Monday, December 2, 13
Storage API Requirements
•Incremental updates (preferably associative)
•Pluggable to support “big data” stores
•Allow for testing jobs

@rweald
Monday, December 2, 13
Storage API
trait MergeableStore[K,
def get(key: K): V
def put(kv: (K,V)): V
/*
* Should follow same
* as our Monoid from
*/
def merge(kv: (K,V)):
}

@rweald
Monday, December 2, 13

V] {

associative property
earlier
V
Twitter Storehaus
http://github.com/twitter/storehaus

@rweald
Monday, December 2, 13
Storing Spark Results
def saveResults(result: DStream[String, Long],
store: RedisStore[String, Long]) = {
result.foreach { rdd =>
rdd.foreach { element =>
val (keys, value) = element
store.merge(keys, impressions)
}
}
}

@rweald
Monday, December 2, 13
Everyone can benefit

@rweald
Monday, December 2, 13
Potential API additions?

class PairDStreamFunctions[K, V] {
def aggregateByKey(aggregator: Monoid[V])
def store(store: MergeableStore[K, V])
}

@rweald
Monday, December 2, 13
Twitter Summingbird
http://github.com/twitter/summingbird
*https://github.com/twitter/summingbird/issues/387

@rweald
Monday, December 2, 13
Testing Your Jobs

@rweald
Monday, December 2, 13
Testing best Practices
•Try and avoid full integration tests
•Use in-memory stores for testing
•Keep logic outside of Spark
•Use Summingbird in memory platform???

@rweald
Monday, December 2, 13
Thank You
Ryan Weald
@rweald

@rweald
Monday, December 2, 13

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Real time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosReal time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesos
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 Furious
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into Cassandra
 
An Introduction to Distributed Search with Datastax Enterprise Search
An Introduction to Distributed Search with Datastax Enterprise SearchAn Introduction to Distributed Search with Datastax Enterprise Search
An Introduction to Distributed Search with Datastax Enterprise Search
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and Cassandra
 
Scylla Summit 2018: From SAP to Scylla - Tracking the Fleet at GPS Insight
Scylla Summit 2018: From SAP to Scylla - Tracking the Fleet at GPS InsightScylla Summit 2018: From SAP to Scylla - Tracking the Fleet at GPS Insight
Scylla Summit 2018: From SAP to Scylla - Tracking the Fleet at GPS Insight
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache Cassandra
 
Cassandra spark connector
Cassandra spark connectorCassandra spark connector
Cassandra spark connector
 
Apache Spark RDD 101
Apache Spark RDD 101Apache Spark RDD 101
Apache Spark RDD 101
 
M|18 How MariaDB Server Scales with Spider
M|18 How MariaDB Server Scales with SpiderM|18 How MariaDB Server Scales with Spider
M|18 How MariaDB Server Scales with Spider
 
Deploying, Backups, and Restore w Datastax + Azure at Albertsons/Safeway (Gur...
Deploying, Backups, and Restore w Datastax + Azure at Albertsons/Safeway (Gur...Deploying, Backups, and Restore w Datastax + Azure at Albertsons/Safeway (Gur...
Deploying, Backups, and Restore w Datastax + Azure at Albertsons/Safeway (Gur...
 
Building a Lambda Architecture with Elasticsearch at Yieldbot
Building a Lambda Architecture with Elasticsearch at YieldbotBuilding a Lambda Architecture with Elasticsearch at Yieldbot
Building a Lambda Architecture with Elasticsearch at Yieldbot
 
IOT with PostgreSQL
IOT with PostgreSQLIOT with PostgreSQL
IOT with PostgreSQL
 
Apache Cassandra Lunch #70: Basics of Apache Cassandra
Apache Cassandra Lunch #70: Basics of Apache CassandraApache Cassandra Lunch #70: Basics of Apache Cassandra
Apache Cassandra Lunch #70: Basics of Apache Cassandra
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Deep Dive Into Elasticsearch
Deep Dive Into ElasticsearchDeep Dive Into Elasticsearch
Deep Dive Into Elasticsearch
 
Cassandra & Spark for IoT
Cassandra & Spark for IoTCassandra & Spark for IoT
Cassandra & Spark for IoT
 
Scylla Summit 2018: Joining Billions of Rows in Seconds with One Database Ins...
Scylla Summit 2018: Joining Billions of Rows in Seconds with One Database Ins...Scylla Summit 2018: Joining Billions of Rows in Seconds with One Database Ins...
Scylla Summit 2018: Joining Billions of Rows in Seconds with One Database Ins...
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
 

Ähnlich wie Productionalizing Spark Streaming

Russell 2012 introduction to spring integration and spring batch
Russell 2012   introduction to spring integration and spring batchRussell 2012   introduction to spring integration and spring batch
Russell 2012 introduction to spring integration and spring batch
GaryPRussell
 

Ähnlich wie Productionalizing Spark Streaming (20)

Spark at-hackthon8jan2014
Spark at-hackthon8jan2014Spark at-hackthon8jan2014
Spark at-hackthon8jan2014
 
Scalding for Hadoop
Scalding for HadoopScalding for Hadoop
Scalding for Hadoop
 
Treasure Data Cloud Strategy
Treasure Data Cloud StrategyTreasure Data Cloud Strategy
Treasure Data Cloud Strategy
 
December 2013 HUG: Spark at Yahoo!
December 2013 HUG: Spark at Yahoo!December 2013 HUG: Spark at Yahoo!
December 2013 HUG: Spark at Yahoo!
 
Use Your MySQL Knowledge to Become a MongoDB Guru
Use Your MySQL Knowledge to Become a MongoDB GuruUse Your MySQL Knowledge to Become a MongoDB Guru
Use Your MySQL Knowledge to Become a MongoDB Guru
 
Why Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelWhy Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) Model
 
Give Me My Damn Report: Making NoSQL Data Accessible to the Business
Give Me My Damn Report: Making NoSQL Data Accessible to the BusinessGive Me My Damn Report: Making NoSQL Data Accessible to the Business
Give Me My Damn Report: Making NoSQL Data Accessible to the Business
 
Why Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data WorldWhy Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data World
 
Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive Applicaitons
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
 
Why Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelWhy Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) Model
 
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at OoyalaCassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala
 
Grails 2.0 Update
Grails 2.0 UpdateGrails 2.0 Update
Grails 2.0 Update
 
Cloudera Impala - HUG Karlsruhe, July 04, 2013
Cloudera Impala - HUG Karlsruhe, July 04, 2013Cloudera Impala - HUG Karlsruhe, July 04, 2013
Cloudera Impala - HUG Karlsruhe, July 04, 2013
 
Introduction to Ruby & Ruby on Rails
Introduction to Ruby & Ruby on RailsIntroduction to Ruby & Ruby on Rails
Introduction to Ruby & Ruby on Rails
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark 101
Spark 101Spark 101
Spark 101
 
Overview of Spark for HPC
Overview of Spark for HPCOverview of Spark for HPC
Overview of Spark for HPC
 
Russell 2012 introduction to spring integration and spring batch
Russell 2012   introduction to spring integration and spring batchRussell 2012   introduction to spring integration and spring batch
Russell 2012 introduction to spring integration and spring batch
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 

Productionalizing Spark Streaming