SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Taboola's Experience with
Apache Spark
tal.s@taboola.com
ruthy.g@taboola.com
Engine Focused on Maximizing CTR & Post Click Engagement
Context
Metadata
Geo
Region-based
Recommendations

User Behavior
Cookie Data

Social
Facebook/Twitter API
Collaborative Filtering
Bucketed Consumption Groups
Largest Content Discovery and
Monetization Network

3B
Daily
recommendations

1M+
sourced content
providers

0M
monthly unique
users

1M+
sourced content
items
What Does it Mean?
• 5 Data Centers across the globe
• Tera-bytes of data / day (many billion events)
• Data must be processed and analyzed in real time, for
example:
–
–
–
–
–

Real-time, per user content recommendations
Real-time expenditure reports
Automated campaign management
Automated recommendation algorithms calibration
Real-time analytics
About Spark
•
•
•
•
•

Open Sourced
Apache top level project (since Feb. 19th)
DataBricks - A commercial company that supports it
Hadoop-compatible computing engine
Can run side-by-side with Hadoop/Hive on the same
data
• Drastically faster than Hadoop through in-memory
computing
• Multiple H/A options - standalone cluster, Apache
mesos and ZooKeeper or YARN
Spark Development Community
• With over 100 developers and 25 companies, one of
the most active communities in big data

Comparison: Storm (48), Giraph (52), Drill (18), Tez (12)

Past 6 months: more active devs than Hadoop MapReduce!
The Spark Community
15

25

20

10

5

0

Streaming
Response Time (s)

Storm

25

20

15

10

5

0

SQL

30

10
5

0

Graph

GraphX

25

Giraph

40
Hadoop

35

Response Time (min)

Shark (mem)

Shark (disk)

Hive

45

Impala (mem)

35

Impala (disk)

30
Spark

Throughput (MB/s/node)

Spark Performance

30

20

15
Spark API
• Simple to write through easy APIs in
Java, Scala and Python
• The same analytics code can be used for both
streaming data and offline data processing
Spark Key Concepts
Write programs in terms of
transformations on distributed
datasets
Resilient Distributed
Operations
Datasets
• Transformations
• Collections of objects spread
across a cluster, stored in
RAM or on Disk
• Built through parallel
transformations
• Automatically rebuilt on
failure

(e.g.
map, filter, group
By)
• Actions
(e.g.
count, collect, sa
ve)
Working With RDDs
textFile = sc.textFile(”SomeFile.txt”)

RDD
RDD
RDD
RDD

Action

Value

Transformations

linesWithSpark.count()
74
linesWithSpark.first()
# Apache Spark

linesWithSpark = textFile.filter(lambda line: "Spark” in line)
Example: Log Mining
Transformed RDD

Load error messages from a log into
memory, then interactively search for various
patterns
Cache 1
Base RDD

lines = spark.textFile(“hdfs://...”)

results

Worker

errors = lines.filter(lambda s: s.startswith(“ERROR”))

tasks

messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()

messages.filter(lambda s: “mysql” in s).count()

Driver

Action

Block 1

Cache 2

Worker

messages.filter(lambda s: “php” in s).count()

Cache 3

. . .

Full-text search of Wikipedia
• 60GB on 20 EC2 machine
• 0.5 sec vs. 20s for on-disk

Worker
Block 3

Block 2
Task Scheduler
• General task
graphs
• Automatically
pipelines
functions
• Data locality
aware
• Partitioning
aware
to avoid shuffles

B:

A:

F:
Stage 1
C:

groupBy
D:

E:

join
Stage 2 map

= RDD

filter

= cached partition

Stage 3
Software Components
• Spark runs as a library in your
program (1 instance per app)
• Runs tasks locally or on cluster
– Mesos, YARN or standalone
mode

• Accesses storage systems via
Hadoop InputFormat API
– Can use HBase, HDFS, S3, …

Your application

SparkContext
Cluster
manager

Local
threads

Worker

Worker

Spark
executor

Spark
executor

HDFS or other storage
System Architecture & Data Flow @ Taboola
Driver +
Consumers

Spark Cluster

FE Servers
MySQL Cluster

C* Cluster

FE Servers
Execution Graph @ Taboola
rdd1 = Context.parallize([data])

• Data start point (dates, etc)

rdd2 =
rdd1.mapPartitions(loadfunc())

• Loading data from external sources
(Cassandra, MySQL, etc)

rdd3 = rdd2.reduce(reduceFunc())

rdd4 =
rdd3.mapPartitions(saverfunc())

rdd4.count()

• Aggregating the data and storing
results
• Saving the results to a DB
• Executing the above graph by
forcing an output operation
Cassandra as a Distributed Storage
•
•
•

Event Log Files saved as blobs to a dedicated keyspace
C* Tables holding the Event Log Files are partitioned by day – new Table per day. This way, it is
easier for maintenance and simpler to load into Spark
Using Astyanax driver + CQL3
–
–

•

Wrote hadoop InputFormat that supports loading this into a lines RDD<String>
–

•

The DataStax InputFormat had issues and at the time was not formally supported

Worked well, but ended up not using it – instead using mapPartitions
–
–

•

Recipe to load all keys of a table very fast (hundred of thousands / sec)
Split by keys and then load data by key in batches – in parallel partitions

Very simple, no overhead, no need to be tied to hadoop
Will probably use the InputFormat when we deploy a Shark solution

Plans to open source all this

userevent_2014-02-19

userevent_2014-02-20

Key (String)

Data (blob)

Key (String)

Data (blob)

GUID (originally
log file name)

Gzipped file

GUID (originally
log file name)

Gzipped file

GUID

Gzipped file

GUID

Gzipped file

…

…

…

…
Sample – Click Counting for Campaign Stopping

1. mapPartitions – mapping from strings to objects with
a pre designed click key
2. reduceByKey – removing duplicate clicks (see next
slide)
3. Map – switch keys to a campaign+day key
4. reduceByKey – aggregate the data by
campaign+day
Campaign Stopping – Removing Dup Clicks

• When more than 1 click found from the same user on the same
item, leave only the oldest
• Using accumulators to track duplicate numbers
Our Deployment
• 7 nodes, each–
–
–
–

24 cores
256G Ram
6 1TB SSD Disks – JBOD configuration
10G Ethernet

• Total Cluster Power
–
–
–

1760GB Ram
168 CPUs
42 TB storage – (effective space is less, Cassandra Keyspaces defined with replication
factor 3)

• Symmetric Deployment
–
–

Mesos + Spark
Cassandra

• More
–
–
–

Rabbit MQ on 3 nodes
ZooKeeper on 3 nodes
MySQL cluster outside this cluster

• Loads & processes ~1 Tera Bytes (unzipped data) in ~3 minutes
Things that work well with Spark
(from our experience)
• Very easy to code complex jobs
– Harder than SQL, but better than other Map Reduce options
– Simple concepts, “small” API

• Easy to Unit Test
– Runs in local mode, so ideal for micro E2E tests
– Each mapper/reducer can be unit tested without Spark – if you
do not use anonymous classes

• Very resilient
• Can read/write to/from any data source, including
RDBMS, Cassandra, HDFS, local files, etc.
• Great monitoring
• Easy to deploy & upgrade
• Blazing fast
Things that do not work that well
(from our experience)
• Long (endless) running tasks require some workarounds
– Temp files - Spark creates a lot of files in
spark.local.dir, requires periodic cleanup
– Use spark.cleaner.ttl for long running tasks
– Periodic Driver restarts (spark 0.8.1)

• Spark Streaming – not fully mature
– Some end cases can cause loss of data
– Sliding window / batch model does not fit our needs
• We always load some history to deal with late arriving data
• State management left to the user and not trivial
– BUT – we were able to easily implement a bullet proof home
grown, near real time, streaming solution with minimal amount
of code
General / Optimization Tips
• Use Spark Accumulators to collect and report
operational data
• 10G Ethernet
• Multiple SSD disks per node, JBOD configuration
• A lot of memory for the cluster
• Use Leader Election for Driver H/A
– In Spark 0.9 may not be needed with the new option to run
the driver inside the cluster
Technologies Taboola Uses for Spark
• Spark – computing cluster
• Mesos – cluster resource manager
– Better resource allocation (coarse grained) for Spark

• ZooKeeper – distributed coordination
– Enables multi master for mesos & spark

• Curator
– Leader Election – Taboola’s Spark Driver

• Cassandra
– Distributed Data Store

• Monitoring – http://metrics.codahale.com/
Attributions
Many of the general Spark slides were taken from the
DataBricks Spark Summit 2013 slides.
There are great materials at:
https://spark.incubator.apache.org/
http://spark-summit.org/summit-2013/
Thank You!
tal.s@taboola.com
ruthy.g@taboola.com

Weitere ähnliche Inhalte

Was ist angesagt?

Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Lucidworks
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteTime Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteSpark Summit
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Sparkdatamantra
 
Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015Databricks
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Evan Chan
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationshadooparchbook
 
Spark and Spark Streaming
Spark and Spark StreamingSpark and Spark Streaming
Spark and Spark Streaming宇 傅
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemCloudera, Inc.
 
Spark Magic Building and Deploying a High Scale Product in 4 Months
Spark Magic Building and Deploying a High Scale Product in 4 MonthsSpark Magic Building and Deploying a High Scale Product in 4 Months
Spark Magic Building and Deploying a High Scale Product in 4 Monthstsliwowicz
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingHari Shreedharan
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Hadoop to spark-v2
Hadoop to spark-v2Hadoop to spark-v2
Hadoop to spark-v2Sujee Maniyam
 
Cassandra & Spark for IoT
Cassandra & Spark for IoTCassandra & Spark for IoT
Cassandra & Spark for IoTMatthias Niehoff
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkSamy Dindane
 
Hands on with Apache Spark
Hands on with Apache SparkHands on with Apache Spark
Hands on with Apache SparkDan Lynn
 

Was ist angesagt? (20)

Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteTime Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
 
Spark and Spark Streaming
Spark and Spark StreamingSpark and Spark Streaming
Spark and Spark Streaming
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
Spark Magic Building and Deploying a High Scale Product in 4 Months
Spark Magic Building and Deploying a High Scale Product in 4 MonthsSpark Magic Building and Deploying a High Scale Product in 4 Months
Spark Magic Building and Deploying a High Scale Product in 4 Months
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Hadoop to spark-v2
Hadoop to spark-v2Hadoop to spark-v2
Hadoop to spark-v2
 
Cassandra & Spark for IoT
Cassandra & Spark for IoTCassandra & Spark for IoT
Cassandra & Spark for IoT
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Hands on with Apache Spark
Hands on with Apache SparkHands on with Apache Spark
Hands on with Apache Spark
 

Ähnlich wie Taboola's experience with Apache Spark (presentation @ Reversim 2014)

Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Sparktsliwowicz
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceCsaba Toth
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed ComputingFederico Cargnelutti
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareMaking Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareData Con LA
 
Manta Unleashed BigDataSG talk 2 July 2013
Manta Unleashed BigDataSG talk 2 July 2013Manta Unleashed BigDataSG talk 2 July 2013
Manta Unleashed BigDataSG talk 2 July 2013Christopher Hogue
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
 
Analyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkAnalyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkNicola Ferraro
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2Fabio Fumarola
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explainedDmytro Sandu
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 

Ähnlich wie Taboola's experience with Apache Spark (presentation @ Reversim 2014) (20)

Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareMaking Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
 
Manta Unleashed BigDataSG talk 2 July 2013
Manta Unleashed BigDataSG talk 2 July 2013Manta Unleashed BigDataSG talk 2 July 2013
Manta Unleashed BigDataSG talk 2 July 2013
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Analyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkAnalyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache Spark
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explained
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 

KĂźrzlich hochgeladen

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel AraĂşjo
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

KĂźrzlich hochgeladen (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

Taboola's experience with Apache Spark (presentation @ Reversim 2014)

  • 1. Taboola's Experience with Apache Spark tal.s@taboola.com ruthy.g@taboola.com
  • 2. Engine Focused on Maximizing CTR & Post Click Engagement Context Metadata Geo Region-based Recommendations User Behavior Cookie Data Social Facebook/Twitter API Collaborative Filtering Bucketed Consumption Groups
  • 3. Largest Content Discovery and Monetization Network 3B Daily recommendations 1M+ sourced content providers 0M monthly unique users 1M+ sourced content items
  • 4. What Does it Mean? • 5 Data Centers across the globe • Tera-bytes of data / day (many billion events) • Data must be processed and analyzed in real time, for example: – – – – – Real-time, per user content recommendations Real-time expenditure reports Automated campaign management Automated recommendation algorithms calibration Real-time analytics
  • 5. About Spark • • • • • Open Sourced Apache top level project (since Feb. 19th) DataBricks - A commercial company that supports it Hadoop-compatible computing engine Can run side-by-side with Hadoop/Hive on the same data • Drastically faster than Hadoop through in-memory computing • Multiple H/A options - standalone cluster, Apache mesos and ZooKeeper or YARN
  • 6. Spark Development Community • With over 100 developers and 25 companies, one of the most active communities in big data Comparison: Storm (48), Giraph (52), Drill (18), Tez (12) Past 6 months: more active devs than Hadoop MapReduce!
  • 8. 15 25 20 10 5 0 Streaming Response Time (s) Storm 25 20 15 10 5 0 SQL 30 10 5 0 Graph GraphX 25 Giraph 40 Hadoop 35 Response Time (min) Shark (mem) Shark (disk) Hive 45 Impala (mem) 35 Impala (disk) 30 Spark Throughput (MB/s/node) Spark Performance 30 20 15
  • 9. Spark API • Simple to write through easy APIs in Java, Scala and Python • The same analytics code can be used for both streaming data and offline data processing
  • 10. Spark Key Concepts Write programs in terms of transformations on distributed datasets Resilient Distributed Operations Datasets • Transformations • Collections of objects spread across a cluster, stored in RAM or on Disk • Built through parallel transformations • Automatically rebuilt on failure (e.g. map, filter, group By) • Actions (e.g. count, collect, sa ve)
  • 11. Working With RDDs textFile = sc.textFile(”SomeFile.txt”) RDD RDD RDD RDD Action Value Transformations linesWithSpark.count() 74 linesWithSpark.first() # Apache Spark linesWithSpark = textFile.filter(lambda line: "Spark” in line)
  • 12. Example: Log Mining Transformed RDD Load error messages from a log into memory, then interactively search for various patterns Cache 1 Base RDD lines = spark.textFile(“hdfs://...”) results Worker errors = lines.filter(lambda s: s.startswith(“ERROR”)) tasks messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() messages.filter(lambda s: “mysql” in s).count() Driver Action Block 1 Cache 2 Worker messages.filter(lambda s: “php” in s).count() Cache 3 . . . Full-text search of Wikipedia • 60GB on 20 EC2 machine • 0.5 sec vs. 20s for on-disk Worker Block 3 Block 2
  • 13. Task Scheduler • General task graphs • Automatically pipelines functions • Data locality aware • Partitioning aware to avoid shuffles B: A: F: Stage 1 C: groupBy D: E: join Stage 2 map = RDD filter = cached partition Stage 3
  • 14. Software Components • Spark runs as a library in your program (1 instance per app) • Runs tasks locally or on cluster – Mesos, YARN or standalone mode • Accesses storage systems via Hadoop InputFormat API – Can use HBase, HDFS, S3, … Your application SparkContext Cluster manager Local threads Worker Worker Spark executor Spark executor HDFS or other storage
  • 15. System Architecture & Data Flow @ Taboola Driver + Consumers Spark Cluster FE Servers MySQL Cluster C* Cluster FE Servers
  • 16. Execution Graph @ Taboola rdd1 = Context.parallize([data]) • Data start point (dates, etc) rdd2 = rdd1.mapPartitions(loadfunc()) • Loading data from external sources (Cassandra, MySQL, etc) rdd3 = rdd2.reduce(reduceFunc()) rdd4 = rdd3.mapPartitions(saverfunc()) rdd4.count() • Aggregating the data and storing results • Saving the results to a DB • Executing the above graph by forcing an output operation
  • 17. Cassandra as a Distributed Storage • • • Event Log Files saved as blobs to a dedicated keyspace C* Tables holding the Event Log Files are partitioned by day – new Table per day. This way, it is easier for maintenance and simpler to load into Spark Using Astyanax driver + CQL3 – – • Wrote hadoop InputFormat that supports loading this into a lines RDD<String> – • The DataStax InputFormat had issues and at the time was not formally supported Worked well, but ended up not using it – instead using mapPartitions – – • Recipe to load all keys of a table very fast (hundred of thousands / sec) Split by keys and then load data by key in batches – in parallel partitions Very simple, no overhead, no need to be tied to hadoop Will probably use the InputFormat when we deploy a Shark solution Plans to open source all this userevent_2014-02-19 userevent_2014-02-20 Key (String) Data (blob) Key (String) Data (blob) GUID (originally log file name) Gzipped file GUID (originally log file name) Gzipped file GUID Gzipped file GUID Gzipped file … … … …
  • 18. Sample – Click Counting for Campaign Stopping 1. mapPartitions – mapping from strings to objects with a pre designed click key 2. reduceByKey – removing duplicate clicks (see next slide) 3. Map – switch keys to a campaign+day key 4. reduceByKey – aggregate the data by campaign+day
  • 19. Campaign Stopping – Removing Dup Clicks • When more than 1 click found from the same user on the same item, leave only the oldest • Using accumulators to track duplicate numbers
  • 20. Our Deployment • 7 nodes, each– – – – 24 cores 256G Ram 6 1TB SSD Disks – JBOD configuration 10G Ethernet • Total Cluster Power – – – 1760GB Ram 168 CPUs 42 TB storage – (effective space is less, Cassandra Keyspaces defined with replication factor 3) • Symmetric Deployment – – Mesos + Spark Cassandra • More – – – Rabbit MQ on 3 nodes ZooKeeper on 3 nodes MySQL cluster outside this cluster • Loads & processes ~1 Tera Bytes (unzipped data) in ~3 minutes
  • 21. Things that work well with Spark (from our experience) • Very easy to code complex jobs – Harder than SQL, but better than other Map Reduce options – Simple concepts, “small” API • Easy to Unit Test – Runs in local mode, so ideal for micro E2E tests – Each mapper/reducer can be unit tested without Spark – if you do not use anonymous classes • Very resilient • Can read/write to/from any data source, including RDBMS, Cassandra, HDFS, local files, etc. • Great monitoring • Easy to deploy & upgrade • Blazing fast
  • 22. Things that do not work that well (from our experience) • Long (endless) running tasks require some workarounds – Temp files - Spark creates a lot of files in spark.local.dir, requires periodic cleanup – Use spark.cleaner.ttl for long running tasks – Periodic Driver restarts (spark 0.8.1) • Spark Streaming – not fully mature – Some end cases can cause loss of data – Sliding window / batch model does not fit our needs • We always load some history to deal with late arriving data • State management left to the user and not trivial – BUT – we were able to easily implement a bullet proof home grown, near real time, streaming solution with minimal amount of code
  • 23. General / Optimization Tips • Use Spark Accumulators to collect and report operational data • 10G Ethernet • Multiple SSD disks per node, JBOD configuration • A lot of memory for the cluster • Use Leader Election for Driver H/A – In Spark 0.9 may not be needed with the new option to run the driver inside the cluster
  • 24. Technologies Taboola Uses for Spark • Spark – computing cluster • Mesos – cluster resource manager – Better resource allocation (coarse grained) for Spark • ZooKeeper – distributed coordination – Enables multi master for mesos & spark • Curator – Leader Election – Taboola’s Spark Driver • Cassandra – Distributed Data Store • Monitoring – http://metrics.codahale.com/
  • 25. Attributions Many of the general Spark slides were taken from the DataBricks Spark Summit 2013 slides. There are great materials at: https://spark.incubator.apache.org/ http://spark-summit.org/summit-2013/

Hinweis der Redaktion

  1. One of the most exciting things you’ll findGrowing all the timeNASCAR slideIncluding several sponsors of this event are just starting to get involved…If your logo is not up here, forgive us – it’s hard to keep up!
  2. RDD  Colloquially referred to as RDDs (e.g. caching in RAM)Lazy operations to build RDDs from other RDDsReturn a result or write it to storage
  3. Let me illustrate this with some bad powerpoint diagrams and animationsThis diagram is LOGICAL,
  4. Add “variables” to the “functions” in functional programming
  5. NOT a modified versionof Hadoop