SlideShare a Scribd company logo
1 of 23
Download to read offline
Real-Time Analytical Processing (RTAP)
Using the Spark Stack
Jason Dai

jason.dai@intel.com
Intel Software and Services Group
Project Overview
Research & open source projects initiated by AMPLab
in UC Berkeley

BDAS: Berkeley Data Analytics Stack
(Ref: https://amplab.cs.berkeley.edu/software/)

Intel closely collaborating with AMPLab & the community
on open source development
• Apache incubation since June 2013
• The “most active cluster data processing engine
after Hadoop MapReduce”

Intel partnering with several big websites
• Building next-gen big data analytics using the Spark stack

• Real-time analytical processing (RTAP)
• E.g., Alibaba , Baidu iQiyi, Youku, etc.

2
Next-Gen Big Data Analytics
Volume
• Massive scale & exponential growth

Variety
• Multi-structured, diverse sources & inconsistent schemas
Value

• Simple (SQL) – descriptive analytics
• Complex (non-SQL) – predictive analytics
Velocity
• Interactive – the speed of thought
• Streaming/online – drinking from the firehose

3
Next-Gen Big Data Analytics
Volume
• Massive scale & exponential growth

Variety
• Multi-structured, diverse sources & inconsistent schemas
Value

• Simple (SQL) – descriptive analytics
• Complex (non-SQL) – predictive analytics
Velocity
• Interactive – the speed of thought
• Streaming/online – drinking from the firehose

4

The Spark stack
Real-Time Analytical Processing (RTAP)
A vision for next-gen big data analytics
• Data captured & processed in a (semi) streaming/online fashion
• Real-time & history data combined and mined interactively and/or iteratively
– Complex OLAP / BI in interactive fashion
– Iterative, complex machine learning & graph analysis

• Predominantly memory-based computation

Events /
Logs

Messaging /
Queue

Stream
Processing

In-Memory
Store

Persistent Storage

NoSQL
Data
Warehouse

5

Online Analysis
/ Dashboard
Low Latency
Processing
Engine

Ad-hoc,
interactive
OLAP / BI

Iterative, complex
machine learning &
graph analysis
Real-World Use Case #1

(Semi) Real-Time Log Aggregation & Analysis
Logs continuously collected & streamed in
• Through queuing/messaging systems

Incoming logs processed in a (semi) streaming fashion
• Aggregations for different time periods, demographics, etc.
• Join logs and history tables when necessary
Aggregation results then consumed in a (semi) streaming fashion
• Monitoring, alerting, etc.

6
(Semi) Real-Time Log Aggregation: MiniBatch Jobs

Log Collector

Log Collector

Kafka
Client

Kafka
Client

…

Server
Kafka

Server

Server
Spark

Mesos

Spark

Kafka

Kafka

Mesos

…

Spark
Mesos

Architecture

• Log collected and published to Kafka continuously

RDBMS

– Kafka cluster collocated with Spark Cluster

• Independent Spark applications launched at regular interval
– A series of mini-batch apps running in “fine-grained” mode on Mesos
– Process newly received data in Kafka (e.g., aggregation by keys) & write results out

In production in the user’s environment today
• 10s of seconds latency

7

For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
(Semi) Real-Time Log Aggregation: MiniBatch Jobs
Design decisions
• Kafka and Spark servers collocated
– A distinct Spark task for each Kafka partition, usually reading from local file system cache

• Individual Spark applications launched periodically
– Apps logging current (Kafka partition) offsets in ZooKeeper
– Data of failed batch simply discarded (other strategies possible)

• Spark running in “fine-grained” mode on Mesos
– Allow the next batch to start even if the current one is late

Performance
• Input (log collection) bottlenecked by network bandwidth (1GbE)
• Processing (log aggregation) bottlenecked by CPUs
– Easily scaled to several seconds

8

For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
(Semi) Real-Time Log Aggregation: Spark
Streaming
Log
Collectors

Kafka
Cluster

Spark
Cluster

RDBMS

Implications
• Better streaming framework support
– Complex (e.g., stateful) analysis, fault-tolerance, etc.

• Kafka & Spark not collocated
– DStream retrieves logs in background (over network) and caches blocks in memory

• Memory tuning to reduce GC is critical
– spark.cleaner.ttl (throughput * spark.cleaner.ttl < spark mem free size)
– Storage level (MEMORY_ONLY_SER2)

• Lowe latency (several seconds)
– No startup overhead (reusing SparkContext)

9

For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
Real-World Use Case #2
Complex, Interactive OLAP / BI
Significant speedup of ad-hoc, complex OLAP / BI
• Spark/Shark cluster runs alongside Hadoop/Hive clusters

• Directly query on Hadoop/Hive data (interactively)
• No ETL, no storage overhead, etc.
In-memory, real-time queries
• Data of interest loaded into memory
create table XYZ tblproperties ("shark.cache" = "true") as select ...

• Typically frontended by lightweight UIs
– Data visualization and exploration (e.g., drill down / up)

10
Time Series Analysis: Unique Event Occurrence
Computing unique event occurrence across the time range
• Input time series
<TimeStamp, ObjectId, EventId, ...>

– E.g., watch of a particular video by a specific user
– E.g., transactions of a particular stock by a specific account

• Output:
<ObjectId, TimeRange, Unique Event#, Unique Event(≥2)#, …, Unique Event(≥n)#>

– E.g., accumulated unique event# for each day in a week (staring from Monday)

…
1

2

…

7

• Implementation
– 2-level aggregation using Spark
• Specialized partitioning, general execution graph, in-memory cached data

– Speedup from 20+ hours to several minutes
11

For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
Time Series Analysis: Unique Event Occurrence
Input time series
TimeStamp

ObjectId

EventId

…

Partitioned by (ObjectId, EventId)

Aggregation
(shuffle)
ObjectId

12

EventId

Day
(1, 2, … ,7)

1

ObjectId

EventId

Day
Count
(1, 2, … ,7)

Potentially cached in mem
Time Series Analysis: Unique Event Occurrence
Input time series
TimeStamp

ObjectId

EventId

…

Partitioned by (ObjectId, EventId)

Aggregation
(shuffle)
ObjectId

EventId

Day
(1, 2, … ,7)

1

ObjectId

ObjectId

ObjectId

EventId

TimeRage
[1, D]

EventId

TimeRage
[1, D]

1

Potentially cached in mem

Day
Count
(1, 2, … ,7)

Accumulated
Count

1 (if count ≥2)
0 otherwise

…

1 (if count ≥n)
0 otherwise

Aggregation
(shuffle)
ObjectId

Partitioned by ObjectId

13

TimeRage
[1, D]

Unique Event# Unique Event(≥2)#

…

Unique Event(≥n)#
Real-World Use Case #3

Complex Machine Learning & Graph Analysis
Algorithm: complex math operations
• Mostly matrix based
– Multiplication, factorization, etc.

• Sometime graph-based
– E.g., sparse matrix

Iterative computations

• Matrix (graph) cached in memory across iterations

14
Graph Analysis: N-Degree Association
N-degree association in the graph
• Computing associations between two
vertices that are n-hop away
• E.g., friends of friend

Graph-parallel implementation

Weight1(u, v) = edge(u, v) ∈ (0, 1)
Weightn(u, v) = 𝑥→𝑣 Weightn−1(u, x)∗Weight1(x, v)

• Bagel (Pregel on Spark) and GraphX
– Memory optimizations for efficient graph
caching critical

• Speedup from 20+ minutes to <2 minutes

15

For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
Graph Analysis: N-Degree Association
u

v

State[u] = list of Weight(x, u)
(for current top K weights to vertex u)

w

State[w] = list of Weight(x, w)
(for current top K weights to vertex w)

State[v] = list of Weight(x, v)
(for current top K weights to vertex v)

16
Graph Analysis: N-Degree Association
u

v

State[u] = list of Weight(x, u)
(for current top K weights to vertex u)

w

State[w] = list of Weight(x, w)
(for current top K weights to vertex w)

State[v] = list of Weight(x, v)
(for current top K weights to vertex v)

u
Messages = {D(x, u) =
Weight(x, v) * edge(w, u)}
(for weight(x, v) in State[v])

Messages = {D(x, u) =
Weight(x, w) * edge(w, u)}
(for weight(x, w) in State[w])

v

17

w
Graph Analysis: N-Degree Association
u

v

State[u] = list of Weight(x, u)
(for current top K weights to vertex u)

w

State[w] = list of Weight(x, w)
(for current top K weights to vertex w)

State[v] = list of Weight(x, v)
(for current top K weights to vertex v)

Weight(u, x) =

u

u
Messages = {D(x, u) =
Weight(x, v) * edge(w, u)}
(for weight(x, v) in State[v])

Messages = {D(x, u) =
Weight(x, w) * edge(w, u)}
(for weight(x, w) in State[w])

v

18

v

w

𝐷 𝑥,𝑢 𝑖𝑛 𝑀𝑒𝑠𝑠𝑎𝑔𝑒𝑠 D(x,

u)

State[u] = list of Weight(x, u)
(for current top K weights to vertex w)

w
Graph Analysis: N-Degree Association
u

v

State[u] = list of Weight(x, u)
(for current top K weights to vertex u)

w

State[w] = list of Weight(x, w)
(for current top K weights to vertex w)

State[v] = list of Weight(x, v)
(for current top K weights to vertex v)

Weight(u, x) =

u

u
Messages = {D(x, u) =
Weight(x, v) * edge(w, u)}
(for weight(x, v) in State[v])

Messages = {D(x, u) =
Weight(x, w) * edge(w, u)}
(for weight(x, w) in State[w])

v

19

v

w

𝐷 𝑥,𝑢 𝑖𝑛 𝑀𝑒𝑠𝑠𝑎𝑔𝑒𝑠 D(x,

u)

State[u] = list of Weight(x, u)
(for current top K weights to vertex w)

w
Contributions by Intel
Netty based shuffle for Spark
FairScheduler for Spark

Spark job log files
Metrics system for Spark
Configurations system for Spark
Spark shell on YARN
Spark (standalone mode) integration with security Hadoop
Byte code generation for Shark
Co-partitioned join in Shark
...
20
Summary
The Spark stack: lightning-fast analytics over Hadoop data
• Active communities and early adopters evolving
– Apache incubator project

• A growing stack: SQL, streaming, graph-parallel, machine learning, …

Work with us on next-gen big data analytics using the Spark stack
• Interactive and in-memory OLAP / BI
• Complex machine learning & graph analysis
• Near real-time, streaming processing
• And many more!

21
Notices and Disclaimers
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL®
PRODUCTS. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH
PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS
OR IMPLIED WARRANTY RELATING TO SALE AND/OR USE OF INTEL PRODUCTS, INCLUDING
LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE,
MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT, OR OTHER INTELLECTUAL
PROPERTY RIGHT.
Intel may make changes to specifications, product descriptions, and plans at any time, without
notice. Designers must not rely on the absence or characteristics of any features or instructions
marked "reserved" or "undefined". Intel reserves these for future definition and shall have no
responsibility whatsoever for conflicts or incompatibilities arising from future changes to
them. The information here is subject to change without notice. Do not finalize a design with
this information.
The products described in this document may contain design defects or errors known as errata
which may cause the product to deviate from published specifications. Current characterized
errata are available on request.
Contact your local Intel sales office or your distributor to obtain the latest specifications and
before placing your product order.

All dates provided are subject to change without notice.
Intel and Intel logoare trademarks of Intel Corporation in the U.S. and other countries.
*Other names and brands may be claimed as the property of others.
Copyright © 2013, Intel Corporation. All rights reserved.
22
Intel realtime analytics_spark

More Related Content

What's hot

Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingHari Shreedharan
 
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...Databricks
 
Databricks clusters in autopilot mode
Databricks clusters in autopilot modeDatabricks clusters in autopilot mode
Databricks clusters in autopilot modePrakash Chockalingam
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkDatabricks
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Sparkjlacefie
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkAndy Petrella
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkDB Tsai
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLFlink Forward
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...Spark Summit
 
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Tathagata Das
 
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...Spark Summit
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsFlink Forward
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Milind Bhandarkar
 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLOptimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLSpark Summit
 
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSPDiscretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSPTathagata Das
 
Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and DatasetKazuaki Ishizaki
 
Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...
Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...
Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...Accumulo Summit
 
TensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache SparkTensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache SparkDatabricks
 

What's hot (20)

Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
 
Databricks clusters in autopilot mode
Databricks clusters in autopilot modeDatabricks clusters in autopilot mode
Databricks clusters in autopilot mode
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
 
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
 
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
 
Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLOptimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone ML
 
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSPDiscretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
 
Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and Dataset
 
Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...
Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...
Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...
 
TensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache SparkTensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache Spark
 

Viewers also liked

Airfare prediction using Machine Learning with Apache Spark on 1 billion obse...
Airfare prediction using Machine Learning with Apache Spark on 1 billion obse...Airfare prediction using Machine Learning with Apache Spark on 1 billion obse...
Airfare prediction using Machine Learning with Apache Spark on 1 billion obse...Josef A. Habdank
 
Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...
Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...
Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...Spark Summit
 
Tutorial en Apache Spark - Clasificando tweets en realtime
Tutorial en Apache Spark - Clasificando tweets en realtimeTutorial en Apache Spark - Clasificando tweets en realtime
Tutorial en Apache Spark - Clasificando tweets en realtimeSocialmetrix
 
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Spark Summit
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Spark Summit
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 

Viewers also liked (7)

Airfare prediction using Machine Learning with Apache Spark on 1 billion obse...
Airfare prediction using Machine Learning with Apache Spark on 1 billion obse...Airfare prediction using Machine Learning with Apache Spark on 1 billion obse...
Airfare prediction using Machine Learning with Apache Spark on 1 billion obse...
 
Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...
Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...
Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...
 
Tutorial en Apache Spark - Clasificando tweets en realtime
Tutorial en Apache Spark - Clasificando tweets en realtimeTutorial en Apache Spark - Clasificando tweets en realtime
Tutorial en Apache Spark - Clasificando tweets en realtime
 
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 

Similar to Intel realtime analytics_spark

夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架hdhappy001
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's comingDatabricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
 
Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupGwen (Chen) Shapira
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Databricks
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Spark Summit
 
OrientDB - Time Series and Event Sequences - Codemotion Milan 2014
OrientDB - Time Series and Event Sequences - Codemotion Milan 2014OrientDB - Time Series and Event Sequences - Codemotion Milan 2014
OrientDB - Time Series and Event Sequences - Codemotion Milan 2014Luigi Dell'Aquila
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Databricks
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingDatabricks
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...
Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...
Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...Codemotion
 

Similar to Intel realtime analytics_spark (20)

夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data Meetup
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
 
OrientDB - Time Series and Event Sequences - Codemotion Milan 2014
OrientDB - Time Series and Event Sequences - Codemotion Milan 2014OrientDB - Time Series and Event Sequences - Codemotion Milan 2014
OrientDB - Time Series and Event Sequences - Codemotion Milan 2014
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...
Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...
Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...
 
Spark cep
Spark cepSpark cep
Spark cep
 

Recently uploaded

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 

Recently uploaded (20)

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 

Intel realtime analytics_spark

  • 1. Real-Time Analytical Processing (RTAP) Using the Spark Stack Jason Dai jason.dai@intel.com Intel Software and Services Group
  • 2. Project Overview Research & open source projects initiated by AMPLab in UC Berkeley BDAS: Berkeley Data Analytics Stack (Ref: https://amplab.cs.berkeley.edu/software/) Intel closely collaborating with AMPLab & the community on open source development • Apache incubation since June 2013 • The “most active cluster data processing engine after Hadoop MapReduce” Intel partnering with several big websites • Building next-gen big data analytics using the Spark stack • Real-time analytical processing (RTAP) • E.g., Alibaba , Baidu iQiyi, Youku, etc. 2
  • 3. Next-Gen Big Data Analytics Volume • Massive scale & exponential growth Variety • Multi-structured, diverse sources & inconsistent schemas Value • Simple (SQL) – descriptive analytics • Complex (non-SQL) – predictive analytics Velocity • Interactive – the speed of thought • Streaming/online – drinking from the firehose 3
  • 4. Next-Gen Big Data Analytics Volume • Massive scale & exponential growth Variety • Multi-structured, diverse sources & inconsistent schemas Value • Simple (SQL) – descriptive analytics • Complex (non-SQL) – predictive analytics Velocity • Interactive – the speed of thought • Streaming/online – drinking from the firehose 4 The Spark stack
  • 5. Real-Time Analytical Processing (RTAP) A vision for next-gen big data analytics • Data captured & processed in a (semi) streaming/online fashion • Real-time & history data combined and mined interactively and/or iteratively – Complex OLAP / BI in interactive fashion – Iterative, complex machine learning & graph analysis • Predominantly memory-based computation Events / Logs Messaging / Queue Stream Processing In-Memory Store Persistent Storage NoSQL Data Warehouse 5 Online Analysis / Dashboard Low Latency Processing Engine Ad-hoc, interactive OLAP / BI Iterative, complex machine learning & graph analysis
  • 6. Real-World Use Case #1 (Semi) Real-Time Log Aggregation & Analysis Logs continuously collected & streamed in • Through queuing/messaging systems Incoming logs processed in a (semi) streaming fashion • Aggregations for different time periods, demographics, etc. • Join logs and history tables when necessary Aggregation results then consumed in a (semi) streaming fashion • Monitoring, alerting, etc. 6
  • 7. (Semi) Real-Time Log Aggregation: MiniBatch Jobs Log Collector Log Collector Kafka Client Kafka Client … Server Kafka Server Server Spark Mesos Spark Kafka Kafka Mesos … Spark Mesos Architecture • Log collected and published to Kafka continuously RDBMS – Kafka cluster collocated with Spark Cluster • Independent Spark applications launched at regular interval – A series of mini-batch apps running in “fine-grained” mode on Mesos – Process newly received data in Kafka (e.g., aggregation by keys) & write results out In production in the user’s environment today • 10s of seconds latency 7 For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
  • 8. (Semi) Real-Time Log Aggregation: MiniBatch Jobs Design decisions • Kafka and Spark servers collocated – A distinct Spark task for each Kafka partition, usually reading from local file system cache • Individual Spark applications launched periodically – Apps logging current (Kafka partition) offsets in ZooKeeper – Data of failed batch simply discarded (other strategies possible) • Spark running in “fine-grained” mode on Mesos – Allow the next batch to start even if the current one is late Performance • Input (log collection) bottlenecked by network bandwidth (1GbE) • Processing (log aggregation) bottlenecked by CPUs – Easily scaled to several seconds 8 For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
  • 9. (Semi) Real-Time Log Aggregation: Spark Streaming Log Collectors Kafka Cluster Spark Cluster RDBMS Implications • Better streaming framework support – Complex (e.g., stateful) analysis, fault-tolerance, etc. • Kafka & Spark not collocated – DStream retrieves logs in background (over network) and caches blocks in memory • Memory tuning to reduce GC is critical – spark.cleaner.ttl (throughput * spark.cleaner.ttl < spark mem free size) – Storage level (MEMORY_ONLY_SER2) • Lowe latency (several seconds) – No startup overhead (reusing SparkContext) 9 For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
  • 10. Real-World Use Case #2 Complex, Interactive OLAP / BI Significant speedup of ad-hoc, complex OLAP / BI • Spark/Shark cluster runs alongside Hadoop/Hive clusters • Directly query on Hadoop/Hive data (interactively) • No ETL, no storage overhead, etc. In-memory, real-time queries • Data of interest loaded into memory create table XYZ tblproperties ("shark.cache" = "true") as select ... • Typically frontended by lightweight UIs – Data visualization and exploration (e.g., drill down / up) 10
  • 11. Time Series Analysis: Unique Event Occurrence Computing unique event occurrence across the time range • Input time series <TimeStamp, ObjectId, EventId, ...> – E.g., watch of a particular video by a specific user – E.g., transactions of a particular stock by a specific account • Output: <ObjectId, TimeRange, Unique Event#, Unique Event(≥2)#, …, Unique Event(≥n)#> – E.g., accumulated unique event# for each day in a week (staring from Monday) … 1 2 … 7 • Implementation – 2-level aggregation using Spark • Specialized partitioning, general execution graph, in-memory cached data – Speedup from 20+ hours to several minutes 11 For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
  • 12. Time Series Analysis: Unique Event Occurrence Input time series TimeStamp ObjectId EventId … Partitioned by (ObjectId, EventId) Aggregation (shuffle) ObjectId 12 EventId Day (1, 2, … ,7) 1 ObjectId EventId Day Count (1, 2, … ,7) Potentially cached in mem
  • 13. Time Series Analysis: Unique Event Occurrence Input time series TimeStamp ObjectId EventId … Partitioned by (ObjectId, EventId) Aggregation (shuffle) ObjectId EventId Day (1, 2, … ,7) 1 ObjectId ObjectId ObjectId EventId TimeRage [1, D] EventId TimeRage [1, D] 1 Potentially cached in mem Day Count (1, 2, … ,7) Accumulated Count 1 (if count ≥2) 0 otherwise … 1 (if count ≥n) 0 otherwise Aggregation (shuffle) ObjectId Partitioned by ObjectId 13 TimeRage [1, D] Unique Event# Unique Event(≥2)# … Unique Event(≥n)#
  • 14. Real-World Use Case #3 Complex Machine Learning & Graph Analysis Algorithm: complex math operations • Mostly matrix based – Multiplication, factorization, etc. • Sometime graph-based – E.g., sparse matrix Iterative computations • Matrix (graph) cached in memory across iterations 14
  • 15. Graph Analysis: N-Degree Association N-degree association in the graph • Computing associations between two vertices that are n-hop away • E.g., friends of friend Graph-parallel implementation Weight1(u, v) = edge(u, v) ∈ (0, 1) Weightn(u, v) = 𝑥→𝑣 Weightn−1(u, x)∗Weight1(x, v) • Bagel (Pregel on Spark) and GraphX – Memory optimizations for efficient graph caching critical • Speedup from 20+ minutes to <2 minutes 15 For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
  • 16. Graph Analysis: N-Degree Association u v State[u] = list of Weight(x, u) (for current top K weights to vertex u) w State[w] = list of Weight(x, w) (for current top K weights to vertex w) State[v] = list of Weight(x, v) (for current top K weights to vertex v) 16
  • 17. Graph Analysis: N-Degree Association u v State[u] = list of Weight(x, u) (for current top K weights to vertex u) w State[w] = list of Weight(x, w) (for current top K weights to vertex w) State[v] = list of Weight(x, v) (for current top K weights to vertex v) u Messages = {D(x, u) = Weight(x, v) * edge(w, u)} (for weight(x, v) in State[v]) Messages = {D(x, u) = Weight(x, w) * edge(w, u)} (for weight(x, w) in State[w]) v 17 w
  • 18. Graph Analysis: N-Degree Association u v State[u] = list of Weight(x, u) (for current top K weights to vertex u) w State[w] = list of Weight(x, w) (for current top K weights to vertex w) State[v] = list of Weight(x, v) (for current top K weights to vertex v) Weight(u, x) = u u Messages = {D(x, u) = Weight(x, v) * edge(w, u)} (for weight(x, v) in State[v]) Messages = {D(x, u) = Weight(x, w) * edge(w, u)} (for weight(x, w) in State[w]) v 18 v w 𝐷 𝑥,𝑢 𝑖𝑛 𝑀𝑒𝑠𝑠𝑎𝑔𝑒𝑠 D(x, u) State[u] = list of Weight(x, u) (for current top K weights to vertex w) w
  • 19. Graph Analysis: N-Degree Association u v State[u] = list of Weight(x, u) (for current top K weights to vertex u) w State[w] = list of Weight(x, w) (for current top K weights to vertex w) State[v] = list of Weight(x, v) (for current top K weights to vertex v) Weight(u, x) = u u Messages = {D(x, u) = Weight(x, v) * edge(w, u)} (for weight(x, v) in State[v]) Messages = {D(x, u) = Weight(x, w) * edge(w, u)} (for weight(x, w) in State[w]) v 19 v w 𝐷 𝑥,𝑢 𝑖𝑛 𝑀𝑒𝑠𝑠𝑎𝑔𝑒𝑠 D(x, u) State[u] = list of Weight(x, u) (for current top K weights to vertex w) w
  • 20. Contributions by Intel Netty based shuffle for Spark FairScheduler for Spark Spark job log files Metrics system for Spark Configurations system for Spark Spark shell on YARN Spark (standalone mode) integration with security Hadoop Byte code generation for Shark Co-partitioned join in Shark ... 20
  • 21. Summary The Spark stack: lightning-fast analytics over Hadoop data • Active communities and early adopters evolving – Apache incubator project • A growing stack: SQL, streaming, graph-parallel, machine learning, … Work with us on next-gen big data analytics using the Spark stack • Interactive and in-memory OLAP / BI • Complex machine learning & graph analysis • Near real-time, streaming processing • And many more! 21
  • 22. Notices and Disclaimers INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY RELATING TO SALE AND/OR USE OF INTEL PRODUCTS, INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT, OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel may make changes to specifications, product descriptions, and plans at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. All dates provided are subject to change without notice. Intel and Intel logoare trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. Copyright © 2013, Intel Corporation. All rights reserved. 22