SlideShare a Scribd company logo
1 of 42
1© Cloudera, Inc. All rights reserved.
Intro to Apache Spark
Anand Iyer
Senior Product Manager, Cloudera
2© Cloudera, Inc. All rights reserved.
Target Audience
• New to Spark, or have very rudimentary knowledge of Spark.
• Have basic knowledge of Map-Reduce
If you are an advanced Spark developer, you are unlikely to get much out of this
talk.
• No performance tuning or debugging tips
3© Cloudera, Inc. All rights reserved.
Spark: Easy and Fast Big Data
• Easy to Develop
• Rich APIs in Java, Scala, Python
• Interactive shell
• Fast to Run
• General execution graphs
• In-memory Caching
4© Cloudera, Inc. All rights reserved.
Easy to code API
5© Cloudera, Inc. All rights reserved.
RDD: Resilient Distributed Datasets
Abstraction to represent the large distributed sets of data that are being processed.
RDDs are:
• Broken up into partitions, which are distributed across nodes
• In practice, RDDs usually have between 100 to 10K partitions
• Partitions operated upon in parallel
• Immutable
• Fault-Tolerant via concept of lineage
6© Cloudera, Inc. All rights reserved.
Spark jobs are DAGs of operations on RDDs
Operations on RDDs
• Transformations: Create a new RDD from existing RDDs
• Actions: Run computation on RDD, return values to the driver
= RDD
join
filter
groupBy
B:
C: D: E:
G:
Ç√Ω
map
A:
map
take
F:
7© Cloudera, Inc. All rights reserved.
Rich Expressive API
• map
• filter
• groupBy
• sort
• union
• join
• leftOuterJoin
• rightOuterJoin
• reduce
• count
• fold
• reduceByKey
• groupByKey
• cogroup
• cross
• zip
sample
take
first
partitionBy
mapWith
pipe
save ...
8© Cloudera, Inc. All rights reserved.
Example: Logistic Regression
sc = SparkContext(…)
rawData = sc.textFile(“hdfs://…”)
data = rawData.map(parserFunc).cache()
w = numpy.random.rand(D)
for i in range(iterations):
gradient = data
.map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x))))
* p.y * p.x)
.reduce(lambda x, y: x + y)
w -= gradient
print “Final w: %s” % w
9© Cloudera, Inc. All rights reserved.
Execution model and Spark
Internals
10© Cloudera, Inc. All rights reserved.
Driver & Executors
• Driver: Master node
• One Driver per Spark App
• Runs the main(…) function of your app
• Executors: Worker nodes
11© Cloudera, Inc. All rights reserved.
Logical graph to physical execution plan
= cached partition
= RDD
join
filter
groupBy
B: F:
C: D: E:
G:
map
A:
map
take
• Execution graph is broken into Stages
• Each Stage consists of multiple Tasks
• Task is unit of computation that is scheduled on an Executor
• A Stage consists of multiple operations that can be pipelined
• Stages are split when data needs to be “shuffled”
12© Cloudera, Inc. All rights reserved.
Shuffle
• Redistributes data among partitions
• reduce, groupBy, join
• Hash keys to buckets
• Identical to MapReduce Shuffle
• Shuffle entails writes to disk
13© Cloudera, Inc. All rights reserved.
Spark WebUI lets you visualize DAG
14© Cloudera, Inc. All rights reserved.
Drivers & Executors revisited
• Driver
• One Driver per Spark App
• Runs the main(…) function of your app
• Creates logical DAG and physical execution plan
• Schedules Tasks
• Driver receives and collects the results of Actions
• Executors
• Hold RDD partitions
• Execute Tasks as scheduled by Driver
15© Cloudera, Inc. All rights reserved.
Spark runs on Cluster Managers
• Spark does not manage cluster of machines
• Runs on YARN, Mesos or Standalone (cluster manager built specifically for Spark)
16© Cloudera, Inc. All rights reserved.
Why is Spark Fast?
17© Cloudera, Inc. All rights reserved.
Memory management leads to greater performance
Trends:
½ price every 18 months
2x bandwidth every 3 years
128 – 384 GB
12-24 cores
50 GB per sec
Memory can be enabler for high performance big data
applications
18© Cloudera, Inc. All rights reserved.
Persisting or Caching RDDs
• If an RDD will be re-used, persist it to prevent re-computation
• Very common in iterative algorithms
• By default, cached RDDs held in memory
• But memory may not suffice
• MEMORY_AND_DISK persistence: Spill the partitions that don’t fit in memory
to disk
19© Cloudera, Inc. All rights reserved.
Lineage for Fault-Tolerance
= RDD
join
filter
groupBy
B: F:
C: D: E:
G:
map
A:
map
take
20© Cloudera, Inc. All rights reserved.
Lineage
= RDD
join
filter
groupBy
B: F:
C: D: E:
G:
map
A:
map
take
21© Cloudera, Inc. All rights reserved.
Lineage
= RDD
join
filter
groupBy
B: F:
C: D: E:
G:
map
A:
map
take
22© Cloudera, Inc. All rights reserved.
Lineage
= RDD
join
filter
groupBy
B: F:
C: D: E:
G:
map
A:
map
take
23© Cloudera, Inc. All rights reserved.
Lineage
= RDD
join
filter
groupBy
B: F:
C: D: E:
G:
map
A:
map
take
24© Cloudera, Inc. All rights reserved.
join
filter
groupBy
B: F:
C: D: E:
H:
Ç√Ω
map
A:
map
take
Lineage Truncation
= RDD
Lineage gets truncated at an RDD when:
• RDD is persisted to memory or disk
• RDD already materialized on disk due to shuffle
G:
25© Cloudera, Inc. All rights reserved.
join
filter
groupBy
B: F:
C: D: E:
H:
Ç√Ω
map
A:
map
take
Lineage Truncation
= RDD
Lineage gets truncated at an RDD when:
• RDD is persisted to memory or disk
• RDD already materialized on disk due to shuffle
G:
26© Cloudera, Inc. All rights reserved.
join
filter
groupBy
B: F:
C: D: E:
H:
Ç√Ω
map
A:
map
take
Lineage Truncation
= RDD
Lineage gets truncated at an RDD when:
• RDD is persisted to memory or disk
• RDD already materialized on disk due to shuffle
G:
27© Cloudera, Inc. All rights reserved.
join
filter
groupBy
B: F:
C: D: E:
H:
Ç√Ω
map
A:
map
take
Lineage Truncation
= RDD
Lineage gets truncated at an RDD when:
• RDD is persisted to memory or disk
• RDD already materialized on disk due to shuffle
G:
28© Cloudera, Inc. All rights reserved.
join
filter
groupBy
B: F:
C: D: E:
H:
Ç√Ω
map
A:
map
take
Lineage Truncation
= RDD
Lineage gets truncated at an RDD when:
• RDD is persisted to memory or disk
• RDD already materialized on disk due to shuffle
G:
29© Cloudera, Inc. All rights reserved.
Summary of what makes Spark fast
• Maximize use of memory
• Re-used RDDs can be explicitly cached to prevent re-computation
• Leverage Lineage & Pipelining to minimize writing intermediate data to disk
• Efficient Task Scheduler
• Ensure worker nodes are kept busy via quick scheduling of Tasks
• More optimizations coming in Spark SQL
• Compact binary in-memory data representation, etc
• More details in subsequent slides
30© Cloudera, Inc. All rights reserved.
Spark will replace MapReduce
To become the standard execution engine for Hadoop
31© Cloudera, Inc. All rights reserved.
Spark Streaming
32© Cloudera, Inc. All rights reserved.
Spark Streaming
• Incoming data represented as DStreams (Discretized Streams)
• Data commonly read from streaming data channels like Kafka or Flume
• A spark-streaming application is a DAG of Transformations and Actions on
DStreams (and RDDs)
33© Cloudera, Inc. All rights reserved.
Discretized Stream
• Incoming data stream is broken down into micro-batches
• Micro-batch size is user defined, usually 0.3 to 1 second
• Micro-batches are disjoint
• Each micro-batch is an RDD
• Effectively, a DStream is a sequence of RDDs, one per micro-batch
• Spark Streaming known for high throughput
34© Cloudera, Inc. All rights reserved.
Windowed DStreams
• Defined by specifying a window size and a step size
• Both are multiples of micro-batch size
• Operations invoked on each window’s data
35© Cloudera, Inc. All rights reserved.
Maintain and update arbitrary state
updateStateByKey(...)
• Define initial state
• Provide state update function
• Continuously update with new information
• State maintained as RDD, updated via Transformation
Examples:
• Running count of words seen in text stream
• Per user session state from activity stream
Note: Requires periodic check-pointing to fault-tolerant storage, every N (~10-15)
micro-batches
36© Cloudera, Inc. All rights reserved.
Spark SQL & Dataframes
37© Cloudera, Inc. All rights reserved.
Dataframes
• Distributed collection of data organized as named typed columns
• Like RDDs, they consist of partitions, can be cached, and have fault-tolerance via
lineage
• Can be constructed from:
• Structured data files: Json, avro, parquet, etc
• Tables in Hive
• Tables in a RDBMS
• Existing RDDs by programmatically applying schema
38© Cloudera, Inc. All rights reserved.
Spark SQL
• SQL statements to process Dataframes
• Embed SQL statements in your scala, java, python Spark application
• Queries can also be issued via JDBC/ODBC
39© Cloudera, Inc. All rights reserved.
Why Spark SQL? Ease of programming
• Easy to code against schema’d records
• SQL is often an easier alternative to code, for non-complex operations on
relational data
• Embed SQL in your scala, java or python applications to seamlessly mix “regular”
spark for complex operations, along with SQL
40© Cloudera, Inc. All rights reserved.
Why Spark SQL? Performance
SQL processed by Query Optimizer  Automatic Optimizations
• Compressed memory format (as against java serialized objects in RDDs)
• Predicate pushdown (read less data to reduce IO)
• Optimal pipelining of operations
• Cost based optimizer
• …
41© Cloudera, Inc. All rights reserved.
MLlib
Collection of popular machine learning algorithms:
Classifiers: logistic regression, boosted trees, random forests,etc
Clustering: k-means, LDA
Recommender Systems: ALS
Dimensionality Reduction: PCA and SVD
Feature Engineering: TF-IDF, Word2Vec, etc
Statistical Functions: Chi-Squared Test, Pearson Correlation,etc
Pipelines API: Chain together feature engineering, training, model validation into
one pipeline
42© Cloudera, Inc. All rights reserved.
Thank You
And of course….we are hiring!!!

More Related Content

What's hot

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 

What's hot (20)

Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Best practices for MySQL High Availability Tutorial
Best practices for MySQL High Availability TutorialBest practices for MySQL High Availability Tutorial
Best practices for MySQL High Availability Tutorial
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Spark core
Spark coreSpark core
Spark core
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Denodo Data Virtualization Platform: Scalability (session 3 from Architect to...
Denodo Data Virtualization Platform: Scalability (session 3 from Architect to...Denodo Data Virtualization Platform: Scalability (session 3 from Architect to...
Denodo Data Virtualization Platform: Scalability (session 3 from Architect to...
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Spark
SparkSpark
Spark
 

Viewers also liked

Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Skillspeed
 

Viewers also liked (8)

Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 

Similar to Intro to Apache Spark

Similar to Intro to Apache Spark (20)

Building Efficient Pipelines in Apache Spark
Building Efficient Pipelines in Apache SparkBuilding Efficient Pipelines in Apache Spark
Building Efficient Pipelines in Apache Spark
 
Empower Hive with Spark
Empower Hive with SparkEmpower Hive with Spark
Empower Hive with Spark
 
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE PlatformLarge Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
5 Apache Spark Tips in 5 Minutes
5 Apache Spark Tips in 5 Minutes5 Apache Spark Tips in 5 Minutes
5 Apache Spark Tips in 5 Minutes
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Spark 101
Spark 101Spark 101
Spark 101
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
 
Introducing Kudu
Introducing KuduIntroducing Kudu
Introducing Kudu
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

Recently uploaded (20)

%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxBUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 

Intro to Apache Spark

  • 1. 1© Cloudera, Inc. All rights reserved. Intro to Apache Spark Anand Iyer Senior Product Manager, Cloudera
  • 2. 2© Cloudera, Inc. All rights reserved. Target Audience • New to Spark, or have very rudimentary knowledge of Spark. • Have basic knowledge of Map-Reduce If you are an advanced Spark developer, you are unlikely to get much out of this talk. • No performance tuning or debugging tips
  • 3. 3© Cloudera, Inc. All rights reserved. Spark: Easy and Fast Big Data • Easy to Develop • Rich APIs in Java, Scala, Python • Interactive shell • Fast to Run • General execution graphs • In-memory Caching
  • 4. 4© Cloudera, Inc. All rights reserved. Easy to code API
  • 5. 5© Cloudera, Inc. All rights reserved. RDD: Resilient Distributed Datasets Abstraction to represent the large distributed sets of data that are being processed. RDDs are: • Broken up into partitions, which are distributed across nodes • In practice, RDDs usually have between 100 to 10K partitions • Partitions operated upon in parallel • Immutable • Fault-Tolerant via concept of lineage
  • 6. 6© Cloudera, Inc. All rights reserved. Spark jobs are DAGs of operations on RDDs Operations on RDDs • Transformations: Create a new RDD from existing RDDs • Actions: Run computation on RDD, return values to the driver = RDD join filter groupBy B: C: D: E: G: Ç√Ω map A: map take F:
  • 7. 7© Cloudera, Inc. All rights reserved. Rich Expressive API • map • filter • groupBy • sort • union • join • leftOuterJoin • rightOuterJoin • reduce • count • fold • reduceByKey • groupByKey • cogroup • cross • zip sample take first partitionBy mapWith pipe save ...
  • 8. 8© Cloudera, Inc. All rights reserved. Example: Logistic Regression sc = SparkContext(…) rawData = sc.textFile(“hdfs://…”) data = rawData.map(parserFunc).cache() w = numpy.random.rand(D) for i in range(iterations): gradient = data .map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x) .reduce(lambda x, y: x + y) w -= gradient print “Final w: %s” % w
  • 9. 9© Cloudera, Inc. All rights reserved. Execution model and Spark Internals
  • 10. 10© Cloudera, Inc. All rights reserved. Driver & Executors • Driver: Master node • One Driver per Spark App • Runs the main(…) function of your app • Executors: Worker nodes
  • 11. 11© Cloudera, Inc. All rights reserved. Logical graph to physical execution plan = cached partition = RDD join filter groupBy B: F: C: D: E: G: map A: map take • Execution graph is broken into Stages • Each Stage consists of multiple Tasks • Task is unit of computation that is scheduled on an Executor • A Stage consists of multiple operations that can be pipelined • Stages are split when data needs to be “shuffled”
  • 12. 12© Cloudera, Inc. All rights reserved. Shuffle • Redistributes data among partitions • reduce, groupBy, join • Hash keys to buckets • Identical to MapReduce Shuffle • Shuffle entails writes to disk
  • 13. 13© Cloudera, Inc. All rights reserved. Spark WebUI lets you visualize DAG
  • 14. 14© Cloudera, Inc. All rights reserved. Drivers & Executors revisited • Driver • One Driver per Spark App • Runs the main(…) function of your app • Creates logical DAG and physical execution plan • Schedules Tasks • Driver receives and collects the results of Actions • Executors • Hold RDD partitions • Execute Tasks as scheduled by Driver
  • 15. 15© Cloudera, Inc. All rights reserved. Spark runs on Cluster Managers • Spark does not manage cluster of machines • Runs on YARN, Mesos or Standalone (cluster manager built specifically for Spark)
  • 16. 16© Cloudera, Inc. All rights reserved. Why is Spark Fast?
  • 17. 17© Cloudera, Inc. All rights reserved. Memory management leads to greater performance Trends: ½ price every 18 months 2x bandwidth every 3 years 128 – 384 GB 12-24 cores 50 GB per sec Memory can be enabler for high performance big data applications
  • 18. 18© Cloudera, Inc. All rights reserved. Persisting or Caching RDDs • If an RDD will be re-used, persist it to prevent re-computation • Very common in iterative algorithms • By default, cached RDDs held in memory • But memory may not suffice • MEMORY_AND_DISK persistence: Spill the partitions that don’t fit in memory to disk
  • 19. 19© Cloudera, Inc. All rights reserved. Lineage for Fault-Tolerance = RDD join filter groupBy B: F: C: D: E: G: map A: map take
  • 20. 20© Cloudera, Inc. All rights reserved. Lineage = RDD join filter groupBy B: F: C: D: E: G: map A: map take
  • 21. 21© Cloudera, Inc. All rights reserved. Lineage = RDD join filter groupBy B: F: C: D: E: G: map A: map take
  • 22. 22© Cloudera, Inc. All rights reserved. Lineage = RDD join filter groupBy B: F: C: D: E: G: map A: map take
  • 23. 23© Cloudera, Inc. All rights reserved. Lineage = RDD join filter groupBy B: F: C: D: E: G: map A: map take
  • 24. 24© Cloudera, Inc. All rights reserved. join filter groupBy B: F: C: D: E: H: Ç√Ω map A: map take Lineage Truncation = RDD Lineage gets truncated at an RDD when: • RDD is persisted to memory or disk • RDD already materialized on disk due to shuffle G:
  • 25. 25© Cloudera, Inc. All rights reserved. join filter groupBy B: F: C: D: E: H: Ç√Ω map A: map take Lineage Truncation = RDD Lineage gets truncated at an RDD when: • RDD is persisted to memory or disk • RDD already materialized on disk due to shuffle G:
  • 26. 26© Cloudera, Inc. All rights reserved. join filter groupBy B: F: C: D: E: H: Ç√Ω map A: map take Lineage Truncation = RDD Lineage gets truncated at an RDD when: • RDD is persisted to memory or disk • RDD already materialized on disk due to shuffle G:
  • 27. 27© Cloudera, Inc. All rights reserved. join filter groupBy B: F: C: D: E: H: Ç√Ω map A: map take Lineage Truncation = RDD Lineage gets truncated at an RDD when: • RDD is persisted to memory or disk • RDD already materialized on disk due to shuffle G:
  • 28. 28© Cloudera, Inc. All rights reserved. join filter groupBy B: F: C: D: E: H: Ç√Ω map A: map take Lineage Truncation = RDD Lineage gets truncated at an RDD when: • RDD is persisted to memory or disk • RDD already materialized on disk due to shuffle G:
  • 29. 29© Cloudera, Inc. All rights reserved. Summary of what makes Spark fast • Maximize use of memory • Re-used RDDs can be explicitly cached to prevent re-computation • Leverage Lineage & Pipelining to minimize writing intermediate data to disk • Efficient Task Scheduler • Ensure worker nodes are kept busy via quick scheduling of Tasks • More optimizations coming in Spark SQL • Compact binary in-memory data representation, etc • More details in subsequent slides
  • 30. 30© Cloudera, Inc. All rights reserved. Spark will replace MapReduce To become the standard execution engine for Hadoop
  • 31. 31© Cloudera, Inc. All rights reserved. Spark Streaming
  • 32. 32© Cloudera, Inc. All rights reserved. Spark Streaming • Incoming data represented as DStreams (Discretized Streams) • Data commonly read from streaming data channels like Kafka or Flume • A spark-streaming application is a DAG of Transformations and Actions on DStreams (and RDDs)
  • 33. 33© Cloudera, Inc. All rights reserved. Discretized Stream • Incoming data stream is broken down into micro-batches • Micro-batch size is user defined, usually 0.3 to 1 second • Micro-batches are disjoint • Each micro-batch is an RDD • Effectively, a DStream is a sequence of RDDs, one per micro-batch • Spark Streaming known for high throughput
  • 34. 34© Cloudera, Inc. All rights reserved. Windowed DStreams • Defined by specifying a window size and a step size • Both are multiples of micro-batch size • Operations invoked on each window’s data
  • 35. 35© Cloudera, Inc. All rights reserved. Maintain and update arbitrary state updateStateByKey(...) • Define initial state • Provide state update function • Continuously update with new information • State maintained as RDD, updated via Transformation Examples: • Running count of words seen in text stream • Per user session state from activity stream Note: Requires periodic check-pointing to fault-tolerant storage, every N (~10-15) micro-batches
  • 36. 36© Cloudera, Inc. All rights reserved. Spark SQL & Dataframes
  • 37. 37© Cloudera, Inc. All rights reserved. Dataframes • Distributed collection of data organized as named typed columns • Like RDDs, they consist of partitions, can be cached, and have fault-tolerance via lineage • Can be constructed from: • Structured data files: Json, avro, parquet, etc • Tables in Hive • Tables in a RDBMS • Existing RDDs by programmatically applying schema
  • 38. 38© Cloudera, Inc. All rights reserved. Spark SQL • SQL statements to process Dataframes • Embed SQL statements in your scala, java, python Spark application • Queries can also be issued via JDBC/ODBC
  • 39. 39© Cloudera, Inc. All rights reserved. Why Spark SQL? Ease of programming • Easy to code against schema’d records • SQL is often an easier alternative to code, for non-complex operations on relational data • Embed SQL in your scala, java or python applications to seamlessly mix “regular” spark for complex operations, along with SQL
  • 40. 40© Cloudera, Inc. All rights reserved. Why Spark SQL? Performance SQL processed by Query Optimizer  Automatic Optimizations • Compressed memory format (as against java serialized objects in RDDs) • Predicate pushdown (read less data to reduce IO) • Optimal pipelining of operations • Cost based optimizer • …
  • 41. 41© Cloudera, Inc. All rights reserved. MLlib Collection of popular machine learning algorithms: Classifiers: logistic regression, boosted trees, random forests,etc Clustering: k-means, LDA Recommender Systems: ALS Dimensionality Reduction: PCA and SVD Feature Engineering: TF-IDF, Word2Vec, etc Statistical Functions: Chi-Squared Test, Pearson Correlation,etc Pipelines API: Chain together feature engineering, training, model validation into one pipeline
  • 42. 42© Cloudera, Inc. All rights reserved. Thank You And of course….we are hiring!!!

Editor's Notes

  1. Compared to 10c/GB, 100 MBps for disk storage Hot data often a small fraction of total data
  2. Show example usage of lineage with caching. Then show example usage of lineage where it goes to the shuffle files
  3. Show example usage of lineage with caching. Then show example usage of lineage where it goes to the shuffle files
  4. Show example usage of lineage with caching. Then show example usage of lineage where it goes to the shuffle files
  5. Show example usage of lineage with caching. Then show example usage of lineage where it goes to the shuffle files
  6. Show example usage of lineage with caching. Then show example usage of lineage where it goes to the shuffle files
  7. Create a logical execution plan for DAG
  8. Create a logical execution plan for DAG
  9. Create a logical execution plan for DAG
  10. Create a logical execution plan for DAG
  11. Create a logical execution plan for DAG
  12. Dstream is the abstraction and each Dstream has transformation and actions like RDDs….subset of transformations and actions.
  13. Dstream is the abstraction and each Dstream has transformation and actions like RDDs….subset of transformations and actions.