SlideShare ist ein Scribd-Unternehmen logo
1 von 67
Downloaden Sie, um offline zu lesen
Running on a Cluster
Basics of RDD
Spark Runtime Architecture
Basics of RDD
Machine or Node 1 Machine or Node 2 Machine or Node 3
Spark Runtime Architecture
Basics of RDD
Worker Node
Machine or Node 1
Worker Node
Machine or Node 2
Worker Node
Machine or Node 3
Driver
Machine or Node 4
Spark Runtime Architecture
Basics of RDD
Worker Node
Machine or Node 1
Worker Node
Machine or Node 2
Worker Node
Machine or Node 3
Driver
Machine or Node 4
User
Spark Runtime Architecture
Basics of RDD
● Process where main() method runs
● When you launch a Spark shell, you’ve created a driver program
● Once the driver terminates, the application is finished.
The Driver
Basics of RDD
● Process where main() method runs
● When you launch a Spark shell, you’ve created a driver program
● Once the driver terminates, the application is finished.
● While running it performs following:
○ Converting a user program into tasks
■ Convert a user program into tasks - units of execution.
■ Converts DAG (logical graph) into a physical execution plan
The Driver
Basics of RDD
● Process where main() method runs
● When you launch a Spark shell, you’ve created a driver program
● Once the driver terminates, the application is finished.
● While running it performs following:
○ Converting a user program into tasks
■ Convert a user program into tasks - units of execution.
■ Converts DAG (logical graph) into a physical execution plan
○ Scheduling tasks on executors
The Driver
Basics of RDD
Driver: Scheduling tasks on executors
Basics of RDD
● Coordinate the scheduling of individual tasks on executors
Driver: Scheduling tasks on executors
Basics of RDD
● Coordinate the scheduling of individual tasks on executors
● Schedule tasks based on data placement
Driver: Scheduling tasks on executors
Basics of RDD
● Coordinate the scheduling of individual tasks on executors
● Schedule tasks based on data placement
● Tracks cached data and uses it to schedule future tasks
Driver: Scheduling tasks on executors
Basics of RDD
● Coordinate the scheduling of individual tasks on executors
● Schedule tasks based on data placement
● Tracks cached data and uses it to schedule future tasks
● Runs Spark web interface at port 4040.
Driver: Scheduling tasks on executors
Basics of RDD
Understanding The Architecture
Worker Node
Machine or Node 1
Worker Node
Machine or Node 2
Worker Node
Machine or Node 3
Driver
Machine or Node 4
User
Resource Manager
YARN/MESOS/EC2
Basics of RDD
Understanding The Architecture
Worker Node
Machine or Node 1
Worker Node
Machine or Node 2
Worker Node
Machine or Node 3
Driver
Machine or Node 4
User
Resource Manager
YARN/MESOS/EC2/Standalone
Basics of RDD
● It is a pluggable component in Spark.
● This allows Spark to run on YARN, Mesos & builtin Standalone
Cluster Manager
Basics of RDD
Understanding The Architecture
Worker Node
Machine or Node 1
Worker Node
Machine or Node 2
Worker Node
Machine or Node 3
Driver
Machine or Node 4
User
Resource Manager
YARN/MESOS/EC2/Standalone
Spark Context
Basics of RDD
Understanding The Architecture
Worker Node
Machine or Node 1
Worker Node
Machine or Node 2
Worker Node
Machine or Node 3
Driver
Machine or Node 4
User
Resource Manager
YARN/MESOS/EC2/Standalone
Spark Context
Executor Executor Executor
Basics of RDD
● Worker processes that run tasks of a job
Executors
Basics of RDD
● Worker processes that run tasks of a job
● Return results to the driver
Executors
Basics of RDD
● Worker processes that run tasks of a job
● Return results to the driver
● Launched once at the beginning of a Spark application
Executors
Basics of RDD
● Worker processes that run tasks of a job
● Return results to the driver
● Launched once at the beginning of a Spark application
● Run for the entire lifetime of an application,
Executors
Basics of RDD
● Worker processes that run tasks of a job
● Return results to the driver
● Launched once at the beginning of a Spark application
● Run for the entire lifetime of an application,
● Provide in-memory storage for cached RDDs via Block Manager
Executors
Basics of RDD
Understanding The Architecture
Worker Node
Machine or Node 1
Worker Node
Machine or Node 2
Worker Node
Machine or Node 3
Driver
Machine or Node 4
User
Resource Manager
YARN/MESOS/EC2/Standalone
Spark Context
Executor Executor Executor
Task Task Task Task Task
Basics of RDD
Understanding The Architecture
Worker Node
Machine or Node 1
Worker Node
Machine or Node 2
Worker Node
Machine or Node 3
Driver
Machine or Node 4
User
Resource Manager
YARN/MESOS/EC2/Standalone
Spark Context
Executor Executor Executor
Task Task Task Task Task
Maintains RDD &
executes workloads
Basics of RDD
Understanding The Architecture
Worker Node
Machine or Node 1
Worker Node
Machine or Node 2
Worker Node
Machine or Node 3
Driver
Machine or Node 4
User
Resource Manager
YARN/MESOS/EC2/Standalone
Spark Context
Executor Executor Executor
Task Task Task Task Task
Maintains RDD &
executes workloads
Converts users program into
tasks & Launches Spark
Applications.
Basics of RDD
Launching a Program
Basics of RDD
● The user submits an application using spark-submit.
Launching a Program
Spark-Submit
Basics of RDD
● The user submits an application using spark-submit.
● spark-submit launches the driver program
Launching a Program
Spark-Submit Driver
Basics of RDD
● The user submits an application using spark-submit.
● spark-submit launches the driver program
● driver invokes the main() and creates spark context
● The driver program contacts the cluster manager for resources
Launching a Program
Spark-Submit Driver
Cluster
Manager
Basics of RDD
Launching a Program
Spark-Submit Driver
Cluster
Manager
Starts
Executors
● The user submits an application using spark-submit.
● spark-submit launches the driver program
● driver invokes the main() and creates spark context
● The driver program contacts the cluster manager for resources
● The cluster manager launches executors
● The driver process runs through the user application.
● the driver sends work to executors in the form of tasks.
● Tasks are run on executor processes to compute and save results.
Basics of RDD
Launching a Program
Spark-Submit Driver
Cluster
Manager
Starts
Executors
● The user submits an application using spark-submit.
● spark-submit launches the driver program
● driver invokes the main() and creates spark context
● The driver program contacts the cluster manager for resources
● The cluster manager launches executors
● The driver process runs through the user application.
● the driver sends work to executors in the form of tasks.
● Tasks are run on executor processes to compute and save results.
● Terminate the executors and release resources if driver’s main() exit or sc.stop()
Exit
sc.stop()
Running On A Cluster
Getting Started - Two Modes
1. Local Mode
2. Cluster Mode
Running On A Cluster
Getting Started - Two Modes
1. Local Mode
2. Cluster Mode
Spark-shell --master ….
Running On A Cluster
Local Mode or Spark in-process
1. Default Mode
2. Does not require any resource manager
a. Simply download and run.
3. Good for utilizing multiple cores for processing
4. Partitions are generally equal to number of CPUs.
5. Used generally for testing
Running On A Cluster
We can run spark-shell, spark-submit with
○ spark-shell
○ spark-shell --master local
○ spark-shell --master local[n]
○ spark-shell --master local[*]
Getting Started - Local Mode
Running On A Cluster
Local Mode - Check!
scala> sc.isLocal
res0: Boolean = true
Running On A Cluster
Local Mode - Check!
scala> sc.isLocal
res0: Boolean = true
scala> sc.master
res0: String = local[*]
Running On A Cluster
Local Mode - HandsOn!
Running On A Cluster
Cluster Modes
Different kind of Resource Managers
a. Standalone
b. YARN
c. Mesos
d. EC2
Running On A Cluster
Cluster Mode - Standalone
Uses inbuilt manager resource manager
How to setup?
a. Install spark on all nodes.
b. Inform all nodes about each other
c. Launch spark on all nodes.
d. The spark nodes will discover each other
Running On A Cluster
Installing Standalone Cluster
1. Copy a compiled version of Spark to the same location on all your machines—for
example, /home/yourname/spark.
2. Set up password-less SSH access from your master machine to the others.
3. Edit the conf/slaves file on your master and fill in the workers’ hostnames.
4. run sbin/start-all.sh on your master
5. Check http://masternode:8080
6. To stop the cluster, run bin/stop-all.sh on your master node.
Running On A Cluster
To run spark inside Hadoop's YARN.
Tasks are run inside the yarn's containers
How to use?
export YARN_CONF_DIR=/etc/hadoop/conf/
export HADOOP_CONF_DIR=/etc/hadoop/conf/
spark-shell --master yarn
Cluster Mode - YARN
Running On A Cluster
Launching a program on yarn - Hands On
1. export YARN_CONF_DIR=/etc/hadoop/conf/
2. export HADOOP_CONF_DIR=/etc/hadoop/conf/
3. spark-submit --master yarn --class org.apache.spark.examples.SparkPi
/usr/hdp/current/spark-client/lib/spark-examples-*.jar 10
Running On A Cluster
Launching a program on yarn - Hands On
export YARN_CONF_DIR=/etc/hadoop/conf/
export HADOOP_CONF_DIR=/etc/hadoop/conf/
git clone https://github.com/cloudxlab/bigdata
cd bigdata
cd spark/
cd projects/
cd apache-log-parsing_sbt/
sbt clean
sbt package
spark-submit --master yarn
target/scala-2.10/apache-log-parsing_2.10-0.0.1.jar 10 10
/data/spark/project/access/access.log.45.gz
Running On A Cluster
Hands On
Launching a program on yarn
Running On A Cluster
Cluster Mode - MESOS
1. Mesos Is a general-purpose cluster manager
2. it runs both analytics workloads and long-running services (DBs)
3. To use Spark on Mesos, pass a mesos:// URI to spark-submit:
spark-submit --master mesos://masternode:5050 yourapp
4. You can use ZooKeeper to elect master in mesos in case of multi-master
5. Use a mesos://zk:// URI pointing to a list of ZooKeeper nodes.
6. Ex:, if you have 3 nodes (n1, n2, n3) having ZK on port 2181, use URI:
mesos://zk://n1:2181/mesos,n2:2181/mesos,n2:2181/mesos
Running On A Cluster
Cluster Mode - Amazon EC2
● Spark comes with a built-in script to launch clusters on Amazon EC2.
● First create an Amazon Web Services (AWS) account
● Obtain an access key ID and secret access key.
● export these as environment variables:
○ export AWS_ACCESS_KEY_ID="..."
○ export AWS_SECRET_ACCESS_KEY="..."
● Create an EC2 SSH key pair and download its private key file (helps in SSH)
● Launch command of the spark-ec2 script:
○ cd /path/to/spark/ec2
○ ./spark-ec2 -k mykeypair -i mykeypair.pem launch mycluster
Running On A Cluster
Deployment Modes
● Based on where does driver run.
● Two ways:
○ Client
○ Cluster
Running On A Cluster
Deployment Modes
● Based on where does driver run.
● Two ways:
○ Client - launch the driver program locally. Default
○ Cluster
Running On A Cluster
Deployment Modes
● Based on where does driver run.
● Two ways:
○ Client - launch the driver program locally. Default
○ Cluster - on one of the worker machines inside the
cluster
Running On A Cluster
Architecture Yarn Client Mode
1. Driver Application is runs outside yarn
a. On machine where it is launched
2. If Driver Application shuts down the process is killed
3. Does not have resilience but is quicker to run.
Running On A Cluster
1. Driver Application runs inside yarn in application master
2. If launcher shuts down the process continues like a batch process
a. in background
3. Preferred way to run the long running processes
Architecture Yarn Cluster Mode
Running On A Cluster
Architecture Yarn cluster Mode - Example
export YARN_CONF_DIR=/etc/hadoop/conf/
export HADOOP_CONF_DIR=/etc/hadoop/conf/
spark-submit --master yarn --deploy-mode cluster --class
org.apache.spark.examples.SparkPi
/usr/hdp/current/spark-client/lib/spark-examples-*.jar 10
To check the status, use:
○ http://e.cloudxlab.com:4040/
○ http://a.cloudxlab.com:8088/cluster
Running On A Cluster
Architecture Yarn cluster Mode - Demo
Hands On
Running On A Cluster
Which Cluster Manager to Use?
1. Start with a local mode if this is a new deployment.
2. To use richer resource scheduling capabilities (e.g., queues), use YARN and Mesos
3. When sharing amongst many users is primary criteria, use Mesos
4. In all cases, it is best to run Spark on the same nodes as HDFS for fast access to
storage.
a. You can either install Mesos or Standalone cluster on Datanodes
b. Or Hadoop distributions already install YARN and HDFS together
Running On A Cluster
Packaging Your Code and Dependencies
1. Bundle all the libraries that your program depends upon
2. No need to bundle the spark libraries (org.apache.spark) and language libraries
(java…)
3. Python users can:
a. Either install on all nodes using pip or easy_install
b. Or use --py-files argument (take files to every node's cwd) of spark-submit
4. Java & Scala
a. Submit libraries using --jars
b. But there are many libraries, use build tool such as sbt or maven
Running On A Cluster
Common flags for spark-submit
Flag Explanation
master Indicates the cluster manager to connect to. The options for
this flag are described earlier.
Running On A Cluster
Common flags for spark-submit
Flag Explanation
deploy-mode Whether to launch the driver program locally (“client”) or
on one of the worker machines inside the cluster (“cluster”).
In client mode spark-submit will run your driver on the same
machine where spark-submit is itself being invoked. In
cluster mode, the driver will be shipped to execute on a
worker node in the cluster. The default is client mode.
Running On A Cluster
Common flags for spark-submit
Flag Explanation
class The “main” class of your application if you’re running a Java
or Scala program.
Running On A Cluster
Common flags for spark-submit
Flag Explanation
name A human-readable name for your application. This will be
displayed in Spark’s web UI.
Running On A Cluster
Common flags for spark-submit
Flag Explanation
jars A list of JAR files to upload and place on the classpath of
your application. If your application depends on a small
number of third-party JARs, you can add them here.
Running On A Cluster
Common flags for spark-submit
Flag Explanation
files A list of files to be placed in the working directory of
your application. This can be used for data files that you
want to distribute to each node.
Running On A Cluster
Common flags for spark-submit
Flag Explanation
py-files A list of files to be added to the PYTHONPATH of
your application. This can contain .py, .egg, or .zip files.
Running On A Cluster
Common flags for spark-submit
Flag Explanation
executor-memory The amount of memory to use for executors, in bytes.
Suffixes can be used to specify larger quantities such as
“512m” (512 megabytes) or “15g” (15 gigabytes).
Running On A Cluster
Common flags for spark-submit
Flag Explanation
driver-memory The amount of memory to use for the driver process,
in bytes. Suffixes can be used to specify larger quantities
such as “512m” (512 megabytes) or “15g” (15
gigabytes).
Thank you!
Running on a Cluster

Weitere ähnliche Inhalte

Was ist angesagt?

Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)
Akhil Das
 

Was ist angesagt? (20)

Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
 
Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand ford
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Apache Spark with Scala
Apache Spark with ScalaApache Spark with Scala
Apache Spark with Scala
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is Failing
 
Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...
Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...
Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and Cassandra
 
Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
 
Spark core
Spark coreSpark core
Spark core
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 

Ähnlich wie Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab

Ähnlich wie Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab (20)

Apache spark - Installation
Apache spark - InstallationApache spark - Installation
Apache spark - Installation
 
Apache Spark Internals
Apache Spark InternalsApache Spark Internals
Apache Spark Internals
 
internals
internalsinternals
internals
 
Internals
InternalsInternals
Internals
 
Spark on yarn
Spark on yarnSpark on yarn
Spark on yarn
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Spark Working Environment in Windows OS
Spark Working Environment in Windows OSSpark Working Environment in Windows OS
Spark Working Environment in Windows OS
 
R Data Access from hdfs,spark,hive
R Data Access  from hdfs,spark,hiveR Data Access  from hdfs,spark,hive
R Data Access from hdfs,spark,hive
 
Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche Spark
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
The benefits of running Spark on your own Docker
The benefits of running Spark on your own DockerThe benefits of running Spark on your own Docker
The benefits of running Spark on your own Docker
 
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversitySpark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
 
Overview of Spark for HPC
Overview of Spark for HPCOverview of Spark for HPC
Overview of Spark for HPC
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Apache spark
Apache sparkApache spark
Apache spark
 

Mehr von CloudxLab

Mehr von CloudxLab (20)

Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
 
Deep Learning Overview
Deep Learning OverviewDeep Learning Overview
Deep Learning Overview
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
 
Training Deep Neural Nets
Training Deep Neural NetsTraining Deep Neural Nets
Training Deep Neural Nets
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
 
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabIntroduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
 
Introduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabIntroduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLab
 
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLab
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random Forests
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Kürzlich hochgeladen (20)

Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab

  • 1. Running on a Cluster
  • 2. Basics of RDD Spark Runtime Architecture
  • 3. Basics of RDD Machine or Node 1 Machine or Node 2 Machine or Node 3 Spark Runtime Architecture
  • 4. Basics of RDD Worker Node Machine or Node 1 Worker Node Machine or Node 2 Worker Node Machine or Node 3 Driver Machine or Node 4 Spark Runtime Architecture
  • 5. Basics of RDD Worker Node Machine or Node 1 Worker Node Machine or Node 2 Worker Node Machine or Node 3 Driver Machine or Node 4 User Spark Runtime Architecture
  • 6. Basics of RDD ● Process where main() method runs ● When you launch a Spark shell, you’ve created a driver program ● Once the driver terminates, the application is finished. The Driver
  • 7. Basics of RDD ● Process where main() method runs ● When you launch a Spark shell, you’ve created a driver program ● Once the driver terminates, the application is finished. ● While running it performs following: ○ Converting a user program into tasks ■ Convert a user program into tasks - units of execution. ■ Converts DAG (logical graph) into a physical execution plan The Driver
  • 8. Basics of RDD ● Process where main() method runs ● When you launch a Spark shell, you’ve created a driver program ● Once the driver terminates, the application is finished. ● While running it performs following: ○ Converting a user program into tasks ■ Convert a user program into tasks - units of execution. ■ Converts DAG (logical graph) into a physical execution plan ○ Scheduling tasks on executors The Driver
  • 9. Basics of RDD Driver: Scheduling tasks on executors
  • 10. Basics of RDD ● Coordinate the scheduling of individual tasks on executors Driver: Scheduling tasks on executors
  • 11. Basics of RDD ● Coordinate the scheduling of individual tasks on executors ● Schedule tasks based on data placement Driver: Scheduling tasks on executors
  • 12. Basics of RDD ● Coordinate the scheduling of individual tasks on executors ● Schedule tasks based on data placement ● Tracks cached data and uses it to schedule future tasks Driver: Scheduling tasks on executors
  • 13. Basics of RDD ● Coordinate the scheduling of individual tasks on executors ● Schedule tasks based on data placement ● Tracks cached data and uses it to schedule future tasks ● Runs Spark web interface at port 4040. Driver: Scheduling tasks on executors
  • 14. Basics of RDD Understanding The Architecture Worker Node Machine or Node 1 Worker Node Machine or Node 2 Worker Node Machine or Node 3 Driver Machine or Node 4 User Resource Manager YARN/MESOS/EC2
  • 15. Basics of RDD Understanding The Architecture Worker Node Machine or Node 1 Worker Node Machine or Node 2 Worker Node Machine or Node 3 Driver Machine or Node 4 User Resource Manager YARN/MESOS/EC2/Standalone
  • 16. Basics of RDD ● It is a pluggable component in Spark. ● This allows Spark to run on YARN, Mesos & builtin Standalone Cluster Manager
  • 17. Basics of RDD Understanding The Architecture Worker Node Machine or Node 1 Worker Node Machine or Node 2 Worker Node Machine or Node 3 Driver Machine or Node 4 User Resource Manager YARN/MESOS/EC2/Standalone Spark Context
  • 18. Basics of RDD Understanding The Architecture Worker Node Machine or Node 1 Worker Node Machine or Node 2 Worker Node Machine or Node 3 Driver Machine or Node 4 User Resource Manager YARN/MESOS/EC2/Standalone Spark Context Executor Executor Executor
  • 19. Basics of RDD ● Worker processes that run tasks of a job Executors
  • 20. Basics of RDD ● Worker processes that run tasks of a job ● Return results to the driver Executors
  • 21. Basics of RDD ● Worker processes that run tasks of a job ● Return results to the driver ● Launched once at the beginning of a Spark application Executors
  • 22. Basics of RDD ● Worker processes that run tasks of a job ● Return results to the driver ● Launched once at the beginning of a Spark application ● Run for the entire lifetime of an application, Executors
  • 23. Basics of RDD ● Worker processes that run tasks of a job ● Return results to the driver ● Launched once at the beginning of a Spark application ● Run for the entire lifetime of an application, ● Provide in-memory storage for cached RDDs via Block Manager Executors
  • 24. Basics of RDD Understanding The Architecture Worker Node Machine or Node 1 Worker Node Machine or Node 2 Worker Node Machine or Node 3 Driver Machine or Node 4 User Resource Manager YARN/MESOS/EC2/Standalone Spark Context Executor Executor Executor Task Task Task Task Task
  • 25. Basics of RDD Understanding The Architecture Worker Node Machine or Node 1 Worker Node Machine or Node 2 Worker Node Machine or Node 3 Driver Machine or Node 4 User Resource Manager YARN/MESOS/EC2/Standalone Spark Context Executor Executor Executor Task Task Task Task Task Maintains RDD & executes workloads
  • 26. Basics of RDD Understanding The Architecture Worker Node Machine or Node 1 Worker Node Machine or Node 2 Worker Node Machine or Node 3 Driver Machine or Node 4 User Resource Manager YARN/MESOS/EC2/Standalone Spark Context Executor Executor Executor Task Task Task Task Task Maintains RDD & executes workloads Converts users program into tasks & Launches Spark Applications.
  • 28. Basics of RDD ● The user submits an application using spark-submit. Launching a Program Spark-Submit
  • 29. Basics of RDD ● The user submits an application using spark-submit. ● spark-submit launches the driver program Launching a Program Spark-Submit Driver
  • 30. Basics of RDD ● The user submits an application using spark-submit. ● spark-submit launches the driver program ● driver invokes the main() and creates spark context ● The driver program contacts the cluster manager for resources Launching a Program Spark-Submit Driver Cluster Manager
  • 31. Basics of RDD Launching a Program Spark-Submit Driver Cluster Manager Starts Executors ● The user submits an application using spark-submit. ● spark-submit launches the driver program ● driver invokes the main() and creates spark context ● The driver program contacts the cluster manager for resources ● The cluster manager launches executors ● The driver process runs through the user application. ● the driver sends work to executors in the form of tasks. ● Tasks are run on executor processes to compute and save results.
  • 32. Basics of RDD Launching a Program Spark-Submit Driver Cluster Manager Starts Executors ● The user submits an application using spark-submit. ● spark-submit launches the driver program ● driver invokes the main() and creates spark context ● The driver program contacts the cluster manager for resources ● The cluster manager launches executors ● The driver process runs through the user application. ● the driver sends work to executors in the form of tasks. ● Tasks are run on executor processes to compute and save results. ● Terminate the executors and release resources if driver’s main() exit or sc.stop() Exit sc.stop()
  • 33. Running On A Cluster Getting Started - Two Modes 1. Local Mode 2. Cluster Mode
  • 34. Running On A Cluster Getting Started - Two Modes 1. Local Mode 2. Cluster Mode Spark-shell --master ….
  • 35. Running On A Cluster Local Mode or Spark in-process 1. Default Mode 2. Does not require any resource manager a. Simply download and run. 3. Good for utilizing multiple cores for processing 4. Partitions are generally equal to number of CPUs. 5. Used generally for testing
  • 36. Running On A Cluster We can run spark-shell, spark-submit with ○ spark-shell ○ spark-shell --master local ○ spark-shell --master local[n] ○ spark-shell --master local[*] Getting Started - Local Mode
  • 37. Running On A Cluster Local Mode - Check! scala> sc.isLocal res0: Boolean = true
  • 38. Running On A Cluster Local Mode - Check! scala> sc.isLocal res0: Boolean = true scala> sc.master res0: String = local[*]
  • 39. Running On A Cluster Local Mode - HandsOn!
  • 40. Running On A Cluster Cluster Modes Different kind of Resource Managers a. Standalone b. YARN c. Mesos d. EC2
  • 41. Running On A Cluster Cluster Mode - Standalone Uses inbuilt manager resource manager How to setup? a. Install spark on all nodes. b. Inform all nodes about each other c. Launch spark on all nodes. d. The spark nodes will discover each other
  • 42. Running On A Cluster Installing Standalone Cluster 1. Copy a compiled version of Spark to the same location on all your machines—for example, /home/yourname/spark. 2. Set up password-less SSH access from your master machine to the others. 3. Edit the conf/slaves file on your master and fill in the workers’ hostnames. 4. run sbin/start-all.sh on your master 5. Check http://masternode:8080 6. To stop the cluster, run bin/stop-all.sh on your master node.
  • 43. Running On A Cluster To run spark inside Hadoop's YARN. Tasks are run inside the yarn's containers How to use? export YARN_CONF_DIR=/etc/hadoop/conf/ export HADOOP_CONF_DIR=/etc/hadoop/conf/ spark-shell --master yarn Cluster Mode - YARN
  • 44. Running On A Cluster Launching a program on yarn - Hands On 1. export YARN_CONF_DIR=/etc/hadoop/conf/ 2. export HADOOP_CONF_DIR=/etc/hadoop/conf/ 3. spark-submit --master yarn --class org.apache.spark.examples.SparkPi /usr/hdp/current/spark-client/lib/spark-examples-*.jar 10
  • 45. Running On A Cluster Launching a program on yarn - Hands On export YARN_CONF_DIR=/etc/hadoop/conf/ export HADOOP_CONF_DIR=/etc/hadoop/conf/ git clone https://github.com/cloudxlab/bigdata cd bigdata cd spark/ cd projects/ cd apache-log-parsing_sbt/ sbt clean sbt package spark-submit --master yarn target/scala-2.10/apache-log-parsing_2.10-0.0.1.jar 10 10 /data/spark/project/access/access.log.45.gz
  • 46. Running On A Cluster Hands On Launching a program on yarn
  • 47. Running On A Cluster Cluster Mode - MESOS 1. Mesos Is a general-purpose cluster manager 2. it runs both analytics workloads and long-running services (DBs) 3. To use Spark on Mesos, pass a mesos:// URI to spark-submit: spark-submit --master mesos://masternode:5050 yourapp 4. You can use ZooKeeper to elect master in mesos in case of multi-master 5. Use a mesos://zk:// URI pointing to a list of ZooKeeper nodes. 6. Ex:, if you have 3 nodes (n1, n2, n3) having ZK on port 2181, use URI: mesos://zk://n1:2181/mesos,n2:2181/mesos,n2:2181/mesos
  • 48. Running On A Cluster Cluster Mode - Amazon EC2 ● Spark comes with a built-in script to launch clusters on Amazon EC2. ● First create an Amazon Web Services (AWS) account ● Obtain an access key ID and secret access key. ● export these as environment variables: ○ export AWS_ACCESS_KEY_ID="..." ○ export AWS_SECRET_ACCESS_KEY="..." ● Create an EC2 SSH key pair and download its private key file (helps in SSH) ● Launch command of the spark-ec2 script: ○ cd /path/to/spark/ec2 ○ ./spark-ec2 -k mykeypair -i mykeypair.pem launch mycluster
  • 49. Running On A Cluster Deployment Modes ● Based on where does driver run. ● Two ways: ○ Client ○ Cluster
  • 50. Running On A Cluster Deployment Modes ● Based on where does driver run. ● Two ways: ○ Client - launch the driver program locally. Default ○ Cluster
  • 51. Running On A Cluster Deployment Modes ● Based on where does driver run. ● Two ways: ○ Client - launch the driver program locally. Default ○ Cluster - on one of the worker machines inside the cluster
  • 52. Running On A Cluster Architecture Yarn Client Mode 1. Driver Application is runs outside yarn a. On machine where it is launched 2. If Driver Application shuts down the process is killed 3. Does not have resilience but is quicker to run.
  • 53. Running On A Cluster 1. Driver Application runs inside yarn in application master 2. If launcher shuts down the process continues like a batch process a. in background 3. Preferred way to run the long running processes Architecture Yarn Cluster Mode
  • 54. Running On A Cluster Architecture Yarn cluster Mode - Example export YARN_CONF_DIR=/etc/hadoop/conf/ export HADOOP_CONF_DIR=/etc/hadoop/conf/ spark-submit --master yarn --deploy-mode cluster --class org.apache.spark.examples.SparkPi /usr/hdp/current/spark-client/lib/spark-examples-*.jar 10 To check the status, use: ○ http://e.cloudxlab.com:4040/ ○ http://a.cloudxlab.com:8088/cluster
  • 55. Running On A Cluster Architecture Yarn cluster Mode - Demo Hands On
  • 56. Running On A Cluster Which Cluster Manager to Use? 1. Start with a local mode if this is a new deployment. 2. To use richer resource scheduling capabilities (e.g., queues), use YARN and Mesos 3. When sharing amongst many users is primary criteria, use Mesos 4. In all cases, it is best to run Spark on the same nodes as HDFS for fast access to storage. a. You can either install Mesos or Standalone cluster on Datanodes b. Or Hadoop distributions already install YARN and HDFS together
  • 57. Running On A Cluster Packaging Your Code and Dependencies 1. Bundle all the libraries that your program depends upon 2. No need to bundle the spark libraries (org.apache.spark) and language libraries (java…) 3. Python users can: a. Either install on all nodes using pip or easy_install b. Or use --py-files argument (take files to every node's cwd) of spark-submit 4. Java & Scala a. Submit libraries using --jars b. But there are many libraries, use build tool such as sbt or maven
  • 58. Running On A Cluster Common flags for spark-submit Flag Explanation master Indicates the cluster manager to connect to. The options for this flag are described earlier.
  • 59. Running On A Cluster Common flags for spark-submit Flag Explanation deploy-mode Whether to launch the driver program locally (“client”) or on one of the worker machines inside the cluster (“cluster”). In client mode spark-submit will run your driver on the same machine where spark-submit is itself being invoked. In cluster mode, the driver will be shipped to execute on a worker node in the cluster. The default is client mode.
  • 60. Running On A Cluster Common flags for spark-submit Flag Explanation class The “main” class of your application if you’re running a Java or Scala program.
  • 61. Running On A Cluster Common flags for spark-submit Flag Explanation name A human-readable name for your application. This will be displayed in Spark’s web UI.
  • 62. Running On A Cluster Common flags for spark-submit Flag Explanation jars A list of JAR files to upload and place on the classpath of your application. If your application depends on a small number of third-party JARs, you can add them here.
  • 63. Running On A Cluster Common flags for spark-submit Flag Explanation files A list of files to be placed in the working directory of your application. This can be used for data files that you want to distribute to each node.
  • 64. Running On A Cluster Common flags for spark-submit Flag Explanation py-files A list of files to be added to the PYTHONPATH of your application. This can contain .py, .egg, or .zip files.
  • 65. Running On A Cluster Common flags for spark-submit Flag Explanation executor-memory The amount of memory to use for executors, in bytes. Suffixes can be used to specify larger quantities such as “512m” (512 megabytes) or “15g” (15 gigabytes).
  • 66. Running On A Cluster Common flags for spark-submit Flag Explanation driver-memory The amount of memory to use for the driver process, in bytes. Suffixes can be used to specify larger quantities such as “512m” (512 megabytes) or “15g” (15 gigabytes).