SlideShare a Scribd company logo
1 of 18
Welcome to
Session on
Spark
Architecture
World Prior to Spark
Philosophy of Distributed Systems
Google File System & its Architecture
Introduction to Spark Architecture
Agenda
World Prior to
Spark ??
Exercise
Find the Sum of all these
multiplications.
 Distributed Systems :-
• Collection of Individual computing Devices that can communicate with each other
• Computing Devices are Autonomous in nature
• Independent Computing devices are called Nodes
• Nodes can act independently of each other
• Nodes are programmed to achieve common goals which are realized by exchanging messages
with each other ( Message Passing System)
• Has a Distribution software called Middleware, which runs on the OS of each Node
• It should emerge as a Single Coherent System
 Properties of Distributed Systems :-
• Concurrency : Multiple programs run together
• Shared Data : Data is accessed simultaneously by multiple entities
• No Global Clock : Each component has a local notion of time
• Interdependency : Independent components depend on each other
Logical Design of Distributed System
 Distributed Computing System Design Challenges:-
• Communication :- Communication among processes
• Processes :- Management of processes/threads on client servers
• Synchronization :- Coordination among the processes in essential
• Fault Tolerance :- Failures of Link/Node/Processes
• Transparency :- Hiding the Implementation policies from the user (Single Coherent System)
 Algorithmic challenges in Distributed Computing Systems:-
• Synchronization/ Coordination Mechanism :- System must be allowed to operate concurrently
 Algorithms:-
• Leader Election
• Mutual Election
• Termination Detection
• Garbage Collection
• Fault Tolerance :-
 Algorithms:-
• Consensus Algorithm
• Voting and Quorum Systems
• Self Stabilizing Systems
 GFS :- Google File System is scalable distributed file system for large data Intensive
applications
 Motivation for GFS:-
1) Exploiting Commodity Hardware – Linux Machines
2) Maximize the cost per dollar
 Goals :-
1) Performance
2) Scalability
3) Reliability
4) Availability
 Design of GFS is Driven by :-
1) Component Failures
2) Huge Files
3) Mutation of Files
4) File System API
Google File System
Cluster Architecture
 GFS Overview :-
• Single Master :- Centralized Management
• Files Stored as Chunks :- With fixed size of 64 MB each
• Reliability through Replication:- Each chunk is replicated across 3 or more chunk servers
• Data Caching:- Due to large size of Data sets
• Interface :- Google Maps
 Role of MASTER :- Maintains all File Meta Data
• File Namespace
• File to Chunk Mapping :- 1 chunk = 64 to 128 MB
• Chunk Location information
• Monitor - Heartbeat
• Centralized Controller
 Operational Log:- Metadata maintained by
Master
• Persistent record of critical metadata
changes
• Replicated on Multiple remote machines
• Master recovers its file system from
operational log
GFS Architecture
Consistency Model
 SPARK Keywords:
• Driver -> Spark Session <-> Master in GFS
• Cluster Manager
• Executor <-> Processes running on Nodes in GFS
• Worker Node <-> Nodes in GFS
• DAG <-> Metadata in GFS
• Partition <-> Chunk in GFS
 Driver : Driver is a process that Clients use to submit application in Spark
 Cluster Manager: The cluster manager launches executors on the worker
nodes on behalf of the driver.
 SparkSession: The SparkSession object represents a connection to a Spark
cluster.
 Executor: Spark Executors are the processes on which Spark DAG tasks run. It
is a JVM process
 DAG (Directed Acyclic Graph): DAG in Spark is a set of Vertices and Edges,
where vertices represent the RDDs and the edges represent the
Operation/actions to be applied on RDD
Correlation to SPARK
SPARK Architecture
 Role of Driver:-
• Takes Application Processing input from Client
• Takes all Transformations /Actions and creates the DAG
• Stores metadata about all RDDs and their Partitions
• Plans the Physical execution of Program
• Contains information about Executors
• Monitors set of Executors Running
 Role of Executor:-
• Executer reserves CPU and memory resources on
worker Nodes in cluster
• Executors work in parallel
• Before Executors begin execution, they register
themselves with driver program
 Role of Worker Nodes:-
• Worker nodes hosts the Executor process
• Worker Node has a finite or fixed numbers of executors
allotted
 Calculation for number of Executors
Configuration:- 1 Hardware – 6 Nodes and each
Node have 16 cores, 64GB RAM
Calculation:-
Assumption:- First on each node, 1 core and 1 GB is
needed for Operating System and Hadoop Daemons, so
we have 15 cores, 63 GB RAM for each node
Number of cores = Concurrent tasks an executor can run
Optimization Number : 5 -> means max 5 concurrent
tasks
Hence, No of Cores/ Executor = 5
Total Cores : 15 – for 5 Nodes
No of Executors/ Node : 3
Total No of Executors = 6*3 = 18
 Role of Cluster Manager:-
• Launches Executors on worker nodes on behalf of Driver
• It Monitors worker Nodes
 SPARK Overview :-
• Apache Spark is a fast and general-purpose cluster
computing system.
• It provides high-level APIs in Java, Scala, Python and
R, and an optimized engine that supports general
execution graphs
• It Supports :
o Spark SQL - For SQL and Structured Data
processing,
o MLlib – For Machine Learning
o GraphX - For Graph Processing
o Spark Streaming - For Streaming Data
 Key features of SPARK:-
• Data Parallelism
• Fault Tolerance
References:
• Distributed Computing Fundamentals book - By Jennifer Welch
• Introduction to Distributed Systems - Prof. Rajiv Mishra – IIT Patna
• Spark Documentation - Apache Spark https://spark.apache.org/
The End

More Related Content

What's hot

Samza tech talk_2015 - huawei
Samza tech talk_2015 - huaweiSamza tech talk_2015 - huawei
Samza tech talk_2015 - huaweiYi Pan
 
Biomatters and Amazon Web Services
Biomatters and Amazon Web Services Biomatters and Amazon Web Services
Biomatters and Amazon Web Services Biomatters
 
Low Latency Streaming Data Processing in Hadoop
Low Latency Streaming Data Processing in HadoopLow Latency Streaming Data Processing in Hadoop
Low Latency Streaming Data Processing in HadoopInSemble
 
Evolution of apache spark
Evolution of apache sparkEvolution of apache spark
Evolution of apache sparkdatamantra
 
Hands-on Performance Tuning Lab - Devoxx Poland
Hands-on Performance Tuning Lab - Devoxx PolandHands-on Performance Tuning Lab - Devoxx Poland
Hands-on Performance Tuning Lab - Devoxx PolandC2B2 Consulting
 
Introduction to Apache Apex
Introduction to Apache ApexIntroduction to Apache Apex
Introduction to Apache ApexApache Apex
 
Samza tech talk_2015 - strata
Samza tech talk_2015 - strataSamza tech talk_2015 - strata
Samza tech talk_2015 - strataYi Pan
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformApache Apex
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streamingdatamantra
 
Akka at Enterprise Scale: Performance Tuning Distributed Applications
Akka at Enterprise Scale: Performance Tuning Distributed ApplicationsAkka at Enterprise Scale: Performance Tuning Distributed Applications
Akka at Enterprise Scale: Performance Tuning Distributed ApplicationsLightbend
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache KafkaJoe Stein
 
Hadoop Ecosystem and Low Latency Streaming Architecture
Hadoop Ecosystem and Low Latency Streaming ArchitectureHadoop Ecosystem and Low Latency Streaming Architecture
Hadoop Ecosystem and Low Latency Streaming ArchitectureInSemble
 
Power of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data StructuresPower of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data Structuresconfluent
 
Apache Zeppelin & Cluster
Apache Zeppelin & ClusterApache Zeppelin & Cluster
Apache Zeppelin & ClusterJongyoul Lee
 
Scaling spark on kubernetes at Lyft
Scaling spark on kubernetes at LyftScaling spark on kubernetes at Lyft
Scaling spark on kubernetes at LyftLi Gao
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka StreamsGuozhang Wang
 

What's hot (20)

Apache Kafka Streams
Apache Kafka StreamsApache Kafka Streams
Apache Kafka Streams
 
Apache spark
Apache sparkApache spark
Apache spark
 
Samza tech talk_2015 - huawei
Samza tech talk_2015 - huaweiSamza tech talk_2015 - huawei
Samza tech talk_2015 - huawei
 
Biomatters and Amazon Web Services
Biomatters and Amazon Web Services Biomatters and Amazon Web Services
Biomatters and Amazon Web Services
 
Low Latency Streaming Data Processing in Hadoop
Low Latency Streaming Data Processing in HadoopLow Latency Streaming Data Processing in Hadoop
Low Latency Streaming Data Processing in Hadoop
 
Evolution of apache spark
Evolution of apache sparkEvolution of apache spark
Evolution of apache spark
 
Big Data for QAs
Big Data for QAsBig Data for QAs
Big Data for QAs
 
Hands-on Performance Tuning Lab - Devoxx Poland
Hands-on Performance Tuning Lab - Devoxx PolandHands-on Performance Tuning Lab - Devoxx Poland
Hands-on Performance Tuning Lab - Devoxx Poland
 
Introduction to Apache Apex
Introduction to Apache ApexIntroduction to Apache Apex
Introduction to Apache Apex
 
Samza tech talk_2015 - strata
Samza tech talk_2015 - strataSamza tech talk_2015 - strata
Samza tech talk_2015 - strata
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Akka at Enterprise Scale: Performance Tuning Distributed Applications
Akka at Enterprise Scale: Performance Tuning Distributed ApplicationsAkka at Enterprise Scale: Performance Tuning Distributed Applications
Akka at Enterprise Scale: Performance Tuning Distributed Applications
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache Kafka
 
Hadoop Ecosystem and Low Latency Streaming Architecture
Hadoop Ecosystem and Low Latency Streaming ArchitectureHadoop Ecosystem and Low Latency Streaming Architecture
Hadoop Ecosystem and Low Latency Streaming Architecture
 
Apex as yarn application
Apex as yarn applicationApex as yarn application
Apex as yarn application
 
Power of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data StructuresPower of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data Structures
 
Apache Zeppelin & Cluster
Apache Zeppelin & ClusterApache Zeppelin & Cluster
Apache Zeppelin & Cluster
 
Scaling spark on kubernetes at Lyft
Scaling spark on kubernetes at LyftScaling spark on kubernetes at Lyft
Scaling spark on kubernetes at Lyft
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 

Similar to Spark 1.0

Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance IssuesAntonios Katsarakis
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache SparkDona Mary Philip
 
(ATS6-PLAT06) Maximizing AEP Performance
(ATS6-PLAT06) Maximizing AEP Performance(ATS6-PLAT06) Maximizing AEP Performance
(ATS6-PLAT06) Maximizing AEP PerformanceBIOVIA
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionSplunk
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.pptvijayapraba1
 
Scientific Computing - Hardware
Scientific Computing - HardwareScientific Computing - Hardware
Scientific Computing - Hardwarejalle6
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)tsliwowicz
 
Monitoring-Docker-Container-and-Dockerized-Applications
Monitoring-Docker-Container-and-Dockerized-ApplicationsMonitoring-Docker-Container-and-Dockerized-Applications
Monitoring-Docker-Container-and-Dockerized-ApplicationsSatya Sanjibani Routray
 
Monitoring docker container and dockerized applications
Monitoring docker container and dockerized applicationsMonitoring docker container and dockerized applications
Monitoring docker container and dockerized applicationsAnanth Padmanabhan
 
Monitoring docker-container-and-dockerized-applications
Monitoring docker-container-and-dockerized-applicationsMonitoring docker-container-and-dockerized-applications
Monitoring docker-container-and-dockerized-applicationsSatya Sanjibani Routray
 
Cloud Infrastructures Slide Set 8 - More Cloud Technologies - Mesos, Spark | ...
Cloud Infrastructures Slide Set 8 - More Cloud Technologies - Mesos, Spark | ...Cloud Infrastructures Slide Set 8 - More Cloud Technologies - Mesos, Spark | ...
Cloud Infrastructures Slide Set 8 - More Cloud Technologies - Mesos, Spark | ...anynines GmbH
 
Unit 5 Advanced Computer Architecture
Unit 5 Advanced Computer ArchitectureUnit 5 Advanced Computer Architecture
Unit 5 Advanced Computer ArchitectureBalaji Vignesh
 
Monitoring docker containers and dockerized applications
Monitoring docker containers and dockerized applicationsMonitoring docker containers and dockerized applications
Monitoring docker containers and dockerized applicationsSatya Sanjibani Routray
 

Similar to Spark 1.0 (20)

Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
 
(ATS6-PLAT06) Maximizing AEP Performance
(ATS6-PLAT06) Maximizing AEP Performance(ATS6-PLAT06) Maximizing AEP Performance
(ATS6-PLAT06) Maximizing AEP Performance
 
Hadoop
HadoopHadoop
Hadoop
 
CA UNIT IV.pptx
CA UNIT IV.pptxCA UNIT IV.pptx
CA UNIT IV.pptx
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
try
trytry
try
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
High performance computing
High performance computingHigh performance computing
High performance computing
 
Scientific Computing - Hardware
Scientific Computing - HardwareScientific Computing - Hardware
Scientific Computing - Hardware
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
 
Monitoring-Docker-Container-and-Dockerized-Applications
Monitoring-Docker-Container-and-Dockerized-ApplicationsMonitoring-Docker-Container-and-Dockerized-Applications
Monitoring-Docker-Container-and-Dockerized-Applications
 
Monitoring docker container and dockerized applications
Monitoring docker container and dockerized applicationsMonitoring docker container and dockerized applications
Monitoring docker container and dockerized applications
 
Monitoring docker-container-and-dockerized-applications
Monitoring docker-container-and-dockerized-applicationsMonitoring docker-container-and-dockerized-applications
Monitoring docker-container-and-dockerized-applications
 
Cloud Infrastructures Slide Set 8 - More Cloud Technologies - Mesos, Spark | ...
Cloud Infrastructures Slide Set 8 - More Cloud Technologies - Mesos, Spark | ...Cloud Infrastructures Slide Set 8 - More Cloud Technologies - Mesos, Spark | ...
Cloud Infrastructures Slide Set 8 - More Cloud Technologies - Mesos, Spark | ...
 
Unit 5 Advanced Computer Architecture
Unit 5 Advanced Computer ArchitectureUnit 5 Advanced Computer Architecture
Unit 5 Advanced Computer Architecture
 
Monitoring docker containers and dockerized applications
Monitoring docker containers and dockerized applicationsMonitoring docker containers and dockerized applications
Monitoring docker containers and dockerized applications
 

Recently uploaded

A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1KnowledgeSeed
 
OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024Shane Coughlan
 
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Andrea Goulet
 
APVP,apvp apvp High quality supplier safe spot transport, 98% purity
APVP,apvp apvp High quality supplier safe spot transport, 98% purityAPVP,apvp apvp High quality supplier safe spot transport, 98% purity
APVP,apvp apvp High quality supplier safe spot transport, 98% purityamy56318795
 
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfA Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfkalichargn70th171
 
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
KLARNA -  Language Models and Knowledge Graphs: A Systems ApproachKLARNA -  Language Models and Knowledge Graphs: A Systems Approach
KLARNA - Language Models and Knowledge Graphs: A Systems ApproachNeo4j
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAlluxio, Inc.
 
IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024vaibhav130304
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...Alluxio, Inc.
 
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdfMicrosoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdfQ-Advise
 
What need to be mastered as AI-Powered Java Developers
What need to be mastered as AI-Powered Java DevelopersWhat need to be mastered as AI-Powered Java Developers
What need to be mastered as AI-Powered Java DevelopersEmilyJiang23
 
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Gáspár Nagy
 
CompTIA Security+ (Study Notes) for cs.pdf
CompTIA Security+ (Study Notes) for cs.pdfCompTIA Security+ (Study Notes) for cs.pdf
CompTIA Security+ (Study Notes) for cs.pdfFurqanuddin10
 
10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdfkalichargn70th171
 
Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024Soroosh Khodami
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAlluxio, Inc.
 
Crafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationCrafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationWave PLM
 
Agnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in KrakówAgnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in Krakówbim.edu.pl
 

Recently uploaded (20)

A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
 
OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024
 
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
 
APVP,apvp apvp High quality supplier safe spot transport, 98% purity
APVP,apvp apvp High quality supplier safe spot transport, 98% purityAPVP,apvp apvp High quality supplier safe spot transport, 98% purity
APVP,apvp apvp High quality supplier safe spot transport, 98% purity
 
Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024
 
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfA Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
 
5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand
 
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
KLARNA -  Language Models and Knowledge Graphs: A Systems ApproachKLARNA -  Language Models and Knowledge Graphs: A Systems Approach
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
 
IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
 
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdfMicrosoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
 
What need to be mastered as AI-Powered Java Developers
What need to be mastered as AI-Powered Java DevelopersWhat need to be mastered as AI-Powered Java Developers
What need to be mastered as AI-Powered Java Developers
 
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
 
CompTIA Security+ (Study Notes) for cs.pdf
CompTIA Security+ (Study Notes) for cs.pdfCompTIA Security+ (Study Notes) for cs.pdf
CompTIA Security+ (Study Notes) for cs.pdf
 
10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf
 
Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
 
Crafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationCrafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM Integration
 
Agnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in KrakówAgnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in Kraków
 

Spark 1.0

  • 2. World Prior to Spark Philosophy of Distributed Systems Google File System & its Architecture Introduction to Spark Architecture Agenda
  • 4. Exercise Find the Sum of all these multiplications.
  • 5.  Distributed Systems :- • Collection of Individual computing Devices that can communicate with each other • Computing Devices are Autonomous in nature • Independent Computing devices are called Nodes • Nodes can act independently of each other • Nodes are programmed to achieve common goals which are realized by exchanging messages with each other ( Message Passing System) • Has a Distribution software called Middleware, which runs on the OS of each Node • It should emerge as a Single Coherent System  Properties of Distributed Systems :- • Concurrency : Multiple programs run together • Shared Data : Data is accessed simultaneously by multiple entities • No Global Clock : Each component has a local notion of time • Interdependency : Independent components depend on each other
  • 6. Logical Design of Distributed System
  • 7.  Distributed Computing System Design Challenges:- • Communication :- Communication among processes • Processes :- Management of processes/threads on client servers • Synchronization :- Coordination among the processes in essential • Fault Tolerance :- Failures of Link/Node/Processes • Transparency :- Hiding the Implementation policies from the user (Single Coherent System)  Algorithmic challenges in Distributed Computing Systems:- • Synchronization/ Coordination Mechanism :- System must be allowed to operate concurrently  Algorithms:- • Leader Election • Mutual Election • Termination Detection • Garbage Collection • Fault Tolerance :-  Algorithms:- • Consensus Algorithm • Voting and Quorum Systems • Self Stabilizing Systems
  • 8.  GFS :- Google File System is scalable distributed file system for large data Intensive applications  Motivation for GFS:- 1) Exploiting Commodity Hardware – Linux Machines 2) Maximize the cost per dollar  Goals :- 1) Performance 2) Scalability 3) Reliability 4) Availability  Design of GFS is Driven by :- 1) Component Failures 2) Huge Files 3) Mutation of Files 4) File System API Google File System
  • 10.  GFS Overview :- • Single Master :- Centralized Management • Files Stored as Chunks :- With fixed size of 64 MB each • Reliability through Replication:- Each chunk is replicated across 3 or more chunk servers • Data Caching:- Due to large size of Data sets • Interface :- Google Maps  Role of MASTER :- Maintains all File Meta Data • File Namespace • File to Chunk Mapping :- 1 chunk = 64 to 128 MB • Chunk Location information • Monitor - Heartbeat • Centralized Controller  Operational Log:- Metadata maintained by Master • Persistent record of critical metadata changes • Replicated on Multiple remote machines • Master recovers its file system from operational log
  • 13.  SPARK Keywords: • Driver -> Spark Session <-> Master in GFS • Cluster Manager • Executor <-> Processes running on Nodes in GFS • Worker Node <-> Nodes in GFS • DAG <-> Metadata in GFS • Partition <-> Chunk in GFS  Driver : Driver is a process that Clients use to submit application in Spark  Cluster Manager: The cluster manager launches executors on the worker nodes on behalf of the driver.  SparkSession: The SparkSession object represents a connection to a Spark cluster.  Executor: Spark Executors are the processes on which Spark DAG tasks run. It is a JVM process  DAG (Directed Acyclic Graph): DAG in Spark is a set of Vertices and Edges, where vertices represent the RDDs and the edges represent the Operation/actions to be applied on RDD Correlation to SPARK
  • 15.  Role of Driver:- • Takes Application Processing input from Client • Takes all Transformations /Actions and creates the DAG • Stores metadata about all RDDs and their Partitions • Plans the Physical execution of Program • Contains information about Executors • Monitors set of Executors Running  Role of Executor:- • Executer reserves CPU and memory resources on worker Nodes in cluster • Executors work in parallel • Before Executors begin execution, they register themselves with driver program  Role of Worker Nodes:- • Worker nodes hosts the Executor process • Worker Node has a finite or fixed numbers of executors allotted  Calculation for number of Executors Configuration:- 1 Hardware – 6 Nodes and each Node have 16 cores, 64GB RAM Calculation:- Assumption:- First on each node, 1 core and 1 GB is needed for Operating System and Hadoop Daemons, so we have 15 cores, 63 GB RAM for each node Number of cores = Concurrent tasks an executor can run Optimization Number : 5 -> means max 5 concurrent tasks Hence, No of Cores/ Executor = 5 Total Cores : 15 – for 5 Nodes No of Executors/ Node : 3 Total No of Executors = 6*3 = 18
  • 16.  Role of Cluster Manager:- • Launches Executors on worker nodes on behalf of Driver • It Monitors worker Nodes  SPARK Overview :- • Apache Spark is a fast and general-purpose cluster computing system. • It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs • It Supports : o Spark SQL - For SQL and Structured Data processing, o MLlib – For Machine Learning o GraphX - For Graph Processing o Spark Streaming - For Streaming Data  Key features of SPARK:- • Data Parallelism • Fault Tolerance
  • 17. References: • Distributed Computing Fundamentals book - By Jennifer Welch • Introduction to Distributed Systems - Prof. Rajiv Mishra – IIT Patna • Spark Documentation - Apache Spark https://spark.apache.org/