Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spark Tutorial |Simplilearn

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Nächste SlideShare
Introduction to apache spark
Introduction to apache spark
Wird geladen in …3
×

Hier ansehen

1 von 54 Anzeige

Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spark Tutorial |Simplilearn

This presentation on Spark Architecture will give an idea of what is Apache Spark, the essential features in Spark, the different Spark components. Here, you will learn about Spark Core, Spark SQL, Spark Streaming, Spark MLlib, and Graphx. You will understand how Spark processes an application and runs it on a cluster with the help of its architecture. Finally, you will perform a demo on Apache Spark. So, let's get started with Apache Spark Architecture.

YouTube Video: https://www.youtube.com/watch?v=CF5Ewk0GxiQ

What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.

What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos

What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming

Who should take this Scala course?
1. Professionals aspiring for a career in the field of real-time big data analytics
2. Analytics professionals
3. Research professionals
4. IT developers and testers
5. Data scientists
6. BI and reporting professionals
7. Students who wish to gain a thorough understanding of Apache Spark

Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training

This presentation on Spark Architecture will give an idea of what is Apache Spark, the essential features in Spark, the different Spark components. Here, you will learn about Spark Core, Spark SQL, Spark Streaming, Spark MLlib, and Graphx. You will understand how Spark processes an application and runs it on a cluster with the help of its architecture. Finally, you will perform a demo on Apache Spark. So, let's get started with Apache Spark Architecture.

YouTube Video: https://www.youtube.com/watch?v=CF5Ewk0GxiQ

What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.

What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos

What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming

Who should take this Scala course?
1. Professionals aspiring for a career in the field of real-time big data analytics
2. Analytics professionals
3. Research professionals
4. IT developers and testers
5. Data scientists
6. BI and reporting professionals
7. Students who wish to gain a thorough understanding of Apache Spark

Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training

Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spark Tutorial |Simplilearn (20)

Anzeige

Weitere von Simplilearn (20)

Aktuellste (20)

Anzeige

Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spark Tutorial |Simplilearn

  1. 1. 1. What is Spark? 2. Components of Spark Spark Core Spark SQL Spark Streaming Spark MLlib GraphX 3. Apache Spark Architecture 4. Running a Spark Application What’s in it for you?
  2. 2. What is Apache Spark? Apache Spark is a top-level open-source cluster computing framework used for real-time processing and analysis of a large amount of data
  3. 3. What is Apache Spark? Apache Spark is a top-level open-source cluster computing framework used for real-time processing and analysis of a large amount of data Fast processing Spark processes data faster since it saves time in reading and writing operations
  4. 4. What is Apache Spark? Apache Spark is a top-level open-source cluster computing framework used for real-time processing and analysis of a large amount of data Fast processing Real-time streaming Spark processes data faster since it saves time in reading and writing operations Spark allows real-time streaming and processing of data
  5. 5. What is Apache Spark? Apache Spark is a top-level open-source cluster computing framework used for real-time processing and analysis of a large amount of data Fast processing Real-time streaming In-memory computation Spark processes data faster since it saves time in reading and writing operations Spark allows real-time streaming and processing of data Spark has DAG execution engine that provides in-memory computation
  6. 6. What is Apache Spark? Apache Spark is a top-level open-source cluster computing framework used for real-time processing and analysis of a large amount of data Fast processing Real-time streaming In-memory computation Fault tolerant Spark processes data faster since it saves time in reading and writing operations Spark allows real-time streaming and processing of data Spark has DAG execution engine that provides in-memory computation Spark is fault tolerant through RDDs which are designed to handle the failure of any worker node in the cluster
  7. 7. Spark Components
  8. 8. Spark Core Apache Spark Components
  9. 9. Spark Core Spark SQL SQL Apache Spark Components
  10. 10. Spark Streaming Spark Core Spark SQL SQL Streaming Apache Spark Components
  11. 11. MLlib Spark Streaming Spark Core Spark SQL SQL Streaming MLlib Apache Spark Components
  12. 12. MLlib Spark Streaming Spark Core Spark SQL GraphX SQL Streaming MLlib Apache Spark Components
  13. 13. Spark Core Spark is the core engine for large-scale parallel and distributed data processing
  14. 14. Spark Core Spark is the core engine for large-scale parallel and distributed data processing Memory management and fault recovery Scheduling, distributing and monitoring jobs on a cluster Interacting with storage system Performs the following:
  15. 15. Spark RDD Resilient Distributed Datasets (RDDs) are the building blocks of any Spark application Create RDD Transformations RDD Actions Results Transformations are Operations (such as map, filter, join, union) that are performed on an RDD that yields a new RDD containing the result Actions are operations (such as reduce, first, count) that return a value after running a computation on an RDD
  16. 16. Spark SQL Spark SQL is Apache Spark’s module for working with structured data SQL
  17. 17. Spark SQL Spark SQL is Apache Spark’s module for working with structured data SQL Integrated You can integrate Spark SQL with Spark programs and query structured data inside Spark programs Spark SQL features
  18. 18. Spark SQL Spark SQL is Apache Spark’s module for working with structured data SQL Integrated High Compatibility You can integrate Spark SQL with Spark programs and query structured data inside Spark programs You can run unmodified Hive queries on existing warehouses in Spark SQL. With existing Hive data, queries and UDFs, Spark SQL offers full compatibility Spark SQL features
  19. 19. Spark SQL Spark SQL is Apache Spark’s module for working with structured data SQL Integrated High Compatibility Scalability You can integrate Spark SQL with Spark programs and query structured data inside Spark programs You can run unmodified Hive queries on existing warehouses in Spark SQL. With existing Hive data, queries and UDFs, Spark SQL offers full compatibility Spark SQL leverages RDD model as it supports large jobs and mid- query fault tolerance. Moreover, for both interactive and long queries, it uses the same engine Spark SQL features
  20. 20. Spark SQL Spark SQL is Apache Spark’s module for working with structured data SQL Integrated Spark SQL features High Compatibility Scalability Standard Connectivity You can integrate Spark SQL with Spark programs and query structured data inside Spark programs You can run unmodified Hive queries on existing warehouses in Spark SQL. With existing Hive data, queries and UDFs, Spark SQL offers full compatibility Spark SQL leverages RDD model as it supports large jobs and mid- query fault tolerance. Moreover, for both interactive and long queries, it uses the same engine You can easily connect Spark SQL with JDBC or ODBC. For connectivity for business intelligence tools, both turned as industry norms
  21. 21. Spark SQL Spark SQL is Apache Spark’s module for working with structured data DataFrame DSLSpark SQL and HQL DataFrame API Data Source API CSV JSON JDBC SQL Architecture SQL
  22. 22. Spark SQL Spark SQL has three main layers Spark SQL is Apache Spark’s module for working with structured data Language API SchemaRDD Data Sources Spark is compatible and even supported by the languages like Python, HiveQL, Scala, and Java As Spark SQL works on schema, tables, and records, you can use SchemaRDD or data frame as a temporary table Data sources for Spark SQL are different like JSON document, HIVE tables, and Cassandra database SQL
  23. 23. Spark SQL Spark allows you to define custom SQL functions called User Defined Functions (UDFs) SQL def lowerRemoveAllWhiteSpaces(s: String): String = { s.tolowerCase().replace(“S”, ‘’”) } val lowerRemoveAllWhiteSpacesUDF = udf[String, String] (lowerRemoveAllWhiteSpaces) val sourceDF = spark.createDF( List( (“ WELCOME “) (“ SpaRk SqL “) ), List( (“text”, StringType, true) ) ) sourceDF.select( lowerRemoveAllWhiteSpacesUDF(col(“text”)).as(“clean_text”) ).show() UDF that removes all the whitespace and lowercases all the characters in a string clean_text welcome sparksql Output
  24. 24. Spark Streaming Spark Streaming an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams Streaming
  25. 25. Spark Streaming Spark Streaming an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams Data can be ingested from many sources and the processed data can be pushed out to different filesystems Streaming
  26. 26. Spark Streaming Spark Streaming an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams Data can be ingested from many sources and the processed data can be pushed out to different filesystems Streaming Streaming data sources Static data sources
  27. 27. Spark Streaming Spark Streaming an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams Data can be ingested from many sources and the processed data can be pushed out to different filesystems Streaming Streaming Streaming data sources Static data sources
  28. 28. Spark Streaming Spark Streaming an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams Data can be ingested from many sources and the processed data can be pushed out to different filesystems Streaming Streaming Streaming data sources Static data sources Data storage
  29. 29. Spark Streaming Spark Streaming an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches Streaming Engine Input data stream Batches of input data Batches of processed data Streaming
  30. 30. Spark Streaming Spark Streaming an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams Streaming Here is an example of a basic RDD operation to extract individual words from lines of text in an input data stream Lines From Time 0 and 1 Lines From Time 1 and 2 Lines From Time 2 and 3 Lines From Time 3 and 4 Words From Time 0 and 1 Words From Time 1 and 2 Words From Time 2 and 3 Words From Time 3 and 4 Lines DStream Words DStream flatMap Operation
  31. 31. Spark MLlib MLlib is Spark’s machine learning library. Its goal is to make practical machine learning scalable and easy MLlib
  32. 32. MLlib is Spark’s machine learning library. Its goal is to make practical machine learning scalable and easy MLlib At a high level, it provides the following: ML Algorithms: classification, regression, clustering, and collaborative filtering Spark MLlib
  33. 33. MLlib is Spark’s machine learning library. Its goal is to make practical machine learning scalable and easy MLlib At a high level, it provides the following: ML Algorithms: classification, regression, clustering, and collaborative filtering Featurization: feature extraction, transformation, dimensionality reduction, and selection Spark MLlib
  34. 34. MLlib is Spark’s machine learning library. Its goal is to make practical machine learning scalable and easy MLlib At a high level, it provides the following: ML Algorithms: classification, regression, clustering, and collaborative filtering Featurization: feature extraction, transformation, dimensionality reduction, and selection Pipelines: tools for constructing, evaluating, and tuning ML pipelines Spark MLlib
  35. 35. MLlib is Spark’s machine learning library. Its goal is to make practical machine learning scalable and easy MLlib At a high level, it provides the following: ML Algorithms: classification, regression, clustering, and collaborative filtering Featurization: feature extraction, transformation, dimensionality reduction, and selection Pipelines: tools for constructing, evaluating, and tuning ML pipelines Utilities: linear algebra, statistics, data handling Spark MLlib
  36. 36. GraphX GraphX is a component in Spark for graphs and graph-parallel computation GraphX is used to model relations between objects. A graph has vertices (objects) and edges (relationships). Mathew Justin Edge Vertex Relationship: Friends
  37. 37. GraphX GraphX is a component in Spark for graphs and graph-parallel computation Provides a uniform tool for ETL Exploratory data analysis Interactive graph computations
  38. 38. GraphX is a component in Spark for graphs and graph-parallel computation Page Rank Fraud Detection Geographic information system Disaster management Following are the applications of GraphX GraphX
  39. 39. Spark Architecture
  40. 40. Spark Architecture Spark Architecture is based on 2 important abstractions
  41. 41. Spark Architecture Spark Architecture is based on 2 important abstractions Resilient Distributed Dataset (RDD) RDD’s are the fundamental units of data in Apache Spark that are split into partitions and can be executed on different nodes of a cluster Cluster RDD
  42. 42. Spark Architecture Spark Architecture is based on 2 important abstractions Resilient Distributed Dataset (RDD) Directed Acyclic Graph (DAG) RDD’s are the fundamental units of data in Apache Spark that are split into partitions and can be executed on different nodes of a cluster Cluster DAG is the scheduling layer of the Spark Architecture that implements stage-oriented scheduling and eliminates the Hadoop MapReduce multistage execution model RDD Stage 1 Parallelize Filter Map Stage 2 reduceByKey Map
  43. 43. Spark Architecture Master Node Driver Program SparkContext • Master Node has a Driver Program • The Spark code behaves as a driver program and creates a SparkContext which is a gateway to all the Spark functionalities Apache Spark uses a master-slave architecture that consists of a driver, that runs on a master node, and multiple executors which run across the worker nodes in the cluster
  44. 44. Spark Architecture Cluster Manager • Spark applications run as independent sets of processes on a cluster • The driver program & Spark context takes care of the job execution within the cluster Master Node Driver Program SparkContext
  45. 45. Spark Architecture Cache Task Task Executor Worker Node Cache Task Task Executor Worker Node • A job is split into multiple tasks that are distributed over the worker node • When an RDD is created in Spark context, it can be distributed across various nodes • Worker nodes are slaves that execute different tasks Cluster Manager Master Node Driver Program SparkContext
  46. 46. Spark Architecture Cache Task Task Executor Worker Node Cache Task Task Executor Worker Node • Executor is responsible for the execution of these tasks • Worker nodes execute the tasks assigned by the Cluster Manager and returns the resultback to the Spark Context Master Node Driver Program SparkContext Cluster Manager
  47. 47. Spark Architecture Cache Task Task Executor Worker Node Cache Task Task Executor Worker Node • Worker nodes execute the tasks assigned by the Cluster Manager and returns it back to the Spark Context • Executor is responsible for the execution of these tasks Master Node Driver Program SparkContext Cluster Manager
  48. 48. Running a Spark Application
  49. 49. Spark Session Driver Program Application How a Spark application runs on a cluster? Spark applications run as independent processes, coordinated by the SparkSession object in the driver program
  50. 50. Spark Session Driver Program Application Resource Manager/ Cluster Manager How a Spark application runs on a cluster? The resource or cluster manager assigns tasks to workers, one task per partition
  51. 51. Spark Session Driver Program Application Worker Node Executor Task Task Cache Partition Partition Disk Data Data How a Spark application runs on a cluster? Resource Manager/ Cluster Manager • A task applies its unit of work to the dataset in its partition and outputs a new partition dataset • Because iterative algorithms apply operations repeatedly to data, they benefit from caching datasets across iterations
  52. 52. How a Spark application runs on a cluster? Spark Session Driver Program Application Executor Task Task Cache Partition Partition Disk Data Data Resource Manager/ Cluster Manager Results are sent back to the driver application or can be saved to disk Worker Node

×