SlideShare ist ein Scribd-Unternehmen logo
1 von 20
SPARK ARCHITECTURE
 PRESENTED BY:-
GAURAV BISWAS
BIT MESRA
SPARK COMPONENTS
 The Spark core is complemented by a set of powerful,
higher-level libraries
SparkSQL
MLlib (for machine learning)
 GraphX
RDD(Resilient Distributed Dataset)
SparkSQL Introduction
 Part of the core distribution since Spark 1.0 (2014)
 Integrated with the Spark stack Supports querying
data either via SQL or via the Hive Query Language
 Originated as the Apache Hive port to run on top of
Spark (in place of MapReduce)
 Can weave SQL queries with code transformations
 Capability to expose Spark datasets over JDBC API and
allow running the SQL like queries on Spark data
using traditional BI and visualization tools
 Bindings in Python, Scala, and Java
SQL Execution Plans
 Logical and Physical query plans
Both are trees representing query evaluation
 Internal nodes are operators over the data
Logical plan is higher-level and algebraic
Physical plan is lower-level and operational
 Logical plan operators –
Conceptually describe what operation needs to be
performed
 Physical plan operators – Correspond to implemented
access methods
Key Features of MLib
 Low level library in Spark
 Built-in data analysis workflow
 Free performance gains
 Scalable
 Python, Scala, JavaAPIs
 Broad coverage of applications & algorithms
 Rapid improvements in speed & robustness
 Easy to use
 Integrated workflow
MLlib
 MLlib is a machine learning library that provides
various algorithms designed to scale out on a cluster
for classification, regression, clustering, collaborative
filtering, and so on.
 These algorithms also work with streaming data, such
as linear regression using ordinary least squares or k-
means clustering (and more on the way).
 Apache Mahout (a machine learning library for
Hadoop) has already turned away from MapReduce
and joined forces on Spark MLlib.
GraphX
 GraphX is an API for graphs and graph parallel
execution.
 It is a network graph analytics engine.
 GraphX is a library that performs graph-parallel
computation and manipulates graph.
 It has various Spark RDD API so it can help to create
directed graphs with arbitrary properties linked to its
vertex and edges.
GraphX
 GraphX also provides various operator and algorithms
to manipulate graph.
 Clustering, classification, traversal, searching, and
pathfinding is possible in GraphX.
Spark GraphX Features
 Flexibility:
 works with both graphs and computations
 unifies ETL (Extract, Transform & Load), exploratory analysis and
iterative graph computation within a single system.
 We can view the same data as both graphs and collections, transform
and join graphs with RDDs efficiently and write custom iterative graph
algorithms
 Speed:
 provides comparable performance to the fastest specialized graph
processing systems.
 It is comparable with the fastest graph systems while retaining Spark’s
flexibility, fault tolerance and ease of use.
Spark GraphX Features
Growing Algorithm Library:
 We can choose from a growing library of graph
algorithms
 Some of the popular algorithms are page rank,
connected components, label propagation, strongly
connected components and triangle count.
Spark Core
 Shelter to API that contains the backbone of Spark i.e.
RDDs
 The basic functionality of Spark is present in Spark
Core :
 memory management
 fault recovery
 interaction with the storage system
 I/O functionalities like task dispatching
Resilient Distributed Dataset(RDD)
 Spark introduces the concept of an RDD , an
immutable fault-tolerant, distributed collection of
objects that can be operated on in parallel.
 RDD can contain any type of object and is created by
loading an external dataset or distributing a collection
from the driver program.
RDD operation
 RDDs support two types of operations:
 Transformations : transform one data collection into
another (such as map, filter, join, union, and so on),
that are performed on an RDD and which yield a new
RDD containing the result. Means create a new dataset
from an existing one
 Actions : require that the computation be performed
(such as reduce, count, first, collect, save and so on)
that return a value after running a computation on an
RDD. which return a value to the driver program or file
after running a computation on the dataset.
Properties for RDD
 Immutability
 Cacheable – linage – persist
 Lazy evaluation (it different than execution)
 Type Inferred
 Two ways to create RDDs:
 parallelizing an existing collection in your driver program,
 referencing a dataset in an external storage system,
such as a shared file system, HDFS, Hbase, Cassandra or
any data source offering a Hadoop InputFormat.
Spark Streaming
 Spark Streaming is the component of Spark which is
used to process real-time streaming data.
 It enables high-throughput and fault-tolerant stream
processing of live data streams.
END!

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (19)

An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Learning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with SparkLearning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with Spark
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Learning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with SparkLearning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with Spark
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Apache spark
Apache sparkApache spark
Apache spark
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
DUG'20: 02 - Accelerating apache spark with DAOS on Aurora
DUG'20: 02 - Accelerating apache spark with DAOS on AuroraDUG'20: 02 - Accelerating apache spark with DAOS on Aurora
DUG'20: 02 - Accelerating apache spark with DAOS on Aurora
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
Ten tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkTen tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache Spark
 
Ten tools for ten big data areas 02_Tableau
Ten tools for ten big data areas 02_TableauTen tools for ten big data areas 02_Tableau
Ten tools for ten big data areas 02_Tableau
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 

Ähnlich wie SPARK ARCHITECTURE

Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 

Ähnlich wie SPARK ARCHITECTURE (20)

Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Spark core
Spark coreSpark core
Spark core
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Spark cluster computing with working sets
Spark cluster computing with working setsSpark cluster computing with working sets
Spark cluster computing with working sets
 
Let's start with Spark
Let's start with SparkLet's start with Spark
Let's start with Spark
 
Lighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureLighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in Azure
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and Scala
 
Apache spark
Apache sparkApache spark
Apache spark
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 

Mehr von GauravBiswas9 (11)

Pipeline anomaly detection
Pipeline anomaly detectionPipeline anomaly detection
Pipeline anomaly detection
 
False colouring
False colouringFalse colouring
False colouring
 
WCDMA
WCDMA WCDMA
WCDMA
 
Ofdm
OfdmOfdm
Ofdm
 
2.5G Cellular Standards
2.5G Cellular Standards2.5G Cellular Standards
2.5G Cellular Standards
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATA
 
Iot in healthcare
Iot in healthcareIot in healthcare
Iot in healthcare
 
Gsm vs gprs
Gsm vs gprsGsm vs gprs
Gsm vs gprs
 
Circuit switch vs packet switch
Circuit switch vs packet switchCircuit switch vs packet switch
Circuit switch vs packet switch
 
Channelization scheme in AMPS & GSM
Channelization scheme in AMPS & GSMChannelization scheme in AMPS & GSM
Channelization scheme in AMPS & GSM
 
Big data analytics.
Big data analytics.Big data analytics.
Big data analytics.
 

Kürzlich hochgeladen

AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 

Kürzlich hochgeladen (20)

AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 

SPARK ARCHITECTURE

  • 1. SPARK ARCHITECTURE  PRESENTED BY:- GAURAV BISWAS BIT MESRA
  • 2. SPARK COMPONENTS  The Spark core is complemented by a set of powerful, higher-level libraries SparkSQL MLlib (for machine learning)  GraphX RDD(Resilient Distributed Dataset)
  • 3. SparkSQL Introduction  Part of the core distribution since Spark 1.0 (2014)  Integrated with the Spark stack Supports querying data either via SQL or via the Hive Query Language  Originated as the Apache Hive port to run on top of Spark (in place of MapReduce)  Can weave SQL queries with code transformations  Capability to expose Spark datasets over JDBC API and allow running the SQL like queries on Spark data using traditional BI and visualization tools  Bindings in Python, Scala, and Java
  • 4.
  • 6.
  • 7.  Logical and Physical query plans Both are trees representing query evaluation  Internal nodes are operators over the data Logical plan is higher-level and algebraic Physical plan is lower-level and operational  Logical plan operators – Conceptually describe what operation needs to be performed  Physical plan operators – Correspond to implemented access methods
  • 8. Key Features of MLib  Low level library in Spark  Built-in data analysis workflow  Free performance gains  Scalable  Python, Scala, JavaAPIs  Broad coverage of applications & algorithms  Rapid improvements in speed & robustness  Easy to use  Integrated workflow
  • 9. MLlib  MLlib is a machine learning library that provides various algorithms designed to scale out on a cluster for classification, regression, clustering, collaborative filtering, and so on.  These algorithms also work with streaming data, such as linear regression using ordinary least squares or k- means clustering (and more on the way).  Apache Mahout (a machine learning library for Hadoop) has already turned away from MapReduce and joined forces on Spark MLlib.
  • 10.
  • 11. GraphX  GraphX is an API for graphs and graph parallel execution.  It is a network graph analytics engine.  GraphX is a library that performs graph-parallel computation and manipulates graph.  It has various Spark RDD API so it can help to create directed graphs with arbitrary properties linked to its vertex and edges.
  • 12. GraphX  GraphX also provides various operator and algorithms to manipulate graph.  Clustering, classification, traversal, searching, and pathfinding is possible in GraphX.
  • 13. Spark GraphX Features  Flexibility:  works with both graphs and computations  unifies ETL (Extract, Transform & Load), exploratory analysis and iterative graph computation within a single system.  We can view the same data as both graphs and collections, transform and join graphs with RDDs efficiently and write custom iterative graph algorithms  Speed:  provides comparable performance to the fastest specialized graph processing systems.  It is comparable with the fastest graph systems while retaining Spark’s flexibility, fault tolerance and ease of use.
  • 14. Spark GraphX Features Growing Algorithm Library:  We can choose from a growing library of graph algorithms  Some of the popular algorithms are page rank, connected components, label propagation, strongly connected components and triangle count.
  • 15. Spark Core  Shelter to API that contains the backbone of Spark i.e. RDDs  The basic functionality of Spark is present in Spark Core :  memory management  fault recovery  interaction with the storage system  I/O functionalities like task dispatching
  • 16. Resilient Distributed Dataset(RDD)  Spark introduces the concept of an RDD , an immutable fault-tolerant, distributed collection of objects that can be operated on in parallel.  RDD can contain any type of object and is created by loading an external dataset or distributing a collection from the driver program.
  • 17. RDD operation  RDDs support two types of operations:  Transformations : transform one data collection into another (such as map, filter, join, union, and so on), that are performed on an RDD and which yield a new RDD containing the result. Means create a new dataset from an existing one  Actions : require that the computation be performed (such as reduce, count, first, collect, save and so on) that return a value after running a computation on an RDD. which return a value to the driver program or file after running a computation on the dataset.
  • 18. Properties for RDD  Immutability  Cacheable – linage – persist  Lazy evaluation (it different than execution)  Type Inferred  Two ways to create RDDs:  parallelizing an existing collection in your driver program,  referencing a dataset in an external storage system, such as a shared file system, HDFS, Hbase, Cassandra or any data source offering a Hadoop InputFormat.
  • 19. Spark Streaming  Spark Streaming is the component of Spark which is used to process real-time streaming data.  It enables high-throughput and fault-tolerant stream processing of live data streams.
  • 20. END!