SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Downloaden Sie, um offline zu lesen
Introduction to Apache Spark
Olalekan Fuad Elesin, Data Engineer
https://twitter.com/elesinOlalekan
https://github.com/OElesin
https://www.linkedin.com/in/elesinolalekan
00: Getting Started
Introduction
Necessary downloads and installations
Intro: Achievements
By the end of this session, you will be comfortable
with the following:
• open a Spark Shell
• explore data sets loaded from HDFS, etc.
• review Spark SQL, Spark Streaming,
• use the Spark Notebook
• developer community resources, etc.
• return to workplace and demo use of Spark!
Intro: Preliminaries
I believe we all have
basic Scala programming skills
01: Getting Started
Installations
hands-on: 5 mins (max)
Installation:
Step 1: Install JDK 7/8 on MacOs or Windows or Linux
http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads
Step 2: Download Spark 2.0.1 from this URL http://spark.apache.org/downlo
(for session, please copy all installations from the USB disk or hard drive )
Step 3: Run Spark Shell
We’ll run Spark’s interactive shell…
./bin/spark-shell
Let’s create some data from the “Scala” REPL prompt
val data = 1 to 100000
Step 4: Now, let’s create some RDD
val dataRDD = sc.parallelize(data)
then we filter out some
dataRDD.filter(_ < 35 ).collect()
Step 3: Run Spark Shell
We’ll run Spark’s interactive shell…
./bin/spark-shell
Let’s create some data from the “Scala” REPL prompt
val data = 1 to 100000
Step 4: Now, let’s create some RDD
val dataRDD = sc.parallelize(data)
then we filter out some
dataRDD.filter(_ < 35 ).collect()Check point 1
What was your result?
02: Why Spark
Why Spark
Talk time: 6 mins (max)
Why Spark
• Most machine learning algorithms are iterative
• A large number of computations on data are also iterative
• With Disk based approached in Hadoop MapReduce, each iteration is written to
disk. This makes process very slow
Input Data
on Disk
Tuples
(On Disk)
Tuples
(On Disk)
Tuples
(On Disk)
Output Data
on Disk
http://www.wiziq.com/blog/hype-around-apache-spark/
Input Data
on Disk
RDD1
(in memory)
RDD2
(in memory)
RDD3
(in memory)
Output Data
on Disk
Hadoop Execution Flow
Spark Execution Flow
03: About Apache Spark
About Apache Spark
Talk time: 4 mins (max)
About Apache Spark
• Initial started at UC Berkeley in 2009 as PhD thesis project by Matei Zarahia
• Fast and general purpose cluster computing system
• 10x (on disk) - 100x (In memory) faster than MapReduce
• Popular for running iterative machine learning algorithms, batch and streaming
computations on data, its SQL interface and data frames.
• Can also be used for Graph processing
• Provides high level API in:
• Scala
• Java
• Python
• R
• Seamless integration with Hadoop and its Ecosystem. Can also read data from a
number of existing data sources
• More info: http://spark.apache.org/
04: Spark Stack
Spark Built-in Libraries
Talk time: 4 mins (max)
Spark Stack
• Spark SQL lets you query
structured data inside Spark
programs, using either SQL or a
familiar DataFrame API.
• Spark Streaming lets you write
streaming jobs the same way you
write batch jobs.
• Spark MLlib & ML: Machine
learning algorithms
• Graphx unifies ETL, exploratory
analysis, and iterative graph
computation within a single system
05: Spark Execution Flow
Spark Execution Flow
Talk time: 4 mins (max)
Execution Flow
http://spark.apache.org/docs/latest/cluster-overview.html
Image Courtesy: Apache Spark Website
06: Terminology
BuzzWords !!!
Talk time: 6 mins (max)
Term Meaning
Application User program built on Spark. Consists of a driver program and executors on the cluster.
Application jar A jar containing the user's Spark application. In some cases users will want to create an
"uber jar" containing their application along with its dependencies. The user's jar should
never include Hadoop or Spark libraries, however, these will be added at runtime.
Driver program The process running the main() function of the application and creating the SparkContext
Cluster manager An external service for acquiring resources on the cluster (e.g. standalone manager,
Mesos, YARN)
Deploy mode Distinguishes where the driver process runs. In "cluster" mode, the framework launches
the driver inside of the cluster. In "client" mode, the submitter launches the driver outside
of the cluster.
Worker node Any node that can run application code in the cluster
Executor A process launched for an application on a worker node, that runs tasks and keeps data in
memory or disk storage across them. Each application has its own executors.
Task A unit of work that will be sent to one executor
Job A parallel computation consisting of multiple tasks that gets spawned in response to a
Spark action (e.g. save, collect); you'll see this term used in the driver's logs.
Stage Each job gets divided into smaller sets of tasks called stages that depend on each other
(similar to the map and reduce stages in MapReduce); you'll see this term used in the
driver's logs.
07: Resilient Distributed Dataset
RDD !!!
Talk time: 4 mins (max)
• Resilient Distributed Dataset (RDD) is a basic Abstraction in Spark
• Immutable, Partitioned collection of elements that can be operated in parallel
• Basic Operations
– map
– filter
– persist
• Multiple Implementation
– PairRDDFunctions : RDD of Key-Value Pairs, groupByKey, Join
– DoubleRDDFunctions : Operation related to double values
– SequenceFileRDDFunctions : Operation related to SequenceFiles
• RDD main characteristics:
– A list of partitions – A function for computing each split
– A list of dependencies on other RDDs
– Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash partitioned)
– Optionally, a list of preferred locations to compute each split on (e.g. block locations for
an HDFS file)
• Custom RDD can be also implemented (by overriding functions)
Resilient Distributed Dataset
08: Cluster Deployment
Cluster Deployment
Talk time: 3 mins (max)
• Standalone Deploy Mode
– simplest way to deploy Spark on a private cluster
• Amazon EC2
– EC2 scripts are available
– Very quick launching a new cluster
• Apache Mesos
• Hadoop YARN
Cluster Deployment
09: Monitoring
Monitoring Spark with
WebUI
Talk time: 2 mins (max)
09: Hand-Ons Time
Hands-on with Spark-shell
and
Spark Notebook
Talk time: ~ mins (max)

Weitere ähnliche Inhalte

Was ist angesagt?

Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiSpark Summit
 
Spark 1.6 vs Spark 2.0
Spark 1.6 vs Spark 2.0Spark 1.6 vs Spark 2.0
Spark 1.6 vs Spark 2.0Sigmoid
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server TalkEvan Chan
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkTaras Matyashovsky
 
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSpark Summit
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark Aakashdata
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache sparkUserReport
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemBojan Babic
 
SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0Sigmoid
 
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...Spark Summit
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterDataWorks Summit
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overviewMartin Zapletal
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangSpark Summit
 
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 

Was ist angesagt? (20)

Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek Laskowski
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
Spark 1.6 vs Spark 2.0
Spark 1.6 vs Spark 2.0Spark 1.6 vs Spark 2.0
Spark 1.6 vs Spark 2.0
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server Talk
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
 
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Spark core
Spark coreSpark core
Spark core
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0
 
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene Pang
 
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
 

Ähnlich wie Introduction to Apache Spark :: Lagos Scala Meetup session 2

Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxRahul Borate
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerEvan Chan
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Sharktrihug
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Xuan-Chao Huang
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark exampleShidrokhGoudarzi1
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitDataWorks Summit
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitSaptak Sen
 

Ähnlich wie Introduction to Apache Spark :: Lagos Scala Meetup session 2 (20)

Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Shark
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Spark core
Spark coreSpark core
Spark core
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Spark Working Environment in Windows OS
Spark Working Environment in Windows OSSpark Working Environment in Windows OS
Spark Working Environment in Windows OS
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
 

Mehr von Olalekan Fuad Elesin

How to go from Idea to First Customer
How to go from Idea to First CustomerHow to go from Idea to First Customer
How to go from Idea to First CustomerOlalekan Fuad Elesin
 
Platform approach to scaling machine learning across the enterprise
Platform approach to scaling machine learning across the enterprisePlatform approach to scaling machine learning across the enterprise
Platform approach to scaling machine learning across the enterpriseOlalekan Fuad Elesin
 
Olalekan Elesin - Big Data MBA Certificate | Bill Schmarzo, Dean of Big Data
Olalekan Elesin - Big Data MBA Certificate | Bill Schmarzo, Dean of Big Data�Olalekan Elesin - Big Data MBA Certificate | Bill Schmarzo, Dean of Big Data�
Olalekan Elesin - Big Data MBA Certificate | Bill Schmarzo, Dean of Big DataOlalekan Fuad Elesin
 
Predictive Analytics for Non-programmers
Predictive Analytics for Non-programmersPredictive Analytics for Non-programmers
Predictive Analytics for Non-programmersOlalekan Fuad Elesin
 

Mehr von Olalekan Fuad Elesin (6)

How to go from Idea to First Customer
How to go from Idea to First CustomerHow to go from Idea to First Customer
How to go from Idea to First Customer
 
Platform approach to scaling machine learning across the enterprise
Platform approach to scaling machine learning across the enterprisePlatform approach to scaling machine learning across the enterprise
Platform approach to scaling machine learning across the enterprise
 
Olalekan Elesin - Big Data MBA Certificate | Bill Schmarzo, Dean of Big Data
Olalekan Elesin - Big Data MBA Certificate | Bill Schmarzo, Dean of Big Data�Olalekan Elesin - Big Data MBA Certificate | Bill Schmarzo, Dean of Big Data�
Olalekan Elesin - Big Data MBA Certificate | Bill Schmarzo, Dean of Big Data
 
Graphs
GraphsGraphs
Graphs
 
Predictive Analytics for Non-programmers
Predictive Analytics for Non-programmersPredictive Analytics for Non-programmers
Predictive Analytics for Non-programmers
 
CV Olalekan Elesin (Spanish)
CV Olalekan Elesin (Spanish)CV Olalekan Elesin (Spanish)
CV Olalekan Elesin (Spanish)
 

Kürzlich hochgeladen

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 

Kürzlich hochgeladen (20)

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 

Introduction to Apache Spark :: Lagos Scala Meetup session 2

  • 1. Introduction to Apache Spark Olalekan Fuad Elesin, Data Engineer https://twitter.com/elesinOlalekan https://github.com/OElesin https://www.linkedin.com/in/elesinolalekan
  • 2. 00: Getting Started Introduction Necessary downloads and installations
  • 3. Intro: Achievements By the end of this session, you will be comfortable with the following: • open a Spark Shell • explore data sets loaded from HDFS, etc. • review Spark SQL, Spark Streaming, • use the Spark Notebook • developer community resources, etc. • return to workplace and demo use of Spark!
  • 4. Intro: Preliminaries I believe we all have basic Scala programming skills
  • 6. Installation: Step 1: Install JDK 7/8 on MacOs or Windows or Linux http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads Step 2: Download Spark 2.0.1 from this URL http://spark.apache.org/downlo (for session, please copy all installations from the USB disk or hard drive ) Step 3: Run Spark Shell We’ll run Spark’s interactive shell… ./bin/spark-shell Let’s create some data from the “Scala” REPL prompt val data = 1 to 100000 Step 4: Now, let’s create some RDD val dataRDD = sc.parallelize(data) then we filter out some dataRDD.filter(_ < 35 ).collect()
  • 7. Step 3: Run Spark Shell We’ll run Spark’s interactive shell… ./bin/spark-shell Let’s create some data from the “Scala” REPL prompt val data = 1 to 100000 Step 4: Now, let’s create some RDD val dataRDD = sc.parallelize(data) then we filter out some dataRDD.filter(_ < 35 ).collect()Check point 1 What was your result?
  • 8. 02: Why Spark Why Spark Talk time: 6 mins (max)
  • 9. Why Spark • Most machine learning algorithms are iterative • A large number of computations on data are also iterative • With Disk based approached in Hadoop MapReduce, each iteration is written to disk. This makes process very slow Input Data on Disk Tuples (On Disk) Tuples (On Disk) Tuples (On Disk) Output Data on Disk http://www.wiziq.com/blog/hype-around-apache-spark/ Input Data on Disk RDD1 (in memory) RDD2 (in memory) RDD3 (in memory) Output Data on Disk Hadoop Execution Flow Spark Execution Flow
  • 10. 03: About Apache Spark About Apache Spark Talk time: 4 mins (max)
  • 11. About Apache Spark • Initial started at UC Berkeley in 2009 as PhD thesis project by Matei Zarahia • Fast and general purpose cluster computing system • 10x (on disk) - 100x (In memory) faster than MapReduce • Popular for running iterative machine learning algorithms, batch and streaming computations on data, its SQL interface and data frames. • Can also be used for Graph processing • Provides high level API in: • Scala • Java • Python • R • Seamless integration with Hadoop and its Ecosystem. Can also read data from a number of existing data sources • More info: http://spark.apache.org/
  • 12. 04: Spark Stack Spark Built-in Libraries Talk time: 4 mins (max)
  • 13. Spark Stack • Spark SQL lets you query structured data inside Spark programs, using either SQL or a familiar DataFrame API. • Spark Streaming lets you write streaming jobs the same way you write batch jobs. • Spark MLlib & ML: Machine learning algorithms • Graphx unifies ETL, exploratory analysis, and iterative graph computation within a single system
  • 14. 05: Spark Execution Flow Spark Execution Flow Talk time: 4 mins (max)
  • 17. Term Meaning Application User program built on Spark. Consists of a driver program and executors on the cluster. Application jar A jar containing the user's Spark application. In some cases users will want to create an "uber jar" containing their application along with its dependencies. The user's jar should never include Hadoop or Spark libraries, however, these will be added at runtime. Driver program The process running the main() function of the application and creating the SparkContext Cluster manager An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN) Deploy mode Distinguishes where the driver process runs. In "cluster" mode, the framework launches the driver inside of the cluster. In "client" mode, the submitter launches the driver outside of the cluster. Worker node Any node that can run application code in the cluster Executor A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors. Task A unit of work that will be sent to one executor Job A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect); you'll see this term used in the driver's logs. Stage Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce); you'll see this term used in the driver's logs.
  • 18. 07: Resilient Distributed Dataset RDD !!! Talk time: 4 mins (max)
  • 19. • Resilient Distributed Dataset (RDD) is a basic Abstraction in Spark • Immutable, Partitioned collection of elements that can be operated in parallel • Basic Operations – map – filter – persist • Multiple Implementation – PairRDDFunctions : RDD of Key-Value Pairs, groupByKey, Join – DoubleRDDFunctions : Operation related to double values – SequenceFileRDDFunctions : Operation related to SequenceFiles • RDD main characteristics: – A list of partitions – A function for computing each split – A list of dependencies on other RDDs – Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash partitioned) – Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file) • Custom RDD can be also implemented (by overriding functions) Resilient Distributed Dataset
  • 20. 08: Cluster Deployment Cluster Deployment Talk time: 3 mins (max)
  • 21. • Standalone Deploy Mode – simplest way to deploy Spark on a private cluster • Amazon EC2 – EC2 scripts are available – Very quick launching a new cluster • Apache Mesos • Hadoop YARN Cluster Deployment
  • 22. 09: Monitoring Monitoring Spark with WebUI Talk time: 2 mins (max)
  • 23. 09: Hand-Ons Time Hands-on with Spark-shell and Spark Notebook Talk time: ~ mins (max)