SlideShare a Scribd company logo
1 of 29
Content
▪ Introduction
▪ What is Apache Spark?
▪ Apache Spark Features
▪ Components of Apache Spark Ecosystem
▪ Apache Spark Languages
▪ Apache Spark History
▪ Why You Should Learn Apache Spark
▪ Do We Need Hadoop to Run Spark?
Content
▪ Apache Spark Installation
▪ Apache Spark Example
▪ Apache Spark Use Cases
▪ Apache Spark Books
▪ Apache Spark Certifications
▪ Apache Spark Training
▪ Final Words
Introduction
For the analysis of big data, the industry is extensively using Apache
Spark. Hadoop enables a flexible, scalable, cost-effective, and fault-
tolerant computing solution. But the main concern is to maintain the
speed while processing big data. The industry needs a powerful engine
that can respond in less than seconds and perform in-memory
processing. Also, that can perform stream processing as well as batch
processing of the data. This is what made Apache Spark come into
existence!
This is the comprehensive guide that will help you learn Apache Spark.
Starting from the introduction, I’ll show you everything you want to
know about Apache Spark. Sounds good? Let’s dive right in..
What is Apache Spark?
The Spark is a project of Apache, popularly known as “lightning fast
cluster computing”. Spark is an open-source framework for the
processing of large datasets. It is the most active Apache project of the
present time. Spark is written in Scala and provides APIs in Python,
Scala, Java, and R.
The most important feature of Apache Spark is its in-memory cluster
computing that is responsible to increase the speed of data
processing. Spark is known to provide a more general and faster data
processing platform. It helps you run programs comparatively faster
than Hadoop i.e. 100 times faster in memory and 10 times faster even
on the disk.
Apache Spark Features
▪ Multiple Language Support
Apache Spark supports multiple languages; it provides APIs written in
Scala, Java, Python or R. It allows users to write applications in different
languages.
▪ Fast Speed
The most important feature of Apache Spark is its processing speed. It
allows an application to run on Hadoop cluster, up to 100 times faster in
memory, and 10 times faster on disk.
▪ Runs Everywhere
Spark can run on multiple platforms without affecting the processing
speed. It can run on Hadoop, Kubernetes, Mesos, Standalone, and even
in the Cloud.
Apache Spark Features
Apache Spark Features
▪ General Purpose
The spark is a powered by the plethora of libraries for machine learning i.e.
MLlib, DataFrames, and SQL along with Spark Streaming and GraphX. One is
allowed to use a combination of these libraries coherently in an application.
The feature of combining streaming, SQL, and complex analytics, and using in
the same application makes Spark a general-purpose framework.
▪ Advanced Analytics
Apache Spark is known to support ‘Map’ and ‘Reduce’ that has been
mentioned earlier. But along with MapReduce, it supports Streaming data,
SQL queries, Graph algorithms, and Machine learning. Thus, Apache Spark is a
great mean of performing advanced analytics.
Apache Spark Components
Apache Spark Ecosystem comprises of various Apache Spark components that are
responsible for the functioning of the Apache Spark. There are 5 components of Apache
Spark that constitute Apache Spark ecosystem.
▪ Spark Core
The main execution engine of the Spark platform is known as Spark Core. All the
working and functionality of Apache Spark depends on the Spark Core including
memory management, task scheduling, fault recovery, and others. It enables in-
memory processing and is responsible to define RDD (Resilient Distributed Dataset) by
an API that is the programming abstraction of Spark.
▪ Spark SQL and DataFrames
The Spark SQL is the main component of Spark that works with the structured data and
supports structured data processing. Spark SQL comes with a programming abstraction
known as DataFrames. Spark SQL enables developers to combine SQL queries with
manipulated programmatic data that are supported by RDDs in different languages.
Apache Spark Components
▪ Spark Streaming
This Spark component is responsible for the live stream data processing such as log
files created by production web servers. It provides API for the manipulation of data
streams, thus makes it easy to learn Apache Spark project. This component is also
responsible for throughput, scalability, and fault tolerance as that of the Spark Core.
▪ MLlib
MLlib is the in-built library of Spark that contains the functionality of Machine
Learning, known as MLlib. It provides various ML algorithms such as clustering,
classification, regression, collaborative filtering and supporting functionality. MLlib
also contains many low-level machine learning primitives.
▪ GraphX
GraphX is the library that enables graph computations. GraphX also provides an API
to perform graph computation by allowing users generate directed graph using
arbitrary properties of the edge and vertex.
Apache Spark Languages
Apache Spark is written in Scala. So, Scala is the native language
used to interact with the Spark Core. Besides, the APIs of Apache
Spark has been written in other languages, these are
▪ Scala
▪ Java
▪ Python
▪ R
As the framework of Spark is built on Scala, it can offer some great
features as compared to other Apache Spark languages. Using
Scala with Apache Spark provides you access to the latest features.
According to a Spark Survey on Apache Spark Languages, 71% of
Spark developers are using Scala, 58% are using Python, 31% are
using Java, while 18% are using R language.
Apache Spark History
Apache Spark introduction cannot actually begin without mentioning the
history of Apache Spark. So, let’s state in brief, Spark was first introduced
in the year 2009 in UC Berkeley R&D Lab, now AMP Lab by M. Zaharia.
And then Spark was open-sourced under BSD License in the year 2010.
In 2013, the Spark project was donated to Apache Software Foundation
and the BSD license turned into Apache 2.0. In 2014, Spark became a top-
level project of Apache Foundation, known as Apache Spark.
In 2015, with the effort of over 1000 contributors, Apache Spark became
one of the most active Apache projects as well as most active open source
project of big data. Till date,. Apache Spark version 2.3.0 has recently
been released on Feb 28th, 2018 which is the latest version of Apache
Spark.
Why You Should Learn Apache
SparkWith the generation of big data by businesses, it has become very
important to analyze that data to understand business insights. Spark is a
revolutionary framework on big data processing land. Enterprises are
extensively adopting Spark which in turn is increasing demand for Apache
Spark developers.
According to O'Reilly Data Science Salary Survey, the salary of developers
is a function of their Apache skills. Scala language and Apache Spark skills
give a good boost to your existing salary. Apache Spark developers are
known as the programmers who receive the highest salary in
development. With the increasing demand for Apache Spark developers
and their salary level, it is the right time for development professionals to
learn Apache Spark and thus help enterprises to perform analysis of data.
Why You Should Learn Apache
SparkHere are the top 5 reasons you should learn Apache
Spark to boost your development career.
▪ To get more access to Big Data
▪ To grow with the growing Apache Spark Adoption
▪ To get benefits of existing big data investments
▪ To fulfill the demands for Spark developers
▪ To make big money
Do You Need Hadoop to Run
Spark?Spark and Hadoop are the most popular big data processing
frameworks. Being faster than MapReduce, Apache Spark has taken an
edge over the Hadoop in terms of speed. Also, Spark can be used for
the processing of different kind of data including real-time whereas
Hadoop can only be used for the batch processing.
Although Hadoop and Spark don’t do the same thing but can still work
together. Spark is responsible for the faster and real-data processing of
data in Hadoop. To achieve maximum benefits, one can run Spark in
the distributed mode using HDFS.
So, it is not the case that we always need Hadoop to run Spark. But if
you want to run Spark with Hadoop, HDFS is the main requirement to
run Spark in the distributed mode.
Apache Spark Installation
The installation of Apache Spark is not a single step process but
we need to perform a series of steps. Note that Java and Scala
are the prerequisites to install Spark. Let’s start 7 step Apache
Spark installation process.
Step 1: Verify if Java is Installed
Step 2: Verify if Scala is Installed
Step 3: Download Scala
Step 4: Install Scala
Step 5: Download Spark
Step 6: Install Spark
Step 7: Verify Spark Installation
Spark Example: Word Count
ApplicationLet’s understand Spark with an example i.e. how to run word count
application. The word count application will count the number of each
word in the document. Consider the below-given input text which has
been saved as input.txt in the home directory.
Following is the procedure to execute the word count application –
Step 1: Open Spark shell
Step 2: Create RDD
Step 3: Execute word count logic
Step 4: Apply action
Step 5: Check output
Apache Spark Use Cases
So, after getting through Apache Spark introduction and installation, it’s
time to have an overview of the Apache Spark use cases. What do these
Spark use cases signify? The Apache Spark use cases explain where
Apache Spark can be used. Before reading the Apache Spark use cases,
let’s understand why companies should use Apache Spark. So, the
businesses should adopt or say have adopted Apache Spark due to its
▪ Ease of use
▪ High-performance gains
▪ Advanced analytics
▪ Real-time data streaming
▪ Ease of deployment
Apache Spark Use Cases
Apache Spark helps businesses to understand the types of
challenges and problems where we can effectively use Apache
Spark. Let’s have a quick sampling of top Apache Spark use cases
in different industries!
▪ E-Commerce Industry
▪ Healthcare Industry
▪ Travel Industry
▪ Game Industry
▪ Security Industry
Apache Spark Books
. Here is the list of top 10 Apache Spark Books –
▪ Learning Spark: Lightning-Fast Big Data Analysis
▪ High-Performance Spark: Best Practices for Scaling and Optimizing Spark
▪ Mastering Apache Spark
▪ Apache Spark in 24 Hours, Sams Teach Yourself
▪ Spark Cookbook
▪ Apache Spark Graph Processing
▪ Advanced Analytics with Apark: Patterns for learning from Data at Scale
▪ Spark: The Definitive Guide – Big Data Processing Made Simple
▪ Spark GraphX in Action
▪ Big Data Analytics with Spark
Apache Spark Certifications
With the increasing popularity of Apache Spark in the big data industry, the
demand for Apache Spark developers is also increasing. But the companies are
looking for the candidates with validated Apache Spark skills i.e. professionals with
an Apache Spark Certification.
Apache Spark Certifications will help you to start a big data career by validating
your Apache Spark skills and expertise. Getting an Apache Spark Certification will
make you stand out of the crowd by demonstrating your skills to the employers and
peers. Here is the list of top 5 Apache Spark Certifications:
▪ HDP Certified Apache Spark Developer
▪ O’Reilly Developer Certification for Apache Spark
▪ Cloudera Spark and Hadoop Developer
▪ Databricks Certification for Apache Spark
▪ MapR Certified Spark Developer
Apache Spark Training
As the demand for Apache Spark developers is on the rise in the
industry, it becomes important to enhance your Apache Spark skills. A
good Apache Spark training helps big data professionals to get hands-
on experience as per industry standards. Nowadays, enterprises are
looking for Hadoop developers who are skilled in the implementation
of Apache Spark best practices.
Whizlabs Apache Spark Training helps you to learn Apache Spark and
prepares you for the HDPCD Certification exam. This Apache Spark
online training helps you get familiar with the deployment of Apache
Spark to develop complex and sophisticated solutions for the
enterprises.
Apache Spark Training
Whizlabs online training for Apache Spark Certification is one
of the best in industry Apache Spark training. Whizlabs
Hortonworks Apache Spark Developer Certification Online
Training helps you to
▪ validate your Apache Spark expertise
▪ demonstrate your Apache Spark skills
▪ remain updated with the latest releases
▪ solve your queries by industry experts
▪ get accredited as certified Spark developer
▪ earn more by giving you a raise in your salary
Final Words
In this presentation, we have covered a complete definitive and
comprehensive guide on Apache Spark. No doubt, it is a must-read guide
for those who want to learn Apache and also for those who want to
extend their Apache Spark skills. Whether you want to learn Apache
Spark components or need to find best Apache Spark certifications, you
can find here!
This guide is the one-stop destination where one can find the answer to
all the questions based on Apache Spark. Apache Spark has the power to
simplify the challenging processing tasks on different types of large
datasets. It performs complex analytics with the integration of graph
algorithms and machine learning. Spark has brought Big Data processing
for everyone. Just check it out!
Reference Links
1. https://spark.apache.org/
2. https://www.whizlabs.com/blog/learn-apache-spark/
3. https://www.whizlabs.com/blog/importance-of-apache-spark/
4. https://www.whizlabs.com/blog/best-apache-spark-books/
5. https://hortonworks.com/
6. https://www.cloudera.com/
Thank You!

More Related Content

What's hot

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 

What's hot (20)

Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
 
Spark overview
Spark overviewSpark overview
Spark overview
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 

Similar to Learn Apache Spark: A Comprehensive Guide

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 

Similar to Learn Apache Spark: A Comprehensive Guide (20)

Spark for big data analytics
Spark for big data analyticsSpark for big data analytics
Spark for big data analytics
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Apache spark
Apache sparkApache spark
Apache spark
 
Performance of Spark vs MapReduce
Performance of Spark vs MapReducePerformance of Spark vs MapReduce
Performance of Spark vs MapReduce
 
Detailed guide to the Apache Spark Framework
Detailed guide to the Apache Spark FrameworkDetailed guide to the Apache Spark Framework
Detailed guide to the Apache Spark Framework
 
Spark and Hadoop Technology
Spark and Hadoop Technology Spark and Hadoop Technology
Spark and Hadoop Technology
 
Apache Spark Notes
Apache Spark NotesApache Spark Notes
Apache Spark Notes
 
Pyspark vs Spark Let's Unravel the Bond!
Pyspark vs Spark Let's Unravel the Bond!Pyspark vs Spark Let's Unravel the Bond!
Pyspark vs Spark Let's Unravel the Bond!
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Spark introduction & Architecture.pptx
Spark introduction & Architecture.pptxSpark introduction & Architecture.pptx
Spark introduction & Architecture.pptx
 
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
[Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing
[Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing[Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing
[Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing
 
Spark_Part 1
Spark_Part 1Spark_Part 1
Spark_Part 1
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala
 
SparkPaper
SparkPaperSparkPaper
SparkPaper
 
Apache spark installation [autosaved]
Apache spark installation [autosaved]Apache spark installation [autosaved]
Apache spark installation [autosaved]
 

More from Whizlabs

More from Whizlabs (20)

When Should You Use AWS Lambda?
When Should You Use AWS Lambda?When Should You Use AWS Lambda?
When Should You Use AWS Lambda?
 
AWS Lambda Documentation
AWS Lambda DocumentationAWS Lambda Documentation
AWS Lambda Documentation
 
AWS Lambda Tutorial
AWS Lambda TutorialAWS Lambda Tutorial
AWS Lambda Tutorial
 
Detailed Analysis of AWS Lambda vs EC2
 Detailed Analysis of AWS Lambda vs EC2 Detailed Analysis of AWS Lambda vs EC2
Detailed Analysis of AWS Lambda vs EC2
 
What is AWS lambda?
What is AWS lambda?What is AWS lambda?
What is AWS lambda?
 
Amazon Elastic Block Storage and Balancer
Amazon Elastic Block Storage and BalancerAmazon Elastic Block Storage and Balancer
Amazon Elastic Block Storage and Balancer
 
Amazon Elastic Compute Cloud
Amazon Elastic Compute CloudAmazon Elastic Compute Cloud
Amazon Elastic Compute Cloud
 
AWS Virtual Private Cloud
AWS Virtual Private CloudAWS Virtual Private Cloud
AWS Virtual Private Cloud
 
The Advantages of Using a Private Cloud Over a Virtual Private Cloud
The Advantages of Using a Private Cloud Over a Virtual Private CloudThe Advantages of Using a Private Cloud Over a Virtual Private Cloud
The Advantages of Using a Private Cloud Over a Virtual Private Cloud
 
Virtual Private Cloud
Virtual Private CloudVirtual Private Cloud
Virtual Private Cloud
 
Amazon Glacier vs Amazon S3
Amazon Glacier vs Amazon S3Amazon Glacier vs Amazon S3
Amazon Glacier vs Amazon S3
 
What is Amazon Glacier?
What is Amazon Glacier?What is Amazon Glacier?
What is Amazon Glacier?
 
Azure interview-questions-pdf
Azure interview-questions-pdfAzure interview-questions-pdf
Azure interview-questions-pdf
 
Top 100 Java Interview Questions with Detailed Answers
Top 100 Java Interview Questions with Detailed AnswersTop 100 Java Interview Questions with Detailed Answers
Top 100 Java Interview Questions with Detailed Answers
 
Top 25 Big Data Interview Questions and Answers
Top 25 Big Data Interview Questions and Answers Top 25 Big Data Interview Questions and Answers
Top 25 Big Data Interview Questions and Answers
 
50 must read hadoop interview questions & answers - whizlabs
50 must read hadoop interview questions & answers - whizlabs50 must read hadoop interview questions & answers - whizlabs
50 must read hadoop interview questions & answers - whizlabs
 
When to Target PMP Exam – PMBOK5 or PMBOK6?
When to Target PMP Exam – PMBOK5 or PMBOK6?When to Target PMP Exam – PMBOK5 or PMBOK6?
When to Target PMP Exam – PMBOK5 or PMBOK6?
 
Secrets To Winning At Office Politics How To Get Things Done And Increase You...
Secrets To Winning At Office Politics How To Get Things Done And Increase You...Secrets To Winning At Office Politics How To Get Things Done And Increase You...
Secrets To Winning At Office Politics How To Get Things Done And Increase You...
 
Tips For Managing A Diverse Project Team - PMP Webinar
Tips For Managing A Diverse Project Team - PMP WebinarTips For Managing A Diverse Project Team - PMP Webinar
Tips For Managing A Diverse Project Team - PMP Webinar
 
Top Ten Reasons For Project Failure - PMP Webinar
Top Ten Reasons For Project Failure - PMP WebinarTop Ten Reasons For Project Failure - PMP Webinar
Top Ten Reasons For Project Failure - PMP Webinar
 

Recently uploaded

一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
ptikerjasaptiker
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
cnajjemba
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 

Recently uploaded (20)

一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 

Learn Apache Spark: A Comprehensive Guide

  • 1.
  • 2. Content ▪ Introduction ▪ What is Apache Spark? ▪ Apache Spark Features ▪ Components of Apache Spark Ecosystem ▪ Apache Spark Languages ▪ Apache Spark History ▪ Why You Should Learn Apache Spark ▪ Do We Need Hadoop to Run Spark?
  • 3. Content ▪ Apache Spark Installation ▪ Apache Spark Example ▪ Apache Spark Use Cases ▪ Apache Spark Books ▪ Apache Spark Certifications ▪ Apache Spark Training ▪ Final Words
  • 4. Introduction For the analysis of big data, the industry is extensively using Apache Spark. Hadoop enables a flexible, scalable, cost-effective, and fault- tolerant computing solution. But the main concern is to maintain the speed while processing big data. The industry needs a powerful engine that can respond in less than seconds and perform in-memory processing. Also, that can perform stream processing as well as batch processing of the data. This is what made Apache Spark come into existence! This is the comprehensive guide that will help you learn Apache Spark. Starting from the introduction, I’ll show you everything you want to know about Apache Spark. Sounds good? Let’s dive right in..
  • 5. What is Apache Spark? The Spark is a project of Apache, popularly known as “lightning fast cluster computing”. Spark is an open-source framework for the processing of large datasets. It is the most active Apache project of the present time. Spark is written in Scala and provides APIs in Python, Scala, Java, and R. The most important feature of Apache Spark is its in-memory cluster computing that is responsible to increase the speed of data processing. Spark is known to provide a more general and faster data processing platform. It helps you run programs comparatively faster than Hadoop i.e. 100 times faster in memory and 10 times faster even on the disk.
  • 6. Apache Spark Features ▪ Multiple Language Support Apache Spark supports multiple languages; it provides APIs written in Scala, Java, Python or R. It allows users to write applications in different languages. ▪ Fast Speed The most important feature of Apache Spark is its processing speed. It allows an application to run on Hadoop cluster, up to 100 times faster in memory, and 10 times faster on disk. ▪ Runs Everywhere Spark can run on multiple platforms without affecting the processing speed. It can run on Hadoop, Kubernetes, Mesos, Standalone, and even in the Cloud.
  • 8. Apache Spark Features ▪ General Purpose The spark is a powered by the plethora of libraries for machine learning i.e. MLlib, DataFrames, and SQL along with Spark Streaming and GraphX. One is allowed to use a combination of these libraries coherently in an application. The feature of combining streaming, SQL, and complex analytics, and using in the same application makes Spark a general-purpose framework. ▪ Advanced Analytics Apache Spark is known to support ‘Map’ and ‘Reduce’ that has been mentioned earlier. But along with MapReduce, it supports Streaming data, SQL queries, Graph algorithms, and Machine learning. Thus, Apache Spark is a great mean of performing advanced analytics.
  • 9. Apache Spark Components Apache Spark Ecosystem comprises of various Apache Spark components that are responsible for the functioning of the Apache Spark. There are 5 components of Apache Spark that constitute Apache Spark ecosystem. ▪ Spark Core The main execution engine of the Spark platform is known as Spark Core. All the working and functionality of Apache Spark depends on the Spark Core including memory management, task scheduling, fault recovery, and others. It enables in- memory processing and is responsible to define RDD (Resilient Distributed Dataset) by an API that is the programming abstraction of Spark. ▪ Spark SQL and DataFrames The Spark SQL is the main component of Spark that works with the structured data and supports structured data processing. Spark SQL comes with a programming abstraction known as DataFrames. Spark SQL enables developers to combine SQL queries with manipulated programmatic data that are supported by RDDs in different languages.
  • 10.
  • 11. Apache Spark Components ▪ Spark Streaming This Spark component is responsible for the live stream data processing such as log files created by production web servers. It provides API for the manipulation of data streams, thus makes it easy to learn Apache Spark project. This component is also responsible for throughput, scalability, and fault tolerance as that of the Spark Core. ▪ MLlib MLlib is the in-built library of Spark that contains the functionality of Machine Learning, known as MLlib. It provides various ML algorithms such as clustering, classification, regression, collaborative filtering and supporting functionality. MLlib also contains many low-level machine learning primitives. ▪ GraphX GraphX is the library that enables graph computations. GraphX also provides an API to perform graph computation by allowing users generate directed graph using arbitrary properties of the edge and vertex.
  • 12. Apache Spark Languages Apache Spark is written in Scala. So, Scala is the native language used to interact with the Spark Core. Besides, the APIs of Apache Spark has been written in other languages, these are ▪ Scala ▪ Java ▪ Python ▪ R As the framework of Spark is built on Scala, it can offer some great features as compared to other Apache Spark languages. Using Scala with Apache Spark provides you access to the latest features. According to a Spark Survey on Apache Spark Languages, 71% of Spark developers are using Scala, 58% are using Python, 31% are using Java, while 18% are using R language.
  • 13.
  • 14. Apache Spark History Apache Spark introduction cannot actually begin without mentioning the history of Apache Spark. So, let’s state in brief, Spark was first introduced in the year 2009 in UC Berkeley R&D Lab, now AMP Lab by M. Zaharia. And then Spark was open-sourced under BSD License in the year 2010. In 2013, the Spark project was donated to Apache Software Foundation and the BSD license turned into Apache 2.0. In 2014, Spark became a top- level project of Apache Foundation, known as Apache Spark. In 2015, with the effort of over 1000 contributors, Apache Spark became one of the most active Apache projects as well as most active open source project of big data. Till date,. Apache Spark version 2.3.0 has recently been released on Feb 28th, 2018 which is the latest version of Apache Spark.
  • 15.
  • 16. Why You Should Learn Apache SparkWith the generation of big data by businesses, it has become very important to analyze that data to understand business insights. Spark is a revolutionary framework on big data processing land. Enterprises are extensively adopting Spark which in turn is increasing demand for Apache Spark developers. According to O'Reilly Data Science Salary Survey, the salary of developers is a function of their Apache skills. Scala language and Apache Spark skills give a good boost to your existing salary. Apache Spark developers are known as the programmers who receive the highest salary in development. With the increasing demand for Apache Spark developers and their salary level, it is the right time for development professionals to learn Apache Spark and thus help enterprises to perform analysis of data.
  • 17. Why You Should Learn Apache SparkHere are the top 5 reasons you should learn Apache Spark to boost your development career. ▪ To get more access to Big Data ▪ To grow with the growing Apache Spark Adoption ▪ To get benefits of existing big data investments ▪ To fulfill the demands for Spark developers ▪ To make big money
  • 18. Do You Need Hadoop to Run Spark?Spark and Hadoop are the most popular big data processing frameworks. Being faster than MapReduce, Apache Spark has taken an edge over the Hadoop in terms of speed. Also, Spark can be used for the processing of different kind of data including real-time whereas Hadoop can only be used for the batch processing. Although Hadoop and Spark don’t do the same thing but can still work together. Spark is responsible for the faster and real-data processing of data in Hadoop. To achieve maximum benefits, one can run Spark in the distributed mode using HDFS. So, it is not the case that we always need Hadoop to run Spark. But if you want to run Spark with Hadoop, HDFS is the main requirement to run Spark in the distributed mode.
  • 19. Apache Spark Installation The installation of Apache Spark is not a single step process but we need to perform a series of steps. Note that Java and Scala are the prerequisites to install Spark. Let’s start 7 step Apache Spark installation process. Step 1: Verify if Java is Installed Step 2: Verify if Scala is Installed Step 3: Download Scala Step 4: Install Scala Step 5: Download Spark Step 6: Install Spark Step 7: Verify Spark Installation
  • 20. Spark Example: Word Count ApplicationLet’s understand Spark with an example i.e. how to run word count application. The word count application will count the number of each word in the document. Consider the below-given input text which has been saved as input.txt in the home directory. Following is the procedure to execute the word count application – Step 1: Open Spark shell Step 2: Create RDD Step 3: Execute word count logic Step 4: Apply action Step 5: Check output
  • 21. Apache Spark Use Cases So, after getting through Apache Spark introduction and installation, it’s time to have an overview of the Apache Spark use cases. What do these Spark use cases signify? The Apache Spark use cases explain where Apache Spark can be used. Before reading the Apache Spark use cases, let’s understand why companies should use Apache Spark. So, the businesses should adopt or say have adopted Apache Spark due to its ▪ Ease of use ▪ High-performance gains ▪ Advanced analytics ▪ Real-time data streaming ▪ Ease of deployment
  • 22.
  • 23. Apache Spark Use Cases Apache Spark helps businesses to understand the types of challenges and problems where we can effectively use Apache Spark. Let’s have a quick sampling of top Apache Spark use cases in different industries! ▪ E-Commerce Industry ▪ Healthcare Industry ▪ Travel Industry ▪ Game Industry ▪ Security Industry
  • 24. Apache Spark Books . Here is the list of top 10 Apache Spark Books – ▪ Learning Spark: Lightning-Fast Big Data Analysis ▪ High-Performance Spark: Best Practices for Scaling and Optimizing Spark ▪ Mastering Apache Spark ▪ Apache Spark in 24 Hours, Sams Teach Yourself ▪ Spark Cookbook ▪ Apache Spark Graph Processing ▪ Advanced Analytics with Apark: Patterns for learning from Data at Scale ▪ Spark: The Definitive Guide – Big Data Processing Made Simple ▪ Spark GraphX in Action ▪ Big Data Analytics with Spark
  • 25. Apache Spark Certifications With the increasing popularity of Apache Spark in the big data industry, the demand for Apache Spark developers is also increasing. But the companies are looking for the candidates with validated Apache Spark skills i.e. professionals with an Apache Spark Certification. Apache Spark Certifications will help you to start a big data career by validating your Apache Spark skills and expertise. Getting an Apache Spark Certification will make you stand out of the crowd by demonstrating your skills to the employers and peers. Here is the list of top 5 Apache Spark Certifications: ▪ HDP Certified Apache Spark Developer ▪ O’Reilly Developer Certification for Apache Spark ▪ Cloudera Spark and Hadoop Developer ▪ Databricks Certification for Apache Spark ▪ MapR Certified Spark Developer
  • 26. Apache Spark Training As the demand for Apache Spark developers is on the rise in the industry, it becomes important to enhance your Apache Spark skills. A good Apache Spark training helps big data professionals to get hands- on experience as per industry standards. Nowadays, enterprises are looking for Hadoop developers who are skilled in the implementation of Apache Spark best practices. Whizlabs Apache Spark Training helps you to learn Apache Spark and prepares you for the HDPCD Certification exam. This Apache Spark online training helps you get familiar with the deployment of Apache Spark to develop complex and sophisticated solutions for the enterprises.
  • 27. Apache Spark Training Whizlabs online training for Apache Spark Certification is one of the best in industry Apache Spark training. Whizlabs Hortonworks Apache Spark Developer Certification Online Training helps you to ▪ validate your Apache Spark expertise ▪ demonstrate your Apache Spark skills ▪ remain updated with the latest releases ▪ solve your queries by industry experts ▪ get accredited as certified Spark developer ▪ earn more by giving you a raise in your salary
  • 28. Final Words In this presentation, we have covered a complete definitive and comprehensive guide on Apache Spark. No doubt, it is a must-read guide for those who want to learn Apache and also for those who want to extend their Apache Spark skills. Whether you want to learn Apache Spark components or need to find best Apache Spark certifications, you can find here! This guide is the one-stop destination where one can find the answer to all the questions based on Apache Spark. Apache Spark has the power to simplify the challenging processing tasks on different types of large datasets. It performs complex analytics with the integration of graph algorithms and machine learning. Spark has brought Big Data processing for everyone. Just check it out!
  • 29. Reference Links 1. https://spark.apache.org/ 2. https://www.whizlabs.com/blog/learn-apache-spark/ 3. https://www.whizlabs.com/blog/importance-of-apache-spark/ 4. https://www.whizlabs.com/blog/best-apache-spark-books/ 5. https://hortonworks.com/ 6. https://www.cloudera.com/ Thank You!