SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Navid Kalaei
Fatemeh Jamali
Dr. Esmaeili
Shiraz University of Technology
Contents:
• Spark
• Frameworks
• Ecosystem
• Resilient Distributed
Datasets(RDD)
• A Simplified Data
Flow
• Executors
• Iterative Operations
• Fault-tolerance
• Comparisons
• Who uses Spark?!
• Datasets
• DataFrame
• Scala
• Practices
• Pi Estimation
• Spark Stream
• Practice
• Compile and Deploy
• Spark SQL
• PageView
• References
2
Spark
 Apache Spark™ is a unified analytics engine for
large-scale data processing.
 Created by AMPLab now Databricks
 Written in Scala
 Licensed under Apache
 Lives in Github
3
Frameworks
4
Ecosystem
Hadoop Spark
Hive SparkSQL
Apache Mahout MLLib
Impala SparkSQL
Apache Giraph Graphax
Apache Storm Spark streaming
5
Resilient Distributed
Datasets
 RDD is a fundamental data structure of Spark
stored in memory.
 It is an immutable distributed collection of objects.
 Each dataset in RDD is divided into logical
partitions, which may be computed on different
nodes of the cluster.
6
Resilient Distributed
Datasets
 RDDs can contain any type of Python, Java, or
Scala objects, including user-defined classes.
 Formally, an RDD is a read-only, partitioned
collection of records.
 RDD is a fault-tolerant collection of elements that
can be operated on in parallel.
7
A Simplified Data Flow
8
Executors
9
Iterative Operations
MapRedu
ce
Spark
10
Fault-tolerance
 RDDs are remember
the sequence of
operations that
created it from the
original fault-tolerant
input data
 Batches of input data
are replicated in
memory of multiple
worker nodes,
therefore fault-tolerant
 Data lost due to
worker failure, can be
11
Breaking the Records!
Hadoop MR
Record
Spark
Record
Spark
1 PB
Data Size 102.5 TB 100 TB 1000 TB
Elapsed Time 72 mins 23 mins 234 mins
# Nodes 2100 206 190
# Cores 50400 physical 6592 virtualized 6080 virtualized
Cluster disk
throughput
3150 GB/s
(est.)
618 GB/s 570 GB/s
Sort Benchmark
Daytona Rules
Yes Yes No
Network dedicated data
center, 10Gbps
virtualized
(EC2) 10Gbps
network
virtualized
(EC2) 10Gbps
network
Sort rate 1.42 TB/min 4.27 TB/min 4.27 TB/min
Sort rate/node 0.67 GB/min 20.7 GB/min 22.5 GB/min
12
Performance
Word
Count
13
Performance
PageRan
k
14
Salary Comparison
15
Who uses Spark?!
16
Spark SQL and DataSet
 Spark SQL is a Spark module for structured data
processing
 Spark SQL uses this extra information to perform
extra optimizations
 Dataset is a new interface that provides the
benefits of RDDs with the benefits of Spark SQL’s
optimized engine
17
DataFrame
 A DataFrame is a Dataset organized into named
columns
 It is conceptually equivalent to a table in a
relational database or a data frame in R/Python
 It benefits from richer optimizations under the
hood
18
Scala
 Scala combines object-oriented and functional
programming in one concise high-level language.
 Scala's static types help avoid bugs in complex
applications
 Its JVM and JavaScript runtimes let you build
high performance systems
 It has an easy access to huge ecosystems of
libraries.
19
Practices
 WordCount
 Pi Estimation
 Text Search
20
Pi Estimation
∏
4
1
1
21
What is streaming?
22
Spark Stream
 Framework for large scale stream processing
 Scales to 100s of nodes
 Provides a simple batch-like API for
implementing complex algorithm
23
Stream Processing
24
 Run a streaming computation as a series of
very small, deterministic batch jobs
 Chop up the live stream into batches of X
seconds
 Spark treats each batch of data as RDDs and
processes them using RDD operations
Practices
25
 Run “Stateless NetworkWordCount”
 Compile Deploy a java file
 Run “Stateful NetworkWordCount”
 Run “PageView” and it’s generator
 Execute simple data operations with Spark SQL
Stream The Words!
26
Compile and Deploy
27
Compile
1. Generate project with Maven
2. Copy the Java file to the end point
3. Edit the pom.xml
4. Package the project
Deploy
1. Submit the .jar file to Spark
Apache Maven
28
 Maven is a tool that can now be used for building
and managing any Java-based project
 Making the build process easy
 Providing a uniform build system
 Providing quality project information
 Providing guidelines for best practices
development
 Allowing transparent migration to new features
pom.xml
29
 Contains project identifiers
 Defines source and target versions of Java to be
used
 Dependencies are subjected from Maven’s
repository
Let’s Compile and Deploy!
30
PageView
 The aim is to analyze hit and miss ratio of a
website
 The generator simulates 100 users from 2
regions on 10 threads
 Hit rate: 95%
31
100Users
• 94709
• 94117
Zip
Codes
• /index
0.7
• /news
0.2
• /contact
0.1
Pages
Spark SQL
32
 The Spark master node connects to relational databases and
loads data from a specific table or using a specific SQL query.
 The Spark master node distributes data to worker nodes for
transformation
 The Worker node connects to the relational database and writes
data
 The user can choose to use row-by-row insertion or bulk insert
References
 [1]: Apache Spark officially sets a new record in
large-scale sorting
 [2]: 2014 Data Science Salary Survey
 [3]: The Performance Comparison of Hadoop and
Spark
 [4] Apache Maven Project
 [5] The Scala Programming Language
33
Archive
 The commands are available on this gist
 Please do not hesitate to contact us if we could
be any further of assistance:
navidkalaei@gmail.com or linkedin.com/in/navid-
kalaei fatemeh.jamalii70@gmail.com
34

Weitere ähnliche Inhalte

Was ist angesagt?

Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming JobsDatabricks
 
Amazon Redshift Tutorial | AWS Tutorial for Beginners | AWS Certification Tra...
Amazon Redshift Tutorial | AWS Tutorial for Beginners | AWS Certification Tra...Amazon Redshift Tutorial | AWS Tutorial for Beginners | AWS Certification Tra...
Amazon Redshift Tutorial | AWS Tutorial for Beginners | AWS Certification Tra...Edureka!
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Databricks
 
NoSQL databases - An introduction
NoSQL databases - An introductionNoSQL databases - An introduction
NoSQL databases - An introductionPooyan Mehrparvar
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Introducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseIntroducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseSnowflake Computing
 
MongoDB at Scale
MongoDB at ScaleMongoDB at Scale
MongoDB at ScaleMongoDB
 
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle Management
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle ManagementMLflow and Azure Machine Learning—The Power Couple for ML Lifecycle Management
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle ManagementDatabricks
 
Building modern data lakes
Building modern data lakes Building modern data lakes
Building modern data lakes Minio
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Sparkdatamantra
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
Non relational databases-no sql
Non relational databases-no sqlNon relational databases-no sql
Non relational databases-no sqlRam kumar
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBMongoDB
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
 

Was ist angesagt? (20)

SQL & NoSQL
SQL & NoSQLSQL & NoSQL
SQL & NoSQL
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Amazon Redshift Tutorial | AWS Tutorial for Beginners | AWS Certification Tra...
Amazon Redshift Tutorial | AWS Tutorial for Beginners | AWS Certification Tra...Amazon Redshift Tutorial | AWS Tutorial for Beginners | AWS Certification Tra...
Amazon Redshift Tutorial | AWS Tutorial for Beginners | AWS Certification Tra...
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
NoSQL databases - An introduction
NoSQL databases - An introductionNoSQL databases - An introduction
NoSQL databases - An introduction
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Introducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseIntroducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data Warehouse
 
MongoDB at Scale
MongoDB at ScaleMongoDB at Scale
MongoDB at Scale
 
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle Management
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle ManagementMLflow and Azure Machine Learning—The Power Couple for ML Lifecycle Management
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle Management
 
Building modern data lakes
Building modern data lakes Building modern data lakes
Building modern data lakes
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Spark
SparkSpark
Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
NoSql
NoSqlNoSql
NoSql
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Non relational databases-no sql
Non relational databases-no sqlNon relational databases-no sql
Non relational databases-no sql
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 

Ähnlich wie Spark Workshop

Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the SurfaceJosi Aranda
 
Spark Unveiled Essential Insights for All Developers
Spark Unveiled Essential Insights for All DevelopersSpark Unveiled Essential Insights for All Developers
Spark Unveiled Essential Insights for All DevelopersKnoldus Inc.
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Simplilearn
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdfMaheshPandit16
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...Simon Ambridge
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark exampleShidrokhGoudarzi1
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for BeginnersAnirudh
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkRahul Kumar
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkDataWorks Summit/Hadoop Summit
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionRUHULAMINHAZARIKA
 
Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your EyesDemi Ben-Ari
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Xuan-Chao Huang
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 

Ähnlich wie Spark Workshop (20)

Spark
SparkSpark
Spark
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Spark Unveiled Essential Insights for All Developers
Spark Unveiled Essential Insights for All DevelopersSpark Unveiled Essential Insights for All Developers
Spark Unveiled Essential Insights for All Developers
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Apache spark
Apache sparkApache spark
Apache spark
 
Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your Eyes
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 

Kürzlich hochgeladen

ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxAleenaJamil4
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxellehsormae
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 

Kürzlich hochgeladen (20)

ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptx
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptx
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 

Spark Workshop

  • 1. Navid Kalaei Fatemeh Jamali Dr. Esmaeili Shiraz University of Technology
  • 2. Contents: • Spark • Frameworks • Ecosystem • Resilient Distributed Datasets(RDD) • A Simplified Data Flow • Executors • Iterative Operations • Fault-tolerance • Comparisons • Who uses Spark?! • Datasets • DataFrame • Scala • Practices • Pi Estimation • Spark Stream • Practice • Compile and Deploy • Spark SQL • PageView • References 2
  • 3. Spark  Apache Spark™ is a unified analytics engine for large-scale data processing.  Created by AMPLab now Databricks  Written in Scala  Licensed under Apache  Lives in Github 3
  • 5. Ecosystem Hadoop Spark Hive SparkSQL Apache Mahout MLLib Impala SparkSQL Apache Giraph Graphax Apache Storm Spark streaming 5
  • 6. Resilient Distributed Datasets  RDD is a fundamental data structure of Spark stored in memory.  It is an immutable distributed collection of objects.  Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. 6
  • 7. Resilient Distributed Datasets  RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.  Formally, an RDD is a read-only, partitioned collection of records.  RDD is a fault-tolerant collection of elements that can be operated on in parallel. 7
  • 11. Fault-tolerance  RDDs are remember the sequence of operations that created it from the original fault-tolerant input data  Batches of input data are replicated in memory of multiple worker nodes, therefore fault-tolerant  Data lost due to worker failure, can be 11
  • 12. Breaking the Records! Hadoop MR Record Spark Record Spark 1 PB Data Size 102.5 TB 100 TB 1000 TB Elapsed Time 72 mins 23 mins 234 mins # Nodes 2100 206 190 # Cores 50400 physical 6592 virtualized 6080 virtualized Cluster disk throughput 3150 GB/s (est.) 618 GB/s 570 GB/s Sort Benchmark Daytona Rules Yes Yes No Network dedicated data center, 10Gbps virtualized (EC2) 10Gbps network virtualized (EC2) 10Gbps network Sort rate 1.42 TB/min 4.27 TB/min 4.27 TB/min Sort rate/node 0.67 GB/min 20.7 GB/min 22.5 GB/min 12
  • 17. Spark SQL and DataSet  Spark SQL is a Spark module for structured data processing  Spark SQL uses this extra information to perform extra optimizations  Dataset is a new interface that provides the benefits of RDDs with the benefits of Spark SQL’s optimized engine 17
  • 18. DataFrame  A DataFrame is a Dataset organized into named columns  It is conceptually equivalent to a table in a relational database or a data frame in R/Python  It benefits from richer optimizations under the hood 18
  • 19. Scala  Scala combines object-oriented and functional programming in one concise high-level language.  Scala's static types help avoid bugs in complex applications  Its JVM and JavaScript runtimes let you build high performance systems  It has an easy access to huge ecosystems of libraries. 19
  • 20. Practices  WordCount  Pi Estimation  Text Search 20
  • 23. Spark Stream  Framework for large scale stream processing  Scales to 100s of nodes  Provides a simple batch-like API for implementing complex algorithm 23
  • 24. Stream Processing 24  Run a streaming computation as a series of very small, deterministic batch jobs  Chop up the live stream into batches of X seconds  Spark treats each batch of data as RDDs and processes them using RDD operations
  • 25. Practices 25  Run “Stateless NetworkWordCount”  Compile Deploy a java file  Run “Stateful NetworkWordCount”  Run “PageView” and it’s generator  Execute simple data operations with Spark SQL
  • 27. Compile and Deploy 27 Compile 1. Generate project with Maven 2. Copy the Java file to the end point 3. Edit the pom.xml 4. Package the project Deploy 1. Submit the .jar file to Spark
  • 28. Apache Maven 28  Maven is a tool that can now be used for building and managing any Java-based project  Making the build process easy  Providing a uniform build system  Providing quality project information  Providing guidelines for best practices development  Allowing transparent migration to new features
  • 29. pom.xml 29  Contains project identifiers  Defines source and target versions of Java to be used  Dependencies are subjected from Maven’s repository
  • 30. Let’s Compile and Deploy! 30
  • 31. PageView  The aim is to analyze hit and miss ratio of a website  The generator simulates 100 users from 2 regions on 10 threads  Hit rate: 95% 31 100Users • 94709 • 94117 Zip Codes • /index 0.7 • /news 0.2 • /contact 0.1 Pages
  • 32. Spark SQL 32  The Spark master node connects to relational databases and loads data from a specific table or using a specific SQL query.  The Spark master node distributes data to worker nodes for transformation  The Worker node connects to the relational database and writes data  The user can choose to use row-by-row insertion or bulk insert
  • 33. References  [1]: Apache Spark officially sets a new record in large-scale sorting  [2]: 2014 Data Science Salary Survey  [3]: The Performance Comparison of Hadoop and Spark  [4] Apache Maven Project  [5] The Scala Programming Language 33
  • 34. Archive  The commands are available on this gist  Please do not hesitate to contact us if we could be any further of assistance: navidkalaei@gmail.com or linkedin.com/in/navid- kalaei fatemeh.jamalii70@gmail.com 34