A gentle introduction to Apache Spark from the theorem of Resilient Distributed Datasets to deploying software to the core platform, Spark Streaming, and Spark SQL
2. Contents:
• Spark
• Frameworks
• Ecosystem
• Resilient Distributed
Datasets(RDD)
• A Simplified Data
Flow
• Executors
• Iterative Operations
• Fault-tolerance
• Comparisons
• Who uses Spark?!
• Datasets
• DataFrame
• Scala
• Practices
• Pi Estimation
• Spark Stream
• Practice
• Compile and Deploy
• Spark SQL
• PageView
• References
2
3. Spark
Apache Spark™ is a unified analytics engine for
large-scale data processing.
Created by AMPLab now Databricks
Written in Scala
Licensed under Apache
Lives in Github
3
6. Resilient Distributed
Datasets
RDD is a fundamental data structure of Spark
stored in memory.
It is an immutable distributed collection of objects.
Each dataset in RDD is divided into logical
partitions, which may be computed on different
nodes of the cluster.
6
7. Resilient Distributed
Datasets
RDDs can contain any type of Python, Java, or
Scala objects, including user-defined classes.
Formally, an RDD is a read-only, partitioned
collection of records.
RDD is a fault-tolerant collection of elements that
can be operated on in parallel.
7
11. Fault-tolerance
RDDs are remember
the sequence of
operations that
created it from the
original fault-tolerant
input data
Batches of input data
are replicated in
memory of multiple
worker nodes,
therefore fault-tolerant
Data lost due to
worker failure, can be
11
12. Breaking the Records!
Hadoop MR
Record
Spark
Record
Spark
1 PB
Data Size 102.5 TB 100 TB 1000 TB
Elapsed Time 72 mins 23 mins 234 mins
# Nodes 2100 206 190
# Cores 50400 physical 6592 virtualized 6080 virtualized
Cluster disk
throughput
3150 GB/s
(est.)
618 GB/s 570 GB/s
Sort Benchmark
Daytona Rules
Yes Yes No
Network dedicated data
center, 10Gbps
virtualized
(EC2) 10Gbps
network
virtualized
(EC2) 10Gbps
network
Sort rate 1.42 TB/min 4.27 TB/min 4.27 TB/min
Sort rate/node 0.67 GB/min 20.7 GB/min 22.5 GB/min
12
17. Spark SQL and DataSet
Spark SQL is a Spark module for structured data
processing
Spark SQL uses this extra information to perform
extra optimizations
Dataset is a new interface that provides the
benefits of RDDs with the benefits of Spark SQL’s
optimized engine
17
18. DataFrame
A DataFrame is a Dataset organized into named
columns
It is conceptually equivalent to a table in a
relational database or a data frame in R/Python
It benefits from richer optimizations under the
hood
18
19. Scala
Scala combines object-oriented and functional
programming in one concise high-level language.
Scala's static types help avoid bugs in complex
applications
Its JVM and JavaScript runtimes let you build
high performance systems
It has an easy access to huge ecosystems of
libraries.
19
23. Spark Stream
Framework for large scale stream processing
Scales to 100s of nodes
Provides a simple batch-like API for
implementing complex algorithm
23
24. Stream Processing
24
Run a streaming computation as a series of
very small, deterministic batch jobs
Chop up the live stream into batches of X
seconds
Spark treats each batch of data as RDDs and
processes them using RDD operations
25. Practices
25
Run “Stateless NetworkWordCount”
Compile Deploy a java file
Run “Stateful NetworkWordCount”
Run “PageView” and it’s generator
Execute simple data operations with Spark SQL
27. Compile and Deploy
27
Compile
1. Generate project with Maven
2. Copy the Java file to the end point
3. Edit the pom.xml
4. Package the project
Deploy
1. Submit the .jar file to Spark
28. Apache Maven
28
Maven is a tool that can now be used for building
and managing any Java-based project
Making the build process easy
Providing a uniform build system
Providing quality project information
Providing guidelines for best practices
development
Allowing transparent migration to new features
29. pom.xml
29
Contains project identifiers
Defines source and target versions of Java to be
used
Dependencies are subjected from Maven’s
repository
31. PageView
The aim is to analyze hit and miss ratio of a
website
The generator simulates 100 users from 2
regions on 10 threads
Hit rate: 95%
31
100Users
• 94709
• 94117
Zip
Codes
• /index
0.7
• /news
0.2
• /contact
0.1
Pages
32. Spark SQL
32
The Spark master node connects to relational databases and
loads data from a specific table or using a specific SQL query.
The Spark master node distributes data to worker nodes for
transformation
The Worker node connects to the relational database and writes
data
The user can choose to use row-by-row insertion or bulk insert
33. References
[1]: Apache Spark officially sets a new record in
large-scale sorting
[2]: 2014 Data Science Salary Survey
[3]: The Performance Comparison of Hadoop and
Spark
[4] Apache Maven Project
[5] The Scala Programming Language
33
34. Archive
The commands are available on this gist
Please do not hesitate to contact us if we could
be any further of assistance:
navidkalaei@gmail.com or linkedin.com/in/navid-
kalaei fatemeh.jamalii70@gmail.com
34