Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Analyzing Data at Scale with Apache Spark
1. Analyzing Data at Scale
with Apache Spark
Nicola Ferraro (@ni_ferraro)
Senior Software Engineer at Red Hat
Naples, November 24th 2017
2.
3. Myself
Nicola Ferraro
Senior Software Engineer at Red Hat
Working on Apache Camel, JBoss Fuse,
Fuse Integration Services for Openshift,
Syndesis, Oshinko Radanalytics.
Follow me on Twitter
@ni_ferraro
4. Agenda
● A brief history of Big Data
● Data processing models
● Spark on Openshift
● Demo
5. Big Data Systems: why?
System capable of handling data with
high:
● Volume
○ Terabytes/Petabytes of data collected
over the years
● Velocity
○ High speed streaming data to be
analyzed in near real-time
● Variety
○ Not just tabular data or json/xml, also
images, videos, free text
Volume
Velocity Variety
There!
8. An Example?
Back to the Future II (Weather forecasting)
We can collect data from static sensors and moving cars to understand the exact
moment when it will stop raining!
E.g. https://goo.gl/FDzfdx
9. Big Data Systems: how?
...
...
...
...
By scaling horizontally to
1000s of machines!
A single machine can be
slow. But together they have
a huge processing power!
10. Evolution of Big Data Systems: Software
2006
Hadoop
...
2014+
2008
Pig (scripting)
2010
Hive (SQL)
11. Evolution of Big Data Systems: Infrastructure
2018 ?
2006
Commodity Hardware
2011
Big Data Appliances 2014
Virtual Machines
12. Evolution of Big Data Systems: Architectures
+
2011
Hybrid
(Lambda)
2016+
Streaming
(Kappa)
2006
BatchData Lake
13. Batch Architecture
HDFS HDFS HDFS HDFS
Map
Reduce
Map
Reduce
Map
Reduce
Map
Reduce
Hadoop
v1
1. Ingest to HDFS
2. Input-output from HDFS with MapReduce
3. Export to external systems using HDFS tools
To serving layerIngest
16. Agenda
● A brief history of Big Data
● Data processing models
● Spark on Openshift
● Demo
17. Map Reduce Example: Word Count
Users implemented 2 functions classes (Map and Reduce) and 1 config file
18. Machine 1
Old Data Processing Model: Map Reduce
Machine 2
Machine 3
Machine 4
MAP
MAP
MAP
MAP
load store
Hadoop: batch architecture
shuffle
cache
cache
cache
cache
REDUCE
REDUCE
REDUCE
REDUCE
Usually HDFS
HDFSReplicaFactor3 Most of the
work is done in
parallel by all
machines!
19. Introducing Spark
Fast data processing platform.
● Batch processing
● Streaming (structured or micro-batching)
● Machine Learning
● Graph-based Algorithms
Multi-language: Scala, Java, Python, R
20. Apache Spark: RDD
The core Spark API is based on the concept of Resilient Distributed Dataset.
RDD (Set of all events received)
val events: RDD[Event] = …
Like a Scala collection
(but lazy)
HDFS
JDBC
NoSQL
Kafka
P1 P2 P3 P4 P5 P6
21. Apache Spark: Functional Programming Model
Java 8 streams:
List<String> firstnames = people.stream()
.filter(p -> p.getAge() < 30)
.map(p -> p.getFirstname())
.distinct()
.collect(Collectors.toList());
Get all distinct first names of people
under 30 from a Java collection.
Apache Spark (Scala):
val firstnames = people
.filter(p => p.age < 30)
.map(p => p.firstname)
.distinct()
.collect();
The only difference: people is a 20TB
RDD and computation is performed by
several machines in parallel
22. Apache Spark: Streaming (or micro-batching)
DStream = Discretized Stream
The size of each micro-batch is
specified by the user (in seconds)
Sliding window mode
23. Apache Spark 2.0: Dataframes/Datasets
RDD/DStream are the core APIs for processing data: it’s now considered too
low-level.
Streaming → DStream[Temperature]
Batch → RDD[Temperature]
Spark 2.0 introduced Structured Streaming:
● Using the same API for streaming and still data
● Treating a stream of events as an growing append-only collection
The plan is to remove RDD/DStream
API in Spark 3.0
For now: structured streaming is
not feature-complete (Spark 2.2.0)
Stream
col1 col2
…
Append-only
Table
24. Apache Spark: Machine Learning
Spark MLlib has built-in algorithms:
● Classification: logistic regression, decision trees, support vector machines, …
● Regression
● Clustering: K-Means, LDA, GMM, …
● Collaborative Filtering
● …
Available for RDD and Dataframe/Datasets (incomplete)
25. Agenda
● A brief history of Big Data
● Data processing models
● Spark on Openshift
● Demo
26. Openshift
Container orchestration platform. Born at Google.
● Running Containers
● Virtual Namespaces
● Virtual Networks
● Service Discovery
● Load Balancing
● Auto-Scaling
● Health-checking and auto-recovery
● Monitoring and Logging
Creating
Containers
Orchestrating
Containers
Kubernetes Enterprise
Edition
27. Spark Architecture
Cluster Manager
Workers
Driver Driver App
(Main.class)
Executed by
Assigns executors to the App
Sends tasks to executors.
Task = “do something on a
data partition”
Oshinko
(Radanalytics)
Executor Executor
Task Task
28. Agenda
● A brief history of Big Data
● Data processing models
● Spark on Openshift
● Demo
29. You’ll see:
● Apache Spark on Openshift with Oshinko
● Kafka on Openshift (EnMasse)
● Spring-Boot + Apache Camel simulator
Sources and instruction available here:
https://github.com/nicolaferraro/iot-day-napoli-2017-demo
Demo