SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Downloaden Sie, um offline zu lesen
Analyzing Data at Scale
with Apache Spark
Nicola Ferraro (@ni_ferraro)
Senior Software Engineer at Red Hat
Naples, November 24th 2017
Myself
Nicola Ferraro
Senior Software Engineer at Red Hat
Working on Apache Camel, JBoss Fuse,
Fuse Integration Services for Openshift,
Syndesis, Oshinko Radanalytics.
Follow me on Twitter
@ni_ferraro
Agenda
● A brief history of Big Data
● Data processing models
● Spark on Openshift
● Demo
Big Data Systems: why?
System capable of handling data with
high:
● Volume
○ Terabytes/Petabytes of data collected
over the years
● Velocity
○ High speed streaming data to be
analyzed in near real-time
● Variety
○ Not just tabular data or json/xml, also
images, videos, free text
Volume
Velocity Variety
There!
Big Data Systems: why IoT?
Big Data Systems: which devices?
An Example?
Back to the Future II (Weather forecasting)
We can collect data from static sensors and moving cars to understand the exact
moment when it will stop raining!
E.g. https://goo.gl/FDzfdx
Big Data Systems: how?
...
...
...
...
By scaling horizontally to
1000s of machines!
A single machine can be
slow. But together they have
a huge processing power!
Evolution of Big Data Systems: Software
2006
Hadoop
...
2014+
2008
Pig (scripting)
2010
Hive (SQL)
Evolution of Big Data Systems: Infrastructure
2018 ?
2006
Commodity Hardware
2011
Big Data Appliances 2014
Virtual Machines
Evolution of Big Data Systems: Architectures
+
2011
Hybrid
(Lambda)
2016+
Streaming
(Kappa)
2006
BatchData Lake
Batch Architecture
HDFS HDFS HDFS HDFS
Map
Reduce
Map
Reduce
Map
Reduce
Map
Reduce
Hadoop
v1
1. Ingest to HDFS
2. Input-output from HDFS with MapReduce
3. Export to external systems using HDFS tools
To serving layerIngest
Lambda Architecture
HDFS
IngestMessaging Streaming
Streaming
To serving layer
Interactive Queries
NoSQL
Batch
Batch processing every
night or every n days...
Kappa Architecture
Distributed
Event Log Streaming
Streaming To serving layer
Agenda
● A brief history of Big Data
● Data processing models
● Spark on Openshift
● Demo
Map Reduce Example: Word Count
Users implemented 2 functions classes (Map and Reduce) and 1 config file
Machine 1
Old Data Processing Model: Map Reduce
Machine 2
Machine 3
Machine 4
MAP
MAP
MAP
MAP
load store
Hadoop: batch architecture
shuffle
cache
cache
cache
cache
REDUCE
REDUCE
REDUCE
REDUCE
Usually HDFS
HDFSReplicaFactor3 Most of the
work is done in
parallel by all
machines!
Introducing Spark
Fast data processing platform.
● Batch processing
● Streaming (structured or micro-batching)
● Machine Learning
● Graph-based Algorithms
Multi-language: Scala, Java, Python, R
Apache Spark: RDD
The core Spark API is based on the concept of Resilient Distributed Dataset.
RDD (Set of all events received)
val events: RDD[Event] = …
Like a Scala collection
(but lazy)
HDFS
JDBC
NoSQL
Kafka
P1 P2 P3 P4 P5 P6
Apache Spark: Functional Programming Model
Java 8 streams:
List<String> firstnames = people.stream()
.filter(p -> p.getAge() < 30)
.map(p -> p.getFirstname())
.distinct()
.collect(Collectors.toList());
Get all distinct first names of people
under 30 from a Java collection.
Apache Spark (Scala):
val firstnames = people
.filter(p => p.age < 30)
.map(p => p.firstname)
.distinct()
.collect();
The only difference: people is a 20TB
RDD and computation is performed by
several machines in parallel
Apache Spark: Streaming (or micro-batching)
DStream = Discretized Stream
The size of each micro-batch is
specified by the user (in seconds)
Sliding window mode
Apache Spark 2.0: Dataframes/Datasets
RDD/DStream are the core APIs for processing data: it’s now considered too
low-level.
Streaming → DStream[Temperature]
Batch → RDD[Temperature]
Spark 2.0 introduced Structured Streaming:
● Using the same API for streaming and still data
● Treating a stream of events as an growing append-only collection
The plan is to remove RDD/DStream
API in Spark 3.0
For now: structured streaming is
not feature-complete (Spark 2.2.0)
Stream
col1 col2
…
Append-only
Table
Apache Spark: Machine Learning
Spark MLlib has built-in algorithms:
● Classification: logistic regression, decision trees, support vector machines, …
● Regression
● Clustering: K-Means, LDA, GMM, …
● Collaborative Filtering
● …
Available for RDD and Dataframe/Datasets (incomplete)
Agenda
● A brief history of Big Data
● Data processing models
● Spark on Openshift
● Demo
Openshift
Container orchestration platform. Born at Google.
● Running Containers
● Virtual Namespaces
● Virtual Networks
● Service Discovery
● Load Balancing
● Auto-Scaling
● Health-checking and auto-recovery
● Monitoring and Logging
Creating
Containers
Orchestrating
Containers
Kubernetes Enterprise
Edition
Spark Architecture
Cluster Manager
Workers
Driver Driver App
(Main.class)
Executed by
Assigns executors to the App
Sends tasks to executors.
Task = “do something on a
data partition”
Oshinko
(Radanalytics)
Executor Executor
Task Task
Agenda
● A brief history of Big Data
● Data processing models
● Spark on Openshift
● Demo
You’ll see:
● Apache Spark on Openshift with Oshinko
● Kafka on Openshift (EnMasse)
● Spring-Boot + Apache Camel simulator
Sources and instruction available here:
https://github.com/nicolaferraro/iot-day-napoli-2017-demo
Demo
Thanks !
Questions ?
@ni_ferraro

Weitere ähnliche Inhalte

Was ist angesagt?

Microservices architecture presentation
Microservices architecture presentationMicroservices architecture presentation
Microservices architecture presentationJoseph SHYIRAMBERE
 
Software Defined Datacenter with Proxmox
Software Defined Datacenter with ProxmoxSoftware Defined Datacenter with Proxmox
Software Defined Datacenter with ProxmoxGLC Networks
 
platform without vendor lock-in
platform without vendor lock-inplatform without vendor lock-in
platform without vendor lock-inKai Jokiniemi
 
Building Local-loop Services for Customers
Building Local-loop Services for CustomersBuilding Local-loop Services for Customers
Building Local-loop Services for CustomersGLC Networks
 
Limiting bandwidth of specific destination based on address list
Limiting bandwidth of specific destination based on address listLimiting bandwidth of specific destination based on address list
Limiting bandwidth of specific destination based on address listAchmad Mardiansyah
 
Bitcoin cryptography
Bitcoin cryptographyBitcoin cryptography
Bitcoin cryptographyVadym Hrusha
 

Was ist angesagt? (8)

Mikrotik fastpath
Mikrotik fastpathMikrotik fastpath
Mikrotik fastpath
 
Mikrotik firewall mangle
Mikrotik firewall mangleMikrotik firewall mangle
Mikrotik firewall mangle
 
Microservices architecture presentation
Microservices architecture presentationMicroservices architecture presentation
Microservices architecture presentation
 
Software Defined Datacenter with Proxmox
Software Defined Datacenter with ProxmoxSoftware Defined Datacenter with Proxmox
Software Defined Datacenter with Proxmox
 
platform without vendor lock-in
platform without vendor lock-inplatform without vendor lock-in
platform without vendor lock-in
 
Building Local-loop Services for Customers
Building Local-loop Services for CustomersBuilding Local-loop Services for Customers
Building Local-loop Services for Customers
 
Limiting bandwidth of specific destination based on address list
Limiting bandwidth of specific destination based on address listLimiting bandwidth of specific destination based on address list
Limiting bandwidth of specific destination based on address list
 
Bitcoin cryptography
Bitcoin cryptographyBitcoin cryptography
Bitcoin cryptography
 

Ähnlich wie Analyzing Data at Scale with Apache Spark

Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Ganesh Raju
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterLinaro
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterLinaro
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streamingdatamantra
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analyticsinoshg
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding HadoopAhmed Ossama
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaDatabricks
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkManish Gupta
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overviewMartin Zapletal
 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1Adam Muise
 
SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
CS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingCS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingPalani Kumar
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 

Ähnlich wie Analyzing Data at Scale with Apache Spark (20)

Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
 
Spark 101
Spark 101Spark 101
Spark 101
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1
 
SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
CS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingCS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_Computing
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 

Mehr von Nicola Ferraro

Camel Day Italia 2021 - Camel K
Camel Day Italia 2021 - Camel KCamel Day Italia 2021 - Camel K
Camel Day Italia 2021 - Camel KNicola Ferraro
 
ApacheCon NA - Apache Camel K: connect your Knative serverless applications w...
ApacheCon NA - Apache Camel K: connect your Knative serverless applications w...ApacheCon NA - Apache Camel K: connect your Knative serverless applications w...
ApacheCon NA - Apache Camel K: connect your Knative serverless applications w...Nicola Ferraro
 
ApacheCon NA - Apache Camel K: a cloud-native integration platform
ApacheCon NA - Apache Camel K: a cloud-native integration platformApacheCon NA - Apache Camel K: a cloud-native integration platform
ApacheCon NA - Apache Camel K: a cloud-native integration platformNicola Ferraro
 
Integrating Applications: the Reactive Way
Integrating Applications: the Reactive WayIntegrating Applications: the Reactive Way
Integrating Applications: the Reactive WayNicola Ferraro
 
Cloud Native Applications on Kubernetes: a DevOps Approach
Cloud Native Applications on Kubernetes: a DevOps ApproachCloud Native Applications on Kubernetes: a DevOps Approach
Cloud Native Applications on Kubernetes: a DevOps ApproachNicola Ferraro
 
Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesNicola Ferraro
 
A brief history of "big data"
A brief history of "big data"A brief history of "big data"
A brief history of "big data"Nicola Ferraro
 

Mehr von Nicola Ferraro (7)

Camel Day Italia 2021 - Camel K
Camel Day Italia 2021 - Camel KCamel Day Italia 2021 - Camel K
Camel Day Italia 2021 - Camel K
 
ApacheCon NA - Apache Camel K: connect your Knative serverless applications w...
ApacheCon NA - Apache Camel K: connect your Knative serverless applications w...ApacheCon NA - Apache Camel K: connect your Knative serverless applications w...
ApacheCon NA - Apache Camel K: connect your Knative serverless applications w...
 
ApacheCon NA - Apache Camel K: a cloud-native integration platform
ApacheCon NA - Apache Camel K: a cloud-native integration platformApacheCon NA - Apache Camel K: a cloud-native integration platform
ApacheCon NA - Apache Camel K: a cloud-native integration platform
 
Integrating Applications: the Reactive Way
Integrating Applications: the Reactive WayIntegrating Applications: the Reactive Way
Integrating Applications: the Reactive Way
 
Cloud Native Applications on Kubernetes: a DevOps Approach
Cloud Native Applications on Kubernetes: a DevOps ApproachCloud Native Applications on Kubernetes: a DevOps Approach
Cloud Native Applications on Kubernetes: a DevOps Approach
 
Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with Kubernetes
 
A brief history of "big data"
A brief history of "big data"A brief history of "big data"
A brief history of "big data"
 

Kürzlich hochgeladen

Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...ranjana rawat
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 

Kürzlich hochgeladen (20)

Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 

Analyzing Data at Scale with Apache Spark

  • 1. Analyzing Data at Scale with Apache Spark Nicola Ferraro (@ni_ferraro) Senior Software Engineer at Red Hat Naples, November 24th 2017
  • 2.
  • 3. Myself Nicola Ferraro Senior Software Engineer at Red Hat Working on Apache Camel, JBoss Fuse, Fuse Integration Services for Openshift, Syndesis, Oshinko Radanalytics. Follow me on Twitter @ni_ferraro
  • 4. Agenda ● A brief history of Big Data ● Data processing models ● Spark on Openshift ● Demo
  • 5. Big Data Systems: why? System capable of handling data with high: ● Volume ○ Terabytes/Petabytes of data collected over the years ● Velocity ○ High speed streaming data to be analyzed in near real-time ● Variety ○ Not just tabular data or json/xml, also images, videos, free text Volume Velocity Variety There!
  • 6. Big Data Systems: why IoT?
  • 7. Big Data Systems: which devices?
  • 8. An Example? Back to the Future II (Weather forecasting) We can collect data from static sensors and moving cars to understand the exact moment when it will stop raining! E.g. https://goo.gl/FDzfdx
  • 9. Big Data Systems: how? ... ... ... ... By scaling horizontally to 1000s of machines! A single machine can be slow. But together they have a huge processing power!
  • 10. Evolution of Big Data Systems: Software 2006 Hadoop ... 2014+ 2008 Pig (scripting) 2010 Hive (SQL)
  • 11. Evolution of Big Data Systems: Infrastructure 2018 ? 2006 Commodity Hardware 2011 Big Data Appliances 2014 Virtual Machines
  • 12. Evolution of Big Data Systems: Architectures + 2011 Hybrid (Lambda) 2016+ Streaming (Kappa) 2006 BatchData Lake
  • 13. Batch Architecture HDFS HDFS HDFS HDFS Map Reduce Map Reduce Map Reduce Map Reduce Hadoop v1 1. Ingest to HDFS 2. Input-output from HDFS with MapReduce 3. Export to external systems using HDFS tools To serving layerIngest
  • 14. Lambda Architecture HDFS IngestMessaging Streaming Streaming To serving layer Interactive Queries NoSQL Batch Batch processing every night or every n days...
  • 15. Kappa Architecture Distributed Event Log Streaming Streaming To serving layer
  • 16. Agenda ● A brief history of Big Data ● Data processing models ● Spark on Openshift ● Demo
  • 17. Map Reduce Example: Word Count Users implemented 2 functions classes (Map and Reduce) and 1 config file
  • 18. Machine 1 Old Data Processing Model: Map Reduce Machine 2 Machine 3 Machine 4 MAP MAP MAP MAP load store Hadoop: batch architecture shuffle cache cache cache cache REDUCE REDUCE REDUCE REDUCE Usually HDFS HDFSReplicaFactor3 Most of the work is done in parallel by all machines!
  • 19. Introducing Spark Fast data processing platform. ● Batch processing ● Streaming (structured or micro-batching) ● Machine Learning ● Graph-based Algorithms Multi-language: Scala, Java, Python, R
  • 20. Apache Spark: RDD The core Spark API is based on the concept of Resilient Distributed Dataset. RDD (Set of all events received) val events: RDD[Event] = … Like a Scala collection (but lazy) HDFS JDBC NoSQL Kafka P1 P2 P3 P4 P5 P6
  • 21. Apache Spark: Functional Programming Model Java 8 streams: List<String> firstnames = people.stream() .filter(p -> p.getAge() < 30) .map(p -> p.getFirstname()) .distinct() .collect(Collectors.toList()); Get all distinct first names of people under 30 from a Java collection. Apache Spark (Scala): val firstnames = people .filter(p => p.age < 30) .map(p => p.firstname) .distinct() .collect(); The only difference: people is a 20TB RDD and computation is performed by several machines in parallel
  • 22. Apache Spark: Streaming (or micro-batching) DStream = Discretized Stream The size of each micro-batch is specified by the user (in seconds) Sliding window mode
  • 23. Apache Spark 2.0: Dataframes/Datasets RDD/DStream are the core APIs for processing data: it’s now considered too low-level. Streaming → DStream[Temperature] Batch → RDD[Temperature] Spark 2.0 introduced Structured Streaming: ● Using the same API for streaming and still data ● Treating a stream of events as an growing append-only collection The plan is to remove RDD/DStream API in Spark 3.0 For now: structured streaming is not feature-complete (Spark 2.2.0) Stream col1 col2 … Append-only Table
  • 24. Apache Spark: Machine Learning Spark MLlib has built-in algorithms: ● Classification: logistic regression, decision trees, support vector machines, … ● Regression ● Clustering: K-Means, LDA, GMM, … ● Collaborative Filtering ● … Available for RDD and Dataframe/Datasets (incomplete)
  • 25. Agenda ● A brief history of Big Data ● Data processing models ● Spark on Openshift ● Demo
  • 26. Openshift Container orchestration platform. Born at Google. ● Running Containers ● Virtual Namespaces ● Virtual Networks ● Service Discovery ● Load Balancing ● Auto-Scaling ● Health-checking and auto-recovery ● Monitoring and Logging Creating Containers Orchestrating Containers Kubernetes Enterprise Edition
  • 27. Spark Architecture Cluster Manager Workers Driver Driver App (Main.class) Executed by Assigns executors to the App Sends tasks to executors. Task = “do something on a data partition” Oshinko (Radanalytics) Executor Executor Task Task
  • 28. Agenda ● A brief history of Big Data ● Data processing models ● Spark on Openshift ● Demo
  • 29. You’ll see: ● Apache Spark on Openshift with Oshinko ● Kafka on Openshift (EnMasse) ● Spring-Boot + Apache Camel simulator Sources and instruction available here: https://github.com/nicolaferraro/iot-day-napoli-2017-demo Demo