Suche senden
Hochladen
Apache Spark Crash Course
•
14 gefällt mir
•
3,092 views
DataWorks Summit/Hadoop Summit
Folgen
Slides from the Apache Spark Crash Course at DataWorks Summit Munich 2017
Weniger lesen
Mehr lesen
Technologie
Melden
Teilen
Melden
Teilen
1 von 84
Jetzt herunterladen
Downloaden Sie, um offline zu lesen
Empfohlen
Spark graphx
Spark graphx
Carol McDonald
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Simplilearn
Big data Hadoop presentation
Big data Hadoop presentation
Shivanee garg
NPTEL BIG DATA FULL PPT BOOK WITH ASSIGNMENT SOLUTION RAJIV MISHRA IIT PATNA...
NPTEL BIG DATA FULL PPT BOOK WITH ASSIGNMENT SOLUTION RAJIV MISHRA IIT PATNA...
SayantanRoy14
Data warehouse presentaion
Data warehouse presentaion
sridhark1981
Big Data Analytics with Spark
Big Data Analytics with Spark
Mohammed Guller
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
Eric Sun
Spark SQL Bucketing at Facebook
Spark SQL Bucketing at Facebook
Databricks
Empfohlen
Spark graphx
Spark graphx
Carol McDonald
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Simplilearn
Big data Hadoop presentation
Big data Hadoop presentation
Shivanee garg
NPTEL BIG DATA FULL PPT BOOK WITH ASSIGNMENT SOLUTION RAJIV MISHRA IIT PATNA...
NPTEL BIG DATA FULL PPT BOOK WITH ASSIGNMENT SOLUTION RAJIV MISHRA IIT PATNA...
SayantanRoy14
Data warehouse presentaion
Data warehouse presentaion
sridhark1981
Big Data Analytics with Spark
Big Data Analytics with Spark
Mohammed Guller
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
Eric Sun
Spark SQL Bucketing at Facebook
Spark SQL Bucketing at Facebook
Databricks
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Simplilearn
Databricks for Dummies
Databricks for Dummies
Rodney Joyce
Data Lake,beyond the Data Warehouse
Data Lake,beyond the Data Warehouse
Data Science Thailand
Big Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must Know
Bernard Marr
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Flink Forward
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop
DataWorks Summit/Hadoop Summit
Data pipelines from zero to solid
Data pipelines from zero to solid
Lars Albertsson
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
nzhang
Data Streaming For Big Data
Data Streaming For Big Data
Seval Çapraz
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Simplilearn
Extending Druid Index File
Extending Druid Index File
Navis Ryu
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
Designing Scalable Data Warehouse Using MySQL
Designing Scalable Data Warehouse Using MySQL
Venu Anuganti
Productizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
Databricks
Mining Data Streams
Mining Data Streams
SujaAldrin
Apache Spark Crash Course
Apache Spark Crash Course
DataWorks Summit
#HSTokyo16 Apache Spark Crash Course
#HSTokyo16 Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
Weitere ähnliche Inhalte
Was ist angesagt?
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Simplilearn
Databricks for Dummies
Databricks for Dummies
Rodney Joyce
Data Lake,beyond the Data Warehouse
Data Lake,beyond the Data Warehouse
Data Science Thailand
Big Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must Know
Bernard Marr
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Flink Forward
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop
DataWorks Summit/Hadoop Summit
Data pipelines from zero to solid
Data pipelines from zero to solid
Lars Albertsson
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
nzhang
Data Streaming For Big Data
Data Streaming For Big Data
Seval Çapraz
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Simplilearn
Extending Druid Index File
Extending Druid Index File
Navis Ryu
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
Designing Scalable Data Warehouse Using MySQL
Designing Scalable Data Warehouse Using MySQL
Venu Anuganti
Productizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
Databricks
Mining Data Streams
Mining Data Streams
SujaAldrin
Was ist angesagt?
(20)
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Databricks for Dummies
Databricks for Dummies
Data Lake,beyond the Data Warehouse
Data Lake,beyond the Data Warehouse
Big Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must Know
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop
Data pipelines from zero to solid
Data pipelines from zero to solid
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
Data Streaming For Big Data
Data Streaming For Big Data
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Extending Druid Index File
Extending Druid Index File
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Designing Scalable Data Warehouse Using MySQL
Designing Scalable Data Warehouse Using MySQL
Productizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
Mining Data Streams
Mining Data Streams
Ähnlich wie Apache Spark Crash Course
Apache Spark Crash Course
Apache Spark Crash Course
DataWorks Summit
#HSTokyo16 Apache Spark Crash Course
#HSTokyo16 Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
Apache Spark Crash Course
Apache Spark Crash Course
DataWorks Summit
Hadoop Summit Tokyo Apache NiFi Crash Course
Hadoop Summit Tokyo Apache NiFi Crash Course
DataWorks Summit/Hadoop Summit
Spark-Zeppelin-ML on HWX
Spark-Zeppelin-ML on HWX
Kirk Haslbeck
Paris FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks Presentation
Abdelkrim Hadjidj
Enabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical Enterprise
Hortonworks
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
DataWorks Summit/Hadoop Summit
Unlocking insights in streaming data
Unlocking insights in streaming data
Carolyn Duby
Make Streaming IoT Analytics Work for You
Make Streaming IoT Analytics Work for You
Hortonworks
IBM Cloud Paris meetup 20180213 - Hortonworks
IBM Cloud Paris meetup 20180213 - Hortonworks
IBM France Lab
Storm Demo Talk - Denver Apr 2015
Storm Demo Talk - Denver Apr 2015
Mac Moore
Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...
Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...
DataWorks Summit
Solving Cybersecurity at Scale
Solving Cybersecurity at Scale
DataWorks Summit
Internet of things Crash Course Workshop
Internet of things Crash Course Workshop
DataWorks Summit
Internet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop Summit
DataWorks Summit
Data in Motion - Data at Rest - Hortonworks a Modern Architecture
Data in Motion - Data at Rest - Hortonworks a Modern Architecture
Mats Johansson
Hortonworks - IBM Cognitive - The Future of Data Science
Hortonworks - IBM Cognitive - The Future of Data Science
Thiago Santiago
Attunity Hortonworks Webinar- Sept 22, 2016
Attunity Hortonworks Webinar- Sept 22, 2016
Hortonworks
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
skumpf
Ähnlich wie Apache Spark Crash Course
(20)
Apache Spark Crash Course
Apache Spark Crash Course
#HSTokyo16 Apache Spark Crash Course
#HSTokyo16 Apache Spark Crash Course
Apache Spark Crash Course
Apache Spark Crash Course
Hadoop Summit Tokyo Apache NiFi Crash Course
Hadoop Summit Tokyo Apache NiFi Crash Course
Spark-Zeppelin-ML on HWX
Spark-Zeppelin-ML on HWX
Paris FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks Presentation
Enabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical Enterprise
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Unlocking insights in streaming data
Unlocking insights in streaming data
Make Streaming IoT Analytics Work for You
Make Streaming IoT Analytics Work for You
IBM Cloud Paris meetup 20180213 - Hortonworks
IBM Cloud Paris meetup 20180213 - Hortonworks
Storm Demo Talk - Denver Apr 2015
Storm Demo Talk - Denver Apr 2015
Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...
Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...
Solving Cybersecurity at Scale
Solving Cybersecurity at Scale
Internet of things Crash Course Workshop
Internet of things Crash Course Workshop
Internet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop Summit
Data in Motion - Data at Rest - Hortonworks a Modern Architecture
Data in Motion - Data at Rest - Hortonworks a Modern Architecture
Hortonworks - IBM Cognitive - The Future of Data Science
Hortonworks - IBM Cognitive - The Future of Data Science
Attunity Hortonworks Webinar- Sept 22, 2016
Attunity Hortonworks Webinar- Sept 22, 2016
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Mehr von DataWorks Summit/Hadoop Summit
Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
Hadoop Crash Course
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
Data Science Crash Course
Data Science Crash Course
DataWorks Summit/Hadoop Summit
Dataflow with Apache NiFi
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
Schema Registry - Set you Data Free
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
HBase in Practice
HBase in Practice
DataWorks Summit/Hadoop Summit
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit/Hadoop Summit
Mehr von DataWorks Summit/Hadoop Summit
(20)
Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
Hadoop Crash Course
Hadoop Crash Course
Data Science Crash Course
Data Science Crash Course
Dataflow with Apache NiFi
Dataflow with Apache NiFi
Schema Registry - Set you Data Free
Schema Registry - Set you Data Free
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
HBase in Practice
HBase in Practice
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Kürzlich hochgeladen
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
Zilliz
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
sammart93
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
DianaGray10
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
apidays
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Orbitshub
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
apidays
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
Khushali Kathiriya
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
Remote DBA Services
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
Martijn de Jong
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Orbitshub
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
The Digital Insurer
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
Andrey Devyatkin
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
Product Anonymous
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
apidays
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
MIND CTI
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Jeffrey Haguewood
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
The Digital Insurer
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
Nanddeep Nachan
Kürzlich hochgeladen
(20)
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
Apache Spark Crash Course
1.
Robert Hryniewicz Developer Advocate @RobertH8z Apache Spark Crash Course - DataWorks Summit -
Munich 2017
2.
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved “Big Data” Ã
Internet of Anything (IoT) – Wind Turbines, Oil Rigs – Beacons, Wearables – Smart Cars à User Generated Content (Social, Web & Mobile) – Twitter, Facebook, Snapchat – Clickstream – Paypal, Venmo 44ZB in 2020
3.
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Visualizing 44ZB 100 pixels = 1M TB 100 px
-> 1M TB assumes 5M pixel resolution screen
4.
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
5.
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved The “Big Data” Problem Ã
A single machine cannot process or even store all the data! Problem Solution à Distribute data over large clusters Difficulty à How to split work across machines? à Moving data over network is expensive à Must consider data & network locality à How to deal with failures? à How to deal with slow nodes?
6.
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Background
7.
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What Is Apache Spark? Ã
Apache open source project originally developed at AMPLab (University of California Berkeley) Ã Unified data processing engine that operates across varied data workloads and platforms
8.
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Why Apache Spark? Ã
Elegant Developer APIs – Single environment for data munging, data wrangling, and Machine Learning (ML) Ã In-memory computation model – Fast! – Effective for iterative computations and ML Ã Machine Learning – Implementation of distributed ML algorithms – Pipeline API (Spark ML)
9.
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark SQL Structured Data Spark Streaming Near Real-time Spark MLlib Machine Learning GraphX Graph Analysis
10.
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Basics
11.
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved SparkSession Ã
Main entry point for Spark functionality à Allows programming with DataFrame and Dataset APIs – Fewer concepts and constructs a developer has to juggle while interacting with Spark à Represented as spark and auto-initialized in Zeppelin env. What is it?
12.
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark SQL Structured Data Spark Streaming Near Real-time Spark MLlib Machine Learning GraphX Graph Analysis
13.
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved More Flexible
Better Storage and Performance///
14.
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark SQL Overview Ã
Spark module for structured data processing (e.g. DB tables, JSON files, CSV) Ã Three ways to manipulate data: – DataFrames API – SQL queries – Datasets API
15.
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DataFrames Ã
Distributed collection of data organized into named columns à Conceptually equivalent to a table in relational DB or a data frame in R/Python à API available in Scala, Java, Python, and R Col1 Col2 … … ColN DataFrame Column Row Data is described as a DataFrame with rows, columns, and a schema
16.
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Sources CSVAvro HIVE Spark SQL Col1
Col2 … … ColN DataFrame Column Row JSON
17.
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Create a DataFrame val
path = "examples/flights.json" val flights = spark.read.json(path) Example
18.
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Register a Temporary View (SQL API) Example flights.createOrReplaceTempView("flightsView")
19.
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Two API Examples: DataFrame
and SQL APIs flights.select("Origin", "Dest", "DepDelay”) .filter($"DepDelay" > 15).show(5) Results +------+----+--------+ |Origin|Dest|DepDelay| +------+----+--------+ | IAD| TPA| 19| | IND| BWI| 34| | IND| JAX| 25| | IND| LAS| 67| | IND| MCO| 94| +------+----+--------+ SELECT Origin, Dest, DepDelay FROM flightsView WHERE DepDelay > 15 LIMIT 5 SQL API DataFrame API
20.
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark SQL Structured Data Spark Streaming Near Real-time Spark MLlib Machine Learning GraphX Graph Analysis
21.
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What is Stream Processing? Batch Processing •
Ability to process and analyze data at-rest (stored data) • Request-based, bulk evaluation and short-lived processing • Enabler for Retrospective, Reactive and On-demand Analytics Stream Processing • Ability to ingest, process and analyze data in-motion in real- or near-real-time • Event or micro-batch driven, continuous evaluation and long-lived processing • Enabler for real-time Prospective, Proactive and Predictive Analytics for Next Best Action Stream Processing + Batch Processing = All Data Analytics real-time (now) historical (past)
22.
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Next
Generation Analytics Iterative & Exploratory Data is the structure Traditional Analytics Structured & Repeatable Structure built to store data 22 Modern Data Applications approach to Insights Start with hypothesis Test against selected data Data leads the way Explore all data, identify correlations Analyze after landing… Analyze in motion…
23.
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Streaming Ã
Extension of Spark Core API Ã Stream processing of live data streams – Scalable – High-throughput – Fault-tolerant Overview ZeroMQ MQTT No longer supported in Spark 2.x
24.
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Streaming
25.
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Streaming Discretized Streams (DStreams) Ã
High-level abstraction representing continuous stream of data à Internally represented as a sequence of RDDs à Operation applied on a DStream translates to operations on the underlying RDDs
26.
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Streaming Example: flatMap
operation
27.
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Streaming Ã
Apply transformations over a sliding window of data, e.g. rolling average Window Operations
28.
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Challenges in Streaming Data Ã
Consistency à Fault tolerance à Out-of-order data
29.
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Structured Streaming: Basics
30.
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Structured Streaming: Model
31.
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Handling late arriving data
32.
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark SQL Structured Data Spark Streaming Near Real-time Spark MLlib Machine Learning GraphX Graph Analysis
33.
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved AI in Media & Pop Culture
34.
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Machine
Learning use cases Healthcare • Predict diagnosis • Prioritize screenings • Reduce re-admittance rates Financial services • Fraud Detection/prevention • Predict underwriting risk • New account risk screens Public Sector • Analyze public sentiment • Optimize resource allocation • Law enforcement & security Retail • Product recommendation • Inventory management • Price optimization Telco/mobile • Predict customer churn • Predict equipment failure • Customer behavior analysis Oil & Gas • Predictive maintenance • Seismic data management • Predict well production levels
35.
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Scatter
2D Data Visualized scatterData ç DataFrame +-----+--------+ |label|features| +-----+--------+ |-12.0| [-4.9]| | -6.0| [-4.5]| | -7.2| [-4.1]| | -5.0| [-3.2]| | -2.0| [-3.0]| | -3.1| [-2.1]| | -4.0| [-1.5]| | -2.2| [-1.2]| | -2.0| [-0.7]| | 1.0| [-0.5]| | -0.7| [-0.2]| ... ... ...
36.
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Linear
Regression Model Training (one feature) Coefficients: 2.81 Intercept: 3.05 y = 2.81x + 3.05 Training Result
37.
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Linear
Regression (two features) Coefficients: [0.464, 0.464] Intercept: 0.0563
38.
38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark
API for building ML pipelines Feature transform 1 Feature transform 2 Combine features Linear Regression Input DataFrame Input DataFrame Output DataFrame Pipeline Pipeline Model Train Predict Export Model
39.
39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark SQL Structured Data Spark Streaming Near Real-time Spark MLlib Machine Learning GraphX Graph Analysis
40.
40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
41.
41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Ã
Page Rank à Topic Modeling (LDA) à Community Detection Source: ampcamp.berkeley.edu
42.
42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Zeppelin & HDP
43.
43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What’s
Apache Zeppelin? Web-based notebook that enables interactive data analytics. You can make beautiful data-driven, interactive and collaborative documents with SQL, Python, Scala and more
44.
44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved imple line chart
45.
45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved orizontal plot of three line charts
46.
46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved treaming data into a line chart
47.
47 © Hortonworks Inc. 2011 – 2016. All Rights Reserved lotting Iris data features in one plot
48.
48 © Hortonworks Inc. 2011 – 2016. All Rights Reserved omparing Iris data distributions
49.
49 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What
is a Note/Notebook? • A web based GUI for small code snippets • Write code snippets in browser • Zeppelin sends code to backend for execution • Zeppelin gets data back from backend • Zeppelin visualizes data • Zeppelin Note = Set of (Paragraphs/Cells) • Other Features - Sharing/Collaboration/Reports/Import/Export
50.
50 © Hortonworks Inc. 2011 – 2016. All Rights Reserved How does Zeppelin work? Notebook Author Collaborators/ Report viewers Zeppelin Cluster Spark | Hive | HBase Any of 30+ back ends
51.
51 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Big Data Lifecycle Collect ETL / Process Analysis Report Data Product Business user Customer Data ScientistData Engineer All in Zeppelin!
52.
52 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Ã
Zeppelin è Interactive notebook à Spark à YARN è Resource Management à HDFS è Distributed Storage Layer YARN Scala Java Python R APIs Spark Core Engine Spark SQL Spark Streaming MLlib GraphX 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS
53.
53 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Access
patterns enabled by YARN YARN: Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° °N HDFS Hadoop Distributed File System Interactive Real-TimeBatch Applications Batch Needs to happen but, no timeframe limitations Interactive Needs to happen at Human time Real-Time Needs to happen at Machine Execution time.
54.
54 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Why Apache Spark on YARN? Ã
Resource management à Utilizes existing HDP cluster infrastructure à Scheduling and queues Spark Driver Client Spark Application Master YARN container Spark Executor YARN container Task Task Spark Executor YARN container Task Task Spark Executor YARN container Task Task
55.
55 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Why
HDFS? Fault Tolerant Distributed Storage • Divide files into big blocks and distribute 3 copies randomly across the cluster • Processing Data Locality • Not Just storage but computation 10110100101 00100111001 11111001010 01110100101 00101100100 10101001100 01010010111 01011101011 11011011010 10110100101 01001010101 01011100100 11010111010 0 Logical File 1 2 3 4 Blocks 1 Cluster 1 1 2 2 2 3 3 34 4 4
56.
56 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark
and HDP
57.
57 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HDCloud
58.
58 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hortonworks Cloud Solutions Microsoft
AWS Google Managed Azure HDInsight Non-Managed / Marketplace Hortonworks Data Cloud for AWS Cloud IaaS Hortonworks Data Platform (via Ambari and via Cloudbreak)
59.
59 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hortonworks Cloud Solutions: Flexibility and Choice Hortonworks Data Cloud for AWS Cloudbreak HDP on Cloud IaaS More Prescriptive More Ephemeral More Options More Long Running
60.
60 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HDP 2.6 and New Cluster Types Spark 2.1 Druid TP Interactive Hive
61.
61 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
62.
62 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
63.
63 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Multitenancy with Zeppelin
64.
64 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Livy Ã
Livy is the open source REST interface for interacting with Apache Spark from anywhere à Installed as Spark Ambari Service Livy Client HTTP HTTP (RPC) Spark Interactive Session SparkContext Spark Batch Session SparkContext Livy Server
65.
65 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Security Across Zeppelin-Livy-Spark Shiro Ispark Group Interpreter SPNego: Kerberos
Kerberos Livy APIs Spark on YARN Zeppelin Driver LDAP Livy Server
66.
66 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Reasons to Integrate with Livy Ã
Bring Sessions to Apache Zeppelin – Isolation – Session sharing à Enable efficient cluster resource utilization – Default Spark interpreter keeps YARN/Spark job running forever – Livy interpreter recycled after 60 minutes of inactivity (controlled by livy.server.session.timeout ) à To Identity Propagation – Send user identity from Zeppelin > Livy > Spark on YARN
67.
67 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Livy
Server SparkSession Sharing Session-2 Session-1 SparkSession-1 SparkContext SparkSession-2 SparkContext Client 1 Client 2 Client 3 Session-1 Session-1 Session-2
68.
68 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Zeppelin Security: Authentication + SSL Tommy Callahan Zeppelin
Spark on YARN LDAP SSL Firewall 1 2 3
69.
69 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Zeppelin + Livy End-to-End Security Ispark Group Interpreter SPNego: Kerberos
Kerberos/RPC Livy APIs Spark on YARN Zeppelin LDAP Livy Server Job runs as Tommy Callahan Tommy Callahan
70.
70 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Sample Architecture
71.
71 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Modern Data Apps Ã
HDP 2.6 – Batch Processing à HDF 2.1 – Streaming Apps DATA AT REST DATA IN MOTION ACTIONABLE INTELLIGENCE Modern Data Applications
72.
72 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Modern Data Applications Custom or Off the Shelf Real-Time Cyber Security protects systems with superior threat detection Smart Manufacturing dramatically improves yields by managing more variables in greater detail Connected, Autonomous Cars drive themselves and improve road safety Future Farming optimizing soil, seeds and equipment to measured conditions on each square foot Automatic Recommendation Engines match products to preferences in milliseconds DATA AT REST DATA IN MOTION ACTIONABLE INTELLIGENCE Modern Data Applications Hortonworks DataFlow Hortonworks Data Platform
73.
73 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Managed Dataflow SOURCES REGIONAL INFRASTRUCTURE CORE INFRASTRUCTURE
74.
74 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
75.
75 © Hortonworks Inc. 2011 – 2016. All Rights Reserved High-Level Overview IoT
Edge (single node) IoT Edge (single node) IoT Devices IoT Devices NiFi Hub Data Broker Column DB Data Store Live Dashboard Data Center (on prem/cloud) HDFS/S3 HBase/Cassandra
76.
76 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Labs / Tutorials
77.
77 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Future Tutorials Ã
Deploying Models with Spark Structured Streaming à Predicting Airline Delays with SparkR à Sentiment Analysis with Apache Spark (Gradient Boosting) à Auto Text Classification (Naïve Bayes)
78.
78 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hortonworks Community Connection Read
access for everyone, join to participate and be recognized • Full Q&A Platform (like StackOverflow) • Knowledge Base Articles • Code Samples and Repositories
79.
79 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Community Engagement Participate
now at: community.hortonworks.com© Hortonworks Inc. 2011 – 2015. All Rights Reserved 12,000+ Registered Users 35,000+ Answers 55,000+ Technical Assets One Website!
80.
80 © Hortonworks Inc. 2011 – 2016. All Rights Reserved www.futureofdata.io Future of Data Meetups
81.
81 © Hortonworks Inc. 2011 – 2016. All Rights Reserved FB Sort Ã
Spark job that reads 60 TB of compressed data and performs a 90 TB shuffle and sort. Ã Largest real-world Spark job to date! – Databricks’ PetaByte sort was on synthetic data. Ã Multiple reliability fixes. Ã Spark job that reads 60 TB of compressed data and performs a 90 TB shuffle and sort. Ã Largest real-world Spark job to date! – Databricks’ PetaByte sort was on synthetic data. Ã Multiple reliability fixes. “Spark could reliably shuffle and sort 90 TB+ intermediate data and run 250,000 tasks in a single job [...] and it has been running in production for several months.”
82.
82 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark SQL Structured Data Spark Streaming Near Real-time Spark MLlib Machine Learning GraphX Graph Analysis Robert Hryniewicz @robertH8z
83.
83 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What’s new in HDP 2.6 –
Spark & Zeppelin à Spark 1.6.3 GA à Spark 2.1 GA à REST API (Livy) GA à Spark Thrift Server doAS GA à SparkSQL – Row/Column Security (GA) à Spark Streaming + Kafka over SSL à Multi Cluster HBase support for SHC à Package support in PySpark & SparkR Spark à Spark 2.x support à Improved Livy integration à No password in clear à JDBC interpreter improvements à Smart Sense integration à Knox proxy Zeppelin UI Zeppelin 0.7.x
84.
Robert Hryniewicz @RobertH8z Thanks!
Jetzt herunterladen