SlideShare ist ein Scribd-Unternehmen logo
1 von 24
Downloaden Sie, um offline zu lesen
SPARK MEETS TELEMETRY 
Mozlandia 2014 
Roberto Agostino Vitillo
TELEMETRY PINGS
TELEMETRY PINGS 
• If Telemetry is enabled, a ping is generated for 
each session 
• Pings are sent to our backend infrastructure as 
json blobs 
• Backend validates and stores pings on S3
TELEMETRY PINGS
TELEMETRY MAP-REDUCE 
import json 
def map(k, d, v, cx): 
j = json.loads(v) 
os = j['info']['OS'] 
cx.write(os, 1) 
def reduce(k, v, cx): 
cx.write(k, sum(v)) 
• Processes pings from S3 using a map reduce 
framework written in Python 
• https://github.com/mozilla/telemetry-server
SHORTCOMINGS 
• Not distributed, limited to a single machine 
• Doesn’t support chains of map/reduce ops 
• Doesn’t support SQL-like queries 
• Batch oriented
source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
WHAT IS SPARK? 
• In-memory data analytics cluster computing 
framework (up to 100x faster than Hadoop) 
• Comes with over 80 distributed operations for 
grouping, filtering etc. 
• Runs standalone or on Hadoop, Mesos and 
TaskCluster in the future (right Jonas?)
WHY DO WE CARE? 
• In memory caching 
• Interactive command line interface for EDA (think R command line) 
• Comes with higher level libraries for machine learning and graph 
processing 
• Works beautifully on a single machine without tedious setup; 
doesn’t depend on Hadoop/HDFS 
• Scala, Python, Clojure and R APIs are available
WHY DO WE REALLY CARE? 
The easier we make it to get answers, 
the more questions we will ask
MASHUP DEMO
HOW DOES IT WORK? 
• User creates Resilient Distributed Datasets (RDDs), 
transforms and executes them 
• RDD operations are compiled to a DAG of 
operators 
• DAG is compiled into stages 
• A stage is executed in parallel as a series of tasks
RDD 
A parallel dataset with partitions 
Var A Var B Var C 
observation 
observation 
observation 
observation 
Partition 
Partition
DAG 
Logical graph of RDD operations 
sc.textFile("input") 
.map(line => line.split(",")) 
.map(line => (line(0), line(1).toInt)) 
.reduceByKey(_ + _, 3) 
RDD[String] RDD[Array[String]] RDD[(String, Int)] 
RDD[(String, Int)] 
map map reduceByKey 
read 
P1 
P2 
P3 
P4
RDD[String] RDD[Array[String]] RDD[(String, Int)] 
RDD[(String, Int)] 
map map reduceByKey 
read 
STAGE 
Stage 1 Stage 2 
P1 
P2 
P3 
P4
Stage 1 
map map 
STAGE 
shuffle 
RDD[String] RDD[Array[String]] RDD[(String, Int)] 
read input output 
read 
map 
map 
shuffle 
P1 
P2 
P3 
P4 
T1 
T2 
T3 
T4 
Set of tasks that can run in parallel 
Stage 1
STAGE 
Set of tasks that can run in parallel 
Stage 1 Stage 2
STAGE 
Set of tasks that can run in parallel 
• Tasks are the fundamental unit of work 
• Tasks are serialised and shipped to workers 
• Task execution 
1. Fetch input 
2. Execute 
3. Output result 
task 1 
task 2 
task 3 
task 4
HANDS-ON
HANDS-ON 
1. Visit telemetry-dash.mozilla.org and sign in using Persona. 
2. Click “Launch an ad-hoc analysis worker”. 
3. Upload your SSH public key (this allows you to log in to the 
server once it’s started up). 
4. Click “Submit” 
5. A Ubuntu machine will be started up on Amazon’s EC2 
infrastructure.
HANDS-ON 
• Connect to the machine through ssh 
• Clone the starter template: 
1. git clone https://github.com/vitillo/mozilla-telemetry-spark.git 
2. cd mozilla-telemetry-spark && source aws/setup.sh 
3. sbt console 
• Open http://bit.ly/1wBHHDH
TUTORIAL
Spark meets Telemetry

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Where in the world is Franz Kafka? | Will LaForest, Confluent
Where in the world is Franz Kafka? | Will LaForest, ConfluentWhere in the world is Franz Kafka? | Will LaForest, Confluent
Where in the world is Franz Kafka? | Will LaForest, Confluent
 
S2
S2S2
S2
 
CArcMOOC 03.04 - Gate-level design
CArcMOOC 03.04 - Gate-level designCArcMOOC 03.04 - Gate-level design
CArcMOOC 03.04 - Gate-level design
 
Pgrouting_foss4guk_ross_mcdonald
Pgrouting_foss4guk_ross_mcdonaldPgrouting_foss4guk_ross_mcdonald
Pgrouting_foss4guk_ross_mcdonald
 
MapReduce with Hadoop
MapReduce with HadoopMapReduce with Hadoop
MapReduce with Hadoop
 
A parallel gpu version of the traveling salesman problem slides
A parallel gpu version of the traveling salesman problem slidesA parallel gpu version of the traveling salesman problem slides
A parallel gpu version of the traveling salesman problem slides
 
Doom in SpaceX
Doom in SpaceXDoom in SpaceX
Doom in SpaceX
 
Flink meetup
Flink meetupFlink meetup
Flink meetup
 
scikit-cuda
scikit-cudascikit-cuda
scikit-cuda
 
3.5 equivalence of pushdown automata and cfl
3.5 equivalence of pushdown automata and cfl3.5 equivalence of pushdown automata and cfl
3.5 equivalence of pushdown automata and cfl
 
Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster Innards
 
Riding the Elephant - Hadoop 2.0
Riding the Elephant - Hadoop 2.0Riding the Elephant - Hadoop 2.0
Riding the Elephant - Hadoop 2.0
 
Three Functional Programming Technologies for Big Data
Three Functional Programming Technologies for Big DataThree Functional Programming Technologies for Big Data
Three Functional Programming Technologies for Big Data
 
Quill - 一個 Scala 的資料庫存取利器
Quill - 一個 Scala 的資料庫存取利器Quill - 一個 Scala 的資料庫存取利器
Quill - 一個 Scala 的資料庫存取利器
 
Aws Quick Dirty Hadoop Mapreduce Ec2 S3
Aws Quick Dirty Hadoop Mapreduce Ec2 S3Aws Quick Dirty Hadoop Mapreduce Ec2 S3
Aws Quick Dirty Hadoop Mapreduce Ec2 S3
 
Open Source Routing Machine - FOSS4G 2016 Bonn
Open Source Routing Machine - FOSS4G 2016 BonnOpen Source Routing Machine - FOSS4G 2016 Bonn
Open Source Routing Machine - FOSS4G 2016 Bonn
 
Graph database
Graph databaseGraph database
Graph database
 
State of OSRM - SOTM 2016
State of OSRM - SOTM 2016State of OSRM - SOTM 2016
State of OSRM - SOTM 2016
 
Nips2016 mlgkernel
Nips2016 mlgkernelNips2016 mlgkernel
Nips2016 mlgkernel
 
Automating AWS Infrastructure Provisioning Using Concourse and Terraform
Automating AWS Infrastructure Provisioning Using Concourse and TerraformAutomating AWS Infrastructure Provisioning Using Concourse and Terraform
Automating AWS Infrastructure Provisioning Using Concourse and Terraform
 

Ähnlich wie Spark meets Telemetry

Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 
Barcelona MUG MongoDB + Hadoop Presentation
Barcelona MUG MongoDB + Hadoop PresentationBarcelona MUG MongoDB + Hadoop Presentation
Barcelona MUG MongoDB + Hadoop Presentation
Norberto Leite
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer
tirlukachaitanya
 

Ähnlich wie Spark meets Telemetry (20)

Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
RBootcam Day 2
RBootcam Day 2RBootcam Day 2
RBootcam Day 2
 
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Deathstar
DeathstarDeathstar
Deathstar
 
Hadoop classes in mumbai
Hadoop classes in mumbaiHadoop classes in mumbai
Hadoop classes in mumbai
 
Barcelona MUG MongoDB + Hadoop Presentation
Barcelona MUG MongoDB + Hadoop PresentationBarcelona MUG MongoDB + Hadoop Presentation
Barcelona MUG MongoDB + Hadoop Presentation
 
Shrug2017 arcpy data_and_you
Shrug2017 arcpy data_and_youShrug2017 arcpy data_and_you
Shrug2017 arcpy data_and_you
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer
 
State of the Art Web Mapping with Open Source
State of the Art Web Mapping with Open SourceState of the Art Web Mapping with Open Source
State of the Art Web Mapping with Open Source
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読み
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Geoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel ProcessingGeoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel Processing
 
A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)
 

Mehr von Roberto Agostino Vitillo (14)

Telemetry Onboarding
Telemetry OnboardingTelemetry Onboarding
Telemetry Onboarding
 
Growing a Data Pipeline for Analytics
Growing a Data Pipeline for AnalyticsGrowing a Data Pipeline for Analytics
Growing a Data Pipeline for Analytics
 
Telemetry Datasets
Telemetry DatasetsTelemetry Datasets
Telemetry Datasets
 
Growing a SQL Query
Growing a SQL QueryGrowing a SQL Query
Growing a SQL Query
 
Telemetry Onboarding
Telemetry OnboardingTelemetry Onboarding
Telemetry Onboarding
 
All you need to know about Statistics
All you need to know about StatisticsAll you need to know about Statistics
All you need to know about Statistics
 
Vectorization on x86: all you need to know
Vectorization on x86: all you need to knowVectorization on x86: all you need to know
Vectorization on x86: all you need to know
 
Sharing C++ objects in Linux
Sharing C++ objects in LinuxSharing C++ objects in Linux
Sharing C++ objects in Linux
 
Performance tools developments
Performance tools developmentsPerformance tools developments
Performance tools developments
 
Exploiting vectorization with ISPC
Exploiting vectorization with ISPCExploiting vectorization with ISPC
Exploiting vectorization with ISPC
 
GOoDA tutorial
GOoDA tutorialGOoDA tutorial
GOoDA tutorial
 
Callgraph analysis
Callgraph analysisCallgraph analysis
Callgraph analysis
 
Vectorization in ATLAS
Vectorization in ATLASVectorization in ATLAS
Vectorization in ATLAS
 
Inter-process communication on steroids
Inter-process communication on steroidsInter-process communication on steroids
Inter-process communication on steroids
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Kürzlich hochgeladen (20)

CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

Spark meets Telemetry

  • 1. SPARK MEETS TELEMETRY Mozlandia 2014 Roberto Agostino Vitillo
  • 3. TELEMETRY PINGS • If Telemetry is enabled, a ping is generated for each session • Pings are sent to our backend infrastructure as json blobs • Backend validates and stores pings on S3
  • 5. TELEMETRY MAP-REDUCE import json def map(k, d, v, cx): j = json.loads(v) os = j['info']['OS'] cx.write(os, 1) def reduce(k, v, cx): cx.write(k, sum(v)) • Processes pings from S3 using a map reduce framework written in Python • https://github.com/mozilla/telemetry-server
  • 6. SHORTCOMINGS • Not distributed, limited to a single machine • Doesn’t support chains of map/reduce ops • Doesn’t support SQL-like queries • Batch oriented
  • 7.
  • 9. WHAT IS SPARK? • In-memory data analytics cluster computing framework (up to 100x faster than Hadoop) • Comes with over 80 distributed operations for grouping, filtering etc. • Runs standalone or on Hadoop, Mesos and TaskCluster in the future (right Jonas?)
  • 10. WHY DO WE CARE? • In memory caching • Interactive command line interface for EDA (think R command line) • Comes with higher level libraries for machine learning and graph processing • Works beautifully on a single machine without tedious setup; doesn’t depend on Hadoop/HDFS • Scala, Python, Clojure and R APIs are available
  • 11. WHY DO WE REALLY CARE? The easier we make it to get answers, the more questions we will ask
  • 13. HOW DOES IT WORK? • User creates Resilient Distributed Datasets (RDDs), transforms and executes them • RDD operations are compiled to a DAG of operators • DAG is compiled into stages • A stage is executed in parallel as a series of tasks
  • 14. RDD A parallel dataset with partitions Var A Var B Var C observation observation observation observation Partition Partition
  • 15. DAG Logical graph of RDD operations sc.textFile("input") .map(line => line.split(",")) .map(line => (line(0), line(1).toInt)) .reduceByKey(_ + _, 3) RDD[String] RDD[Array[String]] RDD[(String, Int)] RDD[(String, Int)] map map reduceByKey read P1 P2 P3 P4
  • 16. RDD[String] RDD[Array[String]] RDD[(String, Int)] RDD[(String, Int)] map map reduceByKey read STAGE Stage 1 Stage 2 P1 P2 P3 P4
  • 17. Stage 1 map map STAGE shuffle RDD[String] RDD[Array[String]] RDD[(String, Int)] read input output read map map shuffle P1 P2 P3 P4 T1 T2 T3 T4 Set of tasks that can run in parallel Stage 1
  • 18. STAGE Set of tasks that can run in parallel Stage 1 Stage 2
  • 19. STAGE Set of tasks that can run in parallel • Tasks are the fundamental unit of work • Tasks are serialised and shipped to workers • Task execution 1. Fetch input 2. Execute 3. Output result task 1 task 2 task 3 task 4
  • 21. HANDS-ON 1. Visit telemetry-dash.mozilla.org and sign in using Persona. 2. Click “Launch an ad-hoc analysis worker”. 3. Upload your SSH public key (this allows you to log in to the server once it’s started up). 4. Click “Submit” 5. A Ubuntu machine will be started up on Amazon’s EC2 infrastructure.
  • 22. HANDS-ON • Connect to the machine through ssh • Clone the starter template: 1. git clone https://github.com/vitillo/mozilla-telemetry-spark.git 2. cd mozilla-telemetry-spark && source aws/setup.sh 3. sbt console • Open http://bit.ly/1wBHHDH