SlideShare ist ein Scribd-Unternehmen logo
1 von 76
Downloaden Sie, um offline zu lesen
Shortening the
Feedback Loop
HowSpotify’sBigDataEcosystemHas
EvolvedtoProduceReal-timeInsights
Josh Baer (jbx@spotify.com)
Note: opinions expressed in these slides are the authors and not necessarilythose of Spotify
Who am I?
• Technical Product Owner at Spotify
• Working with fast processing infrastructure
• Previously, building out Spotify’s 2500 node
Hadoop cluster
@l_phant
• Spotify Launches
• Instant Access to a gigantic
catalog of music
• Click to play instantaneous!
In 2008
Behind the Scenes:
Days to Insights
Behind the Scenes
Behind the Scenes
Minutes to
transfer
Hours to Clean
and Bucket
Hours to Run
Jobs or Ad Hoc
Queries
“Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009
Real-time
Processing
Batch Processing
(Hadoop, Hive, BigQuery)
“Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009
Operational
Monitoring
To leverage actionable
insights, we need a
faster feedback loop!
• Music Streaming Service
• Launched in 2008
• Premium and FreeTiers
• Available in 59 Countries
What is Spotify?
100+ Million Active
Users
30+ Million Songs
1+ Billion Plays/Day
And we have Data
Hadoop at Spotify
• 2,500 Nodes
• 130 PB Capacity
• 120TB Memory accessible by jobs
• 20KJobs/Day
Apache Kafka at Spotify
• 500 Kafka-related machines
• 40TB/day from logs
“Real-Time” at Spotify
• Storm Topologies fed via Kafka
• Powering
✦ Ad Targeting
✦ Real-time recommendations
✦ Real-time stream counts
Migratingto
theCloud
In the Beginning…
• Spotifywas almost completely on-premise/bare
metal
• 2500 node Hadoop cluster, over 10K machines in
production at four globally distributed data centers
• Grew with users: from 1M in 2009, over 100M in 2016
Why Move to the Cloud?
• Cloud Providers have matured, decreasing in costs
and increasing in reliability and variety of service
offered
• Owning and operating physical machines is not a
competitive advantage for Spotify
Why Google’s Cloud?
• We believe Google’s industry leading background
in Big Data technologies will give us a data
processing advantage
Google
Cloud
“Primitives”
BigQuery
• Ad-hoc and interactive querying service for massive datasets
• Like Hive, but without needing to manage Hadoop and servers
• Leverages Google’s internal tech
• Dremel (query execution engine)
• Colossus (distributed storage)
• Borg (distributed compute)
• Juniper (network)
Source: https://cloud.google.com/blog/big-data/2016/01/bigquery-under-the-hood
BigQuery vs. Hive
• Example Query: Find the top 10 songs by
popularity in Spain during October
• BigQuery (1.50 TB processed): 108s
• Hive(15.5TB processed): 2647s
Note: Hive performance unoptimized.Version used (0.14), input format (Avro), run on
a ~2500 nodeYarn cluster.This is not considered to be a thorough benchmark
BigQuery vs. Hive (example #2)
• Example Query: Find the total hours of music
listening in Spain during October
• BigQuery (780 GB processed): 33s
• Hive(15.5TB processed): 969s
Note: Hive performance unoptimized.Version used (0.14), input format (Avro), run on
a ~2500 nodeYarn cluster.This is not considered to be a thorough benchmark
•
Top 10 Tracks in Spain during October 2016
Rank Artist(s) Track Name
1 J	Balvin Safari
2 DJ	Snake Let	Me	Love	You
3 Ricky	Mar8n Vente	Pa'	Ca
4 Sebas8an	Yatra Traicionera
5 Zion	&	Lennox	(feat.	J	Balvin) Otra	Vez
6 Carlos	Vives,	Shakira La	Bicicleta
7 The	Chainsmokers Closer
8 Major	Lazer	(feat.	Jus8n	Bieber	&	MØ) Cold	Water
9 Sia The	Greatest
10 IAmChino	(feat.	Pitbull,	Yandel	&	Chacal) Ay	MI	Dios
Time Spent Listening to
Spotify by users in Spain
during October
Nearly 10,000 Years!
BigQuery at Spotify
• Interactive and ad-hoc querying immediately
started to transferto BQ once the data was
available on the cloud
• Pace of learning increases as friction to question
decreases
Cloud Pub/Sub
• At least once globally distributed message queue
• For high volume, low topic (<10,000) publish
subscribe behavior
• Like Kafka, but without needing to operate servers
and supporting services (zookeeper)
Cloud Pub/Sub at Spotify
• 800K events/second? No problem
• P99 Latency of ingestions into ES: 500ms
• Ingestion from globally distributed non-GCP
datacenters is painless
• Managed Service for running batch and streaming jobs
• UnifiedAPI for batch and streaming mode
• Inspired by internal Google tools like FlumeJava and
Millwheel
• Programming model open-sourced asApache Beam
(currently incubating)
Cloud Dataflow
• Usually run via Scio: https://github.com/spotify/scio
• Scio provides a scalaAPI for running Dataflow jobs
and provides easy integrations with BigQuery
• New batch processing jobs @Spotify are being
written in Scio/Dataflow
Cloud Dataflow (Batch) at Spotify
• Exactly-once stream processing framework
• Areplacement for Spark/Flink streaming and
Storm workloads at Spotify
• Optimizes for consistencywhich can complicate
real-time workloads
Cloud Dataflow (Streaming) at Spotify
Spotify + Google Cloud Timeline
2015 2016
Beginning of Google
Cloud evaluation
BigQuery begins
to replace Hive
Cloud Pub/Sub begins
to replace Kafka
Dataflow (streaming)
begins to replace StormSpotify + Google
Cloud Announcement
Dataflow (batch)
replacing Map/Reduce
Note: Dates are approximations
Putting ItAll
Together
The Problem
• We want to detect within minutes ifwe’ve
introduced a bug in a client release that affects
critical event logging behavior
Before…
Minutes to
transfer
Hours to Clean
and Bucket
Hours to Run
Jobs or Ad Hoc
Queries
HOURS TO
INSIGHTS
Introducing “Pulsar”
• An internal name forthe system aggregating data
fromAccess Points and feeding it into Cloud Pub/
Sub
• Replaces the Kafka real-time event feed
Pulsar
Pub/Sub
• Aggregates global event feed from Pulsar
• Makes data available to multiple zones in
milliseconds
Dataflow
• Subscribes to critical event Pub/Sub topics
• Aggregate events into minute windows
• Always running, no need to schedule orwait for
results
BigQuery
• Receives aggregates from Dataflow
• Allows for ad-hoc inspection or slicing on different
dimensions
Tableau
• DataVisualizationTool that integrates nicelywith
BigQuery
• Pulls data from BigQuery periodically and caches for
quick inspection
Putting it all together
Putting it all together
Milliseconds
to transfer
Milliseconds
to process
Seconds to
Query
SECONDS TO
INSIGHTS
Putting it all together
FasterInsights
toClient
Behavior
Problem
As a developer, I want to be able to instantly explore
data being logged bythe clients.
Solution
• Produce a topic for all employee client events
• Store in Elasticsearch
• Visualize in Kibana
Benefits
• Able to understand what’s being sent bythe client
as it happens
• Exploring events, visualizing distribution (i.e. does
this field actually get populated)
• Prototyping analysis based on a sample
• Dashboards for Employee Releases
FasterInsights
onNew
Features
The previous dashboard is great for prototyping, but
what ifyou want all the data?
Problem
Solution
Allow developers to funnel feature-specific data to
their own elastic search cluster
Dataflow to the Rescue!
• We created a librarythat allows teams to build
maps/filters with simple java code
• Code gets translated into a Dataflow job
Abstract Away the Complexity
No Ops!
• For our users:
• Event-feed managed through Cloud Pub/Sub
• Dataflow managed by Google
• Shared Elasticsearch cluster (managed by an
infra team)
Low Ops :/
• Dataflow is improving, but it’s had some stability
issues with streaming jobs
• Teams may need to set-up their own Elasticsearch
cluster ifthey require a higher SLAthan default
OtherUses
Ad Targeting
• Real-time genre targeting
• Session insights — explicit filter
Real-time Recommendations
Live Results for X-Factor
• X-Factor: television music
competition
• Contest songs get loaded onto
Spotify immediately after show
airs
• Listener behavior determines the
order of contestants on the playlist
Review
Behind the Scenes
Minutes to
transfer
Hours to Clean
and Bucket
Hours to Run
Jobs or Ad Hoc
Queries
To leverage actionable
insights, we need a
faster feedback loop!
Real-time
Processing
Batch Processing
(Hadoop, Hive, BigQuery)
“Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009
Operational
Monitoring
Cloud to the Rescue!
• Spotify has leveled up our abilityto gain actionable
insights by leveraging Google Cloud tools, such as
Pub/Sub, Dataflow and BigQuery
TheValue of a Fast Feedback Loop
• Detecting problems early in data avoids long backfills or
long term data loss
• Instant insights on newly developed features allows
teams to iterate quicker and take risks
• Providing a quicker ad-hoc querying engine allows teams
to ask more questions and learn faster
UseAnything and Everything
• Opensource and other cloud providers offer many
alternatives to the stack we’ve used
• Opensource tools, like Elasticsearch/Kibana, and
proprietary solutions, like Tableau, have also been
useful additions
WhereAre We Going?
• The real-time mission is in the early stages at
Spotify
Stream Processing First
• The sun never sets on Spotify, why impose
boundaries on our datasets?
• What’s the shortest distance between two lines?
Zero!
• Can we reduce the feedback cycle to zero?
We’reHiring!
Engineers, Managers, Product Owners
needed in NYC and Stockholm
https://www.spotify.com/jobs
Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Solving the Industry 4.0 challenges on the logistics domain using Apache Meso...
Solving the Industry 4.0 challenges on the logistics domain using Apache Meso...Solving the Industry 4.0 challenges on the logistics domain using Apache Meso...
Solving the Industry 4.0 challenges on the logistics domain using Apache Meso...Big Data Spain
 
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovRUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovBig Data Spain
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...Data Con LA
 
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...Altan Khendup
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data ArchitectureGuido Schmutz
 
Big Data Computing Architecture
Big Data Computing ArchitectureBig Data Computing Architecture
Big Data Computing ArchitectureGang Tao
 
High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark DataWorks Summit/Hadoop Summit
 
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq AbdullahLeveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq AbdullahDatabricks
 
Disrupting Insurance with Advanced Analytics The Next Generation Carrier
Disrupting Insurance with Advanced Analytics The Next Generation CarrierDisrupting Insurance with Advanced Analytics The Next Generation Carrier
Disrupting Insurance with Advanced Analytics The Next Generation CarrierDataWorks Summit/Hadoop Summit
 
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...DataWorks Summit/Hadoop Summit
 
Stream Analytics
Stream Analytics Stream Analytics
Stream Analytics Franco Ucci
 
Turning an idea into a Data-Driven Production System: An Energy Load Forecas...
 Turning an idea into a Data-Driven Production System: An Energy Load Forecas... Turning an idea into a Data-Driven Production System: An Energy Load Forecas...
Turning an idea into a Data-Driven Production System: An Energy Load Forecas...Big Data Spain
 
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...Big Data Spain
 
Real-Time Robot Predictive Maintenance in Action
Real-Time Robot Predictive Maintenance in ActionReal-Time Robot Predictive Maintenance in Action
Real-Time Robot Predictive Maintenance in ActionDataWorks Summit
 
Cloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and FastCloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and FastDatabricks
 
Druid Overview by Rachel Pedreschi
Druid Overview by Rachel PedreschiDruid Overview by Rachel Pedreschi
Druid Overview by Rachel PedreschiBrian Olsen
 
How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017Rittman Analytics
 

Was ist angesagt? (20)

Solving the Industry 4.0 challenges on the logistics domain using Apache Meso...
Solving the Industry 4.0 challenges on the logistics domain using Apache Meso...Solving the Industry 4.0 challenges on the logistics domain using Apache Meso...
Solving the Industry 4.0 challenges on the logistics domain using Apache Meso...
 
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovRUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
 
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Big Data Computing Architecture
Big Data Computing ArchitectureBig Data Computing Architecture
Big Data Computing Architecture
 
Zero Downtime App Deployment using Hadoop
Zero Downtime App Deployment using HadoopZero Downtime App Deployment using Hadoop
Zero Downtime App Deployment using Hadoop
 
High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark
 
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq AbdullahLeveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
 
Disrupting Insurance with Advanced Analytics The Next Generation Carrier
Disrupting Insurance with Advanced Analytics The Next Generation CarrierDisrupting Insurance with Advanced Analytics The Next Generation Carrier
Disrupting Insurance with Advanced Analytics The Next Generation Carrier
 
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
 
Stream Analytics
Stream Analytics Stream Analytics
Stream Analytics
 
Log I am your father
Log I am your fatherLog I am your father
Log I am your father
 
Turning an idea into a Data-Driven Production System: An Energy Load Forecas...
 Turning an idea into a Data-Driven Production System: An Energy Load Forecas... Turning an idea into a Data-Driven Production System: An Energy Load Forecas...
Turning an idea into a Data-Driven Production System: An Energy Load Forecas...
 
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
 
Real-Time Robot Predictive Maintenance in Action
Real-Time Robot Predictive Maintenance in ActionReal-Time Robot Predictive Maintenance in Action
Real-Time Robot Predictive Maintenance in Action
 
Cloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and FastCloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and Fast
 
Intuit Analytics Cloud 101
Intuit Analytics Cloud 101Intuit Analytics Cloud 101
Intuit Analytics Cloud 101
 
Druid Overview by Rachel Pedreschi
Druid Overview by Rachel PedreschiDruid Overview by Rachel Pedreschi
Druid Overview by Rachel Pedreschi
 
How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017
 

Andere mochten auch

My ANTI-Resume Manifesto
My ANTI-Resume ManifestoMy ANTI-Resume Manifesto
My ANTI-Resume ManifestoDavid Crandall
 
Development Applications 2008 05 26
Development Applications 2008 05 26Development Applications 2008 05 26
Development Applications 2008 05 26jgabateman
 
PostgreSQL - Lección 8 - Manipulando Datos y Transacciones
PostgreSQL - Lección 8 - Manipulando Datos y TransaccionesPostgreSQL - Lección 8 - Manipulando Datos y Transacciones
PostgreSQL - Lección 8 - Manipulando Datos y TransaccionesNicola Strappazzon C.
 
0 to 2,500 Customers with No Cold Calls
0 to 2,500 Customers with No Cold Calls0 to 2,500 Customers with No Cold Calls
0 to 2,500 Customers with No Cold CallsHubSpot
 
Electrical Engineering Basics - What Design Engineers Need to Know
Electrical Engineering Basics - What Design Engineers Need to KnowElectrical Engineering Basics - What Design Engineers Need to Know
Electrical Engineering Basics - What Design Engineers Need to Knowmilestoneseng
 
MasterPlus - Sistema Binário
MasterPlus - Sistema BinárioMasterPlus - Sistema Binário
MasterPlus - Sistema BinárioMasterplusBrasil
 
Strip your charts
Strip your chartsStrip your charts
Strip your chartsuwseidl
 
Resultado Final do Concurso de Bom Sucesso
Resultado Final do Concurso de Bom Sucesso Resultado Final do Concurso de Bom Sucesso
Resultado Final do Concurso de Bom Sucesso Joao Rivonaldo Silva
 
Exames médicos valores - União Sindical
Exames médicos   valores - União SindicalExames médicos   valores - União Sindical
Exames médicos valores - União Sindicalsinteimp
 
2500 years of learning theory: The good, the bad & the ugly - Donald Clark
2500 years of learning theory: The good, the bad & the ugly - Donald Clark2500 years of learning theory: The good, the bad & the ugly - Donald Clark
2500 years of learning theory: The good, the bad & the ugly - Donald ClarkLearning Pool Ltd
 
Sarah Palin\'s Shopping Spree
Sarah Palin\'s Shopping SpreeSarah Palin\'s Shopping Spree
Sarah Palin\'s Shopping Spreecoolstuff
 
Labor Market and Salary Survey in Russia
Labor Market and Salary Survey in RussiaLabor Market and Salary Survey in Russia
Labor Market and Salary Survey in RussiaAwara Direct Search
 
The Recipe For Creating a Successful Startup Ecosystem
The Recipe For Creating a Successful Startup EcosystemThe Recipe For Creating a Successful Startup Ecosystem
The Recipe For Creating a Successful Startup EcosystemTzahi (Zack) Weisfeld
 
Lineadeltiempodelacomputacion Iiuac
Lineadeltiempodelacomputacion IiuacLineadeltiempodelacomputacion Iiuac
Lineadeltiempodelacomputacion IiuacOscorp
 

Andere mochten auch (20)

2015 RAM 2500 3500 Details. El Paso - Albuquerque Dealers Jack Key New Mexico...
2015 RAM 2500 3500 Details. El Paso - Albuquerque Dealers Jack Key New Mexico...2015 RAM 2500 3500 Details. El Paso - Albuquerque Dealers Jack Key New Mexico...
2015 RAM 2500 3500 Details. El Paso - Albuquerque Dealers Jack Key New Mexico...
 
Flexible budget
Flexible budgetFlexible budget
Flexible budget
 
My ANTI-Resume Manifesto
My ANTI-Resume ManifestoMy ANTI-Resume Manifesto
My ANTI-Resume Manifesto
 
Development Applications 2008 05 26
Development Applications 2008 05 26Development Applications 2008 05 26
Development Applications 2008 05 26
 
PostgreSQL - Lección 8 - Manipulando Datos y Transacciones
PostgreSQL - Lección 8 - Manipulando Datos y TransaccionesPostgreSQL - Lección 8 - Manipulando Datos y Transacciones
PostgreSQL - Lección 8 - Manipulando Datos y Transacciones
 
Acoples rapidos
Acoples rapidosAcoples rapidos
Acoples rapidos
 
0 to 2,500 Customers with No Cold Calls
0 to 2,500 Customers with No Cold Calls0 to 2,500 Customers with No Cold Calls
0 to 2,500 Customers with No Cold Calls
 
Electrical Engineering Basics - What Design Engineers Need to Know
Electrical Engineering Basics - What Design Engineers Need to KnowElectrical Engineering Basics - What Design Engineers Need to Know
Electrical Engineering Basics - What Design Engineers Need to Know
 
MasterPlus - Sistema Binário
MasterPlus - Sistema BinárioMasterPlus - Sistema Binário
MasterPlus - Sistema Binário
 
Strip your charts
Strip your chartsStrip your charts
Strip your charts
 
Apresentacao
ApresentacaoApresentacao
Apresentacao
 
Resultado Final do Concurso de Bom Sucesso
Resultado Final do Concurso de Bom Sucesso Resultado Final do Concurso de Bom Sucesso
Resultado Final do Concurso de Bom Sucesso
 
Exames médicos valores - União Sindical
Exames médicos   valores - União SindicalExames médicos   valores - União Sindical
Exames médicos valores - União Sindical
 
Option Strategies
Option StrategiesOption Strategies
Option Strategies
 
2500 years of learning theory: The good, the bad & the ugly - Donald Clark
2500 years of learning theory: The good, the bad & the ugly - Donald Clark2500 years of learning theory: The good, the bad & the ugly - Donald Clark
2500 years of learning theory: The good, the bad & the ugly - Donald Clark
 
Sarah Palin\'s Shopping Spree
Sarah Palin\'s Shopping SpreeSarah Palin\'s Shopping Spree
Sarah Palin\'s Shopping Spree
 
Labor Market and Salary Survey in Russia
Labor Market and Salary Survey in RussiaLabor Market and Salary Survey in Russia
Labor Market and Salary Survey in Russia
 
The Recipe For Creating a Successful Startup Ecosystem
The Recipe For Creating a Successful Startup EcosystemThe Recipe For Creating a Successful Startup Ecosystem
The Recipe For Creating a Successful Startup Ecosystem
 
Catálogo de delícias
Catálogo de delíciasCatálogo de delícias
Catálogo de delícias
 
Lineadeltiempodelacomputacion Iiuac
Lineadeltiempodelacomputacion IiuacLineadeltiempodelacomputacion Iiuac
Lineadeltiempodelacomputacion Iiuac
 

Ähnlich wie Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

Shortening the feedback loop
Shortening the feedback loopShortening the feedback loop
Shortening the feedback loopJosh Baer
 
Spotify in the Cloud - An evolution of data infrastructure - Strata NYC
Spotify in the Cloud - An evolution of data infrastructure - Strata NYCSpotify in the Cloud - An evolution of data infrastructure - Strata NYC
Spotify in the Cloud - An evolution of data infrastructure - Strata NYCJosh Baer
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopEvans Ye
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio, Inc.
 
Puppet Keynote by Ralph Luchs
Puppet Keynote by Ralph LuchsPuppet Keynote by Ralph Luchs
Puppet Keynote by Ralph LuchsNETWAYS
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for ExperimentationGleb Kanterov
 
State of Puppet 2013 - Puppet Camp DC
State of Puppet 2013 - Puppet Camp DCState of Puppet 2013 - Puppet Camp DC
State of Puppet 2013 - Puppet Camp DCPuppet
 
Real time monitoring of hadoop and spark workflows
Real time monitoring of hadoop and spark workflowsReal time monitoring of hadoop and spark workflows
Real time monitoring of hadoop and spark workflowsShankar Manian
 
HadoopCon- Trend Micro SPN Hadoop Overview
HadoopCon- Trend Micro SPN Hadoop OverviewHadoopCon- Trend Micro SPN Hadoop Overview
HadoopCon- Trend Micro SPN Hadoop OverviewYafang Chang
 
Google Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneGoogle Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneDataWorks Summit
 
Apache NiFi: A Drag and Drop Approach
Apache NiFi: A Drag and Drop ApproachApache NiFi: A Drag and Drop Approach
Apache NiFi: A Drag and Drop ApproachCalculated Systems
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...ssuserd3a367
 
CCI2018 - Real-time dashboard whatif analysis
CCI2018 - Real-time dashboard whatif analysisCCI2018 - Real-time dashboard whatif analysis
CCI2018 - Real-time dashboard whatif analysiswalk2talk srl
 
Serverless and AI: Orit Nissan-Messing, Iguazio, Serverless NYC 2018
Serverless and AI: Orit Nissan-Messing, Iguazio, Serverless NYC 2018Serverless and AI: Orit Nissan-Messing, Iguazio, Serverless NYC 2018
Serverless and AI: Orit Nissan-Messing, Iguazio, Serverless NYC 2018iguazio
 
Atlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesAtlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesQubole
 
PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)Stratebi
 
The Evolution of Big Data at Spotify
The Evolution of Big Data at SpotifyThe Evolution of Big Data at Spotify
The Evolution of Big Data at SpotifyJosh Baer
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Spark Summit
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogC4Media
 

Ähnlich wie Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer (20)

Shortening the feedback loop
Shortening the feedback loopShortening the feedback loop
Shortening the feedback loop
 
Spotify in the Cloud - An evolution of data infrastructure - Strata NYC
Spotify in the Cloud - An evolution of data infrastructure - Strata NYCSpotify in the Cloud - An evolution of data infrastructure - Strata NYC
Spotify in the Cloud - An evolution of data infrastructure - Strata NYC
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
 
Puppet Keynote by Ralph Luchs
Puppet Keynote by Ralph LuchsPuppet Keynote by Ralph Luchs
Puppet Keynote by Ralph Luchs
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
State of Puppet 2013 - Puppet Camp DC
State of Puppet 2013 - Puppet Camp DCState of Puppet 2013 - Puppet Camp DC
State of Puppet 2013 - Puppet Camp DC
 
Real time monitoring of hadoop and spark workflows
Real time monitoring of hadoop and spark workflowsReal time monitoring of hadoop and spark workflows
Real time monitoring of hadoop and spark workflows
 
HadoopCon- Trend Micro SPN Hadoop Overview
HadoopCon- Trend Micro SPN Hadoop OverviewHadoopCon- Trend Micro SPN Hadoop Overview
HadoopCon- Trend Micro SPN Hadoop Overview
 
Google Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneGoogle Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better One
 
Apache NiFi: A Drag and Drop Approach
Apache NiFi: A Drag and Drop ApproachApache NiFi: A Drag and Drop Approach
Apache NiFi: A Drag and Drop Approach
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
CCI2018 - Real-time dashboard whatif analysis
CCI2018 - Real-time dashboard whatif analysisCCI2018 - Real-time dashboard whatif analysis
CCI2018 - Real-time dashboard whatif analysis
 
Music streams
Music streamsMusic streams
Music streams
 
Serverless and AI: Orit Nissan-Messing, Iguazio, Serverless NYC 2018
Serverless and AI: Orit Nissan-Messing, Iguazio, Serverless NYC 2018Serverless and AI: Orit Nissan-Messing, Iguazio, Serverless NYC 2018
Serverless and AI: Orit Nissan-Messing, Iguazio, Serverless NYC 2018
 
Atlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesAtlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slides
 
PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)
 
The Evolution of Big Data at Spotify
The Evolution of Big Data at SpotifyThe Evolution of Big Data at Spotify
The Evolution of Big Data at Spotify
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
 

Mehr von Big Data Spain

Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data Spain
 
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Big Data Spain
 
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017Big Data Spain
 
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Big Data Spain
 
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Big Data Spain
 
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Big Data Spain
 
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Big Data Spain
 
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Big Data Spain
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...Big Data Spain
 
Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Big Data Spain
 
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Big Data Spain
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...Big Data Spain
 
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Big Data Spain
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Big Data Spain
 
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Big Data Spain
 
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Big Data Spain
 
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...Big Data Spain
 
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Big Data Spain
 
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...Big Data Spain
 
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Big Data Spain
 

Mehr von Big Data Spain (20)

Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
 
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
 
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
 
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
 
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
 
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
 
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
 
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...
 
Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...
 
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
 
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
 
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
 
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
 
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
 
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
 
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
 

Kürzlich hochgeladen

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 

Kürzlich hochgeladen (20)

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 

Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to produce real-time insights by Josh Baer

  • 1.
  • 2. Shortening the Feedback Loop HowSpotify’sBigDataEcosystemHas EvolvedtoProduceReal-timeInsights Josh Baer (jbx@spotify.com) Note: opinions expressed in these slides are the authors and not necessarilythose of Spotify
  • 3. Who am I? • Technical Product Owner at Spotify • Working with fast processing infrastructure • Previously, building out Spotify’s 2500 node Hadoop cluster @l_phant
  • 4. • Spotify Launches • Instant Access to a gigantic catalog of music • Click to play instantaneous! In 2008
  • 7. Behind the Scenes Minutes to transfer Hours to Clean and Bucket Hours to Run Jobs or Ad Hoc Queries
  • 8. “Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009
  • 9. Real-time Processing Batch Processing (Hadoop, Hive, BigQuery) “Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009 Operational Monitoring
  • 10. To leverage actionable insights, we need a faster feedback loop!
  • 11. • Music Streaming Service • Launched in 2008 • Premium and FreeTiers • Available in 59 Countries What is Spotify?
  • 15. And we have Data
  • 16. Hadoop at Spotify • 2,500 Nodes • 130 PB Capacity • 120TB Memory accessible by jobs • 20KJobs/Day
  • 17. Apache Kafka at Spotify • 500 Kafka-related machines • 40TB/day from logs
  • 18. “Real-Time” at Spotify • Storm Topologies fed via Kafka • Powering ✦ Ad Targeting ✦ Real-time recommendations ✦ Real-time stream counts
  • 20. In the Beginning… • Spotifywas almost completely on-premise/bare metal • 2500 node Hadoop cluster, over 10K machines in production at four globally distributed data centers • Grew with users: from 1M in 2009, over 100M in 2016
  • 21. Why Move to the Cloud? • Cloud Providers have matured, decreasing in costs and increasing in reliability and variety of service offered • Owning and operating physical machines is not a competitive advantage for Spotify
  • 22. Why Google’s Cloud? • We believe Google’s industry leading background in Big Data technologies will give us a data processing advantage
  • 24. BigQuery • Ad-hoc and interactive querying service for massive datasets • Like Hive, but without needing to manage Hadoop and servers • Leverages Google’s internal tech • Dremel (query execution engine) • Colossus (distributed storage) • Borg (distributed compute) • Juniper (network) Source: https://cloud.google.com/blog/big-data/2016/01/bigquery-under-the-hood
  • 25. BigQuery vs. Hive • Example Query: Find the top 10 songs by popularity in Spain during October • BigQuery (1.50 TB processed): 108s • Hive(15.5TB processed): 2647s Note: Hive performance unoptimized.Version used (0.14), input format (Avro), run on a ~2500 nodeYarn cluster.This is not considered to be a thorough benchmark
  • 26. BigQuery vs. Hive (example #2) • Example Query: Find the total hours of music listening in Spain during October • BigQuery (780 GB processed): 33s • Hive(15.5TB processed): 969s Note: Hive performance unoptimized.Version used (0.14), input format (Avro), run on a ~2500 nodeYarn cluster.This is not considered to be a thorough benchmark
  • 27. • Top 10 Tracks in Spain during October 2016 Rank Artist(s) Track Name 1 J Balvin Safari 2 DJ Snake Let Me Love You 3 Ricky Mar8n Vente Pa' Ca 4 Sebas8an Yatra Traicionera 5 Zion & Lennox (feat. J Balvin) Otra Vez 6 Carlos Vives, Shakira La Bicicleta 7 The Chainsmokers Closer 8 Major Lazer (feat. Jus8n Bieber & MØ) Cold Water 9 Sia The Greatest 10 IAmChino (feat. Pitbull, Yandel & Chacal) Ay MI Dios
  • 28. Time Spent Listening to Spotify by users in Spain during October Nearly 10,000 Years!
  • 29. BigQuery at Spotify • Interactive and ad-hoc querying immediately started to transferto BQ once the data was available on the cloud • Pace of learning increases as friction to question decreases
  • 30. Cloud Pub/Sub • At least once globally distributed message queue • For high volume, low topic (<10,000) publish subscribe behavior • Like Kafka, but without needing to operate servers and supporting services (zookeeper)
  • 31. Cloud Pub/Sub at Spotify • 800K events/second? No problem • P99 Latency of ingestions into ES: 500ms • Ingestion from globally distributed non-GCP datacenters is painless
  • 32. • Managed Service for running batch and streaming jobs • UnifiedAPI for batch and streaming mode • Inspired by internal Google tools like FlumeJava and Millwheel • Programming model open-sourced asApache Beam (currently incubating) Cloud Dataflow
  • 33. • Usually run via Scio: https://github.com/spotify/scio • Scio provides a scalaAPI for running Dataflow jobs and provides easy integrations with BigQuery • New batch processing jobs @Spotify are being written in Scio/Dataflow Cloud Dataflow (Batch) at Spotify
  • 34. • Exactly-once stream processing framework • Areplacement for Spark/Flink streaming and Storm workloads at Spotify • Optimizes for consistencywhich can complicate real-time workloads Cloud Dataflow (Streaming) at Spotify
  • 35. Spotify + Google Cloud Timeline 2015 2016 Beginning of Google Cloud evaluation BigQuery begins to replace Hive Cloud Pub/Sub begins to replace Kafka Dataflow (streaming) begins to replace StormSpotify + Google Cloud Announcement Dataflow (batch) replacing Map/Reduce Note: Dates are approximations
  • 37. The Problem • We want to detect within minutes ifwe’ve introduced a bug in a client release that affects critical event logging behavior
  • 38. Before… Minutes to transfer Hours to Clean and Bucket Hours to Run Jobs or Ad Hoc Queries HOURS TO INSIGHTS
  • 39. Introducing “Pulsar” • An internal name forthe system aggregating data fromAccess Points and feeding it into Cloud Pub/ Sub • Replaces the Kafka real-time event feed
  • 41. Pub/Sub • Aggregates global event feed from Pulsar • Makes data available to multiple zones in milliseconds
  • 42. Dataflow • Subscribes to critical event Pub/Sub topics • Aggregate events into minute windows • Always running, no need to schedule orwait for results
  • 43. BigQuery • Receives aggregates from Dataflow • Allows for ad-hoc inspection or slicing on different dimensions
  • 44. Tableau • DataVisualizationTool that integrates nicelywith BigQuery • Pulls data from BigQuery periodically and caches for quick inspection
  • 45. Putting it all together
  • 46. Putting it all together Milliseconds to transfer Milliseconds to process Seconds to Query SECONDS TO INSIGHTS
  • 47. Putting it all together
  • 49. Problem As a developer, I want to be able to instantly explore data being logged bythe clients.
  • 50. Solution • Produce a topic for all employee client events • Store in Elasticsearch • Visualize in Kibana
  • 51.
  • 52.
  • 53. Benefits • Able to understand what’s being sent bythe client as it happens • Exploring events, visualizing distribution (i.e. does this field actually get populated) • Prototyping analysis based on a sample • Dashboards for Employee Releases
  • 55. The previous dashboard is great for prototyping, but what ifyou want all the data? Problem
  • 56. Solution Allow developers to funnel feature-specific data to their own elastic search cluster
  • 57. Dataflow to the Rescue! • We created a librarythat allows teams to build maps/filters with simple java code • Code gets translated into a Dataflow job
  • 58. Abstract Away the Complexity
  • 59.
  • 60. No Ops! • For our users: • Event-feed managed through Cloud Pub/Sub • Dataflow managed by Google • Shared Elasticsearch cluster (managed by an infra team)
  • 61. Low Ops :/ • Dataflow is improving, but it’s had some stability issues with streaming jobs • Teams may need to set-up their own Elasticsearch cluster ifthey require a higher SLAthan default
  • 63. Ad Targeting • Real-time genre targeting • Session insights — explicit filter
  • 65. Live Results for X-Factor • X-Factor: television music competition • Contest songs get loaded onto Spotify immediately after show airs • Listener behavior determines the order of contestants on the playlist
  • 67. Behind the Scenes Minutes to transfer Hours to Clean and Bucket Hours to Run Jobs or Ad Hoc Queries
  • 68. To leverage actionable insights, we need a faster feedback loop!
  • 69. Real-time Processing Batch Processing (Hadoop, Hive, BigQuery) “Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009 Operational Monitoring
  • 70. Cloud to the Rescue! • Spotify has leveled up our abilityto gain actionable insights by leveraging Google Cloud tools, such as Pub/Sub, Dataflow and BigQuery
  • 71. TheValue of a Fast Feedback Loop • Detecting problems early in data avoids long backfills or long term data loss • Instant insights on newly developed features allows teams to iterate quicker and take risks • Providing a quicker ad-hoc querying engine allows teams to ask more questions and learn faster
  • 72. UseAnything and Everything • Opensource and other cloud providers offer many alternatives to the stack we’ve used • Opensource tools, like Elasticsearch/Kibana, and proprietary solutions, like Tableau, have also been useful additions
  • 73. WhereAre We Going? • The real-time mission is in the early stages at Spotify
  • 74. Stream Processing First • The sun never sets on Spotify, why impose boundaries on our datasets? • What’s the shortest distance between two lines? Zero! • Can we reduce the feedback cycle to zero?
  • 75. We’reHiring! Engineers, Managers, Product Owners needed in NYC and Stockholm https://www.spotify.com/jobs