SlideShare a Scribd company logo
1 of 40
Download to read offline
TheEvolution
ofBigDataat
Spotify
Josh Baer (jbx@spotify.com)
WhoAm I?
‣ Technical Product Owner at
Spotify, responsible for
Hadoop
@l_phant
Overview
• Creating Music Charts in three parts:
• Playing Music
• Collecting Data
• Processing Data
• The Future
Building
MusicCharts
Building Music Charts
Play Music Collect Data Process
PlayingMusic
What is Spotify?
• Music Streaming Service
• Browse and Discover Millions of
Songs,Artists andAlbums
• Bythe end of 2014
• 60 Million Monthly Users
• 15 Million Paid Subscribers
What is Spotify?
• Data Infrastructure
• 1300 Hadoop Nodes
• 42 PB Storage
• 30TB data ingested via Kafka/day
• 400TB generated by Hadoop/day
Powered by Data
• RunningApp
• Matches music to running tempo
• Personalized running playlists in
multiple tempos for millions of
active users
http://www.theverge.com/2015/6/1/8696659/spotify-running-is-great-for-discovery
Powered by Data
• Now Page
• Shows, podcasts and playlists
based on day-parts
• Personalized layout so you always
have the right music forthe right
moment
Building Music Charts
10.123.133.333 - - [Mon, 3 June 2015 11:31:33 GMT] "GET /api/admin/job/
aggregator/status HTTP/1.1" 200 1847 "https://my.analytics.app/admin"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36"
10.123.133.222 - - [Mon, 3 June 2015 11:31:43 GMT] "GET /api/admin/job/
aggregator/status HTTP/1.1" 200 1984 "https://my.analytics.app/admin"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36”
10.123.133.222 - - [Mon, 3 June 2015 11:33:02 GMT] "GET /dashboard/
courses/1291726 HTTP/1.1" 304 - "https://my.analytics.app/admin"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36"
10.321.145.111 - - [Mon, 3 June 2015 11:33:03 GMT] "GET /api/loggedInUser
HTTP/1.1" 304 - "https://my.analytics.app/dashboard/courses/1291726"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36"
10.112.322.111 - - [Mon, 3 June 2015 11:33:03 GMT] "POST /api/
instrumentation/events/new HTTP/1.1" 200 2 "https://my.analytics.app/
dashboard/courses/1291726" "Mozilla/5.0 (Macintosh; Intel Mac OS X
10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81
Safari/537.36”
10.123.133.222 - - [Mon, 3 June 2015 11:33:02 GMT] "GET /dashboard/
courses/1291726 HTTP/1.1" 304 - "https://my.analytics.app/admin"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36"
• Raw data is complicated
• Often dirty
• Evolving structure
• Duplication all over
• Getting data to a central
processing point is HARD
Collecting
Data
“It’s simple, we just
throw the data into
Hadoop”
A naive data engineer
LogArchiver
• Original method to transport logs fromAPs to HDFS
• Lasted from 2009 - 2013
• Relies on rsynch/scp to move files around
• Regularly scheduled via cron
LogArchiver Fails
• Worked well with small number ofAPs
• Issues with scale
• Manual Processes of adding new hosts
• Frequent dying of hosts or network issues caused massive congestion
• Manual process of overrides
Apache Kafka to the rescue!
• Apublish-subscribe messaging system open sourced by LinkedIn in 2011
• High level overview:
• Topic: Feeds of messages
• Producer:Amessage publisher
• Consumer :Asubscriber oftopics
Apache Kafka to the rescue!
• Log -> HDFS latency reduced from hours to seconds!
• Benefits:
• Community supported
• Division of responsibilities
• Allowed for enhanced streaming use-cases
Processing
Data
Workflow Management Fail!
0	
  *	
  *	
  *	
  *	
  	
  	
  	
  spotify-­‐core	
  	
  	
  	
  	
  	
  hadoop	
  jar	
  merge_hourly.jar	
  
15	
  *	
  *	
  *	
  *	
  	
  	
  spotify-­‐core	
  	
  	
  	
  	
  	
  hadoop	
  jar	
  aggregate_song_plays.jar	
  
30	
  *	
  *	
  *	
  *	
  	
  	
  spotify-­‐analytics	
  hadoop	
  jar	
  merge_artist_song.jar	
  
*	
  1	
  *	
  *	
  *	
  	
  	
  	
  spotify-­‐core	
  	
  	
  	
  	
  	
  hadoop	
  jar	
  daily_aggregate.jar	
  
*	
  2	
  *	
  *	
  *	
  	
  	
  	
  spotify-­‐core	
  	
  	
  	
  	
  	
  hadoop	
  jar	
  calculate_royalties.jar	
  
*/2	
  22	
  *	
  *	
  *	
  spotify-­‐radio	
  	
  	
  	
  	
  hadoop	
  jar	
  generate_radio.jar	
  
Handles the ‘plumbing’ for Hadoop jobs
https://github.com/spotify/luigi
Luigi - Python Workflow Manager
Easy to get started, no xml like Oozie
Hadoop Availability
• In 2013:
• Hadoop expanded to 200 nodes
• It was business critical
• It was not very reliable :-(
• Created a ‘squad’ with two missions:
• Migrate to a new distribution withYarn
• Make Hadoop reliable
How did we do?
HadoopUptime
90%
92%
94%
96%
98%
100%
Q3-2012 Q4-2012 Q1-2013 Q2-2013 Q3-2013 Q4-2013 Q1-2014 Q2-2014 Q3-2014 Q4-2014 Q1-2015 Q2-2015
Hadoop ownerless Dedicated
squad launches
Upgrade
instability
Continually improving
Going from Python to Crunch
• Most of our jobs were Hadoop (python) streaming
• Lots of failures, slow performance
• Had to find a betterway
Moving from Python to Crunch
• Investigated several frameworks*
• Selected Crunch:
• Real types - compile time error detection, bettertestability
• Higher levelAPI - let the framework optimize foryou
• Better performance #JVM_FTW
*thewit.ch/scalding_crunchy_pig
Play Music Collect Data Process
Data driven features that
allows for new ways to
play and discover music
Lower latency and
enhanced reliability for
passing data from Access
Points to HDFS via Kafka
Increased Hadoop
reliability, Luigi scheduling
and better performance
with Crunch
Improving Charts!
TheFuture
Growth%
0
500
1000
1500
2000
2500
3000
3500
2012 2013 2014 2015
Hadoop Usage Spotify Users
Growth of Hadoop vs. Spotify Users
Explosive Growth
• Increased Spotify Users
• More users listening to more music -> more data -> longer running jobs
• Increased Use Cases
• Beyond simple analytics into Machine Learning, advanced processing
• Increased Engineers
• In 2014, growth of data and machine learning engineers grew rapidly
Scaling Machines: Easy
Scaling People: Hard
User Feedback:
Automate it!
Inviso
Developed by Netflix: https://github.com/Netflix/inviso
Hadoop Report Card
• Contains Statistics
• Guidelines and Best
Practices
• Sent Quarterly
RealTime Use Cases
• Expanding our use of Storm for:
• TargetingAds based on genres
• Visualizing Data
• Quicker recommendations
• More information:
• https://labs.spotify.com/2015/01/05/how-spotify-scales-apache-storm/
Two takeaways
• Getting data into Hadoop is halfthe challenge.Think early
and often about scale.
• Increasing infrastructure reliability and performance leads to
expanded use.This adds challenges but it’s a good problem
to have.
Join The Band!
Engineers needed in NYC, Stockholm
http://spotify.com/jobs

More Related Content

What's hot

Storm at Spotify
Storm at SpotifyStorm at Spotify
Storm at SpotifyNeville Li
 
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Adam Kawa
 
NETFLIX (BIG DATA ANALYTICS )
NETFLIX (BIG DATA ANALYTICS )NETFLIX (BIG DATA ANALYTICS )
NETFLIX (BIG DATA ANALYTICS )ANKUSH
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiManish Gupta
 
Business intelligence architecture
Business intelligence architectureBusiness intelligence architecture
Business intelligence architectureSlava Kokaev
 
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storagehive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata StorageDataWorks Summit/Hadoop Summit
 
Algorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at SpotifyAlgorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at SpotifyChris Johnson
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
Building Data Pipelines for Music Recommendations at Spotify
Building Data Pipelines for Music Recommendations at SpotifyBuilding Data Pipelines for Music Recommendations at Spotify
Building Data Pipelines for Music Recommendations at SpotifyVidhya Murali
 
Real Time analytics with Druid, Apache Spark and Kafka
Real Time analytics with Druid, Apache Spark and KafkaReal Time analytics with Druid, Apache Spark and Kafka
Real Time analytics with Druid, Apache Spark and KafkaDaria Litvinov
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptxAlex Ivy
 
Introduction to Apache NiFi dws19 DWS - DC 2019
Introduction to Apache NiFi   dws19 DWS - DC 2019Introduction to Apache NiFi   dws19 DWS - DC 2019
Introduction to Apache NiFi dws19 DWS - DC 2019Timothy Spann
 
Tableau Conference 2018: Binging on Data - Enabling Analytics at Netflix
Tableau Conference 2018: Binging on Data - Enabling Analytics at NetflixTableau Conference 2018: Binging on Data - Enabling Analytics at Netflix
Tableau Conference 2018: Binging on Data - Enabling Analytics at NetflixBlake Irvine
 
Big data & Digital Marketing
Big data & Digital MarketingBig data & Digital Marketing
Big data & Digital MarketingKarthik Bharath
 
Big data and machine learning @ Spotify
Big data and machine learning @ SpotifyBig data and machine learning @ Spotify
Big data and machine learning @ SpotifyOscar Carlsson
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotXiang Fu
 
Vectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at FacebookVectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at FacebookDatabricks
 
From Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover WeeklyFrom Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover WeeklyChris Johnson
 

What's hot (20)

Storm at Spotify
Storm at SpotifyStorm at Spotify
Storm at Spotify
 
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
 
NETFLIX (BIG DATA ANALYTICS )
NETFLIX (BIG DATA ANALYTICS )NETFLIX (BIG DATA ANALYTICS )
NETFLIX (BIG DATA ANALYTICS )
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFi
 
Business intelligence architecture
Business intelligence architectureBusiness intelligence architecture
Business intelligence architecture
 
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storagehive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
 
Algorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at SpotifyAlgorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at Spotify
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Building Data Pipelines for Music Recommendations at Spotify
Building Data Pipelines for Music Recommendations at SpotifyBuilding Data Pipelines for Music Recommendations at Spotify
Building Data Pipelines for Music Recommendations at Spotify
 
Integrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data LakesIntegrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data Lakes
 
Real Time analytics with Druid, Apache Spark and Kafka
Real Time analytics with Druid, Apache Spark and KafkaReal Time analytics with Druid, Apache Spark and Kafka
Real Time analytics with Druid, Apache Spark and Kafka
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Introduction to Apache NiFi dws19 DWS - DC 2019
Introduction to Apache NiFi   dws19 DWS - DC 2019Introduction to Apache NiFi   dws19 DWS - DC 2019
Introduction to Apache NiFi dws19 DWS - DC 2019
 
Tableau Conference 2018: Binging on Data - Enabling Analytics at Netflix
Tableau Conference 2018: Binging on Data - Enabling Analytics at NetflixTableau Conference 2018: Binging on Data - Enabling Analytics at Netflix
Tableau Conference 2018: Binging on Data - Enabling Analytics at Netflix
 
Big data & Digital Marketing
Big data & Digital MarketingBig data & Digital Marketing
Big data & Digital Marketing
 
Big data and machine learning @ Spotify
Big data and machine learning @ SpotifyBig data and machine learning @ Spotify
Big data and machine learning @ Spotify
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Vectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at FacebookVectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at Facebook
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
From Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover WeeklyFrom Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover Weekly
 

Viewers also liked

Measuring team performance at spotify slideshare
Measuring team performance at spotify slideshareMeasuring team performance at spotify slideshare
Measuring team performance at spotify slideshareDanielle Jabin
 
Africa DevOps Day 2015
Africa DevOps Day 2015Africa DevOps Day 2015
Africa DevOps Day 2015Danielle Jabin
 
Making Better Mistakes Tomorrow
Making Better Mistakes TomorrowMaking Better Mistakes Tomorrow
Making Better Mistakes TomorrowDanielle Jabin
 
Activation: From thinking to tweaking it, how we do it at Spotify
Activation: From thinking to tweaking it, how we do it at Spotify Activation: From thinking to tweaking it, how we do it at Spotify
Activation: From thinking to tweaking it, how we do it at Spotify TheFamily
 
Machine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupMachine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupAndy Sloane
 
Shortening the feedback loop
Shortening the feedback loopShortening the feedback loop
Shortening the feedback loopJosh Baer
 
Collaborative Filtering with Spark
Collaborative Filtering with SparkCollaborative Filtering with Spark
Collaborative Filtering with SparkChris Johnson
 
Growing up with agile - how the Spotify 'model' has evolved
Growing up with agile - how the Spotify 'model' has evolved Growing up with agile - how the Spotify 'model' has evolved
Growing up with agile - how the Spotify 'model' has evolved Peter Antman
 
Music Recommendations at Scale with Spark
Music Recommendations at Scale with SparkMusic Recommendations at Scale with Spark
Music Recommendations at Scale with SparkChris Johnson
 
Collaborative Filtering at Spotify
Collaborative Filtering at SpotifyCollaborative Filtering at Spotify
Collaborative Filtering at SpotifyErik Bernhardsson
 
The California Community College’s Education Planning Initiative (EPI)
The California Community College’s Education Planning Initiative (EPI)The California Community College’s Education Planning Initiative (EPI)
The California Community College’s Education Planning Initiative (EPI)Hobsons
 
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand StreamingTalk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand StreamingSameera Horawalavithana
 
Spotify: Profils d'écoute et data storytelling @ Radio 2.0 2015
Spotify: Profils d'écoute et data storytelling @ Radio 2.0 2015Spotify: Profils d'écoute et data storytelling @ Radio 2.0 2015
Spotify: Profils d'écoute et data storytelling @ Radio 2.0 2015ACTUONDA
 
Music Recommendations at Spotify
Music Recommendations at SpotifyMusic Recommendations at Spotify
Music Recommendations at SpotifyEmily Samuels
 
Impact et attractivité des concerts des radio pour les auditeurs Etude HyperW...
Impact et attractivité des concerts des radio pour les auditeurs Etude HyperW...Impact et attractivité des concerts des radio pour les auditeurs Etude HyperW...
Impact et attractivité des concerts des radio pour les auditeurs Etude HyperW...ACTUONDA
 
Trends in the digital/physical world - March 2014
Trends in the digital/physical world - March 2014Trends in the digital/physical world - March 2014
Trends in the digital/physical world - March 2014Humblebee
 

Viewers also liked (18)

Measuring team performance at spotify slideshare
Measuring team performance at spotify slideshareMeasuring team performance at spotify slideshare
Measuring team performance at spotify slideshare
 
Spotify: Data center & Backend buildout
Spotify: Data center & Backend buildoutSpotify: Data center & Backend buildout
Spotify: Data center & Backend buildout
 
The Spotify Playbook
The Spotify Playbook The Spotify Playbook
The Spotify Playbook
 
Africa DevOps Day 2015
Africa DevOps Day 2015Africa DevOps Day 2015
Africa DevOps Day 2015
 
Making Better Mistakes Tomorrow
Making Better Mistakes TomorrowMaking Better Mistakes Tomorrow
Making Better Mistakes Tomorrow
 
Activation: From thinking to tweaking it, how we do it at Spotify
Activation: From thinking to tweaking it, how we do it at Spotify Activation: From thinking to tweaking it, how we do it at Spotify
Activation: From thinking to tweaking it, how we do it at Spotify
 
Machine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupMachine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data Meetup
 
Shortening the feedback loop
Shortening the feedback loopShortening the feedback loop
Shortening the feedback loop
 
Collaborative Filtering with Spark
Collaborative Filtering with SparkCollaborative Filtering with Spark
Collaborative Filtering with Spark
 
Growing up with agile - how the Spotify 'model' has evolved
Growing up with agile - how the Spotify 'model' has evolved Growing up with agile - how the Spotify 'model' has evolved
Growing up with agile - how the Spotify 'model' has evolved
 
Music Recommendations at Scale with Spark
Music Recommendations at Scale with SparkMusic Recommendations at Scale with Spark
Music Recommendations at Scale with Spark
 
Collaborative Filtering at Spotify
Collaborative Filtering at SpotifyCollaborative Filtering at Spotify
Collaborative Filtering at Spotify
 
The California Community College’s Education Planning Initiative (EPI)
The California Community College’s Education Planning Initiative (EPI)The California Community College’s Education Planning Initiative (EPI)
The California Community College’s Education Planning Initiative (EPI)
 
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand StreamingTalk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
 
Spotify: Profils d'écoute et data storytelling @ Radio 2.0 2015
Spotify: Profils d'écoute et data storytelling @ Radio 2.0 2015Spotify: Profils d'écoute et data storytelling @ Radio 2.0 2015
Spotify: Profils d'écoute et data storytelling @ Radio 2.0 2015
 
Music Recommendations at Spotify
Music Recommendations at SpotifyMusic Recommendations at Spotify
Music Recommendations at Spotify
 
Impact et attractivité des concerts des radio pour les auditeurs Etude HyperW...
Impact et attractivité des concerts des radio pour les auditeurs Etude HyperW...Impact et attractivité des concerts des radio pour les auditeurs Etude HyperW...
Impact et attractivité des concerts des radio pour les auditeurs Etude HyperW...
 
Trends in the digital/physical world - March 2014
Trends in the digital/physical world - March 2014Trends in the digital/physical world - March 2014
Trends in the digital/physical world - March 2014
 

Similar to The Evolution of Big Data at Spotify

Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopEvans Ye
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...ssuserd3a367
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big DataJoe Alex
 
HBaseCon 2013: General Session
HBaseCon 2013: General SessionHBaseCon 2013: General Session
HBaseCon 2013: General SessionCloudera, Inc.
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksData Con LA
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv larsgeorge
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemInSemble
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopAmir Shaikh
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Predictive Analytics and Machine Learning…with SAS and Apache HadoopPredictive Analytics and Machine Learning…with SAS and Apache Hadoop
Predictive Analytics and Machine Learning …with SAS and Apache HadoopHortonworks
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructureelliando dias
 
A glimpse into the Future of Hadoop & Big Data
A glimpse into the Future of Hadoop & Big DataA glimpse into the Future of Hadoop & Big Data
A glimpse into the Future of Hadoop & Big DataSaurav Kumar Sinha
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
HadoopCon- Trend Micro SPN Hadoop Overview
HadoopCon- Trend Micro SPN Hadoop OverviewHadoopCon- Trend Micro SPN Hadoop Overview
HadoopCon- Trend Micro SPN Hadoop OverviewYafang Chang
 

Similar to The Evolution of Big Data at Spotify (20)

Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
HBaseCon 2013: General Session
HBaseCon 2013: General SessionHBaseCon 2013: General Session
HBaseCon 2013: General Session
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Predictive Analytics and Machine Learning…with SAS and Apache HadoopPredictive Analytics and Machine Learning…with SAS and Apache Hadoop
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
 
A glimpse into the Future of Hadoop & Big Data
A glimpse into the Future of Hadoop & Big DataA glimpse into the Future of Hadoop & Big Data
A glimpse into the Future of Hadoop & Big Data
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
HadoopCon- Trend Micro SPN Hadoop Overview
HadoopCon- Trend Micro SPN Hadoop OverviewHadoopCon- Trend Micro SPN Hadoop Overview
HadoopCon- Trend Micro SPN Hadoop Overview
 

Recently uploaded

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 

Recently uploaded (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

The Evolution of Big Data at Spotify

  • 2. WhoAm I? ‣ Technical Product Owner at Spotify, responsible for Hadoop @l_phant
  • 3. Overview • Creating Music Charts in three parts: • Playing Music • Collecting Data • Processing Data • The Future
  • 5.
  • 6. Building Music Charts Play Music Collect Data Process
  • 8. What is Spotify? • Music Streaming Service • Browse and Discover Millions of Songs,Artists andAlbums • Bythe end of 2014 • 60 Million Monthly Users • 15 Million Paid Subscribers
  • 9. What is Spotify? • Data Infrastructure • 1300 Hadoop Nodes • 42 PB Storage • 30TB data ingested via Kafka/day • 400TB generated by Hadoop/day
  • 10. Powered by Data • RunningApp • Matches music to running tempo • Personalized running playlists in multiple tempos for millions of active users http://www.theverge.com/2015/6/1/8696659/spotify-running-is-great-for-discovery
  • 11. Powered by Data • Now Page • Shows, podcasts and playlists based on day-parts • Personalized layout so you always have the right music forthe right moment
  • 12.
  • 13. Building Music Charts 10.123.133.333 - - [Mon, 3 June 2015 11:31:33 GMT] "GET /api/admin/job/ aggregator/status HTTP/1.1" 200 1847 "https://my.analytics.app/admin" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36" 10.123.133.222 - - [Mon, 3 June 2015 11:31:43 GMT] "GET /api/admin/job/ aggregator/status HTTP/1.1" 200 1984 "https://my.analytics.app/admin" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36” 10.123.133.222 - - [Mon, 3 June 2015 11:33:02 GMT] "GET /dashboard/ courses/1291726 HTTP/1.1" 304 - "https://my.analytics.app/admin" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36" 10.321.145.111 - - [Mon, 3 June 2015 11:33:03 GMT] "GET /api/loggedInUser HTTP/1.1" 304 - "https://my.analytics.app/dashboard/courses/1291726" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36" 10.112.322.111 - - [Mon, 3 June 2015 11:33:03 GMT] "POST /api/ instrumentation/events/new HTTP/1.1" 200 2 "https://my.analytics.app/ dashboard/courses/1291726" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36” 10.123.133.222 - - [Mon, 3 June 2015 11:33:02 GMT] "GET /dashboard/ courses/1291726 HTTP/1.1" 304 - "https://my.analytics.app/admin" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36" • Raw data is complicated • Often dirty • Evolving structure • Duplication all over • Getting data to a central processing point is HARD
  • 15. “It’s simple, we just throw the data into Hadoop” A naive data engineer
  • 16. LogArchiver • Original method to transport logs fromAPs to HDFS • Lasted from 2009 - 2013 • Relies on rsynch/scp to move files around • Regularly scheduled via cron
  • 17.
  • 18. LogArchiver Fails • Worked well with small number ofAPs • Issues with scale • Manual Processes of adding new hosts • Frequent dying of hosts or network issues caused massive congestion • Manual process of overrides
  • 19. Apache Kafka to the rescue! • Apublish-subscribe messaging system open sourced by LinkedIn in 2011 • High level overview: • Topic: Feeds of messages • Producer:Amessage publisher • Consumer :Asubscriber oftopics
  • 20. Apache Kafka to the rescue! • Log -> HDFS latency reduced from hours to seconds! • Benefits: • Community supported • Division of responsibilities • Allowed for enhanced streaming use-cases
  • 21.
  • 23. Workflow Management Fail! 0  *  *  *  *        spotify-­‐core            hadoop  jar  merge_hourly.jar   15  *  *  *  *      spotify-­‐core            hadoop  jar  aggregate_song_plays.jar   30  *  *  *  *      spotify-­‐analytics  hadoop  jar  merge_artist_song.jar   *  1  *  *  *        spotify-­‐core            hadoop  jar  daily_aggregate.jar   *  2  *  *  *        spotify-­‐core            hadoop  jar  calculate_royalties.jar   */2  22  *  *  *  spotify-­‐radio          hadoop  jar  generate_radio.jar  
  • 24. Handles the ‘plumbing’ for Hadoop jobs https://github.com/spotify/luigi Luigi - Python Workflow Manager Easy to get started, no xml like Oozie
  • 25. Hadoop Availability • In 2013: • Hadoop expanded to 200 nodes • It was business critical • It was not very reliable :-( • Created a ‘squad’ with two missions: • Migrate to a new distribution withYarn • Make Hadoop reliable
  • 26. How did we do? HadoopUptime 90% 92% 94% 96% 98% 100% Q3-2012 Q4-2012 Q1-2013 Q2-2013 Q3-2013 Q4-2013 Q1-2014 Q2-2014 Q3-2014 Q4-2014 Q1-2015 Q2-2015 Hadoop ownerless Dedicated squad launches Upgrade instability Continually improving
  • 27. Going from Python to Crunch • Most of our jobs were Hadoop (python) streaming • Lots of failures, slow performance • Had to find a betterway
  • 28. Moving from Python to Crunch • Investigated several frameworks* • Selected Crunch: • Real types - compile time error detection, bettertestability • Higher levelAPI - let the framework optimize foryou • Better performance #JVM_FTW *thewit.ch/scalding_crunchy_pig
  • 29.
  • 30. Play Music Collect Data Process Data driven features that allows for new ways to play and discover music Lower latency and enhanced reliability for passing data from Access Points to HDFS via Kafka Increased Hadoop reliability, Luigi scheduling and better performance with Crunch Improving Charts!
  • 32. Growth% 0 500 1000 1500 2000 2500 3000 3500 2012 2013 2014 2015 Hadoop Usage Spotify Users Growth of Hadoop vs. Spotify Users
  • 33. Explosive Growth • Increased Spotify Users • More users listening to more music -> more data -> longer running jobs • Increased Use Cases • Beyond simple analytics into Machine Learning, advanced processing • Increased Engineers • In 2014, growth of data and machine learning engineers grew rapidly
  • 36. Inviso Developed by Netflix: https://github.com/Netflix/inviso
  • 37. Hadoop Report Card • Contains Statistics • Guidelines and Best Practices • Sent Quarterly
  • 38. RealTime Use Cases • Expanding our use of Storm for: • TargetingAds based on genres • Visualizing Data • Quicker recommendations • More information: • https://labs.spotify.com/2015/01/05/how-spotify-scales-apache-storm/
  • 39. Two takeaways • Getting data into Hadoop is halfthe challenge.Think early and often about scale. • Increasing infrastructure reliability and performance leads to expanded use.This adds challenges but it’s a good problem to have.
  • 40. Join The Band! Engineers needed in NYC, Stockholm http://spotify.com/jobs