SlideShare a Scribd company logo
1 of 34
Download to read offline
Luigi
Batch data processing in
Python

Elias Freider
freider@spotify.com
Background
Why did we build Luigi?


We have a lot of data
 Billions of log messages (several TBs) every day
 Usage and backend stats, debug information
 What we want to do
   AB-testing
   Music recommendations
   Monthly/daily/hourly reporting
   Business metric dashboards
 We experiment a lot – need quick development cycles
How to do it?

Hadoop
  Data storage
  Clean, filter, join and aggregate data
Postgres
  Semi-aggregated data for dashboards
Cassandra
  Time series data
Defining a data flow
A bunch of data processing tasks with inter-dependencies
Example: Artist Toplist
Input is a list of Timestamp, Track, Artist tuples




                                 Artist
      Streams
                              Aggregation




                                  Top 10             Database
Example: Artist Toplist
Naive (non-luigi) approach
Errors will occur
If a failed step leaves behind broken data, we want to clean that up
Avoid duplicate work
The second step fails, you fix it, then you want to resume
Parametrization
To use data flows as command line tools
Repeat!
You want to run the dataflow for a set of similar inputs
Introducing Luigi
A Python framework for data flow definition and execution




Main features

  Dead-simple dependency definitions (think GNU make)
  Emphasis on Hadoop/HDFS integration
  Atomic file operations
  Powerful task templating using OOP
  Data flow visualization
  Command line integration
Luigi - Aggregate Artists
Luigi - Aggregate Artists
Run on the command line:
$ python dataflow.py AggregateArtists

DEBUG: Checking if AggregateArtists() is complete
INFO: Scheduled AggregateArtists()
DEBUG: Checking if Streams() is complete
INFO: Done scheduling tasks
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 1
INFO: [pid 74375] Running   AggregateArtists()
INFO: [pid 74375] Done      AggregateArtists()
DEBUG: Asking scheduler for work...
INFO: Done
INFO: There are no more tasks to run at this time
Task templates and targets
Reduces code-repetition by utilizing Luigi’s object oriented data model




Luigi comes with pre-implemented Tasks for...

  Running Hadoop MapReduce utilizing Hadoop Streaming or custom jar-files
  Running Hive and (soon) Pig queries
  Inserting data sets into Postgres


Writing new ones are as easy as defining an interface and
implementing run()
Hadoop MapReduce
Built-in Hadoop Streaming Python framework




Features

 Very slim interface
 Fetches error logs from Hadoop cluster and displays them to the user
 Class instance variables can be referenced in MapReduce code, which makes it
 easy to supply extra data in dictionaries etc. for map side joins
 Easy to send along Python modules that might not be installed on the cluster
 Basic support for secondary sort
 Runs on CPython so you can use your favorite libs (numpy, pandas etc.)
Hadoop MapReduce
Built-in Hadoop Streaming Python framework
Attach Python modules
So you can use local modules remotely on the cluster
Local state is transferred
Set up local variables for map-side joins etc.
Completing the top list
Top 10 artists - Wrapped arbitrary Python code
Database support
Basic functionality for exporting to Postgres. Cassandra support is in the works
Running it all...
   DEBUG: Checking if ArtistToplistToDatabase() is complete
   INFO: Scheduled ArtistToplistToDatabase()
   DEBUG: Checking if Top10Artists() is complete
   INFO: Scheduled Top10Artists()
   DEBUG: Checking if AggregateArtists() is complete
   INFO: Scheduled AggregateArtists()
   DEBUG: Checking if Streams() is complete
   INFO: Done scheduling tasks
   DEBUG: Asking scheduler for work...
   DEBUG: Pending tasks: 3
   INFO: [pid 74811] Running   AggregateArtists()
   INFO: [pid 74811] Done      AggregateArtists()
   DEBUG: Asking scheduler for work...
   DEBUG: Pending tasks: 2
   INFO: [pid 74811] Running   Top10Artists()
   INFO: [pid 74811] Done      Top10Artists()
   DEBUG: Asking scheduler for work...
   DEBUG: Pending tasks: 1
   INFO: [pid 74811] Running   ArtistToplistToDatabase()
   INFO: Done writing, importing at 2013-03-13 15:41:09.407138
   INFO: [pid 74811] Done      ArtistToplistToDatabase()
   DEBUG: Asking scheduler for work...
   INFO: Done
   INFO: There are no more tasks to run at this time
Data flow visualization
Light-weight scheduler daemon with a web interface
The results
Imagine how cool this would be with real data...
Task Parameters
Class variables with some magic




Tasks have implicit __init__




   Generates command line interface with typing and documentation
   $ python dataflow.py AggregateArtists --date 2013-03-05
Task Parameters
Combined usage example
Task Parameters
Makes it really easy to create aggregates over time
Multiple workers
 Basic multi-processing



$ python dataflow.py --workers 3 AggregateArtists --date_interval 2013-W11
Large data flows
            (Screenshot from web interface)




                                                                                                                                                                                                                                                                                                                  EndSongCleaned                                                                       EndSongCleaned
                                                                                                                                                                                                                                                                                                                  (date=2013-03-16)                                                                   (date=2013-03-15)




                                                                                                                                                                                                AggregateTracks                                                                                                             AggregateTracks                                                 UserLocationDay                                  UserLocationDay              UserLocationDay        UserLocationDay         UserLocationDay           UserLocationDay               UserLocationDay      UserLocationDay     UserLocationDay     UserLocationDay     UserLocationDay
                                                                                                                                                                                                   (test=False,                MasterMetadata                                                                                    (test=False,                                                    (test=False,                                   (test=False,                (test=False,           (test=False,            (test=False,               (test=False,                    (test=False,      (test=False,        (test=False,        (test=False,        (test=False,
                                                                                                                                                                                                date=2013-03-16,              (date=2013-03-15)                                                                            date=2013-03-15,                                                 date=2013-03-16,                                 date=2013-03-14,            date=2013-03-13,        date=2013-03-08,        date=2013-03-12,          date=2013-03-07,              date=2013-03-11,     date=2013-03-10,    date=2013-03-09,    date=2013-03-15,    date=2013-03-06,
                                                                                                                                                                                                test_users=False)                                                                                                          test_users=False)                                                test_users=False)                                test_users=False)           test_users=False)       test_users=False)       test_users=False)         test_users=False)             test_users=False)    test_users=False)   test_users=False)   test_users=False)   test_users=False)




                                                                                                                                                                                              JoinArtistGids                                 JoinArtistGids                                                                                                                                                                                                                                                                                       UserLocationPeriod             UserLocationPeriod
                                                                                                                                                                                               (test=False,                                     (test=False,                                                                                                                                                                                                                                                                                         (test=False,                     (test=False,
                                                                                                                                                                                            date=2013-03-16,                               date=2013-03-15,                                                                                                                                                                                                                                                                                           period=10,                        period=10,
                                                                                                                                                                                               nshards=12,                                   nshards=12,                                                                                                                                                                                                                                                                                          date=2013-03-17,               date=2013-03-16,
                                                                                                                                                                                            test_users=False)                              test_users=False)                                                                                                                                                                                                                                                                                      test_users=False)              test_users=False)




  AggregateUserMatrices       AggregateUserMatrices       AggregateUserMatrices                    AggregateUserMatrices               AggregateUserMatrices      AggregateUserMatrices
                                                                                                                                                                                            AggregateByArtists      AggregateByArtists                                        AggregateByArtists               AggregateByArtists               AggregateByArtists             AggregateByArtists                AggregateByArtists            AggregateByArtists           AggregateByArtists      AggregateByArtists       AggregateByArtists
       (test=False,                (test=False,                (test=False,                              (test=False,                        (test=False,              (test=False,
                                                                                                                                                                                               (test=False,            (test=False,                ArtistFollows                 (test=False,                     (test=False,                      (test=False,                  (test=False,                        (test=False,                (test=False,                  (test=False,            (test=False,            (test=False,
    date=2013-03-16,            date=2013-03-14,            date=2013-03-13,                          date=2013-03-15,                   date=2013-03-12,           date=2013-03-11,
                                                                                                                                                                                            date=2013-03-16,        date=2013-03-13,            (date=2013-03-14)             date=2013-03-14,                 date=2013-03-09,                 date=2013-03-08,               date=2013-03-15,                  date=2013-03-12,              date=2013-03-11,              date=2013-03-07,        date=2013-03-10,         date=2013-03-06,
index_version=1362433077,   index_version=1362433077,   index_version=1362433077,               index_version=1362433077,           index_version=1362433077,   index_version=1362433077,
                                                                                                                                                                                            test_users=False)       test_users=False)                                         test_users=False)                test_users=False)                test_users=False)              test_users=False)                 test_users=False)             test_users=False)             test_users=False)       test_users=False)       test_users=False)
    test_users=False)           test_users=False)           test_users=False)                         test_users=False)                  test_users=False)          test_users=False)




                                                                              AccumulateUserMatrices                         AccumulateUserMatrices                                                                                                     AccumulateByArtists                                                                                                                                                                                         AccumulateByArtists
                                                                                    (test=False,                                   (test=False,                                                                                                                (test=False,                                                                                                                                                                                             (test=False,
                                                                                                                                                                                                                                                                                            JoinArtistGids                                                                JoinArtistGids                                              JoinArtistGids                                                                                                                 JoinArtistGids
                                                                                 date=2013-03-16,                               date=2013-03-15,                                                                                                         date=2013-03-16,                                                          PlaylistVectors                                                                                                                   date=2013-03-15,
                                                                                                                                                                                                                                                                                                (test=False,                                                               (test=False,                                                (test=False,                                                                                                                   (test=False,
                                                                          index_version=1362433077,                         index_version=1362433077,                                                                                                    test_users=False,                                                           (test=False,                                                    RelatedArtistsTC                                                test_users=False,
                                                                                                                                                                                                                                                                                          date=2013-03-14,                                                              date=2013-03-13,                                             date=2013-03-12,                                                                                                               date=2013-03-11,
                                                                                 test_users=False,                              test_users=False,                                                                                                          max_days=90,                                                           date=2013-03-14,                                                              ()                                                    max_days=90,
                                                                                                                                                                                                                                                                                            nshards=12,                                                                   nshards=12,                                                  nshards=12,                                                                                                                    nshards=12,
                                                                                 decay_factor=0.99,                             decay_factor=0.99,                                                                                                      decay_factor=0.99,                                                index_version=1362433077)                                                                                                                 decay_factor=0.99,
                                                                                                                                                                                                                                                                                         test_users=False)                                                              test_users=False)                                            test_users=False)                                                                                                              test_users=False)
                                                                                      days=5,                                        days=5,                                                                                                                    days=10,                                                                                                                                                                                                 days=10,
                                                                              build_from_scratch=True)                       build_from_scratch=True)                                                                                                build_from_scratch=True)                                                                                                                                                                                    build_from_scratch=True)




                                                                                                                                                                                                                                                                                                                            UserRecs                                                                                 UserRecs
                                                                                                                                                                                                                                                                                                                            (test=False,                                                                         (test=False,
                                                                                                                                                                                                                                                                                                                        date=2013-03-17,                                                                   date=2013-03-16,
                                                                                                                                                                                                                                                                                                                           rec_days=5,                                                                           rec_days=5,
                                                                                                                                                                                                                                                                                                                                                                      IngestUserRecs
                                                                                                                                                                                                                                                                                                                          exp_days=10,                                                                          exp_days=10,
                                                                                                                                                                                                                                                                                                                                                                   (date=2013-03-15,
                                                                                                                                                                                                                                                                                                                        test_users=False,                                                                  test_users=False,
                                                                                                                                                                                                                                                                                                                                                                    test_users=False,
                                                                                                                                                                                                                                                                                                                       force_updates=False,                                                              force_updates=False,
                                                                                                                                                                                                                                                                                                                                                                        ttl=604800)
                                                                                                                                                                                                                                                                                                                     build_from_scratch=True,                                                          build_from_scratch=True,
                                                                                                                                                                                                                                                                                                                index_path=/spotify/discover/index,                                                index_path=/spotify/discover/index,
                                                                                                                                                                                                                                                                                                                       index_version=None,                                                                index_version=None,
                                                                                                                                                                                                                                                                                                                     FOLLOWS_SCORE=5.0)                                                                 FOLLOWS_SCORE=5.0)




                                                                                                                                                                                                                                                                                                                                                                      IngestUserRecs
                                                                                                                                                                                                                                                                                                                                                                   (date=2013-03-16,
                                                                                                                                                                                                                                                                                                                                                                    test_users=False,
                                                                                                                                                                                                                                                                                                                                                                        ttl=604800)




                                                                                                                                                                                                                                                                                                                                                     IngestUserRecs
                                                                                                                                                                                                                                                                                                                                                    (date=2013-03-17,
                                                                                                                                                                                                                                                                                                                                                     test_users=False,
                                                                                                                                                                                                                                                                                                                                                        ttl=604800)
Really large...
Visualization needs some work
Error notifications
Great for automated execution
Process Synchronization
Prevents two identical tasks from running simultaneously


                                luigid
                       Simple task synchronization




             Data flow 1                   Data flow 2




                        Common dependency
                             Task
What we use Luigi for
Not just another Hadoop streaming framework!

 Hadoop Streaming
 Java Hadoop MapReduce
 Hive
 Pig
 Local (non-hadoop) data processing
 Import/Export data to/from Postgres
 Insert data into Cassandra
 scp/rsync/ftp data files and reports
 Dump and load databases
Others using it with Scala MapReduce and MRJob as well
Luigi is open source

http://github.com/spotify/luigi
Originated at Spotify
Based on many years of experience with data processing
Recent contributions by Foursquare and Bitly
Open source since September 2012
Thank you!
For more information feel free to reach out at
http://github.com/spotify/luigi

Oh, and we’re hiring – http://spotify.com/jobs



Elias Freider
freider@spotify.com

More Related Content

What's hot

Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & FeaturesDataStax Academy
 
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Yongho Ha
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsJonas Bonér
 
Graph processing - Powergraph and GraphX
Graph processing - Powergraph and GraphXGraph processing - Powergraph and GraphX
Graph processing - Powergraph and GraphXAmir Payberah
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisDvir Volk
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Databricks
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraFlink Forward
 
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)NTT DATA OSS Professional Services
 
TiDBのトランザクション
TiDBのトランザクションTiDBのトランザクション
TiDBのトランザクションAkio Mitobe
 
cLoki: Like Loki but for ClickHouse
cLoki: Like Loki but for ClickHousecLoki: Like Loki but for ClickHouse
cLoki: Like Loki but for ClickHouseAltinity Ltd
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBMike Dirolf
 
iptables 101- bottom-up
iptables 101- bottom-upiptables 101- bottom-up
iptables 101- bottom-upHungWei Chiu
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceBrendan Gregg
 
Multi cluster, multitenant and hierarchical kafka messaging service slideshare
Multi cluster, multitenant and hierarchical kafka messaging service   slideshareMulti cluster, multitenant and hierarchical kafka messaging service   slideshare
Multi cluster, multitenant and hierarchical kafka messaging service slideshareAllen (Xiaozhong) Wang
 
さいきんの InnoDB Adaptive Flushing (仮)
さいきんの InnoDB Adaptive Flushing (仮)さいきんの InnoDB Adaptive Flushing (仮)
さいきんの InnoDB Adaptive Flushing (仮)Takanori Sejima
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...Databricks
 

What's hot (20)

Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
 
Graph processing - Powergraph and GraphX
Graph processing - Powergraph and GraphXGraph processing - Powergraph and GraphX
Graph processing - Powergraph and GraphX
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
TiDB Introduction
TiDB IntroductionTiDB Introduction
TiDB Introduction
 
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
 
TiDBのトランザクション
TiDBのトランザクションTiDBのトランザクション
TiDBのトランザクション
 
golang profiling の基礎
golang profiling の基礎golang profiling の基礎
golang profiling の基礎
 
cLoki: Like Loki but for ClickHouse
cLoki: Like Loki but for ClickHousecLoki: Like Loki but for ClickHouse
cLoki: Like Loki but for ClickHouse
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
iptables 101- bottom-up
iptables 101- bottom-upiptables 101- bottom-up
iptables 101- bottom-up
 
Consistent hash
Consistent hashConsistent hash
Consistent hash
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for Performance
 
Multi cluster, multitenant and hierarchical kafka messaging service slideshare
Multi cluster, multitenant and hierarchical kafka messaging service   slideshareMulti cluster, multitenant and hierarchical kafka messaging service   slideshare
Multi cluster, multitenant and hierarchical kafka messaging service slideshare
 
さいきんの InnoDB Adaptive Flushing (仮)
さいきんの InnoDB Adaptive Flushing (仮)さいきんの InnoDB Adaptive Flushing (仮)
さいきんの InnoDB Adaptive Flushing (仮)
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 

Recently uploaded

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 

Recently uploaded (20)

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 

Luigi PyData presentation

  • 1. Luigi Batch data processing in Python Elias Freider freider@spotify.com
  • 2. Background Why did we build Luigi? We have a lot of data Billions of log messages (several TBs) every day Usage and backend stats, debug information What we want to do AB-testing Music recommendations Monthly/daily/hourly reporting Business metric dashboards We experiment a lot – need quick development cycles
  • 3. How to do it? Hadoop Data storage Clean, filter, join and aggregate data Postgres Semi-aggregated data for dashboards Cassandra Time series data
  • 4. Defining a data flow A bunch of data processing tasks with inter-dependencies
  • 5. Example: Artist Toplist Input is a list of Timestamp, Track, Artist tuples Artist Streams Aggregation Top 10 Database
  • 6. Example: Artist Toplist Naive (non-luigi) approach
  • 7. Errors will occur If a failed step leaves behind broken data, we want to clean that up
  • 8. Avoid duplicate work The second step fails, you fix it, then you want to resume
  • 9. Parametrization To use data flows as command line tools
  • 10. Repeat! You want to run the dataflow for a set of similar inputs
  • 11. Introducing Luigi A Python framework for data flow definition and execution Main features Dead-simple dependency definitions (think GNU make) Emphasis on Hadoop/HDFS integration Atomic file operations Powerful task templating using OOP Data flow visualization Command line integration
  • 12. Luigi - Aggregate Artists
  • 13. Luigi - Aggregate Artists Run on the command line: $ python dataflow.py AggregateArtists DEBUG: Checking if AggregateArtists() is complete INFO: Scheduled AggregateArtists() DEBUG: Checking if Streams() is complete INFO: Done scheduling tasks DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 1 INFO: [pid 74375] Running AggregateArtists() INFO: [pid 74375] Done AggregateArtists() DEBUG: Asking scheduler for work... INFO: Done INFO: There are no more tasks to run at this time
  • 14. Task templates and targets Reduces code-repetition by utilizing Luigi’s object oriented data model Luigi comes with pre-implemented Tasks for... Running Hadoop MapReduce utilizing Hadoop Streaming or custom jar-files Running Hive and (soon) Pig queries Inserting data sets into Postgres Writing new ones are as easy as defining an interface and implementing run()
  • 15. Hadoop MapReduce Built-in Hadoop Streaming Python framework Features Very slim interface Fetches error logs from Hadoop cluster and displays them to the user Class instance variables can be referenced in MapReduce code, which makes it easy to supply extra data in dictionaries etc. for map side joins Easy to send along Python modules that might not be installed on the cluster Basic support for secondary sort Runs on CPython so you can use your favorite libs (numpy, pandas etc.)
  • 16. Hadoop MapReduce Built-in Hadoop Streaming Python framework
  • 17. Attach Python modules So you can use local modules remotely on the cluster
  • 18. Local state is transferred Set up local variables for map-side joins etc.
  • 19. Completing the top list Top 10 artists - Wrapped arbitrary Python code
  • 20. Database support Basic functionality for exporting to Postgres. Cassandra support is in the works
  • 21. Running it all... DEBUG: Checking if ArtistToplistToDatabase() is complete INFO: Scheduled ArtistToplistToDatabase() DEBUG: Checking if Top10Artists() is complete INFO: Scheduled Top10Artists() DEBUG: Checking if AggregateArtists() is complete INFO: Scheduled AggregateArtists() DEBUG: Checking if Streams() is complete INFO: Done scheduling tasks DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 3 INFO: [pid 74811] Running AggregateArtists() INFO: [pid 74811] Done AggregateArtists() DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 2 INFO: [pid 74811] Running Top10Artists() INFO: [pid 74811] Done Top10Artists() DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 1 INFO: [pid 74811] Running ArtistToplistToDatabase() INFO: Done writing, importing at 2013-03-13 15:41:09.407138 INFO: [pid 74811] Done ArtistToplistToDatabase() DEBUG: Asking scheduler for work... INFO: Done INFO: There are no more tasks to run at this time
  • 22. Data flow visualization Light-weight scheduler daemon with a web interface
  • 23. The results Imagine how cool this would be with real data...
  • 24. Task Parameters Class variables with some magic Tasks have implicit __init__ Generates command line interface with typing and documentation $ python dataflow.py AggregateArtists --date 2013-03-05
  • 26. Task Parameters Makes it really easy to create aggregates over time
  • 27. Multiple workers Basic multi-processing $ python dataflow.py --workers 3 AggregateArtists --date_interval 2013-W11
  • 28. Large data flows (Screenshot from web interface) EndSongCleaned EndSongCleaned (date=2013-03-16) (date=2013-03-15) AggregateTracks AggregateTracks UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay UserLocationDay (test=False, MasterMetadata (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, date=2013-03-16, (date=2013-03-15) date=2013-03-15, date=2013-03-16, date=2013-03-14, date=2013-03-13, date=2013-03-08, date=2013-03-12, date=2013-03-07, date=2013-03-11, date=2013-03-10, date=2013-03-09, date=2013-03-15, date=2013-03-06, test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) JoinArtistGids JoinArtistGids UserLocationPeriod UserLocationPeriod (test=False, (test=False, (test=False, (test=False, date=2013-03-16, date=2013-03-15, period=10, period=10, nshards=12, nshards=12, date=2013-03-17, date=2013-03-16, test_users=False) test_users=False) test_users=False) test_users=False) AggregateUserMatrices AggregateUserMatrices AggregateUserMatrices AggregateUserMatrices AggregateUserMatrices AggregateUserMatrices AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists AggregateByArtists (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, ArtistFollows (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, (test=False, date=2013-03-16, date=2013-03-14, date=2013-03-13, date=2013-03-15, date=2013-03-12, date=2013-03-11, date=2013-03-16, date=2013-03-13, (date=2013-03-14) date=2013-03-14, date=2013-03-09, date=2013-03-08, date=2013-03-15, date=2013-03-12, date=2013-03-11, date=2013-03-07, date=2013-03-10, date=2013-03-06, index_version=1362433077, index_version=1362433077, index_version=1362433077, index_version=1362433077, index_version=1362433077, index_version=1362433077, test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) test_users=False) AccumulateUserMatrices AccumulateUserMatrices AccumulateByArtists AccumulateByArtists (test=False, (test=False, (test=False, (test=False, JoinArtistGids JoinArtistGids JoinArtistGids JoinArtistGids date=2013-03-16, date=2013-03-15, date=2013-03-16, PlaylistVectors date=2013-03-15, (test=False, (test=False, (test=False, (test=False, index_version=1362433077, index_version=1362433077, test_users=False, (test=False, RelatedArtistsTC test_users=False, date=2013-03-14, date=2013-03-13, date=2013-03-12, date=2013-03-11, test_users=False, test_users=False, max_days=90, date=2013-03-14, () max_days=90, nshards=12, nshards=12, nshards=12, nshards=12, decay_factor=0.99, decay_factor=0.99, decay_factor=0.99, index_version=1362433077) decay_factor=0.99, test_users=False) test_users=False) test_users=False) test_users=False) days=5, days=5, days=10, days=10, build_from_scratch=True) build_from_scratch=True) build_from_scratch=True) build_from_scratch=True) UserRecs UserRecs (test=False, (test=False, date=2013-03-17, date=2013-03-16, rec_days=5, rec_days=5, IngestUserRecs exp_days=10, exp_days=10, (date=2013-03-15, test_users=False, test_users=False, test_users=False, force_updates=False, force_updates=False, ttl=604800) build_from_scratch=True, build_from_scratch=True, index_path=/spotify/discover/index, index_path=/spotify/discover/index, index_version=None, index_version=None, FOLLOWS_SCORE=5.0) FOLLOWS_SCORE=5.0) IngestUserRecs (date=2013-03-16, test_users=False, ttl=604800) IngestUserRecs (date=2013-03-17, test_users=False, ttl=604800)
  • 30. Error notifications Great for automated execution
  • 31. Process Synchronization Prevents two identical tasks from running simultaneously luigid Simple task synchronization Data flow 1 Data flow 2 Common dependency Task
  • 32. What we use Luigi for Not just another Hadoop streaming framework! Hadoop Streaming Java Hadoop MapReduce Hive Pig Local (non-hadoop) data processing Import/Export data to/from Postgres Insert data into Cassandra scp/rsync/ftp data files and reports Dump and load databases Others using it with Scala MapReduce and MRJob as well
  • 33. Luigi is open source http://github.com/spotify/luigi Originated at Spotify Based on many years of experience with data processing Recent contributions by Foursquare and Bitly Open source since September 2012
  • 34. Thank you! For more information feel free to reach out at http://github.com/spotify/luigi Oh, and we’re hiring – http://spotify.com/jobs Elias Freider freider@spotify.com