SlideShare a Scribd company logo
1 of 26
A Beginner’s Guide to Building Data Pipelines with
Where should I focus my outbound sales and marketing efforts to
yield the highest possible ROI?
UK Limited
Companies
Customer
CRM Data
Predictive
Model
With big data, comes big responsibility
Hard to maintain, extend, and… look at.
Script Soup
omg moar
codez
Code
More Codes
if __name__ == '__main__':
today = datetime.now().isoformat()[:10] <- Custom Date handling
arg_parser = argparse.ArgumentParser(prog='COMPANIES HOUSE PARSER', description='Process
arguments')
arg_parser.add_argument('--files', nargs='?', dest='file_names', help='CSV files to read
(supports globstar wildcards)', required=True)
arg_parser.add_argument('--batch', nargs='?', dest='batch_size', type=int, help='Number of
rows to save to DB at once')
arg_parser.add_argument('--date', nargs='?', dest='table_date', help='Date that these data
were released')
arg_parser.add_argument('--log_level', dest='log_level', help='Log level to screen',
default='INFO')
args = arg_parser.parse_args()
The Old Way
Define a command line interface for every task?
log = GLoggingFactory().getLoggerFromPath('/var/log/companies-house-load-csv.log-{}'.format(today))
log.setLogLevel(args.log_level, 'screen') <- Custom logging
table_date = parse_date(args.table_date, datetime.now())
log.info('Starting Companies House data loader...')
ch_loader = CompaniesHouseLoader(logger=log, col_mapping=col_mapping, table_date=table_date)
ch_loader.go(args.file_names) <- What to do if this fails?
log.info('Loader complete. Starting Companies House updater')
ch_updater = CompaniesHouseUpdater(logger=log, table_date=table_date,
company_status_params=company_status_params) <- Need to clean up if this fails
ch_updater.go()
The Old Way
Long, processor intensive tasks stacked together
● Open-sourced & maintained by
Spotify data team
● Erik Berhhardsson and Elias Freider.
Maintained by Arash Rouhani.
● Abstracts batch processing jobs
● Makes it easy to write modular code
and create dependencies between
tasks.
Luigi to the rescue!
● Task templating
● Dependency graphs
● Resumption of data flows after
intermediate failure
● Command line integration
● Error emails
Luigi
Luigi 101- Counting the number of companies in the UK
companies.csv
Count companies
count.txt
input() output()
class CompanyCount(luigi.Task):
def output(self):
return luigi.LocalTarget("count.csv")
def run(self):
count = count_unique_entries("companies.csv")
with self.output().open("w") as out_file:
out_file.write(count)
Company Count Job in Luigi code
Luigi 101- Keeping our count up to date
companies.csv
Count companies count.txt
input()
output()Companies
Download
Companies
Data Server
output()
requires()
class CompanyCount(luigi.Task):
def requires(self):
return CompanyDownload()
def output(self):
return luigi.LocalTarget("count.csv")
def run(self):
count = count_unique_entries(self.input())
with self.output().open("w") as out_file:
out_file.write(count)
Company count with download dependency
the output of the
required task
this task must complete
before CompanyCount
runs
Download task
class CompanyDownload(luigi.Task):
def output(self):
return luigi.LocalTarget("companies.csv")
def run(self):
data = get_company_download()
with self.output().open('w') as out_file:
out_file.write(data)
local output to be picked
up by previous task
download the data and
write it to the output
Target
$ python company_flow.py CompanyCount --local-scheduler
DEBUG: Checking if CompanyCount() is complete
DEBUG: Checking if CompanyDownload() is complete
INFO: Scheduled CompanyCount() (PENDING)
INFO: Scheduled CompanyDownload() (PENDING)
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 2
INFO: [pid 10076] Worker Worker(...) running CompanyDownload()
INFO: [pid 10076] Worker Worker(...) done CompanyDownload()
DEBUG: 1 running tasks, waiting for next task to finish
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 1
INFO: [pid 10076] Worker Worker(...) running CompanyCount()
INFO: [pid 10076] Worker Worker(...) done CompanyCount()
DEBUG: 1 running tasks, waiting for next task to finish
DEBUG: Asking scheduler for work...
INFO: Done
Time dependent tasks - change in companies
Companies Count Task(Date 1)
Companies Count Task(Date 2)
Companies Count Task(Date 3)
Companies Delta
company_count_
delta.txt
output()
input()
input()
input()
class AnnualCompanyCountDelta(luigi.Task):
year = luigi.Parameter()
def requires(self):
tasks = []
for month in range(1, 13):
tasks.append(CompanyCount(dt.datetime.strptime(
"{}-{}-01".format(self.year, month), "%Y-%m-%d"))
)
return tasks
# not shown: output(), run()
Parameterising Luigi tasks
define parameter
generate dependencies
class CompanyCount(luigi.Task):
date = luigi.DateParameter(default=datetime.date.today())
def requires(self):
return CompanyDownload(self.date)
def output(self):
return luigi.LocalTarget("count.csv")
def run(self):
count = count_unique_entries(self.input())
with self.output().open("w") as out_file:
out_file.write(count)
Adding the date dependency to Company Count
added date dependency
to company count
The central scheduler
$ luigid & # start central scheduler in background
$ python company_flow.py CompanyCountDelta --year 2014
by default, localhost:8082
Persisting our data
companies.csv
Count
companies(Date)
count.txt
output()Companies
Download(Date)
Companies
Data Server
output()
requires(Date)
Companies
ToMySQL(Date)
output()
SQL
Database
requires(Date)
class CompaniesToMySQL(luigi.sqla.CopyToTable):
date = luigi.DateParameter()
columns = [(["name", String(100)], {}), ...]
connection_string = "mysql://localhost/test" # or something
table = "companies" # name of the table to store data
def requires(self):
return CompanyDownload(self.date)
def rows(self):
for row in self.get_unique_rows(): # uses self.input()
yield row
Persisting our data
My pipes broke
# ./client.cfg
[core]
error-email: dylan@growthintel.com, stuart@growthintel.com
Things we missed out
There are lots of task types which can be used which we haven’t mentioned
● Hadoop
● Spark
● ssh
● Elasticsearch
● Hive
● Pig
● etc.
Check out the luigi.contrib package
class CompanyCount(luigi.contrib.hadoop.JobTask):
chunks = luigi.Parameter()
def requires(self):
return [CompanyDownload(chunk) for chunk in chunks]
def output(self):
return luigi.contrib.hdfs.HdfsTarget("companines_count.tsv")
def mapper(self, line):
yield "count", 1
def reducer(self, key, values):
yield key, sum(values)
Counting the companies using Hadoop
split input in chunks
HDFS target
map and reduce
methods instead of
run()
● Doesn’t provide a way to trigger
flows
● Doesn’t support distributed
execution
Luigi Limitations
Onwards
● The docs: http://luigi.readthedocs.org/
● The mailing list: https://groups.google.com/forum/#!forum/luigi-user/
● The source: https://github.com/spotify/luigi
● The maintainers are really helpful, responsive, and open to any and all
PRs!
Stuart Coleman
@stubacca81 / stuart@growthintel.com
Dylan Barth
@dylan_barth / dylan@growthintel.com
Thanks!
We’re hiring Python data scientists & engineers!
http://www.growthintel.com/careers/

More Related Content

What's hot

Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisDvir Volk
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 
What_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfWhat_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfHeiko Paulheim
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming JobsDatabricks
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & FeaturesDataStax Academy
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Julian Hyde
 
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLTop 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLJim Mlodgenski
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesDatabricks
 
Kafka timestamp offset
Kafka timestamp offsetKafka timestamp offset
Kafka timestamp offsetDaeMyung Kang
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBMike Dirolf
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 

What's hot (20)

Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
 
What_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfWhat_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdf
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
 
Running Spark in Production
Running Spark in ProductionRunning Spark in Production
Running Spark in Production
 
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLTop 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
 
Kafka timestamp offset
Kafka timestamp offsetKafka timestamp offset
Kafka timestamp offset
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
MongoDB
MongoDBMongoDB
MongoDB
 
Catalyst optimizer
Catalyst optimizerCatalyst optimizer
Catalyst optimizer
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
HDFS Design Principles
HDFS Design PrinciplesHDFS Design Principles
HDFS Design Principles
 

Viewers also liked

The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)Mathieu Bastian
 
10 ways to stumble with big data
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big dataLars Albertsson
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solidLars Albertsson
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applicationsLars Albertsson
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkDataWorks Summit
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Lars Albertsson
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakHakka Labs
 

Viewers also liked (8)

The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)The Mechanics of Testing Large Data Pipelines (QCon London 2016)
The Mechanics of Testing Large Data Pipelines (QCon London 2016)
 
10 ways to stumble with big data
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big data
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 

Similar to Build Data Pipelines with Luigi

Optimization in django orm
Optimization in django ormOptimization in django orm
Optimization in django ormDenys Levchenko
 
MongoDB World 2018: Keynote
MongoDB World 2018: KeynoteMongoDB World 2018: Keynote
MongoDB World 2018: KeynoteMongoDB
 
Mock Hell PyCon DE and PyData Berlin 2019
Mock Hell PyCon DE and PyData Berlin 2019Mock Hell PyCon DE and PyData Berlin 2019
Mock Hell PyCon DE and PyData Berlin 2019Edwin Jung
 
Google App Engine in 40 minutes (the absolute essentials)
Google App Engine in 40 minutes (the absolute essentials)Google App Engine in 40 minutes (the absolute essentials)
Google App Engine in 40 minutes (the absolute essentials)Python Ireland
 
Yufeng Guo | Coding the 7 steps of machine learning | Codemotion Madrid 2018
Yufeng Guo |  Coding the 7 steps of machine learning | Codemotion Madrid 2018 Yufeng Guo |  Coding the 7 steps of machine learning | Codemotion Madrid 2018
Yufeng Guo | Coding the 7 steps of machine learning | Codemotion Madrid 2018 Codemotion
 
Advanced integration services on microsoft ssis 1
Advanced integration services on microsoft ssis 1Advanced integration services on microsoft ssis 1
Advanced integration services on microsoft ssis 1Skillwise Group
 
GHC Participant Training
GHC Participant TrainingGHC Participant Training
GHC Participant TrainingAidIQ
 
Tools for Solving Performance Issues
Tools for Solving Performance IssuesTools for Solving Performance Issues
Tools for Solving Performance IssuesOdoo
 
Building Services With gRPC, Docker and Go
Building Services With gRPC, Docker and GoBuilding Services With gRPC, Docker and Go
Building Services With gRPC, Docker and GoMartin Kess
 
Server side rendering with React and Symfony
Server side rendering with React and SymfonyServer side rendering with React and Symfony
Server side rendering with React and SymfonyIgnacio Martín
 
Designing REST API automation tests in Kotlin
Designing REST API automation tests in KotlinDesigning REST API automation tests in Kotlin
Designing REST API automation tests in KotlinDmitriy Sobko
 
Gae Meets Django
Gae Meets DjangoGae Meets Django
Gae Meets Djangofool2nd
 
IndexedDB and Push Notifications in Progressive Web Apps
IndexedDB and Push Notifications in Progressive Web AppsIndexedDB and Push Notifications in Progressive Web Apps
IndexedDB and Push Notifications in Progressive Web AppsAdégòkè Obasá
 
Business Intelligence Portfolio
Business Intelligence PortfolioBusiness Intelligence Portfolio
Business Intelligence PortfolioChris Seebacher
 
Incremental data processing with Hudi & Spark + dbt.pdf
Incremental data processing with Hudi & Spark + dbt.pdfIncremental data processing with Hudi & Spark + dbt.pdf
Incremental data processing with Hudi & Spark + dbt.pdfnadine39280
 
Introduction to Django
Introduction to DjangoIntroduction to Django
Introduction to DjangoJoaquim Rocha
 
The Ring programming language version 1.5.3 book - Part 40 of 184
The Ring programming language version 1.5.3 book - Part 40 of 184The Ring programming language version 1.5.3 book - Part 40 of 184
The Ring programming language version 1.5.3 book - Part 40 of 184Mahmoud Samir Fayed
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineeringJulian Hyde
 

Similar to Build Data Pipelines with Luigi (20)

Optimization in django orm
Optimization in django ormOptimization in django orm
Optimization in django orm
 
MongoDB World 2018: Keynote
MongoDB World 2018: KeynoteMongoDB World 2018: Keynote
MongoDB World 2018: Keynote
 
Mock Hell PyCon DE and PyData Berlin 2019
Mock Hell PyCon DE and PyData Berlin 2019Mock Hell PyCon DE and PyData Berlin 2019
Mock Hell PyCon DE and PyData Berlin 2019
 
Google App Engine in 40 minutes (the absolute essentials)
Google App Engine in 40 minutes (the absolute essentials)Google App Engine in 40 minutes (the absolute essentials)
Google App Engine in 40 minutes (the absolute essentials)
 
Yufeng Guo | Coding the 7 steps of machine learning | Codemotion Madrid 2018
Yufeng Guo |  Coding the 7 steps of machine learning | Codemotion Madrid 2018 Yufeng Guo |  Coding the 7 steps of machine learning | Codemotion Madrid 2018
Yufeng Guo | Coding the 7 steps of machine learning | Codemotion Madrid 2018
 
Advanced integration services on microsoft ssis 1
Advanced integration services on microsoft ssis 1Advanced integration services on microsoft ssis 1
Advanced integration services on microsoft ssis 1
 
Capstone ms2
Capstone ms2Capstone ms2
Capstone ms2
 
GHC Participant Training
GHC Participant TrainingGHC Participant Training
GHC Participant Training
 
Tools for Solving Performance Issues
Tools for Solving Performance IssuesTools for Solving Performance Issues
Tools for Solving Performance Issues
 
Building Services With gRPC, Docker and Go
Building Services With gRPC, Docker and GoBuilding Services With gRPC, Docker and Go
Building Services With gRPC, Docker and Go
 
Server side rendering with React and Symfony
Server side rendering with React and SymfonyServer side rendering with React and Symfony
Server side rendering with React and Symfony
 
Serverless
ServerlessServerless
Serverless
 
Designing REST API automation tests in Kotlin
Designing REST API automation tests in KotlinDesigning REST API automation tests in Kotlin
Designing REST API automation tests in Kotlin
 
Gae Meets Django
Gae Meets DjangoGae Meets Django
Gae Meets Django
 
IndexedDB and Push Notifications in Progressive Web Apps
IndexedDB and Push Notifications in Progressive Web AppsIndexedDB and Push Notifications in Progressive Web Apps
IndexedDB and Push Notifications in Progressive Web Apps
 
Business Intelligence Portfolio
Business Intelligence PortfolioBusiness Intelligence Portfolio
Business Intelligence Portfolio
 
Incremental data processing with Hudi & Spark + dbt.pdf
Incremental data processing with Hudi & Spark + dbt.pdfIncremental data processing with Hudi & Spark + dbt.pdf
Incremental data processing with Hudi & Spark + dbt.pdf
 
Introduction to Django
Introduction to DjangoIntroduction to Django
Introduction to Django
 
The Ring programming language version 1.5.3 book - Part 40 of 184
The Ring programming language version 1.5.3 book - Part 40 of 184The Ring programming language version 1.5.3 book - Part 40 of 184
The Ring programming language version 1.5.3 book - Part 40 of 184
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineering
 

Recently uploaded

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 

Recently uploaded (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 

Build Data Pipelines with Luigi

  • 1. A Beginner’s Guide to Building Data Pipelines with
  • 2. Where should I focus my outbound sales and marketing efforts to yield the highest possible ROI? UK Limited Companies Customer CRM Data Predictive Model
  • 3. With big data, comes big responsibility
  • 4. Hard to maintain, extend, and… look at. Script Soup omg moar codez Code More Codes
  • 5. if __name__ == '__main__': today = datetime.now().isoformat()[:10] <- Custom Date handling arg_parser = argparse.ArgumentParser(prog='COMPANIES HOUSE PARSER', description='Process arguments') arg_parser.add_argument('--files', nargs='?', dest='file_names', help='CSV files to read (supports globstar wildcards)', required=True) arg_parser.add_argument('--batch', nargs='?', dest='batch_size', type=int, help='Number of rows to save to DB at once') arg_parser.add_argument('--date', nargs='?', dest='table_date', help='Date that these data were released') arg_parser.add_argument('--log_level', dest='log_level', help='Log level to screen', default='INFO') args = arg_parser.parse_args() The Old Way Define a command line interface for every task?
  • 6. log = GLoggingFactory().getLoggerFromPath('/var/log/companies-house-load-csv.log-{}'.format(today)) log.setLogLevel(args.log_level, 'screen') <- Custom logging table_date = parse_date(args.table_date, datetime.now()) log.info('Starting Companies House data loader...') ch_loader = CompaniesHouseLoader(logger=log, col_mapping=col_mapping, table_date=table_date) ch_loader.go(args.file_names) <- What to do if this fails? log.info('Loader complete. Starting Companies House updater') ch_updater = CompaniesHouseUpdater(logger=log, table_date=table_date, company_status_params=company_status_params) <- Need to clean up if this fails ch_updater.go() The Old Way Long, processor intensive tasks stacked together
  • 7. ● Open-sourced & maintained by Spotify data team ● Erik Berhhardsson and Elias Freider. Maintained by Arash Rouhani. ● Abstracts batch processing jobs ● Makes it easy to write modular code and create dependencies between tasks. Luigi to the rescue!
  • 8. ● Task templating ● Dependency graphs ● Resumption of data flows after intermediate failure ● Command line integration ● Error emails Luigi
  • 9. Luigi 101- Counting the number of companies in the UK companies.csv Count companies count.txt input() output()
  • 10. class CompanyCount(luigi.Task): def output(self): return luigi.LocalTarget("count.csv") def run(self): count = count_unique_entries("companies.csv") with self.output().open("w") as out_file: out_file.write(count) Company Count Job in Luigi code
  • 11. Luigi 101- Keeping our count up to date companies.csv Count companies count.txt input() output()Companies Download Companies Data Server output() requires()
  • 12. class CompanyCount(luigi.Task): def requires(self): return CompanyDownload() def output(self): return luigi.LocalTarget("count.csv") def run(self): count = count_unique_entries(self.input()) with self.output().open("w") as out_file: out_file.write(count) Company count with download dependency the output of the required task this task must complete before CompanyCount runs
  • 13. Download task class CompanyDownload(luigi.Task): def output(self): return luigi.LocalTarget("companies.csv") def run(self): data = get_company_download() with self.output().open('w') as out_file: out_file.write(data) local output to be picked up by previous task download the data and write it to the output Target
  • 14. $ python company_flow.py CompanyCount --local-scheduler DEBUG: Checking if CompanyCount() is complete DEBUG: Checking if CompanyDownload() is complete INFO: Scheduled CompanyCount() (PENDING) INFO: Scheduled CompanyDownload() (PENDING) INFO: Done scheduling tasks INFO: Running Worker with 1 processes DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 2 INFO: [pid 10076] Worker Worker(...) running CompanyDownload() INFO: [pid 10076] Worker Worker(...) done CompanyDownload() DEBUG: 1 running tasks, waiting for next task to finish DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 1 INFO: [pid 10076] Worker Worker(...) running CompanyCount() INFO: [pid 10076] Worker Worker(...) done CompanyCount() DEBUG: 1 running tasks, waiting for next task to finish DEBUG: Asking scheduler for work... INFO: Done
  • 15. Time dependent tasks - change in companies Companies Count Task(Date 1) Companies Count Task(Date 2) Companies Count Task(Date 3) Companies Delta company_count_ delta.txt output() input() input() input()
  • 16. class AnnualCompanyCountDelta(luigi.Task): year = luigi.Parameter() def requires(self): tasks = [] for month in range(1, 13): tasks.append(CompanyCount(dt.datetime.strptime( "{}-{}-01".format(self.year, month), "%Y-%m-%d")) ) return tasks # not shown: output(), run() Parameterising Luigi tasks define parameter generate dependencies
  • 17. class CompanyCount(luigi.Task): date = luigi.DateParameter(default=datetime.date.today()) def requires(self): return CompanyDownload(self.date) def output(self): return luigi.LocalTarget("count.csv") def run(self): count = count_unique_entries(self.input()) with self.output().open("w") as out_file: out_file.write(count) Adding the date dependency to Company Count added date dependency to company count
  • 18. The central scheduler $ luigid & # start central scheduler in background $ python company_flow.py CompanyCountDelta --year 2014 by default, localhost:8082
  • 19. Persisting our data companies.csv Count companies(Date) count.txt output()Companies Download(Date) Companies Data Server output() requires(Date) Companies ToMySQL(Date) output() SQL Database requires(Date)
  • 20. class CompaniesToMySQL(luigi.sqla.CopyToTable): date = luigi.DateParameter() columns = [(["name", String(100)], {}), ...] connection_string = "mysql://localhost/test" # or something table = "companies" # name of the table to store data def requires(self): return CompanyDownload(self.date) def rows(self): for row in self.get_unique_rows(): # uses self.input() yield row Persisting our data
  • 21. My pipes broke # ./client.cfg [core] error-email: dylan@growthintel.com, stuart@growthintel.com
  • 22. Things we missed out There are lots of task types which can be used which we haven’t mentioned ● Hadoop ● Spark ● ssh ● Elasticsearch ● Hive ● Pig ● etc. Check out the luigi.contrib package
  • 23. class CompanyCount(luigi.contrib.hadoop.JobTask): chunks = luigi.Parameter() def requires(self): return [CompanyDownload(chunk) for chunk in chunks] def output(self): return luigi.contrib.hdfs.HdfsTarget("companines_count.tsv") def mapper(self, line): yield "count", 1 def reducer(self, key, values): yield key, sum(values) Counting the companies using Hadoop split input in chunks HDFS target map and reduce methods instead of run()
  • 24. ● Doesn’t provide a way to trigger flows ● Doesn’t support distributed execution Luigi Limitations
  • 25. Onwards ● The docs: http://luigi.readthedocs.org/ ● The mailing list: https://groups.google.com/forum/#!forum/luigi-user/ ● The source: https://github.com/spotify/luigi ● The maintainers are really helpful, responsive, and open to any and all PRs!
  • 26. Stuart Coleman @stubacca81 / stuart@growthintel.com Dylan Barth @dylan_barth / dylan@growthintel.com Thanks! We’re hiring Python data scientists & engineers! http://www.growthintel.com/careers/

Editor's Notes

  1. How many of you currently manage data pipelines in your day to day? And how many of you use some sort of framework to manage them?
  2. We work for a startup called Growth Intelligence We use predictive modeling to help generate high quality leads for our customers. help customers answer: where should I focus my outbound sales and marketing efforts to yield the highest possible ROI? How we track all the companies in the UK using a variety of data sources we look at sales data for our customers (positive and negative examples from their CRM) we use that to build a predictive model to predict which leads will convert for our customers
  3. So we work with a fair amount of data, from a lot of different sources. We have data pipelines for: taking in new data to keep our data set current doing analytics on existing data and doing model building processing or transforming our existing data, e.g. indexing a subset of it into elasticsearch And as you all know, the more data you deal with, things can get really messy, really fast and the more of a burden it becomes to maintain.
  4. In the past, we used to deal with each data pipeline on an ad hoc, individual basis. For awhile, this worked fine. As our data set grew, we realized we were quickly creating script soup entire repositories with directories of scripts that had a lot of boilerplate clearly needing some abstraction and cleanup.
  5. We had stuff like this: bespoke command line interfaces for each pipeline fine once or twice, but when you have a lot of pipelines, it becomes unwieldy
  6. We also had processor intensive, longish running tasks stacked up against one another in the script. If one failed, how could we re-run without having to repeat a lot of the work that had already been done? Some other challenges included simply keeping things modular and well-structured especially when different data pipelines may be created at different times by different devs? We also had varying levels of reporting across our tasks, so some had great logging and reporting, others didn’t. And the list goes on... And so now that you understand a bit of the challenge, Stuart is going to take over for a bit to help you understand how we approached this problem.
  7. Stuart: Fortunately, we aren’t the only ones to have this problem. In fact, everyone here probably has at some point! A few years ago, the data team at Spotify open-sourced a python library called Luigi. Erik Bernhardsson and Elias Freider Currently maintained by Arash Rouhani Really active and responsive community. Open to pull requests. Integrated within a week Luigi basically provides a framework for structuring any batch processing job or data pipeline. It helps to abstract away some of the question marks that we just talked about and will stop you going crazy maintaining your pipes
  8. Luigi has some awesome abstractions: It provide a Task class which is a template for a single unit of work and outputs a Target It’s easy to define dependencies between Tasks Luigi generates a dependency graph at runtime, so you don’t have to worry about running scripts in a particular order. It will figure out what is the best order because each task is a unit of work, if something breaks, you can restart in the middle instead of the beginning. You get a graceful restart. This is really useful if you have a task which depends on a long running task which runs infrequently and a short running task which runs every day say. There is no need to rerun your long running task. Luigi in that sense is idempotent It comes with intuitive command line integration so that you can pass parameters into your tasks without writing boilerplate It can notify you when tasks fall over so you don’t have to waste time babysitting And the list goes on! We’ll be the first to say that we aren’t experts in Luigi just yet. But we have slowly started converting some of our data pipelines to use it, and have also been adding our pipelines exclusively in luigi. Now we’ll go through a couple of simple examples now that demonstrate the power of the framework but are simple enough to follow in a 25min talk. We’ll also mention a few of it’s limitations we’ve discovered so far, and then open up for questions and discussion.
  9. Let’s imagine that we have miraculously been given a csv file that contains data about the limited companies here in the UK. It just has simple metadata like the companies registration number, the company name, incorporation date, and sector. For the purposes of this example, let’s imagine we simply want to count the number of unique companies currently operating in the UK and write that count to a file on disk. So the workflow might look something like the above -- read the file, count the unique company names, and write it to a text file
  10. Here we’ve defined our task “CompanyCount” and it inherits from the vanilla Luigi task It has a couple of methods: output and run: Run simply contains the business logic of the task and is executed when the task runs. You can put whatever processing logic you want in here. Output returns a Luigi Target -- valid output can be a lot of things: a location on disk, on a remote server, or location in a database. In this case, we’re simply writing to disk. When we run this from the command line, luigi executes the code in the run method and finishes by writing the count to the output target. Great! But we obviously can’t do much with this data. We want to make sure we have the latest count instead of using our local, outdated file, we’re going to go and get the latest data from a UK government server.
  11. We can break this flow into two units of work: a download task and a processing task. The Task class has another method, requires, which makes it simple to define dependencies between Tasks. In this case, we simply say that the CompanyCount task requires the DownloadCompaniesData task. Let’s see how that changes our CompanyCount task:
  12. Here we’ve made two changes to our CompanyCount task: a “requires” method, specifying that CompanyDownload is required before CompanyCount can complete successfully. we’ve replaced the name of the file with the self.input() method, which returns the Target object that the Task requires. In this case, the LocalTarget returned by CompanyDownload. Now we need to define our CompanyDownload task.
  13. CompanyDownload is a simple task that goes up and gets the company data and downloads it to our directory. The output method returns a target object pointing to a file location on disk. The run method simply downloads the file and writes it to the output location. Note that this output target becomes the input for any task that requires this one (in our case, this is the CompanyCount task). Now, let’s try running this from the command line
  14. To run our company count task from beginning to end, we simply call python company_flow.py CompanyCount That tells Luigi which task we want to run. Also worth noting that we told luigi to use the local-scheduler. This tells luigi to not use the central-scheduler, which is a daemon that comes bundled with luigi and handles scheduling tasks. We’ll talk about what that’s good for in a bit, but for now, we just use the local-scheduler When we run this from the command line, luigi builds up a dependency graph and see’s that before it can run CompanyCount, it needs to run CompanyDownload. It establishes this by calling the exists() method on required tasks, which simply checks to see if the Target returned by the output method already exists. If it does, that task is marked as DONE, otherwise it’s included in the task queue. So first Luigi runs CompanyDownload, and then if it executes successfully, it runs CompanyCount and generates a new count for us. So that was the MVP Luigi task - but from these simple building blocks it is possible to build up complicated examples quite quickly.
  15. Dylan: Having a count of all the companies in the UK is pretty cool, but what if we wanted to get way more awesome and visualize how the overall number of companies in the UK has changed over the past year? We can do that pretty easily with our current code, mostly thanks to the fact that our tasks only do one job each. We just need to add a third task calls out to our previous tasks, getting the company data for each month in a given year then outputs something useful -- a csv, or a histogram of the counts returned.
  16. A couple of interesting things are happening here. First, we are passing in a year as a parameter. Luigi intelligently accepts defined parameters as command line args, so no boilerplate needed! Second, the task uses the year to dynamically generate its requirements. e.g. for each month in this year, run CompanyCount for that month This triggers a download for that date’s data if we don’t have it already. In order for this to work, we’ll have to add a date parameter to our previous tasks Note that we didn’t include our output or run method here this could be a histogram or a csv, whatever you want to do.
  17. And here you can see how we’ve added a date parameter to the CompanyCount task. We’ll also need to add this to the CompanyDownload task (not shown) Now, we can trigger quite a few subtasks with just one task This can be hard to keep track of Luckily, the luigi central-scheduler comes with a basic visualizer. Remember the first time we ran our task sequence from the CLI with --local-scheduler? Let’s try it again, but this time let’s start a luigi central scheduler.
  18. To start a scheduler, we just run luigid from the command line in the background Next we run our task, leaving off the --local-scheduler option this time This tells luigi to use the central scheduler Note: useful in production because it ensures that two instances of the same task never run simultaneously Also: awesome in development because you can visualize tasks dependencies. While this is running, we can visit localhost:8082 check out the dependency graph that luigi has created. Our simple command spawned 24 subtasks, a download and company count task for each month of the 2014. The colors represent task status so all of our previous tasks have run the delta task is still in progress. If a task fails, it’s marked as red. Now that you’ve seen the central-scheduler let’s talk a bit about how Luigi nicely integrates with other tools like mySQL.
  19. In reality, we’re going to want to store the companies data in mySQL so we can use it for modeling and ad hoc querying. We define a new task, CompaniesToMysql it simply takes in a date param writes companies to table for that month In this way, we can leverage the download task that we created previously and run this task completely separately from our analytics tasks. Let’s look at how we can represent this in code
  20. You’ll notice that this looks very different from tasks you’ve seen before This is because our task isn’t inheriting directly from the vanilla luigi task We are using the contrib.sqla module The SQLA CopyToTable task provides powerful abstractions on top of a base luigi task when working with SQL Alchemy It assumes the output of the task will be a SQLAlchemy table you can control by specifying the connection string, table, and columns (if the table doesn’t already exist) Instead of a run method, we override the rows method, which returns a generator of row tuples. This simplifies things, because you can do all the processing you want in the rows method and let the task deal with batch inserts. When we run this, Luigi first checks to make sure that DownloadCompanyData has been run for the date we specified, and then it runs the copy to table to task, inserting the records.
  21. One last thing to think about for now, what happens when something falls over? Luigi handles errors intelligently. Because tasks are individual units of work, if something breaks, you don’t have to re-run everything. You can simply restart the task, and the dependencies that finished will be skipped. We also want to get a notification with a stack trace. we can just add a configuration file specifying the emails to send the stack trace to and here’s an example of an email from Luigi in a case where you try to divide by zero (lolz) It’s worth noting that in addition to system wide luigi settings, you can also specify settings on a per task basis in the config.
  22. Stuart There is a ton of extensibility and integration with other services that Luigi provides abstractions for, and we’ve listed them out here Definitely check out the contrib docs for more info. Here’s a quick example of how you might use the hadoop module
  23. We point our files to HDFS Rather than implementing a run() method, we can have a mapper() and reducer() method
  24. Although Luigi is pretty awesome, there are some limitations worth pointing out: Luigi does not come with built-in triggering, and you still need to rely on something like crontab to trigger workflows periodically. Luigi does not support distribution of execution. When you have workers running thousands of jobs daily, this starts to matter, because the worker nodes get overloaded. Probably ok for a lot of you Idea was that api was more important than architecture. If you are interested in architecture you may want to check out Airflow from the AirBnB team which has just open sourced a library called Airflow
  25. Definitely check out the docs, join the mailing list (it’s pretty active), and check out the repo. There’s active churn on the issues and the maintainers are super responsive. Docs could use more examples probably, that could be your first contribution!