This document summarizes a presentation about using Python, Celery, and RabbitMQ for data processing. It describes using Celery to efficiently process large amounts of data from multiple sources in parallel and deploy the results to different targets. It provides a practical example of using Celery to parse 500,000 emails and load them into a MySQL database and Elasticsearch index. The example code demonstrates setting up Celery, defining tasks, and using Fabric to start workers and process files.
1. Data Processing with Python /
Celery and RabbitMQ
for the
New England Regional Developers (NERD) Summit
Jeff Peck
9/11/2015
2. 9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Introduction
Jeff Peck
Senior Software Engineer
Code Ninja
jpeck@esperdyne.com
www.esperdyne.com
Esperdyne Technologies, LLC
245 Russell Street, Suite 23
Hadley, MA 01035-9558
3. 9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
The Goal of this Presentation
● Understand the challenges of
real-life data processing
scenarios
● Consider the possible solutions
● Describe an approach using
Python / Celery and RabbitMQ
● Discover how you can process
data with Celery, from scratch,
by walking through a real
example
4. 9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Agenda
● Background
● The Challenge
● Approaches Considered
● About Celery / Task Queues
● Practical Example: Processing Emails
● Questions
5. 9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Background
● We process data for ~5 million industrial parts
each week
● Data comes from different sources
● Some structured / some unstructured
● Multiple deploy targets: MySQL / FAST ESP
● Database deploy non-item-specific data (i.e.
catalog data or taxonomy data, etc)
● Metadata processing
● Various dependencies before processing and
pushing to production
6. 9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Background
Structured
Catalog Data
Unstructured
PDF Data
Metadata
Database
Search Index
7. 9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
The Challenge
● Efficiently process data from multiple sources
● Consider all dependencies
● Deploy to multiple targets in parallel
● Capture the success/failure of each item to be
able to generate a report
● Build a process that can be easily triggered to
handle all aspects of data processing on a
weekly basis
8. 9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Approaches
● Process everything in separate batches
– Fine for small amount of data
– Lots of manual steps
– Almost no parallel processing
– Would take approximately one week to process all data
● Pypes
– Flow-based programming paradigm
– “Components” and “Packets”
– Lacked flexibility to spawn multiple jobs from a single
component
9. 9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
“This Calls for Some Celery!”
● Celery: Distributed Task Queue
● Written in Python
● Integrates with RabbitMQ and Redis
● Supports task chaining
● Extremely Flexible
● Distributed
– Can manage multiple queues
● Very active community
– (over 10k downloads per day)
10. 9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Celery
● “Celery is an asynchronous task queue/job
queue based on distributed message passing. It
is focused on real-time operation, but supports
scheduling as well.”
● http://www.celeryproject.org/
● pip install -U Celery
● Supports callbacks or task chaining
● Ideal for processing data from different sources,
and deploying to multiple targets, while
collecting status of individual items
11. 9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
What is a Distributed Task Queue?
● A message queue passes, holds, and delivers
messages across a system or application
● A task queue is a type of message queue that
deals with tasks, such as processing some data
● A distributed task queue combines multiple
task queues across systems
12. 9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Workers, Brokers, and Backends
● In Celery, a worker executes tasks that are
passed to it from the message broker
● The message broker is the service that sends
are receives the messages (i.e. the message
queue). Celery is compatible with many
different brokers such as Redis, Mongo DB,
Iron MQ, etc. We use RabbitMQ.
● A backend is necessary if you want to store the
results of tasks or send the states somewhere
(i.e. when executing a “group” of tasks)
13. 9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Practical Example: Processing
Emails
● 500k emails recovered from Enron
● Goal is to parse each email and load them into
ElasticSearch and MySQL
● We could do this manually in stages, but we want to take
full advantage of our resources and minimize our
interaction with the process
● We will use Celery, RabbitMQ, and Redis
● All of the source code for this example is available here:
https://github.com/esperdyne
●
16. 9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Email Processing: Setup
● Create a new directory for the project
● Create the proj directory and put an empty
__init__.py file in it.
● Download the raw Enron emails
$ mkdir celery-message-processing
$ cd celery-message-processing
$ mkdir proj
$ touch proj/__init__.py
$ wget http://www.cs.cmu.edu/~enron/enron_mail_20150507.tgz
$ tar -xvf enron_mail_20150507.tgz
17. 9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Email Processing: The Celery file
● Inside the proj dir, create a file called
celery.py and open it with your favorite text
editor (i.e. emacs proj/celery.py )
from __future__ import absolute_import
from celery import Celery
app = Celery('proj',
broker='amqp://',
backend='redis://localhost',
include=['proj.tasks'])
# Optional configuration, see the application user guide.
app.conf.update(
CELERY_TASK_RESULT_EXPIRES=3600,
)
if __name__ == '__main__':
app.start()
18. 9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Email Processing: The Tasks File
● Now, create another file inside the proj
directory called tasks.py and open it for
editing.
● Write the following imports:
from __future__ import absolute_import
import email
from sqlalchemy import *
from elasticsearch import Elasticsearch
from celery import Task
from proj.celery import app
19. 9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Emails Processing: Tasks File (cont)
class MessagesTask(Task):
"""This is a celery abstract base class that contains all of the logic for
parsing and deploying content."""
abstract = True
_messages_table = None
_elasticsearch = None
def _init_database(self):
"""Set up the MySQL database"""
db = create_engine('mysql://root@localhost/messages')
metadata = MetaData(db)
messages_table = Table('messages', metadata,
Column('message_id', String(255), primary_key = True),
Column('subject', String(255)),
Column('to', String(255)),
Column('x_to', String(255)),
Column('from', String(255)),
Column('x_from', String(255)),
Column('cc', String(255)),
Column('x_cc', String(255)),
Column('bcc', String(255)),
Column('x_bcc', String(255)),
Column('payload', Text()))
messages_table.create(checkfirst=True)
self._messages_table = messages_table
def _init_elasticsearch(self):
"""Set up the ElasticSearch instance"""
self._elasticsearch = Elasticsearch()
...
20. 9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Emails Processing: Tasks File (cont)
...
def parse_message_file(self, filename):
"""Parse an email file. Return as dictionary"""
with open(filename) as f:
message = email.message_from_file(f)
return {'subject': message.get("Subject"),
'to': message.get("To"),
'x_to': message.get("X-To"),
'from': message.get("From"),
'x_from': message.get("X-From"),
'cc': message.get("Cc"),
'x_cc': message.get("X-cc"),
'bcc': message.get("Bcc"),
'x_bcc': message.get("X-bcc"),
'message_id': message.get("Message-ID"),
'payload': message.get_payload()}
def database_insert(self, message_dict):
"""Insert a message into the MySQL database"""
if self._messages_table is None:
self._init_database()
ins = self._messages_table.insert(values=message_dict)
ins.execute()
def elasticsearch_index(self, id, message_dict):
"""Insert a message into the ElasticSearch index"""
if self._elasticsearch is None:
self._init_elasticsearch()
self._elasticsearch.index(index="messages", doc_type="message", id=id, body=message_dict)
21. 9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Email Processing: Tasks File (cont)
@app.task(base=MessagesTask, queue="parse")
def parse(filename):
"""Parse an email file. Return as dictionary"""
# Call the method in the base task and return the result
return parse.parse_message_file(filename)
@app.task(base=MessagesTask, queue="db_deploy", ignore_result=True)
def deploy_db(message_dict):
"""Deploys the message dictionary to the MySQL database table"""
# Call the method in the base task
deploy_db.database_insert(message_dict)
@app.task(base=MessagesTask, queue="es_deploy", ignore_result=True)
def deploy_es(message_dict):
"""Deploys the message dictionary to the Elastic Search instance"""
# Call the method in the base task
deploy_es.elasticsearch_index(message_dict['message_id'], message_dict)
22. 9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Email Processing: Fabric Script
● I use fabric to start/stop the Celery workers and
to pass the raw emails to be processed
● Make a fabfile.py in the base directory and
open it for editing
import os
from fabric.api import local
from celery import chain, group
from celery.task.control import inspect
from proj.tasks import parse, deploy_db, deploy_es
23. 9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Email Processing: Fabric (cont)
def workers(action):
"""Issue command to start, restart, or stop celery workers"""
# Prepare the directories for pids and logs
local("mkdir -p celery-pids celery-logs")
# Launch 4 celery workers for 4 queues (parse, db_deploy, es_deploy, and default)
# Each has a concurrency of 2 except the default which has a concurrency of 1
# More info on the format of this command can be found here:
# http://docs.celeryproject.org/en/latest/reference/celery.bin.multi.html
local("celery multi {} parse db_deploy es_deploy celery "
"-Q:parse parse -Q:db_deploy db_deploy -Q:es_deploy es_deploy -Q:celery celery "
"-c 2 -c:celery 1 "
"-l info -A proj "
"--pidfile=celery-pids/%n.pid --logfile=celery-logs/%n.log".format(action))
● Start/stop the workers with fabric
Usage example:
$ fab workers:start
$ fab workers:stop
$ fab workers:restart
24. 9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Email Processing: Fabric (cont)
● Task Chaining
def process_one(filename=None):
"""Enqueues a mail file for processing"""
res = chain(parse.s(filename), group(deploy_db.s(), deploy_es.s()))()
print "Enqueued mail file for processing: {} ({})".format(filename, res)
def process(path=None):
"""Enqueues a mail file for processing. Optionally, submitting a
directory will enqueue all files in that directory"""
if os.path.isfile(path):
process_one(path)
elif os.path.isdir(path):
for subpath, subdirs, files in os.walk(path):
for name in files:
process_one(os.path.join(subpath, name))
25. 9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Email Processing: Usage
● To start a build cycle, this is all that you need to
do:
$ fab workers:start
$ fab process:maildir
26. 9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Email Processing: What next?
● Implement a “chord”:
– Trigger a task to update an email's status after
successfully being processed and deployed to MySQL
and ElasticSearch
● Handle errors:
– Write to a special log file every time an error occurs with
a custom error handler
● Reporting:
– Detect the completion of processing with a scheduled
task that confirms that all tasks are complete, and email
a report automatically with the number of successful /
failed messages
27. 9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Email Processing: Try it yourself
● All of the source code and instructions for this demo are
available here:
https://github.com/esperdyne/celery-message-processing
● Can be used as a boilerplate for an unrelated celery
project
● Fork, experiment, ask questions, etc.
28. 9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
One More Thing: Celery Flower
● There is a tool that provides real-time monitoring for your
Celery instance, called “Flower”:
https://github.com/mher/flower
29. 9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Any Questions?
(Can you spare a guess as to why that question mark isn't made out of celery?)