SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Downloaden Sie, um offline zu lesen
Data Processing with Python /
Celery and RabbitMQ
for the
New England Regional Developers (NERD) Summit
Jeff Peck
9/11/2015
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Introduction
Jeff Peck
Senior Software Engineer
Code Ninja
jpeck@esperdyne.com
www.esperdyne.com
Esperdyne Technologies, LLC
245 Russell Street, Suite 23
Hadley, MA 01035-9558
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
The Goal of this Presentation
● Understand the challenges of
real-life data processing
scenarios
● Consider the possible solutions
● Describe an approach using
Python / Celery and RabbitMQ
● Discover how you can process
data with Celery, from scratch,
by walking through a real
example
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Agenda
● Background
● The Challenge
● Approaches Considered
● About Celery / Task Queues
● Practical Example: Processing Emails
● Questions
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Background
● We process data for ~5 million industrial parts
each week
● Data comes from different sources
● Some structured / some unstructured
● Multiple deploy targets: MySQL / FAST ESP
● Database deploy non-item-specific data (i.e.
catalog data or taxonomy data, etc)
● Metadata processing
● Various dependencies before processing and
pushing to production
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Background
Structured
Catalog Data
Unstructured
PDF Data
Metadata
Database
Search Index
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
The Challenge
● Efficiently process data from multiple sources
● Consider all dependencies
● Deploy to multiple targets in parallel
● Capture the success/failure of each item to be
able to generate a report
● Build a process that can be easily triggered to
handle all aspects of data processing on a
weekly basis
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Approaches
● Process everything in separate batches
– Fine for small amount of data
– Lots of manual steps
– Almost no parallel processing
– Would take approximately one week to process all data
● Pypes
– Flow-based programming paradigm
– “Components” and “Packets”
– Lacked flexibility to spawn multiple jobs from a single
component
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
“This Calls for Some Celery!”
● Celery: Distributed Task Queue
● Written in Python
● Integrates with RabbitMQ and Redis
● Supports task chaining
● Extremely Flexible
● Distributed
– Can manage multiple queues
● Very active community
– (over 10k downloads per day)
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Celery
● “Celery is an asynchronous task queue/job
queue based on distributed message passing. It
is focused on real-time operation, but supports
scheduling as well.”
● http://www.celeryproject.org/
● pip install -U Celery
● Supports callbacks or task chaining
● Ideal for processing data from different sources,
and deploying to multiple targets, while
collecting status of individual items
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
What is a Distributed Task Queue?
● A message queue passes, holds, and delivers
messages across a system or application
● A task queue is a type of message queue that
deals with tasks, such as processing some data
● A distributed task queue combines multiple
task queues across systems
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Workers, Brokers, and Backends
● In Celery, a worker executes tasks that are
passed to it from the message broker
● The message broker is the service that sends
are receives the messages (i.e. the message
queue). Celery is compatible with many
different brokers such as Redis, Mongo DB,
Iron MQ, etc. We use RabbitMQ.
● A backend is necessary if you want to store the
results of tasks or send the states somewhere
(i.e. when executing a “group” of tasks)
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Practical Example: Processing
Emails
● 500k emails recovered from Enron
● Goal is to parse each email and load them into
ElasticSearch and MySQL
● We could do this manually in stages, but we want to take
full advantage of our resources and minimize our
interaction with the process
● We will use Celery, RabbitMQ, and Redis
● All of the source code for this example is available here:
https://github.com/esperdyne
●
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Email Processing
Parse
Elastic Search
MySQL
Emails
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Email Processing: Setup
● Install:
– RabbitMQ
– Redis
– Celery
– Fabric
– MySQL
– ElasticSearch
Install RabbitMQ:
$ sudo apt-get install rabbitmq-server
Install Redis:
$ sudo apt-get install redis-server
$ sudo pip install redis
Install Celery:
$ sudo pip install celery
Install Fabric:
$ sudo pip install fabric
Install ElasticSearch:
$ sudo apt-get install openjdk-7-jre
$ wget -qO - https://packages.elastic.co/GPG-KEY-elasticsearch |
sudo apt-key add -
$ echo "deb http://packages.elastic.co/elasticsearch/1.7/debian
stable main" | sudo tee -a /etc/apt/sources.list.d/elasticsearch-
1.7.list
$ sudo apt-get update && sudo apt-get install elasticsearch
$ sudo update-rc.d elasticsearch defaults 95 10
$ sudo pip install elasticsearch
$ sudo service elasticsearch start
Install MySQL:
$ sudo apt-get install mysql-server
$ sudo apt-get build-dep python-mysqldb
$ sudo pip install MySQL_python
$ sudo pip install sqlalchemy
Make “messages” database:
$ mysql -u root -e "CREATE DATABASE messages"
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Email Processing: Setup
● Create a new directory for the project
● Create the proj directory and put an empty
__init__.py file in it.
● Download the raw Enron emails
$ mkdir celery-message-processing
$ cd celery-message-processing
$ mkdir proj
$ touch proj/__init__.py
$ wget http://www.cs.cmu.edu/~enron/enron_mail_20150507.tgz
$ tar -xvf enron_mail_20150507.tgz
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Email Processing: The Celery file
● Inside the proj dir, create a file called
celery.py and open it with your favorite text
editor (i.e. emacs proj/celery.py )
from __future__ import absolute_import
from celery import Celery
app = Celery('proj',
broker='amqp://',
backend='redis://localhost',
include=['proj.tasks'])
# Optional configuration, see the application user guide.
app.conf.update(
CELERY_TASK_RESULT_EXPIRES=3600,
)
if __name__ == '__main__':
app.start()
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Email Processing: The Tasks File
● Now, create another file inside the proj
directory called tasks.py and open it for
editing.
● Write the following imports:
from __future__ import absolute_import
import email
from sqlalchemy import *
from elasticsearch import Elasticsearch
from celery import Task
from proj.celery import app
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Emails Processing: Tasks File (cont)
class MessagesTask(Task):
"""This is a celery abstract base class that contains all of the logic for
    parsing and deploying content."""
abstract = True
_messages_table = None
_elasticsearch = None
def _init_database(self):
"""Set up the MySQL database"""
db = create_engine('mysql://root@localhost/messages')
metadata = MetaData(db)
messages_table = Table('messages', metadata,
Column('message_id', String(255), primary_key = True),
Column('subject', String(255)),
Column('to', String(255)),
Column('x_to', String(255)),
Column('from', String(255)),
Column('x_from', String(255)),
Column('cc', String(255)),
Column('x_cc', String(255)),
Column('bcc', String(255)),
Column('x_bcc', String(255)),
Column('payload', Text()))
messages_table.create(checkfirst=True)
self._messages_table = messages_table
def _init_elasticsearch(self):
"""Set up the ElasticSearch instance"""
self._elasticsearch = Elasticsearch()
...
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Emails Processing: Tasks File (cont)
...
def parse_message_file(self, filename):
"""Parse an email file. Return as dictionary"""
with open(filename) as f:
message = email.message_from_file(f)
return {'subject': message.get("Subject"),
'to': message.get("To"),
'x_to': message.get("X-To"),
'from': message.get("From"),
'x_from': message.get("X-From"),
'cc': message.get("Cc"),
'x_cc': message.get("X-cc"),
'bcc': message.get("Bcc"),
'x_bcc': message.get("X-bcc"),
'message_id': message.get("Message-ID"),
'payload': message.get_payload()}
def database_insert(self, message_dict):
"""Insert a message into the MySQL database"""
if self._messages_table is None:
self._init_database()
ins = self._messages_table.insert(values=message_dict)
ins.execute()
def elasticsearch_index(self, id, message_dict):
"""Insert a message into the ElasticSearch index"""
if self._elasticsearch is None:
self._init_elasticsearch()
self._elasticsearch.index(index="messages", doc_type="message", id=id, body=message_dict)
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Email Processing: Tasks File (cont)
@app.task(base=MessagesTask, queue="parse")
def parse(filename):
"""Parse an email file. Return as dictionary"""
# Call the method in the base task and return the result
return parse.parse_message_file(filename)
@app.task(base=MessagesTask, queue="db_deploy", ignore_result=True)
def deploy_db(message_dict):
"""Deploys the message dictionary to the MySQL database table"""
# Call the method in the base task
deploy_db.database_insert(message_dict)
@app.task(base=MessagesTask, queue="es_deploy", ignore_result=True)
def deploy_es(message_dict):
"""Deploys the message dictionary to the Elastic Search instance"""
# Call the method in the base task
deploy_es.elasticsearch_index(message_dict['message_id'], message_dict)
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Email Processing: Fabric Script
● I use fabric to start/stop the Celery workers and
to pass the raw emails to be processed
● Make a fabfile.py in the base directory and
open it for editing
import os
from fabric.api import local
from celery import chain, group
from celery.task.control import inspect
from proj.tasks import parse, deploy_db, deploy_es
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Email Processing: Fabric (cont)
def workers(action):
"""Issue command to start, restart, or stop celery workers"""
# Prepare the directories for pids and logs
local("mkdir -p celery-pids celery-logs")
# Launch 4 celery workers for 4 queues (parse, db_deploy, es_deploy, and default)
# Each has a concurrency of 2 except the default which has a concurrency of 1
# More info on the format of this command can be found here:
# http://docs.celeryproject.org/en/latest/reference/celery.bin.multi.html
local("celery multi {} parse db_deploy es_deploy celery "
"-Q:parse parse -Q:db_deploy db_deploy -Q:es_deploy es_deploy -Q:celery celery "
"-c 2 -c:celery 1 "
"-l info -A proj "
"--pidfile=celery-pids/%n.pid --logfile=celery-logs/%n.log".format(action))
● Start/stop the workers with fabric
Usage example:
$ fab workers:start
$ fab workers:stop
$ fab workers:restart
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Email Processing: Fabric (cont)
● Task Chaining
def process_one(filename=None):
"""Enqueues a mail file for processing"""
res = chain(parse.s(filename), group(deploy_db.s(), deploy_es.s()))()
print "Enqueued mail file for processing: {} ({})".format(filename, res)
def process(path=None):
"""Enqueues a mail file for processing. Optionally, submitting a
    directory will enqueue all files in that directory"""
if os.path.isfile(path):
process_one(path)
elif os.path.isdir(path):
for subpath, subdirs, files in os.walk(path):
for name in files:
process_one(os.path.join(subpath, name))
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Email Processing: Usage
● To start a build cycle, this is all that you need to
do:
$ fab workers:start
$ fab process:maildir
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Email Processing: What next?
● Implement a “chord”:
– Trigger a task to update an email's status after
successfully being processed and deployed to MySQL
and ElasticSearch
● Handle errors:
– Write to a special log file every time an error occurs with
a custom error handler
● Reporting:
– Detect the completion of processing with a scheduled
task that confirms that all tasks are complete, and email
a report automatically with the number of successful /
failed messages
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Email Processing: Try it yourself
● All of the source code and instructions for this demo are
available here:
https://github.com/esperdyne/celery-message-processing
● Can be used as a boilerplate for an unrelated celery
project
● Fork, experiment, ask questions, etc.
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
One More Thing: Celery Flower
● There is a tool that provides real-time monitoring for your
Celery instance, called “Flower”:
https://github.com/mher/flower
9/11/2015
Jeff Peck
Data Processing with Python / Celery and RabbitMQ
Any Questions?
(Can you spare a guess as to why that question mark isn't made out of celery?)

Weitere ähnliche Inhalte

Was ist angesagt?

Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
Sadayuki Furuhashi
 
PostgreSQL Performance Tuning
PostgreSQL Performance TuningPostgreSQL Performance Tuning
PostgreSQL Performance Tuning
elliando dias
 

Was ist angesagt? (20)

Real World Event Sourcing and CQRS
Real World Event Sourcing and CQRSReal World Event Sourcing and CQRS
Real World Event Sourcing and CQRS
 
Atomicity In Redis: Thomas Hunter
Atomicity In Redis: Thomas HunterAtomicity In Redis: Thomas Hunter
Atomicity In Redis: Thomas Hunter
 
Flower and celery
Flower and celeryFlower and celery
Flower and celery
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
gRPC Design and Implementation
gRPC Design and ImplementationgRPC Design and Implementation
gRPC Design and Implementation
 
Kostas Kloudas - Extending Flink's Streaming APIs
Kostas Kloudas - Extending Flink's Streaming APIsKostas Kloudas - Extending Flink's Streaming APIs
Kostas Kloudas - Extending Flink's Streaming APIs
 
Why Task Queues - ComoRichWeb
Why Task Queues - ComoRichWebWhy Task Queues - ComoRichWeb
Why Task Queues - ComoRichWeb
 
Introduction to Celery
Introduction to CeleryIntroduction to Celery
Introduction to Celery
 
Concurrency Control in MongoDB 3.0
Concurrency Control in MongoDB 3.0Concurrency Control in MongoDB 3.0
Concurrency Control in MongoDB 3.0
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
 
LMAX Disruptor as real-life example
LMAX Disruptor as real-life exampleLMAX Disruptor as real-life example
LMAX Disruptor as real-life example
 
An Introduction to REDIS NoSQL database
An Introduction to REDIS NoSQL databaseAn Introduction to REDIS NoSQL database
An Introduction to REDIS NoSQL database
 
PostgreSQL Performance Tuning
PostgreSQL Performance TuningPostgreSQL Performance Tuning
PostgreSQL Performance Tuning
 
Json in Postgres - the Roadmap
 Json in Postgres - the Roadmap Json in Postgres - the Roadmap
Json in Postgres - the Roadmap
 
Changelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkChangelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache Flink
 
Auditing and Monitoring PostgreSQL/EPAS
Auditing and Monitoring PostgreSQL/EPASAuditing and Monitoring PostgreSQL/EPAS
Auditing and Monitoring PostgreSQL/EPAS
 
Postgresql 12 streaming replication hol
Postgresql 12 streaming replication holPostgresql 12 streaming replication hol
Postgresql 12 streaming replication hol
 
Introduction to Kafka with Spring Integration
Introduction to Kafka with Spring IntegrationIntroduction to Kafka with Spring Integration
Introduction to Kafka with Spring Integration
 
MongoDB .local Toronto 2019: Tips and Tricks for Effective Indexing
MongoDB .local Toronto 2019: Tips and Tricks for Effective IndexingMongoDB .local Toronto 2019: Tips and Tricks for Effective Indexing
MongoDB .local Toronto 2019: Tips and Tricks for Effective Indexing
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0
 

Andere mochten auch (13)

Javan Owino Diploma certificate
Javan Owino Diploma certificateJavan Owino Diploma certificate
Javan Owino Diploma certificate
 
Internal training - Eda
Internal training - EdaInternal training - Eda
Internal training - Eda
 
Diplomas
DiplomasDiplomas
Diplomas
 
Erlang for data ops
Erlang for data opsErlang for data ops
Erlang for data ops
 
Dilplomas Certificaciones
Dilplomas CertificacionesDilplomas Certificaciones
Dilplomas Certificaciones
 
Attachment report IAT
Attachment report IATAttachment report IAT
Attachment report IAT
 
Attachment report Victor
Attachment report VictorAttachment report Victor
Attachment report Victor
 
INTERNSHIP REPORT
INTERNSHIP REPORTINTERNSHIP REPORT
INTERNSHIP REPORT
 
Dynamo db pros and cons
Dynamo db  pros and consDynamo db  pros and cons
Dynamo db pros and cons
 
Attachment report
Attachment report Attachment report
Attachment report
 
Field attachment report (alie chibwe)
Field attachment report (alie chibwe)Field attachment report (alie chibwe)
Field attachment report (alie chibwe)
 
Industrial Training Report-1
Industrial Training Report-1Industrial Training Report-1
Industrial Training Report-1
 
Summer internship project report
Summer internship project reportSummer internship project report
Summer internship project report
 

Ähnlich wie Data processing with celery and rabbit mq

How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
PyData
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
Laura Lorenz
 
Questions On The Code And Core Module
Questions On The Code And Core ModuleQuestions On The Code And Core Module
Questions On The Code And Core Module
Katie Gulley
 
Sanjaykumar Kakaso Mane_MAY2016
Sanjaykumar Kakaso Mane_MAY2016Sanjaykumar Kakaso Mane_MAY2016
Sanjaykumar Kakaso Mane_MAY2016
Sanjay Mane
 

Ähnlich wie Data processing with celery and rabbit mq (20)

How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Learning the basics of Apache NiFi for iot OSS Europe 2020
Learning the basics of Apache NiFi for iot OSS Europe 2020Learning the basics of Apache NiFi for iot OSS Europe 2020
Learning the basics of Apache NiFi for iot OSS Europe 2020
 
Building Reproducible Network Data Analysis / Visualization Workflows
Building Reproducible Network Data Analysis / Visualization WorkflowsBuilding Reproducible Network Data Analysis / Visualization Workflows
Building Reproducible Network Data Analysis / Visualization Workflows
 
Taylor Wicksell and Tom Gianos at SpringOne Platform 2019
Taylor Wicksell and Tom Gianos at SpringOne Platform 2019Taylor Wicksell and Tom Gianos at SpringOne Platform 2019
Taylor Wicksell and Tom Gianos at SpringOne Platform 2019
 
Embulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loaderEmbulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loader
 
Build Automation of PHP Applications
Build Automation of PHP ApplicationsBuild Automation of PHP Applications
Build Automation of PHP Applications
 
Questions On The Code And Core Module
Questions On The Code And Core ModuleQuestions On The Code And Core Module
Questions On The Code And Core Module
 
Yaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdfYaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdf
 
Why Wordnik went non-relational
Why Wordnik went non-relationalWhy Wordnik went non-relational
Why Wordnik went non-relational
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)
 
Log everything! @DC13
Log everything! @DC13Log everything! @DC13
Log everything! @DC13
 
Parallel Programming in Python: Speeding up your analysis
Parallel Programming in Python: Speeding up your analysisParallel Programming in Python: Speeding up your analysis
Parallel Programming in Python: Speeding up your analysis
 
Attack monitoring using ElasticSearch Logstash and Kibana
Attack monitoring using ElasticSearch Logstash and KibanaAttack monitoring using ElasticSearch Logstash and Kibana
Attack monitoring using ElasticSearch Logstash and Kibana
 
Sanjaykumar Kakaso Mane_MAY2016
Sanjaykumar Kakaso Mane_MAY2016Sanjaykumar Kakaso Mane_MAY2016
Sanjaykumar Kakaso Mane_MAY2016
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
 
Performance Tuning Your Puppet Infrastructure - PuppetConf 2014
Performance Tuning Your Puppet Infrastructure - PuppetConf 2014Performance Tuning Your Puppet Infrastructure - PuppetConf 2014
Performance Tuning Your Puppet Infrastructure - PuppetConf 2014
 

Kürzlich hochgeladen

Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 

Kürzlich hochgeladen (20)

Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 

Data processing with celery and rabbit mq

  • 1. Data Processing with Python / Celery and RabbitMQ for the New England Regional Developers (NERD) Summit Jeff Peck 9/11/2015
  • 2. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Introduction Jeff Peck Senior Software Engineer Code Ninja jpeck@esperdyne.com www.esperdyne.com Esperdyne Technologies, LLC 245 Russell Street, Suite 23 Hadley, MA 01035-9558
  • 3. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ The Goal of this Presentation ● Understand the challenges of real-life data processing scenarios ● Consider the possible solutions ● Describe an approach using Python / Celery and RabbitMQ ● Discover how you can process data with Celery, from scratch, by walking through a real example
  • 4. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Agenda ● Background ● The Challenge ● Approaches Considered ● About Celery / Task Queues ● Practical Example: Processing Emails ● Questions
  • 5. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Background ● We process data for ~5 million industrial parts each week ● Data comes from different sources ● Some structured / some unstructured ● Multiple deploy targets: MySQL / FAST ESP ● Database deploy non-item-specific data (i.e. catalog data or taxonomy data, etc) ● Metadata processing ● Various dependencies before processing and pushing to production
  • 6. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Background Structured Catalog Data Unstructured PDF Data Metadata Database Search Index
  • 7. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ The Challenge ● Efficiently process data from multiple sources ● Consider all dependencies ● Deploy to multiple targets in parallel ● Capture the success/failure of each item to be able to generate a report ● Build a process that can be easily triggered to handle all aspects of data processing on a weekly basis
  • 8. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Approaches ● Process everything in separate batches – Fine for small amount of data – Lots of manual steps – Almost no parallel processing – Would take approximately one week to process all data ● Pypes – Flow-based programming paradigm – “Components” and “Packets” – Lacked flexibility to spawn multiple jobs from a single component
  • 9. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ “This Calls for Some Celery!” ● Celery: Distributed Task Queue ● Written in Python ● Integrates with RabbitMQ and Redis ● Supports task chaining ● Extremely Flexible ● Distributed – Can manage multiple queues ● Very active community – (over 10k downloads per day)
  • 10. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Celery ● “Celery is an asynchronous task queue/job queue based on distributed message passing. It is focused on real-time operation, but supports scheduling as well.” ● http://www.celeryproject.org/ ● pip install -U Celery ● Supports callbacks or task chaining ● Ideal for processing data from different sources, and deploying to multiple targets, while collecting status of individual items
  • 11. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ What is a Distributed Task Queue? ● A message queue passes, holds, and delivers messages across a system or application ● A task queue is a type of message queue that deals with tasks, such as processing some data ● A distributed task queue combines multiple task queues across systems
  • 12. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Workers, Brokers, and Backends ● In Celery, a worker executes tasks that are passed to it from the message broker ● The message broker is the service that sends are receives the messages (i.e. the message queue). Celery is compatible with many different brokers such as Redis, Mongo DB, Iron MQ, etc. We use RabbitMQ. ● A backend is necessary if you want to store the results of tasks or send the states somewhere (i.e. when executing a “group” of tasks)
  • 13. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Practical Example: Processing Emails ● 500k emails recovered from Enron ● Goal is to parse each email and load them into ElasticSearch and MySQL ● We could do this manually in stages, but we want to take full advantage of our resources and minimize our interaction with the process ● We will use Celery, RabbitMQ, and Redis ● All of the source code for this example is available here: https://github.com/esperdyne ●
  • 14. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Email Processing Parse Elastic Search MySQL Emails
  • 15. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Email Processing: Setup ● Install: – RabbitMQ – Redis – Celery – Fabric – MySQL – ElasticSearch Install RabbitMQ: $ sudo apt-get install rabbitmq-server Install Redis: $ sudo apt-get install redis-server $ sudo pip install redis Install Celery: $ sudo pip install celery Install Fabric: $ sudo pip install fabric Install ElasticSearch: $ sudo apt-get install openjdk-7-jre $ wget -qO - https://packages.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add - $ echo "deb http://packages.elastic.co/elasticsearch/1.7/debian stable main" | sudo tee -a /etc/apt/sources.list.d/elasticsearch- 1.7.list $ sudo apt-get update && sudo apt-get install elasticsearch $ sudo update-rc.d elasticsearch defaults 95 10 $ sudo pip install elasticsearch $ sudo service elasticsearch start Install MySQL: $ sudo apt-get install mysql-server $ sudo apt-get build-dep python-mysqldb $ sudo pip install MySQL_python $ sudo pip install sqlalchemy Make “messages” database: $ mysql -u root -e "CREATE DATABASE messages"
  • 16. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Email Processing: Setup ● Create a new directory for the project ● Create the proj directory and put an empty __init__.py file in it. ● Download the raw Enron emails $ mkdir celery-message-processing $ cd celery-message-processing $ mkdir proj $ touch proj/__init__.py $ wget http://www.cs.cmu.edu/~enron/enron_mail_20150507.tgz $ tar -xvf enron_mail_20150507.tgz
  • 17. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Email Processing: The Celery file ● Inside the proj dir, create a file called celery.py and open it with your favorite text editor (i.e. emacs proj/celery.py ) from __future__ import absolute_import from celery import Celery app = Celery('proj', broker='amqp://', backend='redis://localhost', include=['proj.tasks']) # Optional configuration, see the application user guide. app.conf.update( CELERY_TASK_RESULT_EXPIRES=3600, ) if __name__ == '__main__': app.start()
  • 18. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Email Processing: The Tasks File ● Now, create another file inside the proj directory called tasks.py and open it for editing. ● Write the following imports: from __future__ import absolute_import import email from sqlalchemy import * from elasticsearch import Elasticsearch from celery import Task from proj.celery import app
  • 19. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Emails Processing: Tasks File (cont) class MessagesTask(Task): """This is a celery abstract base class that contains all of the logic for     parsing and deploying content.""" abstract = True _messages_table = None _elasticsearch = None def _init_database(self): """Set up the MySQL database""" db = create_engine('mysql://root@localhost/messages') metadata = MetaData(db) messages_table = Table('messages', metadata, Column('message_id', String(255), primary_key = True), Column('subject', String(255)), Column('to', String(255)), Column('x_to', String(255)), Column('from', String(255)), Column('x_from', String(255)), Column('cc', String(255)), Column('x_cc', String(255)), Column('bcc', String(255)), Column('x_bcc', String(255)), Column('payload', Text())) messages_table.create(checkfirst=True) self._messages_table = messages_table def _init_elasticsearch(self): """Set up the ElasticSearch instance""" self._elasticsearch = Elasticsearch() ...
  • 20. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Emails Processing: Tasks File (cont) ... def parse_message_file(self, filename): """Parse an email file. Return as dictionary""" with open(filename) as f: message = email.message_from_file(f) return {'subject': message.get("Subject"), 'to': message.get("To"), 'x_to': message.get("X-To"), 'from': message.get("From"), 'x_from': message.get("X-From"), 'cc': message.get("Cc"), 'x_cc': message.get("X-cc"), 'bcc': message.get("Bcc"), 'x_bcc': message.get("X-bcc"), 'message_id': message.get("Message-ID"), 'payload': message.get_payload()} def database_insert(self, message_dict): """Insert a message into the MySQL database""" if self._messages_table is None: self._init_database() ins = self._messages_table.insert(values=message_dict) ins.execute() def elasticsearch_index(self, id, message_dict): """Insert a message into the ElasticSearch index""" if self._elasticsearch is None: self._init_elasticsearch() self._elasticsearch.index(index="messages", doc_type="message", id=id, body=message_dict)
  • 21. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Email Processing: Tasks File (cont) @app.task(base=MessagesTask, queue="parse") def parse(filename): """Parse an email file. Return as dictionary""" # Call the method in the base task and return the result return parse.parse_message_file(filename) @app.task(base=MessagesTask, queue="db_deploy", ignore_result=True) def deploy_db(message_dict): """Deploys the message dictionary to the MySQL database table""" # Call the method in the base task deploy_db.database_insert(message_dict) @app.task(base=MessagesTask, queue="es_deploy", ignore_result=True) def deploy_es(message_dict): """Deploys the message dictionary to the Elastic Search instance""" # Call the method in the base task deploy_es.elasticsearch_index(message_dict['message_id'], message_dict)
  • 22. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Email Processing: Fabric Script ● I use fabric to start/stop the Celery workers and to pass the raw emails to be processed ● Make a fabfile.py in the base directory and open it for editing import os from fabric.api import local from celery import chain, group from celery.task.control import inspect from proj.tasks import parse, deploy_db, deploy_es
  • 23. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Email Processing: Fabric (cont) def workers(action): """Issue command to start, restart, or stop celery workers""" # Prepare the directories for pids and logs local("mkdir -p celery-pids celery-logs") # Launch 4 celery workers for 4 queues (parse, db_deploy, es_deploy, and default) # Each has a concurrency of 2 except the default which has a concurrency of 1 # More info on the format of this command can be found here: # http://docs.celeryproject.org/en/latest/reference/celery.bin.multi.html local("celery multi {} parse db_deploy es_deploy celery " "-Q:parse parse -Q:db_deploy db_deploy -Q:es_deploy es_deploy -Q:celery celery " "-c 2 -c:celery 1 " "-l info -A proj " "--pidfile=celery-pids/%n.pid --logfile=celery-logs/%n.log".format(action)) ● Start/stop the workers with fabric Usage example: $ fab workers:start $ fab workers:stop $ fab workers:restart
  • 24. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Email Processing: Fabric (cont) ● Task Chaining def process_one(filename=None): """Enqueues a mail file for processing""" res = chain(parse.s(filename), group(deploy_db.s(), deploy_es.s()))() print "Enqueued mail file for processing: {} ({})".format(filename, res) def process(path=None): """Enqueues a mail file for processing. Optionally, submitting a     directory will enqueue all files in that directory""" if os.path.isfile(path): process_one(path) elif os.path.isdir(path): for subpath, subdirs, files in os.walk(path): for name in files: process_one(os.path.join(subpath, name))
  • 25. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Email Processing: Usage ● To start a build cycle, this is all that you need to do: $ fab workers:start $ fab process:maildir
  • 26. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Email Processing: What next? ● Implement a “chord”: – Trigger a task to update an email's status after successfully being processed and deployed to MySQL and ElasticSearch ● Handle errors: – Write to a special log file every time an error occurs with a custom error handler ● Reporting: – Detect the completion of processing with a scheduled task that confirms that all tasks are complete, and email a report automatically with the number of successful / failed messages
  • 27. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Email Processing: Try it yourself ● All of the source code and instructions for this demo are available here: https://github.com/esperdyne/celery-message-processing ● Can be used as a boilerplate for an unrelated celery project ● Fork, experiment, ask questions, etc.
  • 28. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ One More Thing: Celery Flower ● There is a tool that provides real-time monitoring for your Celery instance, called “Flower”: https://github.com/mher/flower
  • 29. 9/11/2015 Jeff Peck Data Processing with Python / Celery and RabbitMQ Any Questions? (Can you spare a guess as to why that question mark isn't made out of celery?)