Reporting and analysis systems rely on coherent and reliable data, often from disparate sources. To that end, a series of well established data warehousing practices have emerged to extract data and produce a consistent data store.
This talk will look at some options for composing workflows using Python. In particular, we'll explore beyond Celery's asynchronous task processing functionality into its workflow (aka Canvas) system and how it can be used in conjunction with SQLAlchemy's architecture to provide the building blocks for data stream processing.
Apidays New York 2024 - The value of a flexible API Management solution for O...
Building data flows with Celery and SQLAlchemy
1. Building data flows with Celery
and SQLAlchemy
PyCon Australia 2013
Roger Barnes
@mindsocket
roger@mindsocket.com.au
http://slideshare.net/mindsocket
2. Coming up
● Data warehousing
– AKA data integration
● Processing data flows
– SQLAlchemy
– Celery
● Tying it all together
3. About me
● 15 years doing all things software
● 11 years at a Business Intelligence vendor
● Currently contracting
– This talk based on a real reporting system
10. Python can help!
● Rapid prototyping
● Code re-use
● Existing libraries
● Decouple
– data flow management
– data processing
– business logic
11. Existing solutions
● Not a lot available in the Python space
● People roll their own
● Bubbles (Brewery 2)
– Framework for Python 3
– "Focus on the process, not the data technology"
12. Ways to move data around
● Flat files
● NOSQL data stores
● RDBMS
33. A Processor Task
class AbstractProcessorTask(celery.Task):
abstract = True
processor_class = None
def run(self, *args, **kwargs):
processor = self.processor_class(
*args, **kwargs)
return processor.dispatch()
class DeriveFooTask(AbstractProcessorTask):
processor_class = DeriveFooTransform
DeriveFooTask().apply_async() # Run it!
34. Canvas: Designing Workflows
● Combines a series of tasks
● Groups run in parallel
● Chains run in series
● Can be combined in different ways
>>> new_user_workflow = (create_user.s() | group(
... import_contacts.s(),
... send_welcome_email.s()))
... new_user_workflow.delay(username='artv',
... first='Art',
... last='Vandelay',
... email='art@vandelay.com')
35. Sample Data Processing Flow
Extrac
t sales
Extract
customers
Extract
product
s
Copy sales to
transform
Copy customers
to transform
Copy products
to transform
Join
table
s
Aggregate sales
by customer
Normalis
e
currency
Aggregate
sales by region
Customer data
exception report
39. Turning it up to 11
● A requires/depends structure
● Incremental data loads
● Parameterised flows
● Tracking flow history
● Hooking into other libraries
– NLTK
– SciPy/NumPy
– ...
40. Summary
● Intro to data warehousing
● Process data with SQLAlchemy
● Task dependencies with Celery
canvas