Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
PyconJP
2016-09-22
Fabian Dubois
Building a data
preparation
pipeline with
Pandas and
AWS Lambda
Building a data preparation pipeline with Pandas and AWS Lambda
What Will You Learn?
▸ What is data preparation and why it...
Building a data preparation pipeline with Pandas and AWS Lambda
About Me
▸ Based in Tokyo
▸ Using python with data for 6 y...
Why Data
Preparation?
Building a data preparation pipeline with Pandas and AWS Lambda
So you have got data, now what?
▸ Showing it to an audienc...
Building a data preparation pipeline with Pandas and AWS Lambda
But a lot of available data is messy
▸ incomplete or missi...
Building a data preparation pipeline with Pandas and AWS Lambda
It has all the reasons to be messy
▸ non availability
▸ no...
Building a data preparation pipeline with Pandas and AWS Lambda
And this can have very bad consequences
▸ Crash in your re...
Building a data preparation pipeline with Pandas and AWS Lambda
It is not just about quality (ETL)
▸ Enriching the data
▸ ...
Building a data preparation pipeline with Pandas and AWS Lambda
Example: data journalism &
interactive visualization
▸ Oft...
Building a data preparation pipeline with Pandas and AWS Lambda
If it is a product, it needs to deal with data updates
Cur...
Building a data preparation pipeline with Pandas and AWS Lambda
What does it apply to?
data
quality
data update
frequency ...
How to
prepare data?
Building a data preparation pipeline with Pandas and AWS Lambda
common operations
▸ Date parsing
▸ Deciding on a strategy ...
Building a data preparation pipeline with Pandas and AWS Lambda
Existing tools
▸ Trifacta Wrangler, Talend Dataprep, Googl...
Building a data preparation pipeline with Pandas and AWS Lambda
So why custom solutions with Python and Pandas?
▸ With pyt...
Building a data preparation pipeline with Pandas and AWS Lambda
Example from a Jupiter notebook
▸ load a simple file with a...
Building a data preparation pipeline with Pandas and AWS Lambda
Example: statistics on groups (names)
▸ Is there a
relatio...
Building a data preparation pipeline with Pandas and AWS Lambda
something
is wrong
null values
label issues
Building a data preparation pipeline with Pandas and AWS Lambda
Let’s fix this
▸ deal with
missing values
with `dropna`
or ...
Building a data preparation pipeline with Pandas and AWS Lambda
Close the loop to improve the data entry/acquisition
▸ Man...
Building a data preparation pipeline with Pandas and AWS Lambda
Testing your preparation
▸ Unit tests
▸ Test for anticipat...
Building a data preparation pipeline with Pandas and AWS Lambda
More references for data cleaning
▸ Data cleaning with Pan...
Setting up a
pipeline with AWS
Lambda.
Building a data preparation pipeline with Pandas and AWS Lambda
Some challenges
▸ Don’t let users run scripts
▸ Automating...
Building a data preparation pipeline with Pandas and AWS Lambda
What is AWS Lambda: server less solution
▸ Serverless offer...
Building a data preparation pipeline with Pandas and AWS Lambda
Creating a function
just a python function
Building a data preparation pipeline with Pandas and AWS Lambda
Creating a function: options
Building a data preparation pipeline with Pandas and AWS Lambda
Creating an “architecture” with triggers
Building a data preparation pipeline with Pandas and AWS Lambda
Batch processing at regular interval
▸ cron scheduling
▸ l...
Building a data preparation pipeline with Pandas and AWS Lambda
An API / webhook
▸ on API call
▸ Can be triggered from a g...
Building a data preparation pipeline with Pandas and AWS Lambda
Setting up AWS Lambda for Pandas
Pandas and dependencies n...
Building a data preparation pipeline with Pandas and AWS Lambda
Using pandas from a lambda function
▸ The lambda process
n...
Building a data preparation pipeline with Pandas and AWS Lambda
The actual function
▸ Get the input data from
a google spr...
Building a data preparation pipeline with Pandas and AWS Lambda
upload and test
▸ add your lambda function code to the env...
Building a data preparation pipeline with Pandas and AWS Lambda
caveat 1: python 2.7
▸ officially, only python 2.7 is suppor...
Building a data preparation pipeline with Pandas and AWS Lambda
caveat 2: max process memory (1.5GB) and execution time
▸ ...
Takeaways
Building a data preparation pipeline with Pandas and AWS Lambda
Takeaways
▸ Know your data and your target
▸ Pandas can so...
Thanks
Questions?
@fabian_dubois
fabian@datamaplab.com
check denryoku.io
Nächste SlideShare
Wird geladen in …5
×

PyconJP: Building a data preparation pipeline with Pandas and AWS Lambda

3.785 Aufrufe

Veröffentlicht am

Building a data preparation pipeline with Pandas and AWS Lambda

What is data preparation and why it is required.
How to prepare data with pandas.
How to set up a pipeline with AWS Lambda

https://youtu.be/pc0Xn0uAm34?t=9m15s

Veröffentlicht in: Software

PyconJP: Building a data preparation pipeline with Pandas and AWS Lambda

  1. 1. PyconJP 2016-09-22 Fabian Dubois Building a data preparation pipeline with Pandas and AWS Lambda
  2. 2. Building a data preparation pipeline with Pandas and AWS Lambda What Will You Learn? ▸ What is data preparation and why it is required. ▸ How to prepare data with pandas. ▸ How to set up a pipeline with AWS Lambda
  3. 3. Building a data preparation pipeline with Pandas and AWS Lambda About Me ▸ Based in Tokyo ▸ Using python with data for 6 years ▸ Freelance Data Products Developper and Consultant
 (data visualization, machine learning) ▸ Former Orange Labs and Locarise
 (connected sensors data processing and visualization) ▸ Current side project denryoku.io an API for electric grid power demand and capacity prediction.
  4. 4. Why Data Preparation?
  5. 5. Building a data preparation pipeline with Pandas and AWS Lambda So you have got data, now what? ▸ Showing it to an audience: ▸ a report from a survey? ▸ a news article with charts? ▸ a sales dashboard?
  6. 6. Building a data preparation pipeline with Pandas and AWS Lambda But a lot of available data is messy ▸ incomplete or missing data ▸ mis-formatted, mis-typed data ▸ wrong / corrupted values
  7. 7. Building a data preparation pipeline with Pandas and AWS Lambda It has all the reasons to be messy ▸ non availability ▸ no appropriate mean of collection ▸ lack of validation ▸ human errors
  8. 8. Building a data preparation pipeline with Pandas and AWS Lambda And this can have very bad consequences ▸ Crash in your report generator ▸ incomplete reports ▸ report reaches wrong conclusions ▸ Ultimately, if your data is really bad, you cannot trust any conclusion from it
  9. 9. Building a data preparation pipeline with Pandas and AWS Lambda It is not just about quality (ETL) ▸ Enriching the data ▸ Aggregating !" " clean " !clean ! aggregate,
 classify, …input 1 input 2 output ▸ Classification (ML) ▸ Predictions (ML) Visualize |
  10. 10. Building a data preparation pipeline with Pandas and AWS Lambda Example: data journalism & interactive visualization ▸ Often manually gathered data in spreadsheets ▸ Data cleaning required ▸ Data aggregation/ preprocessing required ▸ Data may be updated on a weekly basis
  11. 11. Building a data preparation pipeline with Pandas and AWS Lambda If it is a product, it needs to deal with data updates Current Data ! preparation script visualisation ready data Visualisation " " | ▸ Who is going to run the script? " New data Needs to be automated (the pipeline)
  12. 12. Building a data preparation pipeline with Pandas and AWS Lambda What does it apply to? data quality data update frequency once monthly real-timedaily low high dashboards, data products data journalism interactive reports, email reports ad hoc data analysisapplication solution jupiter notebook automated preparation
 pipeline (batch)
 prototype micro-batch or real-time
 processing pipeline our focus
  13. 13. How to prepare data?
  14. 14. Building a data preparation pipeline with Pandas and AWS Lambda common operations ▸ Date parsing ▸ Deciding on a strategy for null or non parseable values ▸ Enforce value ranges ▸ Sanitise strings
  15. 15. Building a data preparation pipeline with Pandas and AWS Lambda Existing tools ▸ Trifacta Wrangler, Talend Dataprep, Google Open Refine ▸ great tools to check data quality and define transformations
  16. 16. Building a data preparation pipeline with Pandas and AWS Lambda So why custom solutions with Python and Pandas? ▸ With python, you can do anything! ▸ It is not that difficult ▸ Pandas is a versatile tool that manipulate Dataframes ▸ Easy to specify transformations ▸ Not limited by Pandas, the whole python ecosystem is available, like scikit-learn
  17. 17. Building a data preparation pipeline with Pandas and AWS Lambda Example from a Jupiter notebook ▸ load a simple file with a list of name and ages of different persons
  18. 18. Building a data preparation pipeline with Pandas and AWS Lambda Example: statistics on groups (names) ▸ Is there a relationship between name length and median age? ▸ Chain operations ▸ plot the length of name vs age for each name Warning Outlier
  19. 19. Building a data preparation pipeline with Pandas and AWS Lambda something is wrong null values label issues
  20. 20. Building a data preparation pipeline with Pandas and AWS Lambda Let’s fix this ▸ deal with missing values with `dropna` or `fillna` ▸ clean names ▸ reject outliers
  21. 21. Building a data preparation pipeline with Pandas and AWS Lambda Close the loop to improve the data entry/acquisition ▸ Many errors can be avoided during data collection: ▸ form / column validation ▸ drop down selections for categories ▸ Report rejected rows to improve collection process $ Data ! preparation
 script" list of issues %Improve
 forms…
  22. 22. Building a data preparation pipeline with Pandas and AWS Lambda Testing your preparation ▸ Unit tests ▸ Test for anticipated edge cases (defensive programming) ▸ Property based testing (http://hypothesis.works/)
  23. 23. Building a data preparation pipeline with Pandas and AWS Lambda More references for data cleaning ▸ Data cleaning with Pandas https://www.youtube.com/ watch?v=_eQ_8U5kruQ ▸ Data cleanup with Python: http://kjamistan.com/ automating-your-data-cleanup-with-python/ ▸ Modern Pandas: Tidy Data https:// tomaugspurger.github.io/modern-5-tidy.html
  24. 24. Setting up a pipeline with AWS Lambda.
  25. 25. Building a data preparation pipeline with Pandas and AWS Lambda Some challenges ▸ Don’t let users run scripts ▸ Automating is part of a quality process ▸ Keeping things simple… ▸ and cheap
  26. 26. Building a data preparation pipeline with Pandas and AWS Lambda What is AWS Lambda: server less solution ▸ Serverless offer by AWS ▸ No lifecycle to manage or shared state => resilient ▸ Auto-scaling ▸ Pay for actual running time: low cost ▸ No server, infra management: reduced dev / devops cost …events lambda function output …
  27. 27. Building a data preparation pipeline with Pandas and AWS Lambda Creating a function just a python function
  28. 28. Building a data preparation pipeline with Pandas and AWS Lambda Creating a function: options
  29. 29. Building a data preparation pipeline with Pandas and AWS Lambda Creating an “architecture” with triggers
  30. 30. Building a data preparation pipeline with Pandas and AWS Lambda Batch processing at regular interval ▸ cron scheduling ▸ let your function get some data and process it at regular interval
  31. 31. Building a data preparation pipeline with Pandas and AWS Lambda An API / webhook ▸ on API call ▸ Can be triggered from a google spreadsheet
  32. 32. Building a data preparation pipeline with Pandas and AWS Lambda Setting up AWS Lambda for Pandas Pandas and dependencies need to be compiled for Amazon Linux x86_64 # install compilation environment sudo yum -y update sudo yum -y upgrade sudo yum groupinstall "Development Tools" sudo yum install blas blas-devel lapack lapack-devel Cython --enablerepo=epel # create and activate virtual env virtualenv pdenv source pdenv/bin/activate # install pandas pip install pandas # zip the environment content cd ~/pdenv/lib/python2.7/site-packages/ zip -r ~/pdenv.zip . --exclude *.pyc cd ~/pdenv/lib64/python2.7/site-packages/ zip -r ~/pdenv.zip . --exclude *.pyc # add the supporting libraries cd ~/ mkdir -p libs cp /usr/lib64/liblapack.so.3 /usr/lib64/libblas.so.3 /usr/lib64/libgfortran.so.3 /usr/lib64/libquadmath.so.0 libs/ zip -r ~/pdenv.zip libs 1. Launch an EC2 instance and connect to it 2. Install pandas in a virtualenv 3. Zip the installed libraries shell
  33. 33. Building a data preparation pipeline with Pandas and AWS Lambda Using pandas from a lambda function ▸ The lambda process need to access those binaries ▸ Set up env variables ▸ Call a subprocess ▸ And pickle the function input ▸ AWS will call `lambda_function.lambda _handler` import os, sys, subprocess, json import cPickle as pickle LIBS = os.path.join(os.getcwd(), 'local', 'lib') def handler(filename): def handle(event, context): pickle.dump( event, open( “/tmp/event.p”, “wb” )) env = os.environ.copy() env.update(LD_LIBRARY_PATH=LIBS) proc = subprocess.Popen( ('python', filename), env=env, stdout=subprocess.PIPE) proc.wait() return proc.stdout.read() return handle lambda_handler = handler('my_function.py') python: lambda_function.py
  34. 34. Building a data preparation pipeline with Pandas and AWS Lambda The actual function ▸ Get the input data from a google spreadsheet, a css file on s3, an FTP ▸ Clean it ▸ Copy it somewhere import pandas as pd import pickle import requests from StringIO import StringIO def run(): # get the lambda call arguments event = pickle.load( open( “/tmp/event.p”, “rb” )) # load some data from a google spreadsheet r = requests.get(‘https://docs.google.com/spreadsheets' + ‘/d/{sheet_id}/export?format=csv&gid={page_id}') data = r.content.decode('utf-8') df = pd.read_csv(StringIO(data)) # Do something # save as file file_ = StringIO() df.to_csv(file_, encoding='utf-8') # copy the result somewhere if __name__ == '__main__': run() python: my_function.py
  35. 35. Building a data preparation pipeline with Pandas and AWS Lambda upload and test ▸ add your lambda function code to the environment zip. ▸ upload your function
  36. 36. Building a data preparation pipeline with Pandas and AWS Lambda caveat 1: python 2.7 ▸ officially, only python 2.7 is supported ▸ But python 3 is available and can be called as a subprocess ▸ details here: http://www.cloudtrek.com.au/blog/ running-python-3-on-aws-lambda/
  37. 37. Building a data preparation pipeline with Pandas and AWS Lambda caveat 2: max process memory (1.5GB) and execution time ▸ need to split the dataset if tool large ▸ loop over in your lambda call: ▸ may excess timeout ▸ map to multiple lambda calls ▸ need to merge the dataset at the end ▸ Lambda functions should be simple, chain if required
  38. 38. Takeaways
  39. 39. Building a data preparation pipeline with Pandas and AWS Lambda Takeaways ▸ Know your data and your target ▸ Pandas can solve many issues ▸ Defensive programming and closing the loop ▸ AWS Lambda is a powerful and flexible tool for time and resource constrained teams
  40. 40. Thanks Questions? @fabian_dubois fabian@datamaplab.com check denryoku.io

×