SlideShare ist ein Scribd-Unternehmen logo
1 von 40
Downloaden Sie, um offline zu lesen
PyconJP
2016-09-22
Fabian Dubois
Building a data
preparation
pipeline with
Pandas and
AWS Lambda
Building a data preparation pipeline with Pandas and AWS Lambda
What Will You Learn?
▸ What is data preparation and why it is required.
▸ How to prepare data with pandas.
▸ How to set up a pipeline with AWS Lambda
Building a data preparation pipeline with Pandas and AWS Lambda
About Me
▸ Based in Tokyo
▸ Using python with data for 6 years
▸ Freelance Data Products Developper and Consultant

(data visualization, machine learning)
▸ Former Orange Labs and Locarise

(connected sensors data processing and visualization)
▸ Current side project denryoku.io an API for electric grid
power demand and capacity prediction.
Why Data
Preparation?
Building a data preparation pipeline with Pandas and AWS Lambda
So you have got data, now what?
▸ Showing it to an audience:
▸ a report from a survey?
▸ a news article with charts?
▸ a sales dashboard?
Building a data preparation pipeline with Pandas and AWS Lambda
But a lot of available data is messy
▸ incomplete or missing data
▸ mis-formatted, mis-typed data
▸ wrong / corrupted values
Building a data preparation pipeline with Pandas and AWS Lambda
It has all the reasons to be messy
▸ non availability
▸ no appropriate mean of collection
▸ lack of validation
▸ human errors
Building a data preparation pipeline with Pandas and AWS Lambda
And this can have very bad consequences
▸ Crash in your report generator
▸ incomplete reports
▸ report reaches wrong conclusions
▸ Ultimately, if your data is really bad, you cannot trust any
conclusion from it
Building a data preparation pipeline with Pandas and AWS Lambda
It is not just about quality (ETL)
▸ Enriching the data
▸ Aggregating
!" "
clean
" !clean
!
aggregate,

classify, …input 1
input 2
output
▸ Classification (ML)
▸ Predictions (ML)
Visualize
|
Building a data preparation pipeline with Pandas and AWS Lambda
Example: data journalism &
interactive visualization
▸ Often manually gathered
data in spreadsheets
▸ Data cleaning required
▸ Data aggregation/
preprocessing required
▸ Data may be updated on a
weekly basis
Building a data preparation pipeline with Pandas and AWS Lambda
If it is a product, it needs to deal with data updates
Current Data
!
preparation script visualisation ready data Visualisation
" " |
▸ Who is going to run the script?
"
New data
Needs to be automated (the pipeline)
Building a data preparation pipeline with Pandas and AWS Lambda
What does it apply to?
data
quality
data update
frequency once monthly real-timedaily
low high
dashboards,
data products
data journalism
interactive reports,
email reports
ad hoc data analysisapplication
solution jupiter notebook
automated preparation

pipeline (batch)

prototype
micro-batch or real-time

processing pipeline
our focus
How to
prepare data?
Building a data preparation pipeline with Pandas and AWS Lambda
common operations
▸ Date parsing
▸ Deciding on a strategy for null or non parseable values
▸ Enforce value ranges
▸ Sanitise strings
Building a data preparation pipeline with Pandas and AWS Lambda
Existing tools
▸ Trifacta Wrangler, Talend Dataprep, Google Open Refine
▸ great tools to check data quality and define transformations
Building a data preparation pipeline with Pandas and AWS Lambda
So why custom solutions with Python and Pandas?
▸ With python, you can do anything!
▸ It is not that difficult
▸ Pandas is a versatile tool that manipulate Dataframes
▸ Easy to specify transformations
▸ Not limited by Pandas, the whole python ecosystem is
available, like scikit-learn
Building a data preparation pipeline with Pandas and AWS Lambda
Example from a Jupiter notebook
▸ load a simple file with a list of name and ages of different
persons
Building a data preparation pipeline with Pandas and AWS Lambda
Example: statistics on groups (names)
▸ Is there a
relationship
between name
length and
median age?
▸ Chain
operations
▸ plot the length
of name vs age
for each name
Warning
Outlier
Building a data preparation pipeline with Pandas and AWS Lambda
something
is wrong
null values
label issues
Building a data preparation pipeline with Pandas and AWS Lambda
Let’s fix this
▸ deal with
missing values
with `dropna`
or `fillna`
▸ clean names
▸ reject outliers
Building a data preparation pipeline with Pandas and AWS Lambda
Close the loop to improve the data entry/acquisition
▸ Many errors can be avoided during data collection:
▸ form / column validation
▸ drop down selections for categories
▸ Report rejected rows to improve collection process
$
Data
! preparation

script"
list of issues
%Improve

forms…
Building a data preparation pipeline with Pandas and AWS Lambda
Testing your preparation
▸ Unit tests
▸ Test for anticipated edge cases (defensive programming)
▸ Property based testing (http://hypothesis.works/)
Building a data preparation pipeline with Pandas and AWS Lambda
More references for data cleaning
▸ Data cleaning with Pandas https://www.youtube.com/
watch?v=_eQ_8U5kruQ
▸ Data cleanup with Python: http://kjamistan.com/
automating-your-data-cleanup-with-python/
▸ Modern Pandas: Tidy Data https://
tomaugspurger.github.io/modern-5-tidy.html
Setting up a
pipeline with AWS
Lambda.
Building a data preparation pipeline with Pandas and AWS Lambda
Some challenges
▸ Don’t let users run scripts
▸ Automating is part of a quality process
▸ Keeping things simple…
▸ and cheap
Building a data preparation pipeline with Pandas and AWS Lambda
What is AWS Lambda: server less solution
▸ Serverless offer by AWS
▸ No lifecycle to manage or shared state => resilient
▸ Auto-scaling
▸ Pay for actual running time: low cost
▸ No server, infra management: reduced dev / devops cost
…events
lambda function
output
…
Building a data preparation pipeline with Pandas and AWS Lambda
Creating a function
just a python function
Building a data preparation pipeline with Pandas and AWS Lambda
Creating a function: options
Building a data preparation pipeline with Pandas and AWS Lambda
Creating an “architecture” with triggers
Building a data preparation pipeline with Pandas and AWS Lambda
Batch processing at regular interval
▸ cron scheduling
▸ let your function get some data and process it at regular interval
Building a data preparation pipeline with Pandas and AWS Lambda
An API / webhook
▸ on API call
▸ Can be triggered from a google spreadsheet
Building a data preparation pipeline with Pandas and AWS Lambda
Setting up AWS Lambda for Pandas
Pandas and dependencies need to be compiled for Amazon
Linux x86_64 # install compilation environment
sudo yum -y update
sudo yum -y upgrade
sudo yum groupinstall "Development Tools"
sudo yum install blas blas-devel lapack 
lapack-devel Cython --enablerepo=epel
# create and activate virtual env
virtualenv pdenv
source pdenv/bin/activate
# install pandas
pip install pandas
# zip the environment content
cd ~/pdenv/lib/python2.7/site-packages/
zip -r ~/pdenv.zip . --exclude *.pyc
cd ~/pdenv/lib64/python2.7/site-packages/
zip -r ~/pdenv.zip . --exclude *.pyc
# add the supporting libraries
cd ~/
mkdir -p libs
cp /usr/lib64/liblapack.so.3 
/usr/lib64/libblas.so.3 
/usr/lib64/libgfortran.so.3 
/usr/lib64/libquadmath.so.0 
libs/
zip -r ~/pdenv.zip libs
1. Launch an
EC2 instance
and connect
to it
2. Install
pandas in a
virtualenv
3. Zip the
installed
libraries
shell
Building a data preparation pipeline with Pandas and AWS Lambda
Using pandas from a lambda function
▸ The lambda process
need to access those
binaries
▸ Set up env variables
▸ Call a subprocess
▸ And pickle the function
input
▸ AWS will call
`lambda_function.lambda
_handler`
import os, sys, subprocess, json
import cPickle as pickle
LIBS = os.path.join(os.getcwd(), 'local', 'lib')
def handler(filename):
def handle(event, context):
pickle.dump( event, open( “/tmp/event.p”, “wb” ))
env = os.environ.copy()
env.update(LD_LIBRARY_PATH=LIBS)
proc = subprocess.Popen(
('python', filename),
env=env,
stdout=subprocess.PIPE)
proc.wait()
return proc.stdout.read()
return handle
lambda_handler = handler('my_function.py')
python: lambda_function.py
Building a data preparation pipeline with Pandas and AWS Lambda
The actual function
▸ Get the input data from
a google spreadsheet,
a css file on s3, an FTP
▸ Clean it
▸ Copy it somewhere
import pandas as pd
import pickle
import requests
from StringIO import StringIO
def run():
# get the lambda call arguments
event = pickle.load( open( “/tmp/event.p”, “rb” ))
# load some data from a google spreadsheet
r = requests.get(‘https://docs.google.com/spreadsheets'
+ ‘/d/{sheet_id}/export?format=csv&gid={page_id}')
data = r.content.decode('utf-8')
df = pd.read_csv(StringIO(data))
# Do something
# save as file
file_ = StringIO()
df.to_csv(file_, encoding='utf-8')
# copy the result somewhere
if __name__ == '__main__':
run()
python: my_function.py
Building a data preparation pipeline with Pandas and AWS Lambda
upload and test
▸ add your lambda function code to the environment zip.
▸ upload your function
Building a data preparation pipeline with Pandas and AWS Lambda
caveat 1: python 2.7
▸ officially, only python 2.7 is supported
▸ But python 3 is available and can be called as a
subprocess
▸ details here: http://www.cloudtrek.com.au/blog/
running-python-3-on-aws-lambda/
Building a data preparation pipeline with Pandas and AWS Lambda
caveat 2: max process memory (1.5GB) and execution time
▸ need to split the dataset if tool large
▸ loop over in your lambda call:
▸ may excess timeout
▸ map to multiple lambda calls
▸ need to merge the dataset at the end
▸ Lambda functions should be simple, chain if required
Takeaways
Building a data preparation pipeline with Pandas and AWS Lambda
Takeaways
▸ Know your data and your target
▸ Pandas can solve many issues
▸ Defensive programming and closing the loop
▸ AWS Lambda is a powerful and flexible tool for time and
resource constrained teams
Thanks
Questions?
@fabian_dubois
fabian@datamaplab.com
check denryoku.io

Weitere ähnliche Inhalte

Was ist angesagt?

Hw09 Building Data Intensive Apps A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.OrgHw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09 Building Data Intensive Apps A Closer Look At Trending Topics.OrgCloudera, Inc.
 
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...Databricks
 
Autoscale DynamoDB with Dynamic DynamoDB
Autoscale DynamoDB with Dynamic DynamoDBAutoscale DynamoDB with Dynamic DynamoDB
Autoscale DynamoDB with Dynamic DynamoDBSebastian Dahlgren
 
Apache Spark e AWS Glue
Apache Spark e AWS GlueApache Spark e AWS Glue
Apache Spark e AWS GlueLaercio Serra
 
Presto @ Netflix: Interactive Queries at Petabyte Scale
Presto @ Netflix: Interactive Queries at Petabyte ScalePresto @ Netflix: Interactive Queries at Petabyte Scale
Presto @ Netflix: Interactive Queries at Petabyte ScaleDataWorks Summit
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Amazon Web Services
 
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017Amazon Web Services
 
Fast Data at Scale - AWS Summit Tel Aviv 2017
Fast Data at Scale - AWS Summit Tel Aviv 2017Fast Data at Scale - AWS Summit Tel Aviv 2017
Fast Data at Scale - AWS Summit Tel Aviv 2017Amazon Web Services
 
Amazon EMR Facebook Presto Meetup
Amazon EMR Facebook Presto MeetupAmazon EMR Facebook Presto Meetup
Amazon EMR Facebook Presto Meetupstevemcpherson
 
(WRK302) Event-Driven Programming
(WRK302) Event-Driven Programming(WRK302) Event-Driven Programming
(WRK302) Event-Driven ProgrammingAmazon Web Services
 
From Big to Fast Data. How #kafka and #kafka-connect can redefine you ETL and...
From Big to Fast Data. How #kafka and #kafka-connect can redefine you ETL and...From Big to Fast Data. How #kafka and #kafka-connect can redefine you ETL and...
From Big to Fast Data. How #kafka and #kafka-connect can redefine you ETL and...Landoop Ltd
 
Dependency Injection in Apache Spark Applications
Dependency Injection in Apache Spark ApplicationsDependency Injection in Apache Spark Applications
Dependency Injection in Apache Spark ApplicationsDatabricks
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkDataWorks Summit
 
[SmartNews] Globally Scalable Web Document Classification Using Word2Vec
[SmartNews] Globally Scalable Web Document Classification Using Word2Vec[SmartNews] Globally Scalable Web Document Classification Using Word2Vec
[SmartNews] Globally Scalable Web Document Classification Using Word2VecKouhei Nakaji
 
(BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
(BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014(BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
(BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014Amazon Web Services
 
Building Data Pipelines in Python
Building Data Pipelines in PythonBuilding Data Pipelines in Python
Building Data Pipelines in PythonC4Media
 
AWS Webcast - Amazon Kinesis and Apache Storm
AWS Webcast - Amazon Kinesis and Apache StormAWS Webcast - Amazon Kinesis and Apache Storm
AWS Webcast - Amazon Kinesis and Apache StormAmazon Web Services
 

Was ist angesagt? (20)

Hw09 Building Data Intensive Apps A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.OrgHw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09 Building Data Intensive Apps A Closer Look At Trending Topics.Org
 
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
 
Autoscale DynamoDB with Dynamic DynamoDB
Autoscale DynamoDB with Dynamic DynamoDBAutoscale DynamoDB with Dynamic DynamoDB
Autoscale DynamoDB with Dynamic DynamoDB
 
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR Masterclass
 
Apache Spark e AWS Glue
Apache Spark e AWS GlueApache Spark e AWS Glue
Apache Spark e AWS Glue
 
Presto @ Netflix: Interactive Queries at Petabyte Scale
Presto @ Netflix: Interactive Queries at Petabyte ScalePresto @ Netflix: Interactive Queries at Petabyte Scale
Presto @ Netflix: Interactive Queries at Petabyte Scale
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
 
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 
Fast Data at Scale - AWS Summit Tel Aviv 2017
Fast Data at Scale - AWS Summit Tel Aviv 2017Fast Data at Scale - AWS Summit Tel Aviv 2017
Fast Data at Scale - AWS Summit Tel Aviv 2017
 
Amazon EMR Facebook Presto Meetup
Amazon EMR Facebook Presto MeetupAmazon EMR Facebook Presto Meetup
Amazon EMR Facebook Presto Meetup
 
(WRK302) Event-Driven Programming
(WRK302) Event-Driven Programming(WRK302) Event-Driven Programming
(WRK302) Event-Driven Programming
 
From Big to Fast Data. How #kafka and #kafka-connect can redefine you ETL and...
From Big to Fast Data. How #kafka and #kafka-connect can redefine you ETL and...From Big to Fast Data. How #kafka and #kafka-connect can redefine you ETL and...
From Big to Fast Data. How #kafka and #kafka-connect can redefine you ETL and...
 
Dependency Injection in Apache Spark Applications
Dependency Injection in Apache Spark ApplicationsDependency Injection in Apache Spark Applications
Dependency Injection in Apache Spark Applications
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
 
[SmartNews] Globally Scalable Web Document Classification Using Word2Vec
[SmartNews] Globally Scalable Web Document Classification Using Word2Vec[SmartNews] Globally Scalable Web Document Classification Using Word2Vec
[SmartNews] Globally Scalable Web Document Classification Using Word2Vec
 
(BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
(BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014(BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
(BDT205) Your First Big Data Application on AWS | AWS re:Invent 2014
 
Building Data Pipelines in Python
Building Data Pipelines in PythonBuilding Data Pipelines in Python
Building Data Pipelines in Python
 
Mhug apache storm
Mhug apache stormMhug apache storm
Mhug apache storm
 
AWS Webcast - Amazon Kinesis and Apache Storm
AWS Webcast - Amazon Kinesis and Apache StormAWS Webcast - Amazon Kinesis and Apache Storm
AWS Webcast - Amazon Kinesis and Apache Storm
 

Ähnlich wie PyconJP: Building a data preparation pipeline with Pandas and AWS Lambda

Conference 2014: Rajat Arya - Deployment with GraphLab Create
Conference 2014: Rajat Arya - Deployment with GraphLab Create Conference 2014: Rajat Arya - Deployment with GraphLab Create
Conference 2014: Rajat Arya - Deployment with GraphLab Create Turi, Inc.
 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Lace Lofranco
 
Deep Dive: Infrastructure as Code
Deep Dive: Infrastructure as CodeDeep Dive: Infrastructure as Code
Deep Dive: Infrastructure as CodeAmazon Web Services
 
Deep Dive: Infrastructure as Code
Deep Dive: Infrastructure as CodeDeep Dive: Infrastructure as Code
Deep Dive: Infrastructure as CodeAmazon Web Services
 
Deep Dive: Infrastructure as Code
Deep Dive: Infrastructure as CodeDeep Dive: Infrastructure as Code
Deep Dive: Infrastructure as CodeAmazon Web Services
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...Amazon Web Services
 
AWS May Webinar Series - Deep Dive: Infrastructure as Code
AWS May Webinar Series - Deep Dive: Infrastructure as CodeAWS May Webinar Series - Deep Dive: Infrastructure as Code
AWS May Webinar Series - Deep Dive: Infrastructure as CodeAmazon Web Services
 
Deep Dive - Infrastructure as Code
Deep Dive - Infrastructure as CodeDeep Dive - Infrastructure as Code
Deep Dive - Infrastructure as CodeAmazon Web Services
 
Webinar: The Anatomy of the Cloudant Data Layer
Webinar: The Anatomy of the Cloudant Data LayerWebinar: The Anatomy of the Cloudant Data Layer
Webinar: The Anatomy of the Cloudant Data LayerIBM Cloud Data Services
 
Windows Azure HDInsight Service
Windows Azure HDInsight ServiceWindows Azure HDInsight Service
Windows Azure HDInsight ServiceNeil Mackenzie
 
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & DataductAmazon Web Services
 
Tensorflow in production with AWS Lambda
Tensorflow in production with AWS LambdaTensorflow in production with AWS Lambda
Tensorflow in production with AWS LambdaFabian Dubois
 
Yaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdfYaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdfprevota
 
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...Amazon Web Services
 
B3 - Business intelligence apps on aws
B3 - Business intelligence apps on awsB3 - Business intelligence apps on aws
B3 - Business intelligence apps on awsAmazon Web Services
 
Instrumenting Application Stack in a Dynamically Scaling Environment (DMG212)...
Instrumenting Application Stack in a Dynamically Scaling Environment (DMG212)...Instrumenting Application Stack in a Dynamically Scaling Environment (DMG212)...
Instrumenting Application Stack in a Dynamically Scaling Environment (DMG212)...Amazon Web Services
 
Analyzing Mixpanel Data into Amazon Redshift
Analyzing Mixpanel Data into Amazon RedshiftAnalyzing Mixpanel Data into Amazon Redshift
Analyzing Mixpanel Data into Amazon RedshiftGeorge Psistakis
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Scaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with DatabricksScaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with DatabricksDatabricks
 

Ähnlich wie PyconJP: Building a data preparation pipeline with Pandas and AWS Lambda (20)

Conference 2014: Rajat Arya - Deployment with GraphLab Create
Conference 2014: Rajat Arya - Deployment with GraphLab Create Conference 2014: Rajat Arya - Deployment with GraphLab Create
Conference 2014: Rajat Arya - Deployment with GraphLab Create
 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
 
Deep Dive: Infrastructure as Code
Deep Dive: Infrastructure as CodeDeep Dive: Infrastructure as Code
Deep Dive: Infrastructure as Code
 
Deep Dive: Infrastructure as Code
Deep Dive: Infrastructure as CodeDeep Dive: Infrastructure as Code
Deep Dive: Infrastructure as Code
 
Big Data on AWS
Big Data on AWSBig Data on AWS
Big Data on AWS
 
Deep Dive: Infrastructure as Code
Deep Dive: Infrastructure as CodeDeep Dive: Infrastructure as Code
Deep Dive: Infrastructure as Code
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
 
AWS May Webinar Series - Deep Dive: Infrastructure as Code
AWS May Webinar Series - Deep Dive: Infrastructure as CodeAWS May Webinar Series - Deep Dive: Infrastructure as Code
AWS May Webinar Series - Deep Dive: Infrastructure as Code
 
Deep Dive - Infrastructure as Code
Deep Dive - Infrastructure as CodeDeep Dive - Infrastructure as Code
Deep Dive - Infrastructure as Code
 
Webinar: The Anatomy of the Cloudant Data Layer
Webinar: The Anatomy of the Cloudant Data LayerWebinar: The Anatomy of the Cloudant Data Layer
Webinar: The Anatomy of the Cloudant Data Layer
 
Windows Azure HDInsight Service
Windows Azure HDInsight ServiceWindows Azure HDInsight Service
Windows Azure HDInsight Service
 
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
 
Tensorflow in production with AWS Lambda
Tensorflow in production with AWS LambdaTensorflow in production with AWS Lambda
Tensorflow in production with AWS Lambda
 
Yaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdfYaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdf
 
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
 
B3 - Business intelligence apps on aws
B3 - Business intelligence apps on awsB3 - Business intelligence apps on aws
B3 - Business intelligence apps on aws
 
Instrumenting Application Stack in a Dynamically Scaling Environment (DMG212)...
Instrumenting Application Stack in a Dynamically Scaling Environment (DMG212)...Instrumenting Application Stack in a Dynamically Scaling Environment (DMG212)...
Instrumenting Application Stack in a Dynamically Scaling Environment (DMG212)...
 
Analyzing Mixpanel Data into Amazon Redshift
Analyzing Mixpanel Data into Amazon RedshiftAnalyzing Mixpanel Data into Amazon Redshift
Analyzing Mixpanel Data into Amazon Redshift
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Scaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with DatabricksScaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with Databricks
 

Kürzlich hochgeladen

英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...Akihiro Suda
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 

Kürzlich hochgeladen (20)

英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 

PyconJP: Building a data preparation pipeline with Pandas and AWS Lambda

  • 1. PyconJP 2016-09-22 Fabian Dubois Building a data preparation pipeline with Pandas and AWS Lambda
  • 2. Building a data preparation pipeline with Pandas and AWS Lambda What Will You Learn? ▸ What is data preparation and why it is required. ▸ How to prepare data with pandas. ▸ How to set up a pipeline with AWS Lambda
  • 3. Building a data preparation pipeline with Pandas and AWS Lambda About Me ▸ Based in Tokyo ▸ Using python with data for 6 years ▸ Freelance Data Products Developper and Consultant
 (data visualization, machine learning) ▸ Former Orange Labs and Locarise
 (connected sensors data processing and visualization) ▸ Current side project denryoku.io an API for electric grid power demand and capacity prediction.
  • 5. Building a data preparation pipeline with Pandas and AWS Lambda So you have got data, now what? ▸ Showing it to an audience: ▸ a report from a survey? ▸ a news article with charts? ▸ a sales dashboard?
  • 6. Building a data preparation pipeline with Pandas and AWS Lambda But a lot of available data is messy ▸ incomplete or missing data ▸ mis-formatted, mis-typed data ▸ wrong / corrupted values
  • 7. Building a data preparation pipeline with Pandas and AWS Lambda It has all the reasons to be messy ▸ non availability ▸ no appropriate mean of collection ▸ lack of validation ▸ human errors
  • 8. Building a data preparation pipeline with Pandas and AWS Lambda And this can have very bad consequences ▸ Crash in your report generator ▸ incomplete reports ▸ report reaches wrong conclusions ▸ Ultimately, if your data is really bad, you cannot trust any conclusion from it
  • 9. Building a data preparation pipeline with Pandas and AWS Lambda It is not just about quality (ETL) ▸ Enriching the data ▸ Aggregating !" " clean " !clean ! aggregate,
 classify, …input 1 input 2 output ▸ Classification (ML) ▸ Predictions (ML) Visualize |
  • 10. Building a data preparation pipeline with Pandas and AWS Lambda Example: data journalism & interactive visualization ▸ Often manually gathered data in spreadsheets ▸ Data cleaning required ▸ Data aggregation/ preprocessing required ▸ Data may be updated on a weekly basis
  • 11. Building a data preparation pipeline with Pandas and AWS Lambda If it is a product, it needs to deal with data updates Current Data ! preparation script visualisation ready data Visualisation " " | ▸ Who is going to run the script? " New data Needs to be automated (the pipeline)
  • 12. Building a data preparation pipeline with Pandas and AWS Lambda What does it apply to? data quality data update frequency once monthly real-timedaily low high dashboards, data products data journalism interactive reports, email reports ad hoc data analysisapplication solution jupiter notebook automated preparation
 pipeline (batch)
 prototype micro-batch or real-time
 processing pipeline our focus
  • 14. Building a data preparation pipeline with Pandas and AWS Lambda common operations ▸ Date parsing ▸ Deciding on a strategy for null or non parseable values ▸ Enforce value ranges ▸ Sanitise strings
  • 15. Building a data preparation pipeline with Pandas and AWS Lambda Existing tools ▸ Trifacta Wrangler, Talend Dataprep, Google Open Refine ▸ great tools to check data quality and define transformations
  • 16. Building a data preparation pipeline with Pandas and AWS Lambda So why custom solutions with Python and Pandas? ▸ With python, you can do anything! ▸ It is not that difficult ▸ Pandas is a versatile tool that manipulate Dataframes ▸ Easy to specify transformations ▸ Not limited by Pandas, the whole python ecosystem is available, like scikit-learn
  • 17. Building a data preparation pipeline with Pandas and AWS Lambda Example from a Jupiter notebook ▸ load a simple file with a list of name and ages of different persons
  • 18. Building a data preparation pipeline with Pandas and AWS Lambda Example: statistics on groups (names) ▸ Is there a relationship between name length and median age? ▸ Chain operations ▸ plot the length of name vs age for each name Warning Outlier
  • 19. Building a data preparation pipeline with Pandas and AWS Lambda something is wrong null values label issues
  • 20. Building a data preparation pipeline with Pandas and AWS Lambda Let’s fix this ▸ deal with missing values with `dropna` or `fillna` ▸ clean names ▸ reject outliers
  • 21. Building a data preparation pipeline with Pandas and AWS Lambda Close the loop to improve the data entry/acquisition ▸ Many errors can be avoided during data collection: ▸ form / column validation ▸ drop down selections for categories ▸ Report rejected rows to improve collection process $ Data ! preparation
 script" list of issues %Improve
 forms…
  • 22. Building a data preparation pipeline with Pandas and AWS Lambda Testing your preparation ▸ Unit tests ▸ Test for anticipated edge cases (defensive programming) ▸ Property based testing (http://hypothesis.works/)
  • 23. Building a data preparation pipeline with Pandas and AWS Lambda More references for data cleaning ▸ Data cleaning with Pandas https://www.youtube.com/ watch?v=_eQ_8U5kruQ ▸ Data cleanup with Python: http://kjamistan.com/ automating-your-data-cleanup-with-python/ ▸ Modern Pandas: Tidy Data https:// tomaugspurger.github.io/modern-5-tidy.html
  • 24. Setting up a pipeline with AWS Lambda.
  • 25. Building a data preparation pipeline with Pandas and AWS Lambda Some challenges ▸ Don’t let users run scripts ▸ Automating is part of a quality process ▸ Keeping things simple… ▸ and cheap
  • 26. Building a data preparation pipeline with Pandas and AWS Lambda What is AWS Lambda: server less solution ▸ Serverless offer by AWS ▸ No lifecycle to manage or shared state => resilient ▸ Auto-scaling ▸ Pay for actual running time: low cost ▸ No server, infra management: reduced dev / devops cost …events lambda function output …
  • 27. Building a data preparation pipeline with Pandas and AWS Lambda Creating a function just a python function
  • 28. Building a data preparation pipeline with Pandas and AWS Lambda Creating a function: options
  • 29. Building a data preparation pipeline with Pandas and AWS Lambda Creating an “architecture” with triggers
  • 30. Building a data preparation pipeline with Pandas and AWS Lambda Batch processing at regular interval ▸ cron scheduling ▸ let your function get some data and process it at regular interval
  • 31. Building a data preparation pipeline with Pandas and AWS Lambda An API / webhook ▸ on API call ▸ Can be triggered from a google spreadsheet
  • 32. Building a data preparation pipeline with Pandas and AWS Lambda Setting up AWS Lambda for Pandas Pandas and dependencies need to be compiled for Amazon Linux x86_64 # install compilation environment sudo yum -y update sudo yum -y upgrade sudo yum groupinstall "Development Tools" sudo yum install blas blas-devel lapack lapack-devel Cython --enablerepo=epel # create and activate virtual env virtualenv pdenv source pdenv/bin/activate # install pandas pip install pandas # zip the environment content cd ~/pdenv/lib/python2.7/site-packages/ zip -r ~/pdenv.zip . --exclude *.pyc cd ~/pdenv/lib64/python2.7/site-packages/ zip -r ~/pdenv.zip . --exclude *.pyc # add the supporting libraries cd ~/ mkdir -p libs cp /usr/lib64/liblapack.so.3 /usr/lib64/libblas.so.3 /usr/lib64/libgfortran.so.3 /usr/lib64/libquadmath.so.0 libs/ zip -r ~/pdenv.zip libs 1. Launch an EC2 instance and connect to it 2. Install pandas in a virtualenv 3. Zip the installed libraries shell
  • 33. Building a data preparation pipeline with Pandas and AWS Lambda Using pandas from a lambda function ▸ The lambda process need to access those binaries ▸ Set up env variables ▸ Call a subprocess ▸ And pickle the function input ▸ AWS will call `lambda_function.lambda _handler` import os, sys, subprocess, json import cPickle as pickle LIBS = os.path.join(os.getcwd(), 'local', 'lib') def handler(filename): def handle(event, context): pickle.dump( event, open( “/tmp/event.p”, “wb” )) env = os.environ.copy() env.update(LD_LIBRARY_PATH=LIBS) proc = subprocess.Popen( ('python', filename), env=env, stdout=subprocess.PIPE) proc.wait() return proc.stdout.read() return handle lambda_handler = handler('my_function.py') python: lambda_function.py
  • 34. Building a data preparation pipeline with Pandas and AWS Lambda The actual function ▸ Get the input data from a google spreadsheet, a css file on s3, an FTP ▸ Clean it ▸ Copy it somewhere import pandas as pd import pickle import requests from StringIO import StringIO def run(): # get the lambda call arguments event = pickle.load( open( “/tmp/event.p”, “rb” )) # load some data from a google spreadsheet r = requests.get(‘https://docs.google.com/spreadsheets' + ‘/d/{sheet_id}/export?format=csv&gid={page_id}') data = r.content.decode('utf-8') df = pd.read_csv(StringIO(data)) # Do something # save as file file_ = StringIO() df.to_csv(file_, encoding='utf-8') # copy the result somewhere if __name__ == '__main__': run() python: my_function.py
  • 35. Building a data preparation pipeline with Pandas and AWS Lambda upload and test ▸ add your lambda function code to the environment zip. ▸ upload your function
  • 36. Building a data preparation pipeline with Pandas and AWS Lambda caveat 1: python 2.7 ▸ officially, only python 2.7 is supported ▸ But python 3 is available and can be called as a subprocess ▸ details here: http://www.cloudtrek.com.au/blog/ running-python-3-on-aws-lambda/
  • 37. Building a data preparation pipeline with Pandas and AWS Lambda caveat 2: max process memory (1.5GB) and execution time ▸ need to split the dataset if tool large ▸ loop over in your lambda call: ▸ may excess timeout ▸ map to multiple lambda calls ▸ need to merge the dataset at the end ▸ Lambda functions should be simple, chain if required
  • 39. Building a data preparation pipeline with Pandas and AWS Lambda Takeaways ▸ Know your data and your target ▸ Pandas can solve many issues ▸ Defensive programming and closing the loop ▸ AWS Lambda is a powerful and flexible tool for time and resource constrained teams