SlideShare ist ein Scribd-Unternehmen logo
1 von 56
Downloaden Sie, um offline zu lesen
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Serverless Data Prep with AWS
Glue
Moataz Anany
Solutions Architect
AWS
A N T 3 1 3
Nitin Wagh
Solutions Architect
AWS
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Workshop map
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Workshop map
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Workshop map
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
You are a data engineer at AnyCompany*
Create a dataset for
reporting and visualization
Cleanse
Transform
Optimize for reporting queries
Help solve a
machine learning problem
Data scientists at AnyCompany
need to understand passenger
tipping behavior
* a completely made-up ride-hailing startup
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Your dataset: NYC taxi trips
Yellow and green taxi trips
• Pick-up and drop-off dates/times
• Pick-up and drop-off locations
• Trip distances
• Itemized fares
• Tip amount
• Driver-reported passenger counts
http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
For-hire vehicle (FHV)
• Pick-up and drop-off dates/times
• Pick-up and drop-off locations
• Dispatching base
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Your dataset: NYC taxi trips
Original raw dataset
Green and yellow taxi + FHV
Years 2009 to 2018
~1.6Bn rows
215 files
253GB total
Simplified raw dataset
Only yellow taxi + few look-ups
Jan to March 2017
~2M rows
3 files
2.5GB+ uncompressed
Ready in a publicly-accessible
Amazon Simple Storage Service
(Amazon S3) bucket
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Workshop map
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Navigate to the ANT313 workshop website
Use Firefox or Chrome. Keep the website open in a separate tab
https://bit.ly/ant313-workshop
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hands-on
I.2: Get your AWS account ready
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Key resources AWS CloudFormation will create
1. A new data lake S3 bucket
2. Necessary AWS Identity and Access Management (IAM) policies and
roles for AWS Glue, Amazon Athena, and Amazon SageMaker
3. An AWS Glue development endpoint
4. A number of named queries in Athena
Finally, the NYC taxi trips raw dataset is copied into your S3 bucket
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Workshop map
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Glue automates
the undifferentiated heavy lifting of ETL
Automatically discover and categorize your data making it immediately
searchable and queryable across data sources
Generate code to clean, enrich, and reliably move data between various data
sources
Run your jobs on a serverless, fully managed, scale-out environment. No compute
resources to manage
Discover
Develop
Deploy
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data Catalog
 Hive Metastore compatible with enhanced functionality
 Crawlers automatically extract metadata and create tables
 Integrated with Athena, Amazon Redshift Spectrum
Job execution
 Run jobs on a serverless Spark platform
 Provides flexible scheduling
 Handles dependency resolution, monitoring, and alerting
Job authoring
 Auto-generates ETL code
 Build on open frameworks – Python and Spark
 Developer-centric – editing, debugging, sharing
AWS Glue components
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Workshop map
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Glue Crawlers and the AWS Glue Data Catalog
A crawler . . .
• Samples and classifies your data
• Extracts metadata
• Infers schema and partitioning format
• Creates tables in your account’s AWS Glue Data Catalog
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Why query with Amazon Athena?
• Interactive query service
• Makes it easy to analyze data in Amazon S3 using
standard SQL
• Out-of-the-box integrated with AWS Glue Data Catalog
• Serverless
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Glue
Crawler
AWS Glue
Data Catalog
NYC Taxi
Raw Dataset
1
Concepts in action
2
3
4
1. Crawler crawls raw dataset in
Amazon S3 bucket
2. Crawler writes metadata into
AWS Glue Data Catalog
3. You query raw dataset in
Athena
4. Athena uses schema definition
to read raw dataset from S3
and returns results
You
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hands-on
II.1.1: Catalog raw data with an
AWS Glue crawler
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hands-on
II.1.2: Explore table schema and
metadata in
AWS Glue Data Catalog
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hands-on
II.1.3: Query raw data with Amazon
Athena
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Workshop map
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Interactive ETL development with AWS Glue and Amazon SageMaker
What is Amazon SageMaker?
• A fully managed ML platform
• Enables you to build, train, and deploy machine
learning models at any scale
• Provides a Jupyter notebook environment
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Connect your IDE to an AWS Glue development endpoint
Environment to interactively develop, debug, and test ETL code
AWS Glue Spark environment
Interpreter
server
Remote
interpreter
What is an AWS Glue development endpoint?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Interactive ETL development with AWS Glue and Amazon SageMaker
AWS Glue enables you to
• Create an Amazon SageMaker notebook environment
• Connect it to an AWS Glue dev endpoint
• And, finally, access your Jupyter notebook
All without leaving the AWS Glue console
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Glue
Data Catalog
NYC Taxi
Raw Dataset
3
Concepts in action
2
1
1. You author ETL code in an
Amazon SageMaker notebook
2. ETL code runs on AWS Glue
Dev endpoint
3. ETL code…
a. reads raw dataset
b. applies transformations
c. writes optimized dataset back to your
Amazon S3 bucket
4. You use your ETL code in
notebook to create a script file
Amazon SageMaker
Notebook
NYC Taxi
Optimized Dataset
4
You
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hands-on
II.2.1: Create an Amazon
SageMaker notebook instance
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hands-on
II.2.2: Interactively author and run
ETL code in Jupyter
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Transformations we’ll apply to raw data
• Join NYC trips dataset with look-up tables (denormalization)
• Create new timestamp columns for pick-up & drop-off
• Drop unnecessary columns
• Partition the dataset
• Convert to a columnar file format (parquet)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Workshop map
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Concepts in action
NYC taxi
optimized dataset
AWS Glue
crawler
AWS Glue
Data Catalog
1
2
3
4
1. Crawler crawls optimized
dataset in S3 bucket
2. Crawler writes metadata into
AWS Glue Data Catalog
3. You query optimized dataset
in Athena
4. Athena uses schema definition
to read data from Amazon S3
and returns results
You
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hands-on
II.3.1: Catalog optimized data with
an AWS Glue crawler
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hands-on
II.3.2: Query optimized data with
Amazon Athena
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Workshop map
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
What is an AWS Glue job?
An AWS Glue job encapsulates the business logic that performs
extract, transform, and load (ETL) work
• A core building block in your production ETL pipeline
• Provide your PySpark ETL script, or have one auto-generated
• Supports a rich set of built-in AWS Glue transformations
• Jobs can be started, stopped, monitored
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
What is an AWS Glue trigger?
Triggers are the “glue” in your AWS Glue ETL pipeline
Triggers . . .
• Can be used to chain multiple AWS Glue jobs in a series
• Can start multiple jobs at once
• Can be scheduled, on-demand, or based on job events
• Can pass unique parameters to customize AWS Glue job runs
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Three ways to set up an AWS Glue ETL pipeline
• Schedule-driven
• Event-driven
• State machine–driven
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Schedule-driven AWS Glue ETL pipeline
We work our way backwards from a daily SLA deadline
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Event-driven AWS Glue ETL pipeline
Let Amazon CloudWatch Events and AWS Lambda drive the pipeline
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
State machine–driven AWS Glue ETL pipeline
Let AWS Step Functions drive the pipeline
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Concepts in action
NYC Taxi
Optimized Dataset
AWS Glue
Crawler for
Raw dataset
AWS Glue
Data Catalog
1
4
We’ll build a Schedule-driven pipeline.
1. Scheduled crawler runs on raw dataset
and updates AWS Glue data catalog
2. AWS Glue job starts based on trigger(s)
3. Job reads raw dataset, applies
transformations, and writes optimized
dataset to your S3 bucket
4. Scheduled crawler runs on optimized
dataset and updates data catalog
AWS Glue Job to
created
Optimized dataset
NYC Taxi
Raw Dataset
AWS Glue
Crawler for
Optimized
dataset
3
2
Trigger
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hands-on
II.4.1: Schedule AWS Glue crawlers
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hands-on
II.4.2: Create an AWS Glue job
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hands-on
II.4.3: Create an AWS Glue trigger to
run jobs
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hands-on
Advanced: Try out AWS Glue job
bookmarks
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Workshop map
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
ML problem definition
Tip given to taxi drivers is substantial part of their income
They can also serve as loose metric for service quality that can be useful
for taxi companies
What are influences for higher tips (for example, taxi starting and ending
in richer part of cities?)
We present analysis of factors affecting tips and use these factors to
predict tips
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
ML architecture
Results
transform()
Test data
fit()
XGBoost-Regression
Data splitting
Feature engineering
Endpoint
Endpoint configuration
Model creation
Training job
Amazon SageMaker
Notebook instance – Glue Spark
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Concepts in action
NYC taxi
optimized
1. Launch notebook
2. Read optimized dataset
3. Train the ML model
4. Host the trained model
and perform prediction
AWS Glue
Data Catalog
AWS Glue development endpoint
Amazon SageMaker
notebook
2
3
4
1
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hands-on
II.5.1: Interactively develop a
machine learning model in Jupyter
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Run ML Notebook interactively
• Initialize variables and import spark libraries
• Retrieve Optimized NYC Taxi trips Dataset
• Observe features against target label (tip_amount)
• Perform feature engineering
• Split feature engineered data set into Train and Test
• Launch Spark XGBoostEstimator (Observe hyperparameters)
• Perform Prediction and observe accuracy
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
You made it!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Before you go . . .
• Visit the “Account clean-up” section for instructions on
cleaning-up your AWS account
• Tell us how we did: Please fill out the survey
Thank you
Thank you!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Moataz Anany
Solutions Architect
AWS
Nitin Wagh
Solutions Architect
AWS
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Build Your First Big Data Application on AWS (ANT213-R1) - AWS re:Invent 2018
Build Your First Big Data Application on AWS (ANT213-R1) - AWS re:Invent 2018Build Your First Big Data Application on AWS (ANT213-R1) - AWS re:Invent 2018
Build Your First Big Data Application on AWS (ANT213-R1) - AWS re:Invent 2018
 
Migrate Your Hadoop/Spark Workload to Amazon EMR and Architect It for Securit...
Migrate Your Hadoop/Spark Workload to Amazon EMR and Architect It for Securit...Migrate Your Hadoop/Spark Workload to Amazon EMR and Architect It for Securit...
Migrate Your Hadoop/Spark Workload to Amazon EMR and Architect It for Securit...
 
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018
 
Introduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF LoftIntroduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF Loft
 
Bridge OLTP and Stream Processing with Amazon Kinesis, AWS Lambda, & MongoDB ...
Bridge OLTP and Stream Processing with Amazon Kinesis, AWS Lambda, & MongoDB ...Bridge OLTP and Stream Processing with Amazon Kinesis, AWS Lambda, & MongoDB ...
Bridge OLTP and Stream Processing with Amazon Kinesis, AWS Lambda, & MongoDB ...
 
Running Your SQL Server Database on Amazon RDS (DAT329) - AWS re:Invent 2018
Running Your SQL Server Database on Amazon RDS (DAT329) - AWS re:Invent 2018Running Your SQL Server Database on Amazon RDS (DAT329) - AWS re:Invent 2018
Running Your SQL Server Database on Amazon RDS (DAT329) - AWS re:Invent 2018
 
Running Oracle Databases on Amazon RDS and Migrating to PostgreSQL (DAT307-R1...
Running Oracle Databases on Amazon RDS and Migrating to PostgreSQL (DAT307-R1...Running Oracle Databases on Amazon RDS and Migrating to PostgreSQL (DAT307-R1...
Running Oracle Databases on Amazon RDS and Migrating to PostgreSQL (DAT307-R1...
 
Migrating Databases to the Cloud with AWS Database Migration Service (DAT207)...
Migrating Databases to the Cloud with AWS Database Migration Service (DAT207)...Migrating Databases to the Cloud with AWS Database Migration Service (DAT207)...
Migrating Databases to the Cloud with AWS Database Migration Service (DAT207)...
 
Build Machine Learning Solutions on Data Lakes (ARC321) - AWS re:Invent 2018
Build Machine Learning Solutions on Data Lakes (ARC321) - AWS re:Invent 2018Build Machine Learning Solutions on Data Lakes (ARC321) - AWS re:Invent 2018
Build Machine Learning Solutions on Data Lakes (ARC321) - AWS re:Invent 2018
 
What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018
What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018
What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018
 
Building Your Geospatial Data Lake (WPS324) - AWS re:Invent 2018
Building Your Geospatial Data Lake (WPS324) - AWS re:Invent 2018Building Your Geospatial Data Lake (WPS324) - AWS re:Invent 2018
Building Your Geospatial Data Lake (WPS324) - AWS re:Invent 2018
 
Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018
Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018
Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018
 
Build a High-Performance, Cloud-Native, Open-Source Platform on AWS & Save Mi...
Build a High-Performance, Cloud-Native, Open-Source Platform on AWS & Save Mi...Build a High-Performance, Cloud-Native, Open-Source Platform on AWS & Save Mi...
Build a High-Performance, Cloud-Native, Open-Source Platform on AWS & Save Mi...
 
Building Serverless Applications with Amazon DynamoDB & AWS Lambda - Workshop...
Building Serverless Applications with Amazon DynamoDB & AWS Lambda - Workshop...Building Serverless Applications with Amazon DynamoDB & AWS Lambda - Workshop...
Building Serverless Applications with Amazon DynamoDB & AWS Lambda - Workshop...
 
The Future of Enterprise Applications is Serverless (ENT314-R1) - AWS re:Inve...
The Future of Enterprise Applications is Serverless (ENT314-R1) - AWS re:Inve...The Future of Enterprise Applications is Serverless (ENT314-R1) - AWS re:Inve...
The Future of Enterprise Applications is Serverless (ENT314-R1) - AWS re:Inve...
 
Give REST a Rest: Easily Migrate Your APIs to GraphQL (MOB318-R1) - AWS re:In...
Give REST a Rest: Easily Migrate Your APIs to GraphQL (MOB318-R1) - AWS re:In...Give REST a Rest: Easily Migrate Your APIs to GraphQL (MOB318-R1) - AWS re:In...
Give REST a Rest: Easily Migrate Your APIs to GraphQL (MOB318-R1) - AWS re:In...
 
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
 
Log Analytics with AWS
Log Analytics with AWSLog Analytics with AWS
Log Analytics with AWS
 
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018
 
Implementing Multi-Region AWS IoT, ft. Analog Devices (IOT401) - AWS re:Inven...
Implementing Multi-Region AWS IoT, ft. Analog Devices (IOT401) - AWS re:Inven...Implementing Multi-Region AWS IoT, ft. Analog Devices (IOT401) - AWS re:Inven...
Implementing Multi-Region AWS IoT, ft. Analog Devices (IOT401) - AWS re:Inven...
 

Ähnlich wie Serverless Data Prep with AWS Glue (ANT313) - AWS re:Invent 2018

Ähnlich wie Serverless Data Prep with AWS Glue (ANT313) - AWS re:Invent 2018 (20)

Control for Your Cloud Environment Using AWS Management Tools (ENT226-R1) - A...
Control for Your Cloud Environment Using AWS Management Tools (ENT226-R1) - A...Control for Your Cloud Environment Using AWS Management Tools (ENT226-R1) - A...
Control for Your Cloud Environment Using AWS Management Tools (ENT226-R1) - A...
 
Building Serverless ETL Pipelines
Building Serverless ETL PipelinesBuilding Serverless ETL Pipelines
Building Serverless ETL Pipelines
 
Serverless data prep with AWS Glue - ADB306 - New York AWS Summit
Serverless data prep with AWS Glue - ADB306 - New York AWS SummitServerless data prep with AWS Glue - ADB306 - New York AWS Summit
Serverless data prep with AWS Glue - ADB306 - New York AWS Summit
 
Build Modern Applications that Align with Twelve-Factor Methods (API303) - AW...
Build Modern Applications that Align with Twelve-Factor Methods (API303) - AW...Build Modern Applications that Align with Twelve-Factor Methods (API303) - AW...
Build Modern Applications that Align with Twelve-Factor Methods (API303) - AW...
 
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018
 
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
 
Workshop: Architecting a Serverless Data Lake
Workshop: Architecting a Serverless Data LakeWorkshop: Architecting a Serverless Data Lake
Workshop: Architecting a Serverless Data Lake
 
Become a Serverless Black Belt - Optimizing Your Serverless Applications - AW...
Become a Serverless Black Belt - Optimizing Your Serverless Applications - AW...Become a Serverless Black Belt - Optimizing Your Serverless Applications - AW...
Become a Serverless Black Belt - Optimizing Your Serverless Applications - AW...
 
Serverless use cases with AWS Lambda - More Serverless Event
Serverless use cases with AWS Lambda - More Serverless EventServerless use cases with AWS Lambda - More Serverless Event
Serverless use cases with AWS Lambda - More Serverless Event
 
Operations for Containerized Applications (CON334-R1) - AWS re:Invent 2018
Operations for Containerized Applications (CON334-R1) - AWS re:Invent 2018Operations for Containerized Applications (CON334-R1) - AWS re:Invent 2018
Operations for Containerized Applications (CON334-R1) - AWS re:Invent 2018
 
Analyze Amazon CloudFront and Lambda@Edge Logs to Improve Customer Experience...
Analyze Amazon CloudFront and Lambda@Edge Logs to Improve Customer Experience...Analyze Amazon CloudFront and Lambda@Edge Logs to Improve Customer Experience...
Analyze Amazon CloudFront and Lambda@Edge Logs to Improve Customer Experience...
 
Replicate & Manage Data Using Managed Databases & Serverless Technologies (DA...
Replicate & Manage Data Using Managed Databases & Serverless Technologies (DA...Replicate & Manage Data Using Managed Databases & Serverless Technologies (DA...
Replicate & Manage Data Using Managed Databases & Serverless Technologies (DA...
 
Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Chicago ...
Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Chicago ...Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Chicago ...
Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Chicago ...
 
Getting Started with Serverless Architectures
Getting Started with Serverless ArchitecturesGetting Started with Serverless Architectures
Getting Started with Serverless Architectures
 
Using Tableau and AWS for Fearless Reporting at UMD
Using Tableau and AWS for Fearless Reporting at UMDUsing Tableau and AWS for Fearless Reporting at UMD
Using Tableau and AWS for Fearless Reporting at UMD
 
Building a Modern Data Platform in the Cloud
Building a Modern Data Platform in the CloudBuilding a Modern Data Platform in the Cloud
Building a Modern Data Platform in the Cloud
 
Have Your Front End and Monitor It, Too (ANT303) - AWS re:Invent 2018
Have Your Front End and Monitor It, Too (ANT303) - AWS re:Invent 2018Have Your Front End and Monitor It, Too (ANT303) - AWS re:Invent 2018
Have Your Front End and Monitor It, Too (ANT303) - AWS re:Invent 2018
 
The Serverless Tidal Wave - SwampUP 2018 Keynote
The Serverless Tidal Wave - SwampUP 2018 KeynoteThe Serverless Tidal Wave - SwampUP 2018 Keynote
The Serverless Tidal Wave - SwampUP 2018 Keynote
 
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
 
Considerations for Building Your First Streaming Application (ANT359) - AWS r...
Considerations for Building Your First Streaming Application (ANT359) - AWS r...Considerations for Building Your First Streaming Application (ANT359) - AWS r...
Considerations for Building Your First Streaming Application (ANT359) - AWS r...
 

Mehr von Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

Mehr von Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Serverless Data Prep with AWS Glue (ANT313) - AWS re:Invent 2018

  • 1.
  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Serverless Data Prep with AWS Glue Moataz Anany Solutions Architect AWS A N T 3 1 3 Nitin Wagh Solutions Architect AWS
  • 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Workshop map
  • 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Workshop map
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Workshop map
  • 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. You are a data engineer at AnyCompany* Create a dataset for reporting and visualization Cleanse Transform Optimize for reporting queries Help solve a machine learning problem Data scientists at AnyCompany need to understand passenger tipping behavior * a completely made-up ride-hailing startup
  • 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Your dataset: NYC taxi trips Yellow and green taxi trips • Pick-up and drop-off dates/times • Pick-up and drop-off locations • Trip distances • Itemized fares • Tip amount • Driver-reported passenger counts http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml For-hire vehicle (FHV) • Pick-up and drop-off dates/times • Pick-up and drop-off locations • Dispatching base
  • 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Your dataset: NYC taxi trips Original raw dataset Green and yellow taxi + FHV Years 2009 to 2018 ~1.6Bn rows 215 files 253GB total Simplified raw dataset Only yellow taxi + few look-ups Jan to March 2017 ~2M rows 3 files 2.5GB+ uncompressed Ready in a publicly-accessible Amazon Simple Storage Service (Amazon S3) bucket
  • 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Workshop map
  • 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Navigate to the ANT313 workshop website Use Firefox or Chrome. Keep the website open in a separate tab https://bit.ly/ant313-workshop
  • 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hands-on I.2: Get your AWS account ready
  • 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Key resources AWS CloudFormation will create 1. A new data lake S3 bucket 2. Necessary AWS Identity and Access Management (IAM) policies and roles for AWS Glue, Amazon Athena, and Amazon SageMaker 3. An AWS Glue development endpoint 4. A number of named queries in Athena Finally, the NYC taxi trips raw dataset is copied into your S3 bucket
  • 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Workshop map
  • 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Glue automates the undifferentiated heavy lifting of ETL Automatically discover and categorize your data making it immediately searchable and queryable across data sources Generate code to clean, enrich, and reliably move data between various data sources Run your jobs on a serverless, fully managed, scale-out environment. No compute resources to manage Discover Develop Deploy
  • 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data Catalog  Hive Metastore compatible with enhanced functionality  Crawlers automatically extract metadata and create tables  Integrated with Athena, Amazon Redshift Spectrum Job execution  Run jobs on a serverless Spark platform  Provides flexible scheduling  Handles dependency resolution, monitoring, and alerting Job authoring  Auto-generates ETL code  Build on open frameworks – Python and Spark  Developer-centric – editing, debugging, sharing AWS Glue components
  • 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Workshop map
  • 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Glue Crawlers and the AWS Glue Data Catalog A crawler . . . • Samples and classifies your data • Extracts metadata • Infers schema and partitioning format • Creates tables in your account’s AWS Glue Data Catalog
  • 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Why query with Amazon Athena? • Interactive query service • Makes it easy to analyze data in Amazon S3 using standard SQL • Out-of-the-box integrated with AWS Glue Data Catalog • Serverless
  • 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Glue Crawler AWS Glue Data Catalog NYC Taxi Raw Dataset 1 Concepts in action 2 3 4 1. Crawler crawls raw dataset in Amazon S3 bucket 2. Crawler writes metadata into AWS Glue Data Catalog 3. You query raw dataset in Athena 4. Athena uses schema definition to read raw dataset from S3 and returns results You
  • 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hands-on II.1.1: Catalog raw data with an AWS Glue crawler
  • 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hands-on II.1.2: Explore table schema and metadata in AWS Glue Data Catalog
  • 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hands-on II.1.3: Query raw data with Amazon Athena
  • 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Workshop map
  • 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Interactive ETL development with AWS Glue and Amazon SageMaker What is Amazon SageMaker? • A fully managed ML platform • Enables you to build, train, and deploy machine learning models at any scale • Provides a Jupyter notebook environment
  • 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Connect your IDE to an AWS Glue development endpoint Environment to interactively develop, debug, and test ETL code AWS Glue Spark environment Interpreter server Remote interpreter What is an AWS Glue development endpoint?
  • 26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Interactive ETL development with AWS Glue and Amazon SageMaker AWS Glue enables you to • Create an Amazon SageMaker notebook environment • Connect it to an AWS Glue dev endpoint • And, finally, access your Jupyter notebook All without leaving the AWS Glue console
  • 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Glue Data Catalog NYC Taxi Raw Dataset 3 Concepts in action 2 1 1. You author ETL code in an Amazon SageMaker notebook 2. ETL code runs on AWS Glue Dev endpoint 3. ETL code… a. reads raw dataset b. applies transformations c. writes optimized dataset back to your Amazon S3 bucket 4. You use your ETL code in notebook to create a script file Amazon SageMaker Notebook NYC Taxi Optimized Dataset 4 You
  • 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hands-on II.2.1: Create an Amazon SageMaker notebook instance
  • 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hands-on II.2.2: Interactively author and run ETL code in Jupyter
  • 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Transformations we’ll apply to raw data • Join NYC trips dataset with look-up tables (denormalization) • Create new timestamp columns for pick-up & drop-off • Drop unnecessary columns • Partition the dataset • Convert to a columnar file format (parquet)
  • 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Workshop map
  • 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Concepts in action NYC taxi optimized dataset AWS Glue crawler AWS Glue Data Catalog 1 2 3 4 1. Crawler crawls optimized dataset in S3 bucket 2. Crawler writes metadata into AWS Glue Data Catalog 3. You query optimized dataset in Athena 4. Athena uses schema definition to read data from Amazon S3 and returns results You
  • 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hands-on II.3.1: Catalog optimized data with an AWS Glue crawler
  • 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hands-on II.3.2: Query optimized data with Amazon Athena
  • 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Workshop map
  • 36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What is an AWS Glue job? An AWS Glue job encapsulates the business logic that performs extract, transform, and load (ETL) work • A core building block in your production ETL pipeline • Provide your PySpark ETL script, or have one auto-generated • Supports a rich set of built-in AWS Glue transformations • Jobs can be started, stopped, monitored
  • 37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What is an AWS Glue trigger? Triggers are the “glue” in your AWS Glue ETL pipeline Triggers . . . • Can be used to chain multiple AWS Glue jobs in a series • Can start multiple jobs at once • Can be scheduled, on-demand, or based on job events • Can pass unique parameters to customize AWS Glue job runs
  • 38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Three ways to set up an AWS Glue ETL pipeline • Schedule-driven • Event-driven • State machine–driven
  • 39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Schedule-driven AWS Glue ETL pipeline We work our way backwards from a daily SLA deadline
  • 40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Event-driven AWS Glue ETL pipeline Let Amazon CloudWatch Events and AWS Lambda drive the pipeline
  • 41. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. State machine–driven AWS Glue ETL pipeline Let AWS Step Functions drive the pipeline
  • 42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Concepts in action NYC Taxi Optimized Dataset AWS Glue Crawler for Raw dataset AWS Glue Data Catalog 1 4 We’ll build a Schedule-driven pipeline. 1. Scheduled crawler runs on raw dataset and updates AWS Glue data catalog 2. AWS Glue job starts based on trigger(s) 3. Job reads raw dataset, applies transformations, and writes optimized dataset to your S3 bucket 4. Scheduled crawler runs on optimized dataset and updates data catalog AWS Glue Job to created Optimized dataset NYC Taxi Raw Dataset AWS Glue Crawler for Optimized dataset 3 2 Trigger
  • 43. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hands-on II.4.1: Schedule AWS Glue crawlers
  • 44. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hands-on II.4.2: Create an AWS Glue job
  • 45. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hands-on II.4.3: Create an AWS Glue trigger to run jobs
  • 46. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hands-on Advanced: Try out AWS Glue job bookmarks
  • 47. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Workshop map
  • 48. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. ML problem definition Tip given to taxi drivers is substantial part of their income They can also serve as loose metric for service quality that can be useful for taxi companies What are influences for higher tips (for example, taxi starting and ending in richer part of cities?) We present analysis of factors affecting tips and use these factors to predict tips
  • 49. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. ML architecture Results transform() Test data fit() XGBoost-Regression Data splitting Feature engineering Endpoint Endpoint configuration Model creation Training job Amazon SageMaker Notebook instance – Glue Spark
  • 50. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Concepts in action NYC taxi optimized 1. Launch notebook 2. Read optimized dataset 3. Train the ML model 4. Host the trained model and perform prediction AWS Glue Data Catalog AWS Glue development endpoint Amazon SageMaker notebook 2 3 4 1
  • 51. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hands-on II.5.1: Interactively develop a machine learning model in Jupyter
  • 52. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Run ML Notebook interactively • Initialize variables and import spark libraries • Retrieve Optimized NYC Taxi trips Dataset • Observe features against target label (tip_amount) • Perform feature engineering • Split feature engineered data set into Train and Test • Launch Spark XGBoostEstimator (Observe hyperparameters) • Perform Prediction and observe accuracy
  • 53. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. You made it!
  • 54. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Before you go . . . • Visit the “Account clean-up” section for instructions on cleaning-up your AWS account • Tell us how we did: Please fill out the survey Thank you
  • 55. Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Moataz Anany Solutions Architect AWS Nitin Wagh Solutions Architect AWS
  • 56. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.