SlideShare ist ein Scribd-Unternehmen logo
1 von 49
Downloaden Sie, um offline zu lesen
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Performing serverless analytics in
AWS Glue
Mehul A. Shah
GM, AWS Glue and AWS Lake Formation
A D B 2 0 2
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Agenda
What is serverless?
AWS Glue overview
Serverless data discovery
Serverless data science, including data prep, analytics, and profiling
Serverless orchestration
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
A new cloud programming paradigm
Submit code
Run on your behalf
Auto-scale
Pay on invocation
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Early use cases
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Analytics is harder
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
AWS Glue
Fully managed, serverless data integration service
For developers and data scientists
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
We’ve been busy …
Access policies
for AWS Glue
Data Catalog
Amazon
SageMaker
notebooks
Encryption
at rest
Canada central
Region
Crawler combine
compatible schemas
ETL job
metrics
Amazon
DynamoDB
integration
Job delay
notification
London
Region
Seoul
Region
Crawler merge
new columns
Mumbai
Region
Support Apache
Spark 2.2.1
ETL job
timeout
Singapore
Region
Sydney
Region
Readers support
JSONPath expressions
New job
events types
Support for
Scala scripts
Tokyo
Region
Crawler CWE
notifications
XML support AWS CloudTrail
support
AWS
CloudFormation
templates
Crawler exclusion
patterns
Per-second
billing
DynamicFrame
filter and map
GDPR, HIPAA,
and BAA
compliance
Ireland, Oregon,
Ohio Region
Frankfurt
Region
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Select AWS Glue customers
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
What customers are saying
data lake
Amazon EMR in
Hive and SparkSQL …”
Ram Kumar Regnaswamy, CTO, Beeswax
data lake to our Redshift warehouse is just one of use case examples of
AWS Glue. … Being cost-effective is essential. … AWS Glue has enabled our small team of data
engineers to run the whole data infrastructure in our
Umang Rustagi, Co-founder and COO, FinAccel
200% faster than traditional ETL tools with no operational overhead due
to the serverless nature of Glue ETL and at a fraction of the cost
Miki Hardisty, CTO, Jack in the Box
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
OrchestrationData Catalog Serverless engine
Automatic crawling
Apache Hive Metastore compatible
Integrated with Amazon Web Services
(AWS) analytic services
Discover
Flexible scheduling
Monitoring and alerting
External integrations
Deploy
Apache Spark
Python shell
Interactive and batch jobs
Develop
AWS Glue components
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Under the hood
Serverless Apache Spark with essential extras!
Apache Spark Core: RDDs
Apache Spark
DataFrames
AWS Glue
DynamicFrames
SparkSQL AWS Glue ETL
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Beyond data integration: serverless data science and
exploration
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Public GitHub timeline
40+ event types
githubarchive.org
Unique payload
per event type
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
year
month …
day …
2018
11 12
2221
hour …
JSON
year
month …
day …
2018
11 12
2221
hour …
Parquet
transform
Example analytics use case
Apache Hive-style
partitions
Source S3 bucket Target S3 bucket
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Crawler discovers structure
Handles complex, nested fields
Detects Hive-style partitions
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Crawler performance
90M+ files per day
Millions of partitions
YMMV with partitions
size and complexity
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Data science and analytics backends
Apache Spark
Data analytics
Data preparation
Profiling
Python shell
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Connect Amazon SageMaker notebooks
Explore your data
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Analyze and experiment
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Glue Apache Spark environment
Interpreter
server
Remote
interpreter
Architecture for interactive data science
Deploy to production
Push scripts to Amazon S3
Register as job
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Auto-configure VPC & role-based access
security & isolation preserved
Customers can specify capacity (DPU)
Automatically scale resources
Only pay for the resources you consume
per-second billing (10-minute min.)
No need to provision, configure, or
manage servers
Customer VPC Customer VPC
Compute instances
Serverless execution
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Data science and analytics backends
Apache Spark
Data analytics
Data preparation
Profiling
Python shell
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Under the hood: Apache Spark and AWS Glue libraries
Apache Spark is a distributed data-processing engine for complex analytics
AWS Glue builds on the Apache Spark to offer ETL-specific functionality
Apache Spark Core: RDDs
Apache Spark
DataFrames
AWS Glue
DynamicFrames
SparkSQL AWS Glue ETL
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
DataFrames
Core data structure for SparkSQL
Like structured tables
Need schema up front
Each row has same structure
Suited for SQL-like analytics
DataFrames and DynamicFrames
DynamicFrames
Like DataFrames for ETL
Designed for processing semi-structured data,
(e.g., JSON, Avro, Apache logs)
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Public GitHub timeline
40+ event types
Semi-structured
Payload structure
and size vary by
event type
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Schema per-record, no up-front schema needed
Easy to restructure, tag, modify
Can be more compact than DataFrame rows
Many flows can be done in single pass
DynamicFrame internals
{“id”:”2489”, “type”: ”CreateEvent”,
”payload”: {“creator”:…}, …}
Dynamic records
typeid typeid
DynamicFrame schema
typeid
{“id”:4391, “type”: “PullEvent”,
”payload”: {“assets”:…}, …}
typeid
{“id”:”6510”, “type”: “PushEvent”,
”payload”: {“pusher”:…}, …}
id
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Glue Parquet writer
We built a custom Parquet writer to provide
schema flexibility
Standard Parquet writer:
Set schema -> write row group(s)
Glue Parquet writer:
1. Start writing columns, adding fields as
necessary
2. Close first row group and write
schema
Additional schema changes trigger new file
Row group 1
Row group 2
Column 1
Column 2
Column 1
Column 2
…
…
Row group metadata,
including schema
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
DynamicFrame performance
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Day Month Year
DynamicFrame DataFrame
Time(sec.)
Data size (# files)
24 744 8699
(lower is better)
Configuration
10 DPUs
Apache Spark 2.2
Workload
JSON to Parquet
Filter for Fork events
DynamicFrame w/ custom Parquet
SQL GroupBy query
Parquet output Time (sec.)
DynamicFrame 78
DataFrame 195
Conversion to Parquet
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Data science and analytics backends
Apache Spark
Data analytics
Data preparation
Profiling
Python shell
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
AWS Glue execution model
Apache Spark and AWS Glue are data parallel.
Data is divided into partitions (shards) that
are processed concurrently.
Jobs are divided into stages
1 stage x 1 partition = 1 task
Driver schedules tasks on executors
2 executors per DPU
Driver
Executors Overall throughput is limited by
the number of partitions (shards)
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
AWS Glue job metrics
Metrics can be enabled in the CLI/SDK by passing --enable-metrics as a job
parameter key.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Profile jobs using Glue metrics
Derived from the underlying Apache Spark metrics
Driver and per executor
Aggregates and instantaneous
Reports to Amazon CloudWatch metrics every 30 sec.
Metrics: Memory usage, bytes read and written,
CPU load, bytes shuffled, needed executors,
and more
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Example: Profiling memory usage
overwhelms
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Example: AWS Glue small-file handling
Driver memory remains below 50% for
the entire duration of execution
DynamicFrames
automatically group
files into fewer tasks
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
New worker types
Worker maps to 1 DPU
Standard – 2 executors/worker: 16 GB
More memory per executor
G.1X – 1 executor/worker: 16 GB
G.2X – 1 executor/worker: 32 GB
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Data science and analytics backends
Apache Spark
Data analytics
Data preparation
Profiling
Python shell
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Python shell job type
A cost-effective primitive for small to medium tasks
Python
shell
SQL-based analytics
Medium-size ML tasks
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
AWS Glue Python shell specs
Python 2.7 environment with
boto3, awscli, numpy, scipy, pandas, scikit-learn, PyGreSQL, and so on
Cold spin-up: < 20 sec., no runtime limit
Network addressable, support for VPCs, 10GB local storage
Sizes: 1 DPU (includes 16GB), and 1/16 DPU (includes 1GB)
Pricing: $0.44 per DPU-hour, 1-min. minimum, per-second billing
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Python shell collaborative filtering example
Amazon customer reviews dataset (s3://amazon-reviews-pds)
Video category
Compute low-rank approx. of (Customer x Product) ratings using SVD
uses scipy sparse matrix and SVD library
Step Time (sec)
Amazon Redshift COPY 13
Extract ratings 5
Generate matrix 1552
SVD (k=1000) 2575
Total 4145
1 DPU
matrix: 217K x 384K
SVD -- rank = 1000
runtime: 69 min.
estimated cost: $0.60
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Orchestration
Marketing: Ad spend by
customer segment
Event based
AWS Lambda trigger
Sales: Revenue by
customer segment
Schedule
Central: ROI by
customer segment
Weekly sales
Compose jobs globally with
event-based dependencies
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Orchestration building blocks
Crawlers Jobs TriggersEntities
Schedule ExternalEventsDependencies
Conditions TimeoutRetriesControl
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Example event-driven workflow
Crawl
raw dataset
Run
“optimize”
job
Crawl
optimized
dataset
SLA
deadline
Ready
for reporting
New raw
data arrives
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Conclusion
AWS Glue supports Apache Spark and Python shell
“functions” for data science and analytics
Serverless is “Function-as-a-Service”
End-to-end serverless analytics
with Data Catalog, crawlers,
notebooks, and orchestration
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Thank you!
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Mehul A. Shah
glue-pm@amazon.com

Weitere ähnliche Inhalte

Was ist angesagt?

AWS re:Invent 2019
AWS re:Invent 2019AWS re:Invent 2019
AWS re:Invent 2019
Craig Milroy
 

Was ist angesagt? (20)

Modernizing Your Microsoft Business Applications - CMP201 - Anaheim AWS Summit
Modernizing Your Microsoft Business Applications - CMP201 - Anaheim AWS SummitModernizing Your Microsoft Business Applications - CMP201 - Anaheim AWS Summit
Modernizing Your Microsoft Business Applications - CMP201 - Anaheim AWS Summit
 
Architecting Digital Media Archive Migrations with AWS - STG301 - Anaheim AWS...
Architecting Digital Media Archive Migrations with AWS - STG301 - Anaheim AWS...Architecting Digital Media Archive Migrations with AWS - STG301 - Anaheim AWS...
Architecting Digital Media Archive Migrations with AWS - STG301 - Anaheim AWS...
 
Make your data move: Best practices for migrating data to AWS - STG201 - New ...
Make your data move: Best practices for migrating data to AWS - STG201 - New ...Make your data move: Best practices for migrating data to AWS - STG201 - New ...
Make your data move: Best practices for migrating data to AWS - STG201 - New ...
 
AWS re:Invent 2019
AWS re:Invent 2019AWS re:Invent 2019
AWS re:Invent 2019
 
Twelve-factor serverless applications - MAD302 - Santa Clara AWS Summit
Twelve-factor serverless applications - MAD302 - Santa Clara AWS SummitTwelve-factor serverless applications - MAD302 - Santa Clara AWS Summit
Twelve-factor serverless applications - MAD302 - Santa Clara AWS Summit
 
Twelve-Factor serverless applications - MAD311 - Chicago AWS Summit
Twelve-Factor serverless applications - MAD311 - Chicago AWS SummitTwelve-Factor serverless applications - MAD311 - Chicago AWS Summit
Twelve-Factor serverless applications - MAD311 - Chicago AWS Summit
 
Introducing Open Distro for Elasticsearch - ADB201 - New York AWS Summit
Introducing Open Distro for Elasticsearch - ADB201 - New York AWS SummitIntroducing Open Distro for Elasticsearch - ADB201 - New York AWS Summit
Introducing Open Distro for Elasticsearch - ADB201 - New York AWS Summit
 
What’s new with Amazon S3, Amazon EFS, and other AWS storage services - STG20...
What’s new with Amazon S3, Amazon EFS, and other AWS storage services - STG20...What’s new with Amazon S3, Amazon EFS, and other AWS storage services - STG20...
What’s new with Amazon S3, Amazon EFS, and other AWS storage services - STG20...
 
Manage your database in the cloud like a pro with Cloud Volumes Service for A...
Manage your database in the cloud like a pro with Cloud Volumes Service for A...Manage your database in the cloud like a pro with Cloud Volumes Service for A...
Manage your database in the cloud like a pro with Cloud Volumes Service for A...
 
AWS CloudFormation macros: Coding best practices - MAD201 - New York AWS Summit
AWS CloudFormation macros: Coding best practices - MAD201 - New York AWS SummitAWS CloudFormation macros: Coding best practices - MAD201 - New York AWS Summit
AWS CloudFormation macros: Coding best practices - MAD201 - New York AWS Summit
 
Twelve-Factor Serverless Applications - MAD303 - Anaheim AWS Summit
Twelve-Factor Serverless Applications - MAD303 - Anaheim AWS SummitTwelve-Factor Serverless Applications - MAD303 - Anaheim AWS Summit
Twelve-Factor Serverless Applications - MAD303 - Anaheim AWS Summit
 
Fulfilling_a_Billion_Requests_from_a_Global_SaaS_Company_Insights_into_AfterS...
Fulfilling_a_Billion_Requests_from_a_Global_SaaS_Company_Insights_into_AfterS...Fulfilling_a_Billion_Requests_from_a_Global_SaaS_Company_Insights_into_AfterS...
Fulfilling_a_Billion_Requests_from_a_Global_SaaS_Company_Insights_into_AfterS...
 
Build_a_Unified_Cloud
Build_a_Unified_CloudBuild_a_Unified_Cloud
Build_a_Unified_Cloud
 
Soluzioni per la migrazione e gestione dei dati in Amazon Web Services
Soluzioni per la migrazione e gestione dei dati in Amazon Web ServicesSoluzioni per la migrazione e gestione dei dati in Amazon Web Services
Soluzioni per la migrazione e gestione dei dati in Amazon Web Services
 
Module 2: Core AWS Compute and Storage Services - Virtual AWSome Day June 2018
Module 2: Core AWS Compute and Storage Services - Virtual AWSome Day June 2018Module 2: Core AWS Compute and Storage Services - Virtual AWSome Day June 2018
Module 2: Core AWS Compute and Storage Services - Virtual AWSome Day June 2018
 
HK-AWS-Quick-Start-Workshop
HK-AWS-Quick-Start-WorkshopHK-AWS-Quick-Start-Workshop
HK-AWS-Quick-Start-Workshop
 
AWS IoT Greengrass Workshop - SVC303 - Anaheim AWS Summit
AWS IoT Greengrass Workshop - SVC303 - Anaheim AWS SummitAWS IoT Greengrass Workshop - SVC303 - Anaheim AWS Summit
AWS IoT Greengrass Workshop - SVC303 - Anaheim AWS Summit
 
How to speed up and scale your innovation efforts - MAD203 - Chicago AWS Summit
How to speed up and scale your innovation efforts - MAD203 - Chicago AWS SummitHow to speed up and scale your innovation efforts - MAD203 - Chicago AWS Summit
How to speed up and scale your innovation efforts - MAD203 - Chicago AWS Summit
 
Simplifying Microsoft Architectures with AWS Services (WIN306) - AWS re:Inven...
Simplifying Microsoft Architectures with AWS Services (WIN306) - AWS re:Inven...Simplifying Microsoft Architectures with AWS Services (WIN306) - AWS re:Inven...
Simplifying Microsoft Architectures with AWS Services (WIN306) - AWS re:Inven...
 
Deep dive on Amazon S3 Glacier Deep Archive - STG301 - Santa Clara AWS Summit
Deep dive on Amazon S3 Glacier Deep Archive - STG301 - Santa Clara AWS SummitDeep dive on Amazon S3 Glacier Deep Archive - STG301 - Santa Clara AWS Summit
Deep dive on Amazon S3 Glacier Deep Archive - STG301 - Santa Clara AWS Summit
 

Ähnlich wie Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS Summit

Ähnlich wie Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS Summit (20)

Getting Started with ARM-Based EC2 A1 Instances - CMP302 - Anaheim AWS Summit
Getting Started with ARM-Based EC2 A1 Instances - CMP302 - Anaheim AWS SummitGetting Started with ARM-Based EC2 A1 Instances - CMP302 - Anaheim AWS Summit
Getting Started with ARM-Based EC2 A1 Instances - CMP302 - Anaheim AWS Summit
 
Building Serverless ETL Pipelines
Building Serverless ETL PipelinesBuilding Serverless ETL Pipelines
Building Serverless ETL Pipelines
 
Optimize costs - Migrate existing workloads to the new A1 EC2 Instances - CMP...
Optimize costs - Migrate existing workloads to the new A1 EC2 Instances - CMP...Optimize costs - Migrate existing workloads to the new A1 EC2 Instances - CMP...
Optimize costs - Migrate existing workloads to the new A1 EC2 Instances - CMP...
 
Serverless data prep with AWS Glue - ADB306 - New York AWS Summit
Serverless data prep with AWS Glue - ADB306 - New York AWS SummitServerless data prep with AWS Glue - ADB306 - New York AWS Summit
Serverless data prep with AWS Glue - ADB306 - New York AWS Summit
 
Building Serverless Analytics Pipelines with AWS Glue - AWS Summit Sydney 2019
Building Serverless Analytics Pipelines with AWS Glue - AWS Summit Sydney 2019Building Serverless Analytics Pipelines with AWS Glue - AWS Summit Sydney 2019
Building Serverless Analytics Pipelines with AWS Glue - AWS Summit Sydney 2019
 
MassMutual Goes Cloud-First with Hybrid Cloud on AWS
MassMutual Goes Cloud-Firstwith Hybrid Cloud on AWSMassMutual Goes Cloud-Firstwith Hybrid Cloud on AWS
MassMutual Goes Cloud-First with Hybrid Cloud on AWS
 
AWS Snowball Edge and AWS Greengrass for Fun and Profit (STG388) - AWS re:Inv...
AWS Snowball Edge and AWS Greengrass for Fun and Profit (STG388) - AWS re:Inv...AWS Snowball Edge and AWS Greengrass for Fun and Profit (STG388) - AWS re:Inv...
AWS Snowball Edge and AWS Greengrass for Fun and Profit (STG388) - AWS re:Inv...
 
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
 
Serverless workshop with Amazon Web Services
Serverless workshop with Amazon Web ServicesServerless workshop with Amazon Web Services
Serverless workshop with Amazon Web Services
 
The family - presentation on AWS Serverless
The family - presentation on AWS ServerlessThe family - presentation on AWS Serverless
The family - presentation on AWS Serverless
 
Analyzing your web and application logs on AWS. Utrecht AWS Dev Day
Analyzing your web and application logs on AWS. Utrecht AWS Dev DayAnalyzing your web and application logs on AWS. Utrecht AWS Dev Day
Analyzing your web and application logs on AWS. Utrecht AWS Dev Day
 
Architetture per l'analisi di flussi di dati in tempo reale
Architetture per l'analisi di flussi di dati in tempo realeArchitetture per l'analisi di flussi di dati in tempo reale
Architetture per l'analisi di flussi di dati in tempo reale
 
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2..."Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
 
Stream processing and managing real-time data
Stream processing and managing real-time dataStream processing and managing real-time data
Stream processing and managing real-time data
 
Favorire l'innovazione passando da applicazioni monolitiche ad architetture m...
Favorire l'innovazione passando da applicazioni monolitiche ad architetture m...Favorire l'innovazione passando da applicazioni monolitiche ad architetture m...
Favorire l'innovazione passando da applicazioni monolitiche ad architetture m...
 
Built & Delivered in Six Months Using Serverless Technical Patterns and Micro...
Built & Delivered in Six Months Using Serverless Technical Patterns and Micro...Built & Delivered in Six Months Using Serverless Technical Patterns and Micro...
Built & Delivered in Six Months Using Serverless Technical Patterns and Micro...
 
Build your own log analytics solution on AWS - ADB301 - Atlanta AWS Summit
Build your own log analytics solution on AWS - ADB301 - Atlanta AWS SummitBuild your own log analytics solution on AWS - ADB301 - Atlanta AWS Summit
Build your own log analytics solution on AWS - ADB301 - Atlanta AWS Summit
 
AWS Summit Stockholm - Fargate: deploy containers, not infrastructure
AWS Summit Stockholm - Fargate: deploy containers, not infrastructureAWS Summit Stockholm - Fargate: deploy containers, not infrastructure
AWS Summit Stockholm - Fargate: deploy containers, not infrastructure
 
Getting Started with Serverless Architectures
Getting Started with Serverless ArchitecturesGetting Started with Serverless Architectures
Getting Started with Serverless Architectures
 
Creating Serverless apps for NASA in GovCloud
Creating Serverless apps for NASA in GovCloudCreating Serverless apps for NASA in GovCloud
Creating Serverless apps for NASA in GovCloud
 

Mehr von Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

Mehr von Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS Summit

  • 1. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Performing serverless analytics in AWS Glue Mehul A. Shah GM, AWS Glue and AWS Lake Formation A D B 2 0 2
  • 2. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Agenda What is serverless? AWS Glue overview Serverless data discovery Serverless data science, including data prep, analytics, and profiling Serverless orchestration
  • 3. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 4. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T A new cloud programming paradigm Submit code Run on your behalf Auto-scale Pay on invocation
  • 5. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Early use cases
  • 6. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Analytics is harder
  • 7. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 8. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T AWS Glue Fully managed, serverless data integration service For developers and data scientists
  • 9. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T We’ve been busy … Access policies for AWS Glue Data Catalog Amazon SageMaker notebooks Encryption at rest Canada central Region Crawler combine compatible schemas ETL job metrics Amazon DynamoDB integration Job delay notification London Region Seoul Region Crawler merge new columns Mumbai Region Support Apache Spark 2.2.1 ETL job timeout Singapore Region Sydney Region Readers support JSONPath expressions New job events types Support for Scala scripts Tokyo Region Crawler CWE notifications XML support AWS CloudTrail support AWS CloudFormation templates Crawler exclusion patterns Per-second billing DynamicFrame filter and map GDPR, HIPAA, and BAA compliance Ireland, Oregon, Ohio Region Frankfurt Region
  • 10. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Select AWS Glue customers
  • 11. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T What customers are saying data lake Amazon EMR in Hive and SparkSQL …” Ram Kumar Regnaswamy, CTO, Beeswax data lake to our Redshift warehouse is just one of use case examples of AWS Glue. … Being cost-effective is essential. … AWS Glue has enabled our small team of data engineers to run the whole data infrastructure in our Umang Rustagi, Co-founder and COO, FinAccel 200% faster than traditional ETL tools with no operational overhead due to the serverless nature of Glue ETL and at a fraction of the cost Miki Hardisty, CTO, Jack in the Box
  • 12. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T OrchestrationData Catalog Serverless engine Automatic crawling Apache Hive Metastore compatible Integrated with Amazon Web Services (AWS) analytic services Discover Flexible scheduling Monitoring and alerting External integrations Deploy Apache Spark Python shell Interactive and batch jobs Develop AWS Glue components
  • 13. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Under the hood Serverless Apache Spark with essential extras! Apache Spark Core: RDDs Apache Spark DataFrames AWS Glue DynamicFrames SparkSQL AWS Glue ETL
  • 14. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Beyond data integration: serverless data science and exploration
  • 15. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 16. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Public GitHub timeline 40+ event types githubarchive.org Unique payload per event type
  • 17. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T year month … day … 2018 11 12 2221 hour … JSON year month … day … 2018 11 12 2221 hour … Parquet transform Example analytics use case Apache Hive-style partitions Source S3 bucket Target S3 bucket
  • 18. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Crawler discovers structure Handles complex, nested fields Detects Hive-style partitions
  • 19. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Crawler performance 90M+ files per day Millions of partitions YMMV with partitions size and complexity
  • 20. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 21. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Data science and analytics backends Apache Spark Data analytics Data preparation Profiling Python shell
  • 22. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Connect Amazon SageMaker notebooks Explore your data
  • 23. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Analyze and experiment
  • 24. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Glue Apache Spark environment Interpreter server Remote interpreter Architecture for interactive data science Deploy to production Push scripts to Amazon S3 Register as job
  • 25. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Auto-configure VPC & role-based access security & isolation preserved Customers can specify capacity (DPU) Automatically scale resources Only pay for the resources you consume per-second billing (10-minute min.) No need to provision, configure, or manage servers Customer VPC Customer VPC Compute instances Serverless execution
  • 26. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Data science and analytics backends Apache Spark Data analytics Data preparation Profiling Python shell
  • 27. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Under the hood: Apache Spark and AWS Glue libraries Apache Spark is a distributed data-processing engine for complex analytics AWS Glue builds on the Apache Spark to offer ETL-specific functionality Apache Spark Core: RDDs Apache Spark DataFrames AWS Glue DynamicFrames SparkSQL AWS Glue ETL
  • 28. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T DataFrames Core data structure for SparkSQL Like structured tables Need schema up front Each row has same structure Suited for SQL-like analytics DataFrames and DynamicFrames DynamicFrames Like DataFrames for ETL Designed for processing semi-structured data, (e.g., JSON, Avro, Apache logs)
  • 29. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Public GitHub timeline 40+ event types Semi-structured Payload structure and size vary by event type
  • 30. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Schema per-record, no up-front schema needed Easy to restructure, tag, modify Can be more compact than DataFrame rows Many flows can be done in single pass DynamicFrame internals {“id”:”2489”, “type”: ”CreateEvent”, ”payload”: {“creator”:…}, …} Dynamic records typeid typeid DynamicFrame schema typeid {“id”:4391, “type”: “PullEvent”, ”payload”: {“assets”:…}, …} typeid {“id”:”6510”, “type”: “PushEvent”, ”payload”: {“pusher”:…}, …} id
  • 31. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Glue Parquet writer We built a custom Parquet writer to provide schema flexibility Standard Parquet writer: Set schema -> write row group(s) Glue Parquet writer: 1. Start writing columns, adding fields as necessary 2. Close first row group and write schema Additional schema changes trigger new file Row group 1 Row group 2 Column 1 Column 2 Column 1 Column 2 … … Row group metadata, including schema
  • 32. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T DynamicFrame performance 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Day Month Year DynamicFrame DataFrame Time(sec.) Data size (# files) 24 744 8699 (lower is better) Configuration 10 DPUs Apache Spark 2.2 Workload JSON to Parquet Filter for Fork events DynamicFrame w/ custom Parquet SQL GroupBy query Parquet output Time (sec.) DynamicFrame 78 DataFrame 195 Conversion to Parquet
  • 33. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Data science and analytics backends Apache Spark Data analytics Data preparation Profiling Python shell
  • 34. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T AWS Glue execution model Apache Spark and AWS Glue are data parallel. Data is divided into partitions (shards) that are processed concurrently. Jobs are divided into stages 1 stage x 1 partition = 1 task Driver schedules tasks on executors 2 executors per DPU Driver Executors Overall throughput is limited by the number of partitions (shards)
  • 35. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T AWS Glue job metrics Metrics can be enabled in the CLI/SDK by passing --enable-metrics as a job parameter key.
  • 36. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Profile jobs using Glue metrics Derived from the underlying Apache Spark metrics Driver and per executor Aggregates and instantaneous Reports to Amazon CloudWatch metrics every 30 sec. Metrics: Memory usage, bytes read and written, CPU load, bytes shuffled, needed executors, and more
  • 37. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Example: Profiling memory usage overwhelms
  • 38. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Example: AWS Glue small-file handling Driver memory remains below 50% for the entire duration of execution DynamicFrames automatically group files into fewer tasks
  • 39. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T New worker types Worker maps to 1 DPU Standard – 2 executors/worker: 16 GB More memory per executor G.1X – 1 executor/worker: 16 GB G.2X – 1 executor/worker: 32 GB
  • 40. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Data science and analytics backends Apache Spark Data analytics Data preparation Profiling Python shell
  • 41. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Python shell job type A cost-effective primitive for small to medium tasks Python shell SQL-based analytics Medium-size ML tasks
  • 42. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T AWS Glue Python shell specs Python 2.7 environment with boto3, awscli, numpy, scipy, pandas, scikit-learn, PyGreSQL, and so on Cold spin-up: < 20 sec., no runtime limit Network addressable, support for VPCs, 10GB local storage Sizes: 1 DPU (includes 16GB), and 1/16 DPU (includes 1GB) Pricing: $0.44 per DPU-hour, 1-min. minimum, per-second billing
  • 43. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Python shell collaborative filtering example Amazon customer reviews dataset (s3://amazon-reviews-pds) Video category Compute low-rank approx. of (Customer x Product) ratings using SVD uses scipy sparse matrix and SVD library Step Time (sec) Amazon Redshift COPY 13 Extract ratings 5 Generate matrix 1552 SVD (k=1000) 2575 Total 4145 1 DPU matrix: 217K x 384K SVD -- rank = 1000 runtime: 69 min. estimated cost: $0.60
  • 44. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 45. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Orchestration Marketing: Ad spend by customer segment Event based AWS Lambda trigger Sales: Revenue by customer segment Schedule Central: ROI by customer segment Weekly sales Compose jobs globally with event-based dependencies
  • 46. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Orchestration building blocks Crawlers Jobs TriggersEntities Schedule ExternalEventsDependencies Conditions TimeoutRetriesControl
  • 47. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Example event-driven workflow Crawl raw dataset Run “optimize” job Crawl optimized dataset SLA deadline Ready for reporting New raw data arrives
  • 48. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Conclusion AWS Glue supports Apache Spark and Python shell “functions” for data science and analytics Serverless is “Function-as-a-Service” End-to-end serverless analytics with Data Catalog, crawlers, notebooks, and orchestration
  • 49. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Thank you! S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Mehul A. Shah glue-pm@amazon.com