Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS Summit

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Performing serverless analytics in
AWS Glue
Mehul A. Shah
GM, AWS Glue and AWS Lake Formation
A D B 2 0 2

Agenda
What is serverless?
AWS Glue overview
Serverless data discovery
Serverless data science, including data prep, analytics, and profiling
Serverless orchestration

A new cloud programming paradigm
Submit code
Run on your behalf
Auto-scale
Pay on invocation

Early use cases

Analytics is harder

AWS Glue
Fully managed, serverless data integration service
For developers and data scientists

We’ve been busy …
Access policies
for AWS Glue
Data Catalog
Amazon
SageMaker
notebooks
Encryption
at rest
Canada central
Region
Crawler combine
compatible schemas
ETL job
metrics
Amazon
DynamoDB
integration
Job delay
notification
London
Region
Seoul
Region
Crawler merge
new columns
Mumbai
Region
Support Apache
Spark 2.2.1
ETL job
timeout
Singapore
Region
Sydney
Region
Readers support
JSONPath expressions
New job
events types
Support for
Scala scripts
Tokyo
Region
Crawler CWE
notifications
XML support AWS CloudTrail
support
AWS
CloudFormation
templates
Crawler exclusion
patterns
Per-second
billing
DynamicFrame
filter and map
GDPR, HIPAA,
and BAA
compliance
Ireland, Oregon,
Ohio Region
Frankfurt
Region

Select AWS Glue customers

What customers are saying
data lake
Amazon EMR in
Hive and SparkSQL …”
Ram Kumar Regnaswamy, CTO, Beeswax
data lake to our Redshift warehouse is just one of use case examples of
AWS Glue. … Being cost-effective is essential. … AWS Glue has enabled our small team of data
engineers to run the whole data infrastructure in our
Umang Rustagi, Co-founder and COO, FinAccel
200% faster than traditional ETL tools with no operational overhead due
to the serverless nature of Glue ETL and at a fraction of the cost
Miki Hardisty, CTO, Jack in the Box

OrchestrationData Catalog Serverless engine
Automatic crawling
Apache Hive Metastore compatible
Integrated with Amazon Web Services
(AWS) analytic services
Discover
Flexible scheduling
Monitoring and alerting
External integrations
Deploy
Apache Spark
Python shell
Interactive and batch jobs
Develop
AWS Glue components

Under the hood
Serverless Apache Spark with essential extras!
Apache Spark Core: RDDs
Apache Spark
DataFrames
AWS Glue
DynamicFrames
SparkSQL AWS Glue ETL

Beyond data integration: serverless data science and
exploration

Public GitHub timeline
40+ event types
githubarchive.org
Unique payload
per event type

year
month …
day …
2018
11 12
2221
hour …
JSON
year
month …
day …
2018
11 12
2221
hour …
Parquet
transform
Example analytics use case
Apache Hive-style
partitions
Source S3 bucket Target S3 bucket

Crawler discovers structure
Handles complex, nested fields
Detects Hive-style partitions

Crawler performance
90M+ files per day
Millions of partitions
YMMV with partitions
size and complexity

Data science and analytics backends
Apache Spark
Data analytics
Data preparation
Profiling
Python shell

Connect Amazon SageMaker notebooks
Explore your data

Analyze and experiment

Glue Apache Spark environment
Interpreter
server
Remote
interpreter
Architecture for interactive data science
Deploy to production
Push scripts to Amazon S3
Register as job

Auto-configure VPC & role-based access
security & isolation preserved
Customers can specify capacity (DPU)
Automatically scale resources
Only pay for the resources you consume
per-second billing (10-minute min.)
No need to provision, configure, or
manage servers
Customer VPC Customer VPC
Compute instances
Serverless execution

Under the hood: Apache Spark and AWS Glue libraries
Apache Spark is a distributed data-processing engine for complex analytics
AWS Glue builds on the Apache Spark to offer ETL-specific functionality
Apache Spark Core: RDDs
Apache Spark
DataFrames
AWS Glue
DynamicFrames
SparkSQL AWS Glue ETL

DataFrames
Core data structure for SparkSQL
Like structured tables
Need schema up front
Each row has same structure
Suited for SQL-like analytics
DataFrames and DynamicFrames
DynamicFrames
Like DataFrames for ETL
Designed for processing semi-structured data,
(e.g., JSON, Avro, Apache logs)

Public GitHub timeline
40+ event types
Semi-structured
Payload structure
and size vary by
event type

Schema per-record, no up-front schema needed
Easy to restructure, tag, modify
Can be more compact than DataFrame rows
Many flows can be done in single pass
DynamicFrame internals
{“id”:”2489”, “type”: ”CreateEvent”,
”payload”: {“creator”:…}, …}
Dynamic records
typeid typeid
DynamicFrame schema
typeid
{“id”:4391, “type”: “PullEvent”,
”payload”: {“assets”:…}, …}
typeid
{“id”:”6510”, “type”: “PushEvent”,
”payload”: {“pusher”:…}, …}
id

Glue Parquet writer
We built a custom Parquet writer to provide
schema flexibility
Standard Parquet writer:
Set schema -> write row group(s)
Glue Parquet writer:
1. Start writing columns, adding fields as
necessary
2. Close first row group and write
schema
Additional schema changes trigger new file
Row group 1
Row group 2
Column 1
Column 2
Column 1
Column 2
…
…
Row group metadata,
including schema

DynamicFrame performance
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Day Month Year
DynamicFrame DataFrame
Time(sec.)
Data size (# files)
24 744 8699
(lower is better)
Configuration
10 DPUs
Apache Spark 2.2
Workload
JSON to Parquet
Filter for Fork events
DynamicFrame w/ custom Parquet
SQL GroupBy query
Parquet output Time (sec.)
DynamicFrame 78
DataFrame 195
Conversion to Parquet

AWS Glue execution model
Apache Spark and AWS Glue are data parallel.
Data is divided into partitions (shards) that
are processed concurrently.
Jobs are divided into stages
1 stage x 1 partition = 1 task
Driver schedules tasks on executors
2 executors per DPU
Driver
Executors Overall throughput is limited by
the number of partitions (shards)

AWS Glue job metrics
Metrics can be enabled in the CLI/SDK by passing --enable-metrics as a job
parameter key.

Profile jobs using Glue metrics
Derived from the underlying Apache Spark metrics
Driver and per executor
Aggregates and instantaneous
Reports to Amazon CloudWatch metrics every 30 sec.
Metrics: Memory usage, bytes read and written,
CPU load, bytes shuffled, needed executors,
and more

Example: Profiling memory usage
overwhelms

Example: AWS Glue small-file handling
Driver memory remains below 50% for
the entire duration of execution
DynamicFrames
automatically group
files into fewer tasks

New worker types
Worker maps to 1 DPU
Standard – 2 executors/worker: 16 GB
More memory per executor
G.1X – 1 executor/worker: 16 GB
G.2X – 1 executor/worker: 32 GB

Python shell job type
A cost-effective primitive for small to medium tasks
Python
shell
SQL-based analytics
Medium-size ML tasks

AWS Glue Python shell specs
Python 2.7 environment with
boto3, awscli, numpy, scipy, pandas, scikit-learn, PyGreSQL, and so on
Cold spin-up: < 20 sec., no runtime limit
Network addressable, support for VPCs, 10GB local storage
Sizes: 1 DPU (includes 16GB), and 1/16 DPU (includes 1GB)
Pricing: $0.44 per DPU-hour, 1-min. minimum, per-second billing

Python shell collaborative filtering example
Amazon customer reviews dataset (s3://amazon-reviews-pds)
Video category
Compute low-rank approx. of (Customer x Product) ratings using SVD
uses scipy sparse matrix and SVD library
Step Time (sec)
Amazon Redshift COPY 13
Extract ratings 5
Generate matrix 1552
SVD (k=1000) 2575
Total 4145
1 DPU
matrix: 217K x 384K
SVD -- rank = 1000
runtime: 69 min.
estimated cost: $0.60

Orchestration
Marketing: Ad spend by
customer segment
Event based
AWS Lambda trigger
Sales: Revenue by
customer segment
Schedule
Central: ROI by
customer segment
Weekly sales
Compose jobs globally with
event-based dependencies

Orchestration building blocks
Crawlers Jobs TriggersEntities
Schedule ExternalEventsDependencies
Conditions TimeoutRetriesControl

Example event-driven workflow
Crawl
raw dataset
Run
“optimize”
job
Crawl
optimized
dataset
SLA
deadline
Ready
for reporting
New raw
data arrives

Conclusion
AWS Glue supports Apache Spark and Python shell
“functions” for data science and analytics
Serverless is “Function-as-a-Service”
End-to-end serverless analytics
with Data Catalog, crawlers,
notebooks, and orchestration

Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS Summit

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS Summit

Ähnlich wie Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS Summit (20)

Mehr von Amazon Web Services

Mehr von Amazon Web Services (20)

Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS Summit