Weitere ähnliche Inhalte Ähnlich wie Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS Summit (20) Mehr von Amazon Web Services (20) Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS Summit1. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Performing serverless analytics in
AWS Glue
Mehul A. Shah
GM, AWS Glue and AWS Lake Formation
A D B 2 0 2
2. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Agenda
What is serverless?
AWS Glue overview
Serverless data discovery
Serverless data science, including data prep, analytics, and profiling
Serverless orchestration
3. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
4. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
A new cloud programming paradigm
Submit code
Run on your behalf
Auto-scale
Pay on invocation
5. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Early use cases
6. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Analytics is harder
7. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
8. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
AWS Glue
Fully managed, serverless data integration service
For developers and data scientists
9. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
We’ve been busy …
Access policies
for AWS Glue
Data Catalog
Amazon
SageMaker
notebooks
Encryption
at rest
Canada central
Region
Crawler combine
compatible schemas
ETL job
metrics
Amazon
DynamoDB
integration
Job delay
notification
London
Region
Seoul
Region
Crawler merge
new columns
Mumbai
Region
Support Apache
Spark 2.2.1
ETL job
timeout
Singapore
Region
Sydney
Region
Readers support
JSONPath expressions
New job
events types
Support for
Scala scripts
Tokyo
Region
Crawler CWE
notifications
XML support AWS CloudTrail
support
AWS
CloudFormation
templates
Crawler exclusion
patterns
Per-second
billing
DynamicFrame
filter and map
GDPR, HIPAA,
and BAA
compliance
Ireland, Oregon,
Ohio Region
Frankfurt
Region
10. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Select AWS Glue customers
11. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
What customers are saying
data lake
Amazon EMR in
Hive and SparkSQL …”
Ram Kumar Regnaswamy, CTO, Beeswax
data lake to our Redshift warehouse is just one of use case examples of
AWS Glue. … Being cost-effective is essential. … AWS Glue has enabled our small team of data
engineers to run the whole data infrastructure in our
Umang Rustagi, Co-founder and COO, FinAccel
200% faster than traditional ETL tools with no operational overhead due
to the serverless nature of Glue ETL and at a fraction of the cost
Miki Hardisty, CTO, Jack in the Box
12. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
OrchestrationData Catalog Serverless engine
Automatic crawling
Apache Hive Metastore compatible
Integrated with Amazon Web Services
(AWS) analytic services
Discover
Flexible scheduling
Monitoring and alerting
External integrations
Deploy
Apache Spark
Python shell
Interactive and batch jobs
Develop
AWS Glue components
13. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Under the hood
Serverless Apache Spark with essential extras!
Apache Spark Core: RDDs
Apache Spark
DataFrames
AWS Glue
DynamicFrames
SparkSQL AWS Glue ETL
14. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Beyond data integration: serverless data science and
exploration
15. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
16. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Public GitHub timeline
40+ event types
githubarchive.org
Unique payload
per event type
17. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
year
month …
day …
2018
11 12
2221
hour …
JSON
year
month …
day …
2018
11 12
2221
hour …
Parquet
transform
Example analytics use case
Apache Hive-style
partitions
Source S3 bucket Target S3 bucket
18. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Crawler discovers structure
Handles complex, nested fields
Detects Hive-style partitions
19. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Crawler performance
90M+ files per day
Millions of partitions
YMMV with partitions
size and complexity
20. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
21. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Data science and analytics backends
Apache Spark
Data analytics
Data preparation
Profiling
Python shell
22. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Connect Amazon SageMaker notebooks
Explore your data
23. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Analyze and experiment
24. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Glue Apache Spark environment
Interpreter
server
Remote
interpreter
Architecture for interactive data science
Deploy to production
Push scripts to Amazon S3
Register as job
25. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Auto-configure VPC & role-based access
security & isolation preserved
Customers can specify capacity (DPU)
Automatically scale resources
Only pay for the resources you consume
per-second billing (10-minute min.)
No need to provision, configure, or
manage servers
Customer VPC Customer VPC
Compute instances
Serverless execution
26. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Data science and analytics backends
Apache Spark
Data analytics
Data preparation
Profiling
Python shell
27. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Under the hood: Apache Spark and AWS Glue libraries
Apache Spark is a distributed data-processing engine for complex analytics
AWS Glue builds on the Apache Spark to offer ETL-specific functionality
Apache Spark Core: RDDs
Apache Spark
DataFrames
AWS Glue
DynamicFrames
SparkSQL AWS Glue ETL
28. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
DataFrames
Core data structure for SparkSQL
Like structured tables
Need schema up front
Each row has same structure
Suited for SQL-like analytics
DataFrames and DynamicFrames
DynamicFrames
Like DataFrames for ETL
Designed for processing semi-structured data,
(e.g., JSON, Avro, Apache logs)
29. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Public GitHub timeline
40+ event types
Semi-structured
Payload structure
and size vary by
event type
30. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Schema per-record, no up-front schema needed
Easy to restructure, tag, modify
Can be more compact than DataFrame rows
Many flows can be done in single pass
DynamicFrame internals
{“id”:”2489”, “type”: ”CreateEvent”,
”payload”: {“creator”:…}, …}
Dynamic records
typeid typeid
DynamicFrame schema
typeid
{“id”:4391, “type”: “PullEvent”,
”payload”: {“assets”:…}, …}
typeid
{“id”:”6510”, “type”: “PushEvent”,
”payload”: {“pusher”:…}, …}
id
31. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Glue Parquet writer
We built a custom Parquet writer to provide
schema flexibility
Standard Parquet writer:
Set schema -> write row group(s)
Glue Parquet writer:
1. Start writing columns, adding fields as
necessary
2. Close first row group and write
schema
Additional schema changes trigger new file
Row group 1
Row group 2
Column 1
Column 2
Column 1
Column 2
…
…
Row group metadata,
including schema
32. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
DynamicFrame performance
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Day Month Year
DynamicFrame DataFrame
Time(sec.)
Data size (# files)
24 744 8699
(lower is better)
Configuration
10 DPUs
Apache Spark 2.2
Workload
JSON to Parquet
Filter for Fork events
DynamicFrame w/ custom Parquet
SQL GroupBy query
Parquet output Time (sec.)
DynamicFrame 78
DataFrame 195
Conversion to Parquet
33. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Data science and analytics backends
Apache Spark
Data analytics
Data preparation
Profiling
Python shell
34. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
AWS Glue execution model
Apache Spark and AWS Glue are data parallel.
Data is divided into partitions (shards) that
are processed concurrently.
Jobs are divided into stages
1 stage x 1 partition = 1 task
Driver schedules tasks on executors
2 executors per DPU
Driver
Executors Overall throughput is limited by
the number of partitions (shards)
35. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
AWS Glue job metrics
Metrics can be enabled in the CLI/SDK by passing --enable-metrics as a job
parameter key.
36. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Profile jobs using Glue metrics
Derived from the underlying Apache Spark metrics
Driver and per executor
Aggregates and instantaneous
Reports to Amazon CloudWatch metrics every 30 sec.
Metrics: Memory usage, bytes read and written,
CPU load, bytes shuffled, needed executors,
and more
37. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Example: Profiling memory usage
overwhelms
38. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Example: AWS Glue small-file handling
Driver memory remains below 50% for
the entire duration of execution
DynamicFrames
automatically group
files into fewer tasks
39. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
New worker types
Worker maps to 1 DPU
Standard – 2 executors/worker: 16 GB
More memory per executor
G.1X – 1 executor/worker: 16 GB
G.2X – 1 executor/worker: 32 GB
40. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Data science and analytics backends
Apache Spark
Data analytics
Data preparation
Profiling
Python shell
41. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Python shell job type
A cost-effective primitive for small to medium tasks
Python
shell
SQL-based analytics
Medium-size ML tasks
42. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
AWS Glue Python shell specs
Python 2.7 environment with
boto3, awscli, numpy, scipy, pandas, scikit-learn, PyGreSQL, and so on
Cold spin-up: < 20 sec., no runtime limit
Network addressable, support for VPCs, 10GB local storage
Sizes: 1 DPU (includes 16GB), and 1/16 DPU (includes 1GB)
Pricing: $0.44 per DPU-hour, 1-min. minimum, per-second billing
43. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Python shell collaborative filtering example
Amazon customer reviews dataset (s3://amazon-reviews-pds)
Video category
Compute low-rank approx. of (Customer x Product) ratings using SVD
uses scipy sparse matrix and SVD library
Step Time (sec)
Amazon Redshift COPY 13
Extract ratings 5
Generate matrix 1552
SVD (k=1000) 2575
Total 4145
1 DPU
matrix: 217K x 384K
SVD -- rank = 1000
runtime: 69 min.
estimated cost: $0.60
44. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
45. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Orchestration
Marketing: Ad spend by
customer segment
Event based
AWS Lambda trigger
Sales: Revenue by
customer segment
Schedule
Central: ROI by
customer segment
Weekly sales
Compose jobs globally with
event-based dependencies
46. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Orchestration building blocks
Crawlers Jobs TriggersEntities
Schedule ExternalEventsDependencies
Conditions TimeoutRetriesControl
47. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Example event-driven workflow
Crawl
raw dataset
Run
“optimize”
job
Crawl
optimized
dataset
SLA
deadline
Ready
for reporting
New raw
data arrives
48. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Conclusion
AWS Glue supports Apache Spark and Python shell
“functions” for data science and analytics
Serverless is “Function-as-a-Service”
End-to-end serverless analytics
with Data Catalog, crawlers,
notebooks, and orchestration
49. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Thank you!
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Mehul A. Shah
glue-pm@amazon.com