Weitere ähnliche Inhalte Ähnlich wie Data Catalog & ETL - Glue & Athena (20) Mehr von Amazon Web Services (20) Data Catalog & ETL - Glue & Athena1. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
HC Lo, Solutions Architect
Data Catalog & ETL - Glue &
Athena
September 12, 2019
2. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
What is AWS Glue Data Catalog?
Unified metadata repository across relational databases, Amazon RDS, Amazon
Redshift, and Amazon S3…with support for more coming!
• Get a single view into your data, no matter where it is stored
• Automatically classify your data in one central list that is searchable
• Track data evolution using schema versioning
• Query your data using Amazon Athena or Amazon Redshift Spectrum
• Hive metastore compatible; can be used as an external Hive Metastore for
applications running on Amazon EMR
3. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
What is a Data Lake
Architectural pattern enabling:
• Ubiquitous storage at any scale
• Consolidated data processing
• Collaborate and analyze data in
different ways leading to better,
faster decision making
4. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Most comprehensive
Broadest and deepest portfolio, purpose-built for builders
Migration & Streaming Services
Infrastructure Data Catalog
& ETL
Security &
Management
Data
Warehousing
Big Data
Processing
Interactive
Query
Operational
Analytics
Real time
Analytics
Serverless
Data processing
Data Movement
Analytics
Data Lake Infrastructure & Management
Dashboards Predictive Analytics
Visualization, Engagement, & Machine Learning
Digital User Engagement
5. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Data Movement
Analytics
Most comprehensive
Broadest and deepest portfolio, purpose-built for builders
+ 11 more
Redshift
EMR (Spark
& Hadoop)
Athena
Elasticsearch
Service
Kinesis Data
Analytics
Glue (Spark
& Python)
S3/Glacier GlueLake
Formation
Visualization, Engagement, & Machine Learning
QuickSight SageMaker Comprehend Lex Polly Rekognition Translate Transcribe
Database Migration Service | Snowball | Snowmobile | Kinesis Data Firehose | Kinesis Data Streams | Managed Streaming for Kafka
Data Lake Infrastructure & Management
Pinpoint
6. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Popular Customer Use Cases
7. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lake on AWS
On premises data
Web app data
Amazon RDS
Other databases
Streaming data
Your data
AWS GLUE
ETL
Amazon
QuickSight
Amazon
SageMaker
8. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Log Aggregation
AWS Service Logs
Web Application Logs
Server Logs
S3
Athena
New File
Trigger
Update table partition
Create partition
on S3
Copy to new
partition
Query data
S3
Lambda
Glue
Data Catalog
9. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Log Aggregation with ETL
AWS Service Logs
Web Application Logs
Server Logs
S3
Athena
Glue
Crawler
Update table partition
Create partition
on S3
Query data
S3
Glue ETL
Glue
Data Catalog
10. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Real-Time Data Collection
S3
Athena
Real-time events Store partitioned in S3
Trigger Job
Update table partition
Query data
Kinesis
Glue ETL
Glue
Data Catalog
11. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Export
S3
Athena
Database Migration Exported tables in S3
Trigger Job
Update table partition
Query data
Database Migration
Service
Glue ETL
Glue
Data Catalog
12. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
SaaS Model
S3
Athena
Query data
Hot data
Warn & cold dataApplication request
Glue
Data Catalog
13. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Science
S3
Athena
Application Data
S3 Glue ETL
Athena
SageMaker
EMR
Enrichment Feature
Store
Glue
Data Catalog
14. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
S3S3
AWS Glue
ETL
Athena
Amazon
Reviews Dataset
Glue
Data Catalog
1
Comprehend
2
3
Glue Crawler
4
QuickSight
5
Data Enrichment – Amazon Comprehend
15. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Connect
Kinesis Data
Streams
Agent
Events
Kinesis Data
Firehose
S3 Athena
AWS Glue
Data Catalog
Firehouse
Output Schema
Parquet
1
2 3
4
5
Redshift
Spectrum
Data Ingest in Parquet Format
16. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Analytics Reporting
Athena
Redshift
Spectrum
EMR
API
QuickSight
Glue
Data Catalog
17. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Amazon Athena is an interactive query service
that makes it easy to analyze data directly on
Amazon S3 using Standard SQL
18. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Why Amazon Athena ?
• Decouple storage from compute
• Serverless – No infrastructure or resources to manage
• Pay only for data scanned
• Schema on read – Same data, many views
• Secure – IAM for authentication; Encryption at rest & in transit
• Standard compliant and open storage file formats
• Built on powerful community supported OSS solutions
19. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Familiar Technologies Under the Covers
Used for SQL Queries
In-memory distributed query engine
ANSI-SQL compatible with extensions
(Eg. SELECT * FROM tableName)
Used for DDL functionality
Complex data types
Multitude of formats
Supports data partitioning
(Eg. CREATE TABLE, ALTER TABLE, MSCK REPAIR)
20. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Presto SQL
• ANSI SQL compliant
• Complex joins, nested queries & window
functions
• Complex data types (arrays, structs, maps)
• Presto built-in functions
• File Formats: CSV, JSON, RegEx, Parquet, Avro,
ORC, CloudTrail
• Compression: GZIP, Zlib, LZO, Snappy
• Integrated with AWS Glue Data Catalog
21. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
A Better Model
Old Methodology
• Analyst asks for a report
• Developer writes code
• Code executes on shared
cluster for several hours
• Analyst reviews report
• Analyst asks for more…
With Amazon Athena
• Analyst creates table
• Analyst iterates
• Generate final report
Simple, Quick and No Infrastructure or Developer to Manage
22. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Simple Pricing
• DDL operations – FREE
• SQL operations – FREE
• Query concurrency – FREE
• Data scanned - $5 / TB
• Standard S3 rates for storage, requests, and data transfer
apply
23. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Security and Access Control
• Encryption – SSE, SSE-KMS, CSE-KMS
• Auto detect source bucket KMS key
• Destination bucket may use separate key
• Access Control
• IAM
• S3 ACL
• S3 bucket policies
• Coming… Athorization with Glue Data Catalog
• Database level
• Table level
24. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Cost Monitoring
• Billing console provides spend per account
• Athena APIs are logged in CloudTrail
• Combine CloudTrail and Athena API for per IAM user cost
• More cost controls to come…
25. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
LAB 2 - Guide
http://bit.ly/2md1R9z
26. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2019, Amazon Web Services, Inc. or its Affiliates.
【AWS 亞馬遜雲端聚落】
意猶未盡 ?
立即加入LINE好友 >>掌握AWS最新消息 !
Thank you!