More Related Content Similar to AWS Summit Seoul 2023 | 실시간 CDC 데이터 처리! Modern Transactional Data Lake 구축하기 (20) More from Amazon Web Services Korea (19) AWS Summit Seoul 2023 | 실시간 CDC 데이터 처리! Modern Transactional Data Lake 구축하기1. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
S E O U L | M A Y 4 , 2 0 2 3
2. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
CDC !
Modern Transactional Data Lake
AWS
3. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agenda
• Append-Only
• CDC-based UPSERT
▪ View
▪ Open Table Formats – Apache Iceberg, Hudi, Delta Lake
• Modern Transactional Data Lake Architecture
4. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
CRM
IoT
WEB
Messages
CDC*
Event Streams
* CDC: Change Data Capture
RDBMS Data Insights
5. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
RDBMS Scalability
RDBMS
(Replica)
RDBMS
(Primary)
Query
Engine
(1)
Storage
Query
Engine
(2)
Query
Engine
(3)
Storage
interface
Scale-Out
Scale-Out
Primary-Replica Cluster
RDBMS
(Primary)
Scale-Up
RDBMS
(Replica)
Scale-Out
Replica
Primary
Distributed File System
RDBMS
6. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DFS*
Stream
Storage
Data Lake
Data
Mart
AI/ML
CRM
IoT
WEB
Messages
CDC
Event Streams
Data Lake
* DFS: Distributed File System
Data
Ware
house
Stream
Delivery
7. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
CRM
IoT
WEB
Messages
CDC
Event Streams
Data Lake
Amazon Kinesis
Data Streams
Amazon Kinesis
Data Firehose
Amazon Athena
Amazon S3
Data Lake
Amazon QuickSight
8. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
IMMUTABLE Objects
Distributed
CAN NOT Update/Delete In-Place
Insert (Append)-Only
interface (HTTPS, SDK APIs)
Transactional (X)
MUTABLE Records
Files per tables
Update/Delete In-Place
Insert/Update/Delete
table1
table2
table3
RDBMS
Transactional (O)
RDBMS vs. S3 (≈ Distributed Object Storage)
File
System
File
System
File
System
Amazon S3
9. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
RDBMS
CDC
CDC Update/Delete ?
Amazon Kinesis
Data Streams
Amazon Kinesis
Data Firehose
Amazon Athena
Amazon S3
AWS DMS
datalake/
year=2023/month=05/day=03/hour=01/
obj1.parquet
obj2.parquet
…
year=2023/month=05/day=03/hour=02/
updated-obj1.parquet
…
Data Lake
Operation
Changed Data
I, pk1, c1, c2, t1
U, pk1, c1, c2, t2
D, pk0, c1, c2, t3
10. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
View UPSERT : Merge-On-Read
RDBMS
Updated/
Deleted
Data
Inserted Data
View Table
Operation
Changed Data
I, pk1, c1, c2, t1
U, pk1, c1, c2, t2
I, pk1, c1, c2, t1
U, pk1, c1, c2, t2
D, pk0, c1, c2, t3
I, pk1, c1, c2, t1
U, pk1, c1, c2, t2
I, pk0, c1, c2, t0
D, pk0, c1, c2, t3
I, pk0, c1, c2, t0
11. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
View UPSERT : Merge-On-Read
RDBMS
Updated/Deleted
Data
Inserted Data
View Table
Amazon S3
Amazon Athena
Amazon Redshift
Logical View
Materialized
View
CDC
12. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Logical View vs. Materialized View
CREATE VIEW view_tbl AS
SELECT *
FROM org_tbl, delta_tbl
SELECT *
FROM view_tbl
SELECT *
FROM (
SELECT *
FROM org_tbl, delta_tbl
)
SELECT *
FROM view_tbl
Materialized View
Logical View
org_tbl
Amazon S3
view_tbl
+
delta_tbl
13. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Redshift Materialized Views
14. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Kinesis
Data Streams
Amazon Redshift / Redshift Serverless
Permanent
Tables
Real-time
Materialized
View
Streaming
Table
…
…
Amazon
QuickSight
Amazon MSK
Amazon Redshift Streaming Ingestion
M A T E R I A L I Z E D V I E W
Auto Refresh
Data Source
15. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
t1 t2
Inserted Data
(t1)
Amazon S3
Inserted Data
(t2)
+
+ a b c d e f
Merge & Compaction
time
Data Size
Updated/
Deleted Data
(t1)
Updated/
Deleted Data
(t2)
16. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
year=2022/month=01/day=01/hour=00/
p1.parquet
p2.parauet
year=2022/month=02/day=01/hour=00/
...
year=2022/month=12/day=01/hour=00/
...
year=2023/month=01/day=02/hour=00/
p1.parquet
p2.parauet
year=2023/month=01/day=02/hour=01/
p1.parquet
p2.parauet
S3 Glacier
Deep
Archive
S3
Standard
Logical View
Update/
Delete
View
Merge-On-Read
17. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Logical View
• – Read ,
•
• = Merge & Compaction +
•
18. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Real-time
Materialized View
org_tbl
delta_tbl
Auto Refresh
Streaming
Table
Permanent
Table
Materialized View
Amazon Redshift
Data
Volume
Data
Volume
Data
Volume
t1
tN time
t2
Data Size Unlimited Data Volume
.....
19. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Real-time
Materialized View
org_tbl
delta_tbl
Auto Refresh
Table
data files commit log
Merge-On-Read
Streaming
Table
Permanent
Table
Amazon S3
Materialized View S3 ?
Amazon Redshift
20. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Table
data files commit log
Merge-On-Read
Amazon S3
“Table Format” = Layout of Files in Table
commit_log
date=2023-01-01
21. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon S3 RDBMS
RDBMS
Index
Field1
(v1, t1)
Files
binlog
Read
Field1
(v2, t2)
my_table/
date=2023-01-01/
file-1.parquet
......
file-2.parquet
......
commit_log/
00000.json
00001.json
......
Amazon S3
Write
t1 t2 time
Table
data files
Merge-On-Read
commit log
Insert file-1.parquet
Insert file-2.parquet
Delete file-1.parquet
22. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
“Table Format” = Layout of Files in Table
O P E N T A B L E F O R M A T S
23. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Apache Hudi
© hudi.apache.org
24. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Apache Hudi
© hudi.apache.org
25. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Apache Iceberg
s0
Data
Snapshots
t0 t1
Partition
File
Location
Schema
Format
Stats
Write & Commit
time
Snapshots: State of table at some time
s1
26. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Apache Iceberg
M E T A D A T A F I L E S T O T R A C K D A T A
schema, partitions, snapshots
list of files and mappings to snapshots
tracks data files and statistics
© iceberg.apache.org
27. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Apache Iceberg
M E T A D A T A F I L E S T O T R A C K D A T A
my_table/
├── metadata/
│ ├── 00000.metadata.json
│ ├── 00001.metadata.json
│ ├── 00002.metadata.json
│ .......
│ ├── a39f-e190-b871-ac8e5b-m0.avro
│ ├── a39f-e190-b871-ac8e5b-m1.avro
│ ├── a39f-e190-b871-ac8e5b-m2.avro
│ .......
│ ├── snap-1954-1-2e934.avro
│ ├── snap-4381-1-255b.avro
│ ├── snap-4866-1-8bf57.avro
└── data/
├── date=2023-01-01
│ └── file-1.parquet
└── date=2023-01-02
└── file-2.parquet
© iceberg.apache.org
28. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Delta Lake
my_table/
├── _delta_log
│ ├── 00000.json
│ ├── 00001.json
│ ├── 00002.json
│ .......
│ ├── 00010.json
│ └── 00010.checkpoint.parquet
├── date=2023-01-01
│ └── file-1.parquet
└── date=2023-01-02
└── file-2.parquet
Transaction Log
Single commits
Checkpoint Files
(Optional) Partition Directories
Data Files
Add 1.parquet
Add 2.parquet
Remove 1.parquet
Remove 2.parquet
Add 3.parquet
29. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Open Table Formats – Iceberg, Hudi, Delta Lake
Apache Iceberg Hudi Delta Lake
ACID Yes Yes Yes
Partition Evolution Yes No No
Schema Evolution Yes Partial Limited
Time Travel Yes Yes Yes
Merge Yes Yes Yes
Compaction API based Manual Automated
Data Format Parquet, Avro, ORC, CSV Parquet, ORC Parquet
Current Pointer Metastore, File system with
version File
Timeline commit Transaction log
Conflict Resolution Optimistic Optimistic Optimistic
Programming
Language
Java & Python Scala, Java & Python Java & Python
30. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Modern Transactional Data Lake
31. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Typical Data Pipeline & Data Lake
AWS DMS Amazon Kinesis
Data Streams
Amazon Athena
Amazon S3
Amazon RDS
Payments
• : Insert
• : Update
• : Delete
• :
Append Only
Amazon Kinesis
Data Firehose
Data Source Data Pipeline Data Lake
User Profile
32. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
CDC-based UPSERT Data Lake
AWS DMS Amazon Kinesis
Data Streams
Amazon Athena
Amazon S3
Amazon RDS Amazon Kinesis
Data Firehose
S3
User Profile iceberg
Payments
parquet, orc, avro
iceberg, hudi, delta lake
Athena Hudi Iceberg Delta Lake
Insert X O X
Delete X O X
Select O O O
33. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
CDC-based UPSERT Data Lake
AWS DMS Amazon Kinesis
Data Streams
Amazon Athena
Amazon S3
Amazon RDS
S3
User Profile iceberg
Payments
parquet, orc, avro
iceberg, hudi, delta lake
Athena Hudi Iceberg Delta Lake
Insert X O X
Delete X O X
Select O O O
AWS Glue
Flink /
Spark
Amazon EMR
Open Source
Serverless Fully Managed
34. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
CDC-based UPSERT Data Lake
AWS DMS Amazon Kinesis
Data Streams
Amazon Athena
Amazon S3
Amazon RDS AWS Glue
Streaming
Operation
Changed Data
I, pk1, c1, c2, t1
U, pk1, c1, c2, t2
D, pk0, c1, c2, t3
CDC
{ JSON }
35. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Transactional Data Lake
AWS DMS Amazon Kinesis
Data Streams
AWS Glue
Streaming
Amazon Athena
Amazon S3
Amazon RDS
AWS DMS Amazon Kinesis
Data Streams
Amazon Athena
Amazon S3
Amazon RDS Amazon Kinesis
Data Firehose
{JSON}
{JSON}
36. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Demo
37. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Reference Architecture
https://github.com/aws-samples/transactional-datalake-using-apache-iceberg-on-aws-glue
38. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Spark + Glue Context
Kinesis Data Streams
Apache Iceberg
Insert/Update/Delete
1
2
3
Glue Streaming Job Code
39. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Glue Streaming Job Code
40. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
5
Glue Streaming
Upsert
Delete
1
2
3
41. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Summary
42. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
“Table Format” = Layout of Files in Table
O P E N T A B L E F O R M A T S
Amazon S3
Update/Delete In-Place
table1
table2
table3
RDBMS
Transactional
Data Lake RDBMS
43. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Transactional Data Lake:
AWS DMS Amazon Kinesis
Data Streams
AWS Glue
ETL
Amazon Athena
Amazon S3
Amazon RDS
(Apache Iceberg,
Hudi, Delta Lake)
Amazon S3
Amazon Kinesis
Data Firehose
Raw Zone Curated Zone
44. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Transactional Data Lake: +
L A M B D A A R C H I T E C T U R E
AWS DMS Amazon Kinesis
Data Streams
AWS Glue
ETL
Amazon Athena
Amazon S3
Amazon RDS
Amazon Redshift / Redshift Serverless
Real-Time
Materialized
View
Streaming
Table
Permanent
Tables
(Apache Iceberg,
Hudi, Delta Lake)
Amazon S3
Amazon Kinesis
Data Firehose
Raw Zone Curated Zone
Batch Layer
Speed Layer
45. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Transactional Data Lake:
AWS DMS Amazon Kinesis
Data Streams
AWS Glue
Streaming
Amazon Athena
Amazon S3
Amazon RDS
(Apache Iceberg,
Hudi, Delta Lake)
Amazon Redshift / Redshift Serverless
Real-Time
Materialized
View
Streaming
Table
Permanent
Tables
46. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
On-Premise Transactional Data Lake
Generic
database
Corporate
data center
Long Time-to-build High Cost in TCO
Deep Expertise
Required
Security
HDFS
Kafka
Connect
Connect
Hive /
Presto
Flink /
Spark
Streaming
47. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Generic
database
AWS DMS Amazon Kinesis
Data Streams
AWS Glue
Streaming
Amazon Athena
Amazon S3
Corporate
data center
AWS Cloud
Streaming Migrations for Analytics on
Generic
database
Corporate
data center
HDFS
Hive /
Presto
Kafka
Connect
Connect
(Apache Iceberg,
Hudi, Delta Lake)
(Apache Iceberg,
Hudi, Delta Lake)
Flink /
Spark
S
48. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data Lake
49. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.
감사합니다
© 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.