SlideShare ist ein Scribd-Unternehmen logo
1 von 46
Downloaden Sie, um offline zu lesen
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Greg Brandt, Liyin Tang (Airbnb)
December 2, 2016
Streaming ETL
For Amazon RDS and Amazon DynamoDB
DAT315
What to Expect from the Session
• Database Change Data Capture (CDC)
• Improving ETL to Data Warehouse
Spinaltap (CDC)
Architectural Evolution
From monolithic Rails app
Too many specialized
services/data stores
New Challenges
• Co-processing logic breaks down out of process/transaction context
• Primary tables/indices on many machines, not single RDBMS
• Specialized systems needed for certain use cases (analytics, search,
etc.)
Architectural Tenants
• Build for production
• Plan for the future, build for today
• Prefer existing solutions and patterns that we have
experience with in production
• Services should own their data and not share their
storage
• Mutations to data should be propagated via
standardized events
Change Data Capture (CDC)
Goal: Provide streams of data mutations
• In near real time
• With timeline consistency
To keep all these systems in sync
Option 1: Application-Driven Dual Writes
• Consistency hard
• (2PC/consensus needed)
• Data model easy
• (Schema controlled by application)
• Development easy
• Use queue e.g. Kafka, RabbitMQ in addition to RDBMS
Option 2: Database Log Mining
• Consistency easy
• (Leverage commit log semantics)
• Parsing/Data model hard
• (Database’s internal commit log)
We Chose Database Log Mining
• Parsing is easier than consensus
• Many libraries/APIs exist to make parsing easy
• Consuming stream of commits gives timeline
consistency by default
Data Ecosystem
Requirements
• Timeline consistency with at-least-once message
delivery
• Easily add new sources to consume (new machines if
necessary)
• Support low latency and high throughput use cases
• High availability with automatic failover
• Heterogeneous data sources (MySQL, Amazon
DynamoDB)
MySQL Commit Log
• Java library for binary log parsing
• https://github.com/shyiko/mysql-binlog-
connector-java/
• Emit mutation events
• (Write_rows, Update_rows, Delete_rows)
• Logical clock determined from binlog
file/offset
• (Single-master, Multi-AZ setup)
• Leverage XidEvent for transaction
boundary metadata/checkpointing
• (InnoDB implementation detail)
DynamoDB Streams
• Using DynamoDB Streams Kinesis
Adapter
• Guarantees
• Each stream record appears exactly once
in the stream.
• Stream records appear in the same
sequence as the actual modifications to
the item
• Monotonically increasing logical clock
is hard
• Need to incorporate shard id, parent/child
splitting semantics
• SequenceNumber is not global
Abstract Mutation
• Provide monotonically increasing* id
from logical clock
• Source-specific metadata (e.g. MySQL
binlog filename/offset)
• The beforeImage of the row in DB
(possibly null)
• The afterImage of the row in DB
(possibly null)
• Encode this using source-agnostic
format (e.g. Thrift)
• Write this object to message bus (e.g.
Kafka)
{
id: Long,
opCode: [
INSERT,
UPDATE,
DELETE
],
metadata: Map<String, String>,
beforeImage: Record,
afterImage: Record
}
Clustering/Configuration
• LEADER/STANDBY state model
• Each machine is LEADER for a subset of
sources
• Workload distributed evenly
• Use ZooKeeper-based Apache Helix
framework for cluster management
• http://helix.apache.org/
• Dynamic source configuration changes
• Helix Instance group tags to separate
MySQL/DynamoDB nodes
Fault Tolerance
• Controller handles node failure/elects
new LEADER for sources
• Maintain leader_epoch counter in Helix
ZooKeeper property store
• Prefix generated ids with leader_epoch
for monotonicity
• E.g. (leader_epoch, binlog_file,
binlog_pos)
Pub/Sub
• Produce mutations to Kafka with
durable configuration*
• Async coprocessors consume
messages, produce new streams
• Model streaming library allows
encapsulation of DB table schema
• Service controls both API endpoint and
streaming view of data
• Keep 24 hours of MySQL binlog
• Alert / rewind on failures in this tier
Online Validation
• Download binlog after it is flushed/immutable
• Check for holes/ordering violations by consuming stream from Kafka
• Allows us to maintain low latency with confidence in consistency of stream
• Auto-healing
• Reset binlog position to earlier if too many failures
Production Lessons
• Need schema history store for regions of commit log to support rewind
• E.g. write DDL to commit log, apply to local MySQL while processing stream to obtain
range/schema mapping
• Be careful about table encodings! (latin1, utf8...)
• request.required.acks = all can potentially hit every broker…
• (Group produce requests by broker to avoid hitting too many)
• Per-source produce buffer size
• (Tune for throughput/latency)
Data Ecosystem
Streaming DB Exports
Batch Infrastructure
Airflow Scheduling
Events
Log
DB
Mutation
Gold SilverBatch Ingestion
Query Engines:
Hive/Presto/Spark
RDS EC2
Growing Pain
Airflow Scheduling
Events
Log
DB
Mutation
Gold SilverBatch Ingestion
Query Engines:
Hive/Presto/Spark
RDS EC2
Point-in-Time Restore based DB Export
• Pros:
• Simple
• Especially for schema change
• Consistent
• Cons:
• No SLA for RDS PITR restoration time
• No near real time ad hoc query
• No hourly snapshot
• High storage cost
Overviews
Real-Time Ingestion on HBase
HBase HDFSSpinaltap
Query Engines: Hive/Presto/Spark
Spark
Streaming
RDS
Real time
query
snapshot
Batch
query
Access Data in HBase
HBase HDFS
Streaming:
Spark
snapshot
Unified view on real time data
Interactive Query:
Presto
Batch Job:
Hive/Spark
Snapshot & Reseed
HBase HDFS
Snapshot
(Hfile Links)
Bulk upload
(Reseed)
Onboard New Tables
HBase
RDS
HDFS
Streaming of Mutations from SpinalTap
Reseed
Reseed
Ingest
Disaster Recovery - Checkpoint
HBase
RDS
HDFS
Streaming of Mutations from SpinalTap
Reseed
Reseed
Ingest
Disaster Recovery - Rewind
HBase
RDS
HDFS
Streaming of Mutations from SpinalTap
Reseed
Reseed
Ingest
Disaster Recovery - Reseed
HBase
RDS
HDFS
Streaming of Mutations from SpinalTap
Reseed
Reseed
Ingest
HBase Schema
Key Space Design
• Multiplex all DB tables on Single HBase Table
• Fast point look up based on primary keys
• Efficient sequential scans for one table
• Load balance
HBase Row Keys – Primary Keys
• Hash Key= md5(DB_TABLE, PK1=v1, PK2=v2)
• Row Key = Hash Key + DB_TABLE + PK1=v1 +
Pk2=v2
• Fast point lookup based on primary keys
• Efficient sequential scan for all the keys in same
DB/Table
• Balanced based on hash key
Hash DB_TABLE PK1=v1 PK2=v2
HBase Row Keys – Secondary Keys
• Hash Key= md5(DB_TABLE, Index_1=v1)
• Row Key = Hash Key + DB_TABLE + Index_1=v1 +
PK1=vpk1
• Prefix scan for a given secondary index
Hash DB_TABLE Index=v1 PK1=vpk1
HBase Versioning
Rows CF:Columns Version Value
<ShardKey><DB_TABLE_#1><
PK_a=A>
id FriMay1900:33:192016 101
<ShardKey><DB_TABLE_#1><
PK_a=A>
city FriMay1900:33:192016 SanFrancisco
<ShardKey><DB_TABLE_#1><
PK_a=A>
city FriMay1000:34:192016 NewYork
<ShardKey><DB_TABLE_#2><
PK_a=A’>
id FriMay1900:33:192016 1
Version by Timestamp
Binlog Order
TXN 1
COMMIT_T
S: 101
TXN 2
COMMIT_T
S: 102
TXN 3
COMMIT_T
S: 103
TXN N
COMMIT_T
S: N’
…
Version by Timestamp
Binlog Order
TXN 1
COMMIT_T
S: T1
TXN 2
COMMIT_T
S: T3
TXN 3
COMMIT_T
S: T2
TXN N
COMMIT_T
S: N’
…
mysql-
bin.00000:1
00
mysql-
bin.00000:1
01
mysql-
bin.00000:1
02
mysql-
bin.00000:
N
NTP
HBase Versioning
Rows CF:Columns Version CommitTS
<ShardKey><DB_TABLE_#1><
PK_a=A>
id mysql-bin.00000:100 T0
<ShardKey><DB_TABLE_#1><
PK_a=A>
id mysql-bin.00000:101 T1
<ShardKey><DB_TABLE_#1><
PK_a=A>
id mysql-bin.00000:102 T3
<ShardKey><DB_TABLE_#1><
PK_a=A>
id mysql-bin.00000:103 T2
PITR Semantics
Binlog Order
TXN 1
COMMIT_T
S: 101
TXN 2
COMMIT_T
S: 103
TXN 3
COMMIT_T
S: 102
TXN N
COMMIT_T
S: N’
…
NTP
PITR Semantics: Binlog Commit Time Index
Rows Version(LogicalOffset) Value
<ShardKey><DB_TABLE_#1><
2016-05-2323><100>
100 mysql-bin.00000:100
<ShardKey><DB_TABLE_#1><
2016-05-2323><101>
101 mysql-bin.00000:101
<ShardKey><DB_TABLE_#1><
2016-05-2323><103>
103 mysql-bin.00000:103
<ShardKey><DB_TABLE_#1><2
016-05-2400><102>
102 mysql-bin.00000:102
First mutation
across PITR
The last
mutation before
PITR
Streaming DB Export
• Pros:
• Consistent
• High SLA for the daily snapshot
• Consistent as PITR semantics
• Near real time ad hoc query
• Hive/Spark compatible
• Hourly snapshot view
• Low storage cost
• Cons:
• Schema change
Thank you!
Remember to complete
your evaluations!

Weitere ähnliche Inhalte

Was ist angesagt?

Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Amazon Web Services
 

Was ist angesagt? (20)

Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
 
AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...
AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...
AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...
 
Build a Real-time Streaming Data Visualization System with Amazon Kinesis Ana...
Build a Real-time Streaming Data Visualization System with Amazon Kinesis Ana...Build a Real-time Streaming Data Visualization System with Amazon Kinesis Ana...
Build a Real-time Streaming Data Visualization System with Amazon Kinesis Ana...
 
February 2016 Webinar Series - Architectural Patterns for Big Data on AWS
February 2016 Webinar Series - Architectural Patterns for Big Data on AWSFebruary 2016 Webinar Series - Architectural Patterns for Big Data on AWS
February 2016 Webinar Series - Architectural Patterns for Big Data on AWS
 
Introduction to AWS Glue
Introduction to AWS GlueIntroduction to AWS Glue
Introduction to AWS Glue
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
Introduction to Amazon Kinesis Analytics
Introduction to Amazon Kinesis AnalyticsIntroduction to Amazon Kinesis Analytics
Introduction to Amazon Kinesis Analytics
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
 
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
 
Real-Time Streaming Data Solution on AWS with Beeswax
Real-Time Streaming Data Solution on AWS with BeeswaxReal-Time Streaming Data Solution on AWS with Beeswax
Real-Time Streaming Data Solution on AWS with Beeswax
 
Building Big Data Applications with Serverless Architectures - June 2017 AWS...
Building Big Data Applications with Serverless Architectures -  June 2017 AWS...Building Big Data Applications with Serverless Architectures -  June 2017 AWS...
Building Big Data Applications with Serverless Architectures - June 2017 AWS...
 
Big data on aws
Big data on awsBig data on aws
Big data on aws
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
 
Full Stack Analytics on AWS - AWS Summit Cape Town 2017
Full Stack Analytics on AWS - AWS Summit Cape Town 2017 Full Stack Analytics on AWS - AWS Summit Cape Town 2017
Full Stack Analytics on AWS - AWS Summit Cape Town 2017
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
 
Serverless Streaming Data Processing using Amazon Kinesis Analytics
Serverless Streaming Data Processing using Amazon Kinesis AnalyticsServerless Streaming Data Processing using Amazon Kinesis Analytics
Serverless Streaming Data Processing using Amazon Kinesis Analytics
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOT
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOTAWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOT
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOT
 
Big Data Architectural Patterns
Big Data Architectural PatternsBig Data Architectural Patterns
Big Data Architectural Patterns
 
SRV420 Analyzing Streaming Data in Real-time with Amazon Kinesis
SRV420 Analyzing Streaming Data in Real-time with Amazon KinesisSRV420 Analyzing Streaming Data in Real-time with Amazon Kinesis
SRV420 Analyzing Streaming Data in Real-time with Amazon Kinesis
 

Ähnlich wie AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Ähnlich wie AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315) (20)

Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
 
SQL Server 2014 In-Memory OLTP
SQL Server 2014 In-Memory OLTPSQL Server 2014 In-Memory OLTP
SQL Server 2014 In-Memory OLTP
 
Dissecting Scalable Database Architectures
Dissecting Scalable Database ArchitecturesDissecting Scalable Database Architectures
Dissecting Scalable Database Architectures
 
Aerospike Hybrid Memory Architecture
Aerospike Hybrid Memory ArchitectureAerospike Hybrid Memory Architecture
Aerospike Hybrid Memory Architecture
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
CosmosDB for DBAs & Developers
CosmosDB for DBAs & DevelopersCosmosDB for DBAs & Developers
CosmosDB for DBAs & Developers
 
Redshift overview
Redshift overviewRedshift overview
Redshift overview
 
Design Principles for a Modern Data Warehouse
Design Principles for a Modern Data WarehouseDesign Principles for a Modern Data Warehouse
Design Principles for a Modern Data Warehouse
 
L6.sp17.pptx
L6.sp17.pptxL6.sp17.pptx
L6.sp17.pptx
 
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibabahbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
 
AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
AWS June 2016 Webinar Series - Amazon Redshift or Big Data AnalyticsAWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
 
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
 
How does Apache Pegasus (incubating) community develop at SensorsData
How does Apache Pegasus (incubating) community develop at SensorsDataHow does Apache Pegasus (incubating) community develop at SensorsData
How does Apache Pegasus (incubating) community develop at SensorsData
 
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 

Mehr von Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

Mehr von Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Kürzlich hochgeladen

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Kürzlich hochgeladen (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 

AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

  • 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Greg Brandt, Liyin Tang (Airbnb) December 2, 2016 Streaming ETL For Amazon RDS and Amazon DynamoDB DAT315
  • 2. What to Expect from the Session • Database Change Data Capture (CDC) • Improving ETL to Data Warehouse
  • 4. Architectural Evolution From monolithic Rails app Too many specialized services/data stores
  • 5. New Challenges • Co-processing logic breaks down out of process/transaction context • Primary tables/indices on many machines, not single RDBMS • Specialized systems needed for certain use cases (analytics, search, etc.)
  • 6. Architectural Tenants • Build for production • Plan for the future, build for today • Prefer existing solutions and patterns that we have experience with in production • Services should own their data and not share their storage • Mutations to data should be propagated via standardized events
  • 7. Change Data Capture (CDC) Goal: Provide streams of data mutations • In near real time • With timeline consistency To keep all these systems in sync
  • 8. Option 1: Application-Driven Dual Writes • Consistency hard • (2PC/consensus needed) • Data model easy • (Schema controlled by application) • Development easy • Use queue e.g. Kafka, RabbitMQ in addition to RDBMS
  • 9. Option 2: Database Log Mining • Consistency easy • (Leverage commit log semantics) • Parsing/Data model hard • (Database’s internal commit log)
  • 10. We Chose Database Log Mining • Parsing is easier than consensus • Many libraries/APIs exist to make parsing easy • Consuming stream of commits gives timeline consistency by default
  • 12. Requirements • Timeline consistency with at-least-once message delivery • Easily add new sources to consume (new machines if necessary) • Support low latency and high throughput use cases • High availability with automatic failover • Heterogeneous data sources (MySQL, Amazon DynamoDB)
  • 13. MySQL Commit Log • Java library for binary log parsing • https://github.com/shyiko/mysql-binlog- connector-java/ • Emit mutation events • (Write_rows, Update_rows, Delete_rows) • Logical clock determined from binlog file/offset • (Single-master, Multi-AZ setup) • Leverage XidEvent for transaction boundary metadata/checkpointing • (InnoDB implementation detail)
  • 14. DynamoDB Streams • Using DynamoDB Streams Kinesis Adapter • Guarantees • Each stream record appears exactly once in the stream. • Stream records appear in the same sequence as the actual modifications to the item • Monotonically increasing logical clock is hard • Need to incorporate shard id, parent/child splitting semantics • SequenceNumber is not global
  • 15. Abstract Mutation • Provide monotonically increasing* id from logical clock • Source-specific metadata (e.g. MySQL binlog filename/offset) • The beforeImage of the row in DB (possibly null) • The afterImage of the row in DB (possibly null) • Encode this using source-agnostic format (e.g. Thrift) • Write this object to message bus (e.g. Kafka) { id: Long, opCode: [ INSERT, UPDATE, DELETE ], metadata: Map<String, String>, beforeImage: Record, afterImage: Record }
  • 16. Clustering/Configuration • LEADER/STANDBY state model • Each machine is LEADER for a subset of sources • Workload distributed evenly • Use ZooKeeper-based Apache Helix framework for cluster management • http://helix.apache.org/ • Dynamic source configuration changes • Helix Instance group tags to separate MySQL/DynamoDB nodes
  • 17. Fault Tolerance • Controller handles node failure/elects new LEADER for sources • Maintain leader_epoch counter in Helix ZooKeeper property store • Prefix generated ids with leader_epoch for monotonicity • E.g. (leader_epoch, binlog_file, binlog_pos)
  • 18. Pub/Sub • Produce mutations to Kafka with durable configuration* • Async coprocessors consume messages, produce new streams • Model streaming library allows encapsulation of DB table schema • Service controls both API endpoint and streaming view of data • Keep 24 hours of MySQL binlog • Alert / rewind on failures in this tier
  • 19. Online Validation • Download binlog after it is flushed/immutable • Check for holes/ordering violations by consuming stream from Kafka • Allows us to maintain low latency with confidence in consistency of stream • Auto-healing • Reset binlog position to earlier if too many failures
  • 20. Production Lessons • Need schema history store for regions of commit log to support rewind • E.g. write DDL to commit log, apply to local MySQL while processing stream to obtain range/schema mapping • Be careful about table encodings! (latin1, utf8...) • request.required.acks = all can potentially hit every broker… • (Group produce requests by broker to avoid hitting too many) • Per-source produce buffer size • (Tune for throughput/latency)
  • 23. Batch Infrastructure Airflow Scheduling Events Log DB Mutation Gold SilverBatch Ingestion Query Engines: Hive/Presto/Spark RDS EC2
  • 24. Growing Pain Airflow Scheduling Events Log DB Mutation Gold SilverBatch Ingestion Query Engines: Hive/Presto/Spark RDS EC2
  • 25. Point-in-Time Restore based DB Export • Pros: • Simple • Especially for schema change • Consistent • Cons: • No SLA for RDS PITR restoration time • No near real time ad hoc query • No hourly snapshot • High storage cost
  • 27. Real-Time Ingestion on HBase HBase HDFSSpinaltap Query Engines: Hive/Presto/Spark Spark Streaming RDS Real time query snapshot Batch query
  • 28. Access Data in HBase HBase HDFS Streaming: Spark snapshot Unified view on real time data Interactive Query: Presto Batch Job: Hive/Spark
  • 29. Snapshot & Reseed HBase HDFS Snapshot (Hfile Links) Bulk upload (Reseed)
  • 30. Onboard New Tables HBase RDS HDFS Streaming of Mutations from SpinalTap Reseed Reseed Ingest
  • 31. Disaster Recovery - Checkpoint HBase RDS HDFS Streaming of Mutations from SpinalTap Reseed Reseed Ingest
  • 32. Disaster Recovery - Rewind HBase RDS HDFS Streaming of Mutations from SpinalTap Reseed Reseed Ingest
  • 33. Disaster Recovery - Reseed HBase RDS HDFS Streaming of Mutations from SpinalTap Reseed Reseed Ingest
  • 35. Key Space Design • Multiplex all DB tables on Single HBase Table • Fast point look up based on primary keys • Efficient sequential scans for one table • Load balance
  • 36. HBase Row Keys – Primary Keys • Hash Key= md5(DB_TABLE, PK1=v1, PK2=v2) • Row Key = Hash Key + DB_TABLE + PK1=v1 + Pk2=v2 • Fast point lookup based on primary keys • Efficient sequential scan for all the keys in same DB/Table • Balanced based on hash key Hash DB_TABLE PK1=v1 PK2=v2
  • 37. HBase Row Keys – Secondary Keys • Hash Key= md5(DB_TABLE, Index_1=v1) • Row Key = Hash Key + DB_TABLE + Index_1=v1 + PK1=vpk1 • Prefix scan for a given secondary index Hash DB_TABLE Index=v1 PK1=vpk1
  • 38. HBase Versioning Rows CF:Columns Version Value <ShardKey><DB_TABLE_#1>< PK_a=A> id FriMay1900:33:192016 101 <ShardKey><DB_TABLE_#1>< PK_a=A> city FriMay1900:33:192016 SanFrancisco <ShardKey><DB_TABLE_#1>< PK_a=A> city FriMay1000:34:192016 NewYork <ShardKey><DB_TABLE_#2>< PK_a=A’> id FriMay1900:33:192016 1
  • 39. Version by Timestamp Binlog Order TXN 1 COMMIT_T S: 101 TXN 2 COMMIT_T S: 102 TXN 3 COMMIT_T S: 103 TXN N COMMIT_T S: N’ …
  • 40. Version by Timestamp Binlog Order TXN 1 COMMIT_T S: T1 TXN 2 COMMIT_T S: T3 TXN 3 COMMIT_T S: T2 TXN N COMMIT_T S: N’ … mysql- bin.00000:1 00 mysql- bin.00000:1 01 mysql- bin.00000:1 02 mysql- bin.00000: N NTP
  • 41. HBase Versioning Rows CF:Columns Version CommitTS <ShardKey><DB_TABLE_#1>< PK_a=A> id mysql-bin.00000:100 T0 <ShardKey><DB_TABLE_#1>< PK_a=A> id mysql-bin.00000:101 T1 <ShardKey><DB_TABLE_#1>< PK_a=A> id mysql-bin.00000:102 T3 <ShardKey><DB_TABLE_#1>< PK_a=A> id mysql-bin.00000:103 T2
  • 42. PITR Semantics Binlog Order TXN 1 COMMIT_T S: 101 TXN 2 COMMIT_T S: 103 TXN 3 COMMIT_T S: 102 TXN N COMMIT_T S: N’ … NTP
  • 43. PITR Semantics: Binlog Commit Time Index Rows Version(LogicalOffset) Value <ShardKey><DB_TABLE_#1>< 2016-05-2323><100> 100 mysql-bin.00000:100 <ShardKey><DB_TABLE_#1>< 2016-05-2323><101> 101 mysql-bin.00000:101 <ShardKey><DB_TABLE_#1>< 2016-05-2323><103> 103 mysql-bin.00000:103 <ShardKey><DB_TABLE_#1><2 016-05-2400><102> 102 mysql-bin.00000:102 First mutation across PITR The last mutation before PITR
  • 44. Streaming DB Export • Pros: • Consistent • High SLA for the daily snapshot • Consistent as PITR semantics • Near real time ad hoc query • Hive/Spark compatible • Hourly snapshot view • Low storage cost • Cons: • Schema change