SlideShare a Scribd company logo
1 of 43
Download to read offline
The Evolution of Netflix’s S3
Data Warehouse
Ryan Blue & Dan Weeks
September 2018 - Strata NY
Overview
● Netflix Architecture
● S3 Data Warehouse
● Iceberg Tables
● What’s Next
Netflix Architecture
September 2018 - Strata NY
Cloud native data warehouse
● Separate Compute and Storage
● Isolate Different Workloads
● Single Source of Truth
Architectural Principles
● S3 as storage layer
○ Metadata in Hive Metastore
● EC2 as compute layer
○ Hadoop + YARN
● Spark, Presto (and a little Hive and Pig)
Tech Stack
S3 Data Warehouse
September 2018 - Strata NY
Hadoop file system
compatibility with S3
S3HDFS
create()
open()
listStatus()
delete()
rename()
REST.PUT.OBJECT
REST.GET.OBJECT
REST.GET.BUCKET
REST.DELETE.OBJECT
REST.COPY.OBJECT +
REST.DELETE.OBJECT
S3 as a File System
What about performance?
● Performance
○ Individual operations take longer
○ Some operations do not map cleanly
○ Break contracts to optimize
● Commit Path
○ Relies on expensive rename
○ Creates multiple copies with versioning
Performance & Compatibility
● The “batch” pattern
○ Never delete data as part of a job
○ Always write data to new paths
○ Atomically swap data locations
● S3 Committer
○ Use features like multi-part upload
○ Allows for “append” support
Optimizing Commits
What about consistency?
● Overlay a consistent view of metadata
○ Track file system metadata externally
○ Expire old metadata and rely on S3
● Check listings against consistent system
○ Fail or delay until view is consistent
○ Manually resolve collisions
Consistent Listing (s3mper)
● Maintenance Cost is High
○ Custom changes per execution engine
○ Never implemented in Presto or Hive
○ Behaviors differ slightly by implementation
● Platform issues are surfaced to users
○ Append is not atomic
○ Automatic overwrite
○ Table operations can be inconsistent
Challenges
● File System
○ Works around differences in behaviors
○ Trades correctness for fewer S3 calls
● s3mper
○ Works around S3 prefix-listing inconsistency
● S3 committers and Batch Pattern
○ Works around lack of atomic changes to file listings
○ Works around lack of cheap rename in S3
○ Needed to avoid using S3 file system for silly operations
Common Threads
Maybe the problem is using
S3 as a file system?
Why are we using S3 this way?
Iceberg
September 2018 - Strata NY
● Key idea: organize data in a directory tree
date=20180513/
|- hour=18/
| |- ...
|- hour=19/
| |- part-000.parquet
| |- ...
| |- part-031.parquet
|- hour=20/
| |- ...
|- ...
Hive Table Design
● Filter by directories as columns
SELECT ... WHERE date = '20180513' AND hour = 19
date=20180513/
|- hour=18/
| |- ...
|- hour=19/
| |- part-000.parquet
| |- ...
| |- part-031.parquet
|- hour=20/
| |- ...
|- ...
Hive Table Design
● Table state is stored in two places
○ Partitions in the Hive Metastore
○ Files in a FS with no transaction support
● Still requires directory listing to plan jobs
○ O(n) listing calls, n = # matching partitions
○ Eventual consistency breaks correctness
● Requires elaborate locking for “correctness”
○ Nothing respects the locking scheme
Design Problems
● Key idea: track all files in a table over time
○ A snapshot is a complete list of files in a table
○ Each write produces and commits a new snapshot
Iceberg’s Design
S1 S2 S3 ...
● Snapshot isolation without locking
○ Readers use a current snapshot
○ Writers produce new snapshots in isolation, then commit
● Any change to the file list is an atomic operation
○ Append data across partitions
○ Merge or rewrite files
Snapshot Design Benefits
S1 S2 S3 ...
R W
Design Benefits
● No expensive or eventually-consistent FS operations:
○ No directory or prefix listing
○ No rename: data files written in place
● Reads and writes are isolated and all changes are atomic
● Faster scan planning, distributed across the cluster
○ O(1) manifest reads, not O(n) partition list calls
○ Upper and lower bounds used to eliminate files
● Reliable CBO metrics
Iceberg replaces s3mper, batch
pattern, and S3 committers
Want more specifics?
Come to the Iceberg talk!
At 5:25 today in 1E09
What’s next?
September 2018 - Strata NY
● New to Hadoop? Big data is great! Just remember . . .
○ You need to know the physical layout of tables you read
○ Make sure you don’t write too many files – or too few
○ Appends are actually overwrites, except in Presto
○ Don’t write from Presto (but nothing will stop you)
○ You shouldn’t use timestamps or nested types
○ You can’t drop columns in CSV tables
○ And by CSV, we don’t really mean CSV
○ You can’t rename columns in JSON tables
○ If you rename columns in Parquet,
either Presto or Spark will work, but not both
○ . . .
Today: A narrow paved path
● Hidden partitioning
○ Partition filters derived from data filters
○ No more accidental full table scans
● Full schema evolution
○ Supports add, drop, and rename columns
● Reliable support for types
○ date, time, timestamp, and decimal
○ struct, list, map, and mixed nesting
While we’re fixing tables . . .
● Queries are not broken by layout changes
● Physical layout can evolve without painful migration
○ Mistakes can be fixed
○ Prototypes can move to production faster
○ Tables can change as volume grows over time
● Data Platform can transparently fix table layout
Table Layout is Hidden
● Any write is atomic – either complete or invisible
○ Rewrite files instead of partitions
○ Tables never have partially committed data
● Simple, built-in change detection
○ Cache and materialized view maintenance
○ Incremental processing
● Data Platform can monitor and fix data files
○ Compact small files
○ Repartition to a new layout
Snapshot-based Tables
● Common implementation for table operations
○ Write settings are per table, like row group size
○ Read defaults are set in one place, like split combination
● Simple data gathering
○ Log scan predicates and projection to Kafka
○ Recommend optimizations based on actual use
● Data Platform can automate tuning recommendations
○ Test file format tuning settings per table
○ Update table to affect all writes
Table Format Library
Questions?
September 2018 - Strata NY
Additional Iceberg Slides
September 2018 - Strata NY
● Historical Atlas data:
○ Time-series metrics from Netflix runtime systems
○ 1 month: 2.7 million files in 2,688 partitions
○ Problem: cannot process more than a few days of data
● Sample query:
select distinct tags['type'] as type
from iceberg.atlas
where
name = 'metric-name' and
date > 20180222 and date <= 20180228
order by type;
Case Study: Atlas
● Hive table – with Parquet filters:
○ 400k+ splits per day, not combined
○ EXPLAIN query: 9.6 min (planning wall time)
● Iceberg table – partition data filtering:
○ 15,218 splits, combined
○ 13 min (wall time) / 61.5 hr (task time) / 10 sec (planning)
● Iceberg table – partition and min/max filtering:
○ 412 splits
○ 42 sec (wall time) / 22 min (task time) / 25 sec (planning)
Case Study: Atlas Performance
● Implementation of snapshot-based tracking
○ Adds table schema, partition layout, string properties
○ Tracks old snapshots for eventual garbage collection
● Table metadata is immutable and always moves forward
● The current snapshot (pointer) can be rolled back
Iceberg Metadata
v1.json
S1 S2
v2.json
S1 S2 S3
v3.json
S2 S3
● Snapshots are split across one or more manifest files
○ Manifests store partition data for each data file
○ Reused to avoid high write volume
Manifest Files
v1.json
S1 S2
v2.json
S1 S2 S3
v3.json
S2 S3
m0.avro m1.avro m2.avro
● Basic data file info:
○ File location and format
○ Iceberg tracking data
● Values to filter files for a scan:
○ Partition data values
○ Per-column lower and upper bounds
● Metrics for cost-based optimization:
○ File-level: row count, size
○ Column-level: value count, null count, size
Manifest File Contents
● To commit, a writer must:
○ Note the current metadata version – the base version
○ Create new metadata and manifest files
○ Atomically swap the base version for the new version
● This atomic swap ensures a linear history
● Atomic swap is implemented by:
○ A custom metastore implementation
○ Atomic rename for HDFS or local tables
Commits
● Writers optimistically write new versions:
○ Assume that no other writer is operating
○ On conflict, retry based on the latest metadata
● To support retry, operations are structured as:
○ Assumptions about the current table state
○ Pending changes to the current table state
● Changes are safe if the assumptions are all true
Commits: Conflict Resolution
● Use case: safely merge small files
○ Merge input: file1.avro, file2.avro
○ Merge output: merge1.parquet
● Rewrite operation:
○ Assumption: file1.avro and file2.avro are still present
○ Pending changes:
Remove file1.avro and file2.avro
Add merge1.parquet
● Deleting file1.avro or file2.avro will cause a commit failure
Commits: Resolution Example

More Related Content

What's hot

Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
 
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfDeep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfAltinity Ltd
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseData Con LA
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseDatabricks
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseDatabricks
 
Apache Druid 101
Apache Druid 101Apache Druid 101
Apache Druid 101Data Con LA
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
 
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebookAniket Mokashi
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
 
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxData
 
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...DataWorks Summit
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKKriangkrai Chaonithi
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Icebergkbajda
 
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron SchildkroutKafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkroutconfluent
 

What's hot (20)

Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfDeep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Apache Druid 101
Apache Druid 101Apache Druid 101
Apache Druid 101
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Flink Streaming
Flink StreamingFlink Streaming
Flink Streaming
 
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebook
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
 
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
 
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron SchildkroutKafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
 

Similar to The evolution of Netflix's S3 data warehouse (Strata NY 2018)

Elasticsearch as a time series database
Elasticsearch as a time series databaseElasticsearch as a time series database
Elasticsearch as a time series databasefelixbarny
 
Data Day Texas 2017: Scaling Data Science at Stitch Fix
Data Day Texas 2017: Scaling Data Science at Stitch FixData Day Texas 2017: Scaling Data Science at Stitch Fix
Data Day Texas 2017: Scaling Data Science at Stitch FixStefan Krawczyk
 
ApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxXinliShang1
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBJason Terpko
 
Data Day Seattle 2017: Scaling Data Science at Stitch Fix
Data Day Seattle 2017: Scaling Data Science at Stitch FixData Day Seattle 2017: Scaling Data Science at Stitch Fix
Data Day Seattle 2017: Scaling Data Science at Stitch FixStefan Krawczyk
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataJihoon Son
 
Hadoop and cassandra
Hadoop and cassandraHadoop and cassandra
Hadoop and cassandraChristina Yu
 
Managing data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBManaging data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBAntonios Giannopoulos
 
Apache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack WorldApache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack WorldJihoon Son
 
Improving Apache Spark's Reliability with DataSourceV2
Improving Apache Spark's Reliability with DataSourceV2Improving Apache Spark's Reliability with DataSourceV2
Improving Apache Spark's Reliability with DataSourceV2Databricks
 
When is Myrocks good? 2020 Webinar Series
When is Myrocks good? 2020 Webinar SeriesWhen is Myrocks good? 2020 Webinar Series
When is Myrocks good? 2020 Webinar SeriesAlkin Tezuysal
 
Nikhil summer internship 2016
Nikhil   summer internship 2016Nikhil   summer internship 2016
Nikhil summer internship 2016Nikhil Shekhar
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerMichael Spector
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solidLars Albertsson
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at UberDatabricks
 

Similar to The evolution of Netflix's S3 data warehouse (Strata NY 2018) (20)

Running Cassandra in AWS
Running Cassandra in AWSRunning Cassandra in AWS
Running Cassandra in AWS
 
Elasticsearch as a time series database
Elasticsearch as a time series databaseElasticsearch as a time series database
Elasticsearch as a time series database
 
Data Day Texas 2017: Scaling Data Science at Stitch Fix
Data Day Texas 2017: Scaling Data Science at Stitch FixData Day Texas 2017: Scaling Data Science at Stitch Fix
Data Day Texas 2017: Scaling Data Science at Stitch Fix
 
ApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptx
 
week1slides1704202828322.pdf
week1slides1704202828322.pdfweek1slides1704202828322.pdf
week1slides1704202828322.pdf
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDB
 
Data Day Seattle 2017: Scaling Data Science at Stitch Fix
Data Day Seattle 2017: Scaling Data Science at Stitch FixData Day Seattle 2017: Scaling Data Science at Stitch Fix
Data Day Seattle 2017: Scaling Data Science at Stitch Fix
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big Data
 
Hadoop and cassandra
Hadoop and cassandraHadoop and cassandra
Hadoop and cassandra
 
Managing data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBManaging data and operation distribution in MongoDB
Managing data and operation distribution in MongoDB
 
Introducing Datawave
Introducing DatawaveIntroducing Datawave
Introducing Datawave
 
Apache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack WorldApache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack World
 
Improving Apache Spark's Reliability with DataSourceV2
Improving Apache Spark's Reliability with DataSourceV2Improving Apache Spark's Reliability with DataSourceV2
Improving Apache Spark's Reliability with DataSourceV2
 
When is Myrocks good? 2020 Webinar Series
When is Myrocks good? 2020 Webinar SeriesWhen is Myrocks good? 2020 Webinar Series
When is Myrocks good? 2020 Webinar Series
 
Fluent Bit: Log Forwarding at Scale
Fluent Bit: Log Forwarding at ScaleFluent Bit: Log Forwarding at Scale
Fluent Bit: Log Forwarding at Scale
 
Nikhil summer internship 2016
Nikhil   summer internship 2016Nikhil   summer internship 2016
Nikhil summer internship 2016
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at Appsflyer
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 

Recently uploaded

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...gajnagarg
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachBoston Institute of Analytics
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...amitlee9823
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...gajnagarg
 

Recently uploaded (20)

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
 

The evolution of Netflix's S3 data warehouse (Strata NY 2018)

  • 1. The Evolution of Netflix’s S3 Data Warehouse Ryan Blue & Dan Weeks September 2018 - Strata NY
  • 2. Overview ● Netflix Architecture ● S3 Data Warehouse ● Iceberg Tables ● What’s Next
  • 4. Cloud native data warehouse
  • 5. ● Separate Compute and Storage ● Isolate Different Workloads ● Single Source of Truth Architectural Principles
  • 6. ● S3 as storage layer ○ Metadata in Hive Metastore ● EC2 as compute layer ○ Hadoop + YARN ● Spark, Presto (and a little Hive and Pig) Tech Stack
  • 7. S3 Data Warehouse September 2018 - Strata NY
  • 11. ● Performance ○ Individual operations take longer ○ Some operations do not map cleanly ○ Break contracts to optimize ● Commit Path ○ Relies on expensive rename ○ Creates multiple copies with versioning Performance & Compatibility
  • 12. ● The “batch” pattern ○ Never delete data as part of a job ○ Always write data to new paths ○ Atomically swap data locations ● S3 Committer ○ Use features like multi-part upload ○ Allows for “append” support Optimizing Commits
  • 14. ● Overlay a consistent view of metadata ○ Track file system metadata externally ○ Expire old metadata and rely on S3 ● Check listings against consistent system ○ Fail or delay until view is consistent ○ Manually resolve collisions Consistent Listing (s3mper)
  • 15. ● Maintenance Cost is High ○ Custom changes per execution engine ○ Never implemented in Presto or Hive ○ Behaviors differ slightly by implementation ● Platform issues are surfaced to users ○ Append is not atomic ○ Automatic overwrite ○ Table operations can be inconsistent Challenges
  • 16. ● File System ○ Works around differences in behaviors ○ Trades correctness for fewer S3 calls ● s3mper ○ Works around S3 prefix-listing inconsistency ● S3 committers and Batch Pattern ○ Works around lack of atomic changes to file listings ○ Works around lack of cheap rename in S3 ○ Needed to avoid using S3 file system for silly operations Common Threads
  • 17. Maybe the problem is using S3 as a file system?
  • 18. Why are we using S3 this way?
  • 20. ● Key idea: organize data in a directory tree date=20180513/ |- hour=18/ | |- ... |- hour=19/ | |- part-000.parquet | |- ... | |- part-031.parquet |- hour=20/ | |- ... |- ... Hive Table Design
  • 21. ● Filter by directories as columns SELECT ... WHERE date = '20180513' AND hour = 19 date=20180513/ |- hour=18/ | |- ... |- hour=19/ | |- part-000.parquet | |- ... | |- part-031.parquet |- hour=20/ | |- ... |- ... Hive Table Design
  • 22. ● Table state is stored in two places ○ Partitions in the Hive Metastore ○ Files in a FS with no transaction support ● Still requires directory listing to plan jobs ○ O(n) listing calls, n = # matching partitions ○ Eventual consistency breaks correctness ● Requires elaborate locking for “correctness” ○ Nothing respects the locking scheme Design Problems
  • 23. ● Key idea: track all files in a table over time ○ A snapshot is a complete list of files in a table ○ Each write produces and commits a new snapshot Iceberg’s Design S1 S2 S3 ...
  • 24. ● Snapshot isolation without locking ○ Readers use a current snapshot ○ Writers produce new snapshots in isolation, then commit ● Any change to the file list is an atomic operation ○ Append data across partitions ○ Merge or rewrite files Snapshot Design Benefits S1 S2 S3 ... R W
  • 25. Design Benefits ● No expensive or eventually-consistent FS operations: ○ No directory or prefix listing ○ No rename: data files written in place ● Reads and writes are isolated and all changes are atomic ● Faster scan planning, distributed across the cluster ○ O(1) manifest reads, not O(n) partition list calls ○ Upper and lower bounds used to eliminate files ● Reliable CBO metrics
  • 26. Iceberg replaces s3mper, batch pattern, and S3 committers
  • 27. Want more specifics? Come to the Iceberg talk! At 5:25 today in 1E09
  • 29. ● New to Hadoop? Big data is great! Just remember . . . ○ You need to know the physical layout of tables you read ○ Make sure you don’t write too many files – or too few ○ Appends are actually overwrites, except in Presto ○ Don’t write from Presto (but nothing will stop you) ○ You shouldn’t use timestamps or nested types ○ You can’t drop columns in CSV tables ○ And by CSV, we don’t really mean CSV ○ You can’t rename columns in JSON tables ○ If you rename columns in Parquet, either Presto or Spark will work, but not both ○ . . . Today: A narrow paved path
  • 30. ● Hidden partitioning ○ Partition filters derived from data filters ○ No more accidental full table scans ● Full schema evolution ○ Supports add, drop, and rename columns ● Reliable support for types ○ date, time, timestamp, and decimal ○ struct, list, map, and mixed nesting While we’re fixing tables . . .
  • 31. ● Queries are not broken by layout changes ● Physical layout can evolve without painful migration ○ Mistakes can be fixed ○ Prototypes can move to production faster ○ Tables can change as volume grows over time ● Data Platform can transparently fix table layout Table Layout is Hidden
  • 32. ● Any write is atomic – either complete or invisible ○ Rewrite files instead of partitions ○ Tables never have partially committed data ● Simple, built-in change detection ○ Cache and materialized view maintenance ○ Incremental processing ● Data Platform can monitor and fix data files ○ Compact small files ○ Repartition to a new layout Snapshot-based Tables
  • 33. ● Common implementation for table operations ○ Write settings are per table, like row group size ○ Read defaults are set in one place, like split combination ● Simple data gathering ○ Log scan predicates and projection to Kafka ○ Recommend optimizations based on actual use ● Data Platform can automate tuning recommendations ○ Test file format tuning settings per table ○ Update table to affect all writes Table Format Library
  • 36. ● Historical Atlas data: ○ Time-series metrics from Netflix runtime systems ○ 1 month: 2.7 million files in 2,688 partitions ○ Problem: cannot process more than a few days of data ● Sample query: select distinct tags['type'] as type from iceberg.atlas where name = 'metric-name' and date > 20180222 and date <= 20180228 order by type; Case Study: Atlas
  • 37. ● Hive table – with Parquet filters: ○ 400k+ splits per day, not combined ○ EXPLAIN query: 9.6 min (planning wall time) ● Iceberg table – partition data filtering: ○ 15,218 splits, combined ○ 13 min (wall time) / 61.5 hr (task time) / 10 sec (planning) ● Iceberg table – partition and min/max filtering: ○ 412 splits ○ 42 sec (wall time) / 22 min (task time) / 25 sec (planning) Case Study: Atlas Performance
  • 38. ● Implementation of snapshot-based tracking ○ Adds table schema, partition layout, string properties ○ Tracks old snapshots for eventual garbage collection ● Table metadata is immutable and always moves forward ● The current snapshot (pointer) can be rolled back Iceberg Metadata v1.json S1 S2 v2.json S1 S2 S3 v3.json S2 S3
  • 39. ● Snapshots are split across one or more manifest files ○ Manifests store partition data for each data file ○ Reused to avoid high write volume Manifest Files v1.json S1 S2 v2.json S1 S2 S3 v3.json S2 S3 m0.avro m1.avro m2.avro
  • 40. ● Basic data file info: ○ File location and format ○ Iceberg tracking data ● Values to filter files for a scan: ○ Partition data values ○ Per-column lower and upper bounds ● Metrics for cost-based optimization: ○ File-level: row count, size ○ Column-level: value count, null count, size Manifest File Contents
  • 41. ● To commit, a writer must: ○ Note the current metadata version – the base version ○ Create new metadata and manifest files ○ Atomically swap the base version for the new version ● This atomic swap ensures a linear history ● Atomic swap is implemented by: ○ A custom metastore implementation ○ Atomic rename for HDFS or local tables Commits
  • 42. ● Writers optimistically write new versions: ○ Assume that no other writer is operating ○ On conflict, retry based on the latest metadata ● To support retry, operations are structured as: ○ Assumptions about the current table state ○ Pending changes to the current table state ● Changes are safe if the assumptions are all true Commits: Conflict Resolution
  • 43. ● Use case: safely merge small files ○ Merge input: file1.avro, file2.avro ○ Merge output: merge1.parquet ● Rewrite operation: ○ Assumption: file1.avro and file2.avro are still present ○ Pending changes: Remove file1.avro and file2.avro Add merge1.parquet ● Deleting file1.avro or file2.avro will cause a commit failure Commits: Resolution Example