AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Eva Tse, Director, Big Data Services, Netflix
Kurt Brown, Director, Data Platform, Netflix
November 29, 2016
Netflix
Using Amazon S3 as the Fabric
of Our Big Data Ecosystem
BDM306

What to Expect from the Session
How we use Amazon S3 as our centralized data hub
Our big data ecosystems on AWS
- Big data processing engines
- Architecture
- Tools and services

‘The only valuable thing is intuition.’
– Albert Einstein

Why is it Intuitive?
It is a cloud native service! (free engineering)
‘Practically infinitely’ scalable
99.999999999% durable
99.99% available
Decouple compute and storage

Why is it Counter Intuitive?
Eventual consistency?
Performance?

Our Data Hub Scale
60+ PB
1.5 billions+ Objects

Data hub
60+ PB
100+ TB daily
Ingest
Expiration
400+ TB daily
ETL Processing
Read ~3.5 PB daily
Write 500+ TB daily
Data Velocity

Event Data Pipeline
Business events
~500 billion events/day
5 min SLA from source to data hub

Cloud
apps
Kafka AWS
S3
Cloud
apps
Kafka
Cloud
apps
Kafka
Ursula
Data
Hub
AWS SQS
Event Data
Region 1
Region 2
Region 3
AWS
S3
AWS
S3

Dimension Data Pipeline
Stateful data in Cassandra clusters
Extract from tens of Cassandra clusters
Daily or more granular extracts

Dimension Data
Cassandra
clusters
Aegisthus Data
Hub
Cassandra
clusters
Cassandra
clusters
SSTables
on AWS S3
Region 1
Region 2
Region 3
SSTables
on AWS S3
SSTables
on AWS S3

Our Data Processing Engines
Data Hub
Hadoop Yarn
Clusters
~250 - 400 r3.4xl~3500 d2.4xl

Look at it from Scalability Angle
1 d2.4xl has 24 TB
60 PB/24 TB = 2,560 machines
To achieve 3 way replications for redundancy in one zone,
we need 7,680 machines!
The data size we have is beyond what we could fit into our
clusters!

What are the tradeoffs?
Eventual consistency
Performance

Eventual Consistency
Updates (overwrite puts)
- Always put new files with new keys when updating data;
then delete old files
List
- We need to know we missed something
- Keep prefix manifest in S3mper (or EMRFS)

Parquet
Majority of our data is in Parquet file format
Supported across Hive, Pig, Presto, Spark
Performance benefits in read
- Column projections
- Predicate pushdown
- Vectorized read

Performance Impact
Read
- Penalty: Throughput and latency
- Impact depends on amount of data read
- Improvement: I/O manager in Parquet
Write
- Penalty: Writing to local disk before upload to S3
- Improvement: Direct write via multi-part uploads

Performance Impact
List
- Penalty: List thousands of partitions for split calculation
- Each partition is a S3 prefix
- Improvement: Track files instead of prefixes

Performance Impact – some good news
ETL jobs:
- Mostly CPU bound, not network bound
- Performance converges w/ volume and complexity
Interactive queries:
- % impact is higher  … but they run fast
Benefits still outweigh the cost!

....
..
For Users
Should I run my job on my laptop?
Where can I find the right version of
the tools?
Which cluster should I run
my high priority ETL job?
Where can I see all my jobs run yesterday?

....
..
For Admins
How do I manage different versions
of tools in different clusters?
How can I upgrade/swap the
clusters with no downtime to users?

Genie – Job and Cluster Mgmt Service
Users:
- Discovery: find the right cluster to run the jobs
- Gateway: to run different kinds of jobs
- Orchestration: and, the one place to find all jobs!
Admins:
- Config mgmt: multiple versions of multiple tools
- Deployment: cluster swap/updates with no downtime

....
..
Job scripts & jars
Tools & clusters
configs

Metacat
Metacat
Federated metadata service. A proxy across data sources.
Metastore
Amazon
RDS
Amazon
Redshift

Metacat
Common APIs for our applications and tools. Thrift APIs for
interoperability.
Metadata discovery across data sources
Additional business context
- Lifecycle policy (TTL) per table
- Table owner, description, tags
- User-defined custom metrics

Data Lifecycle Management
Janitor tools
- Delete ‘dangling’ data after 60 days
- Delete data obsoleted by ‘data updates’ after 3 days
- Delete partitions based on table TTL

Deletion Service
Centralized service to handle errors, retries, and backoffs
of S3 deletes
Cool-down period to delete after a few days
Store history and statistics
Allow easy recovery based on time and tags

Core S3
Versioned buckets
20 days
Scale
Simplicity

Above and beyond
Other data stores
Heterogeneous cloud platform
CRR

Approach
Tell us who you are
User agent
S3 access logs
Metrics pipeline
Charlotte

Approach
The calculation
Tableau reports
Data Doctor

Approach
The calculation
Tableau reports
Data Doctor
TTLs
Future: Tie to job cost and leverage SIA / Amazon Glacier?

Amazon Redshift
Faster, interactive subset of data
Some use, some don’t
Auto-synch (tag-based)
Fast loading!
Backups, restore, & expansion

Druid
Interactive at scale
S3 (source of truth)
S3 for Druid deep storage

Tableau
S3 (source of truth)
Mostly extracts (vs. Direct Connect)
Backups (multi-region)

Big Data API (aka Kragle)
import kragle as kg
trans_info = kg.transport.Transporter()
.source('metacat://prodhive/default/my_table')
.target('metacat://redshift/test/demo_table')
.execute()

Next Steps
Add caching?
Storage efficiency
Partner with the S3 team to
improve S3 for big data

Take-aways
Amazon S3 = Data hub = 
Extend and improve as you go
It takes an ecosystem
S3

Remember to complete
your evaluations!

AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)

Ähnlich wie AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306) (20)

Mehr von Amazon Web Services

Mehr von Amazon Web Services (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ecosystem (BDM306)