Amazon S3 is the central data hub for Netflix's big data ecosystem. We currently have over 1.5 billion objects and 60+ PB of data stored in S3. As we ingest, transform, transport, and visualize data, we find this data naturally weaving in and out of S3. Amazon S3 provides us the flexibility to use an interoperable set of big data processing tools like Spark, Presto, Hive, and Pig. It serves as the hub for transporting data to additional data stores / engines like Teradata, Redshift, and Druid, as well as exporting data to reporting tools like Microstrategy and Tableau. Over time, we have built an ecosystem of services and tools to manage our data on S3. We have a federated metadata catalog service that keeps track of all our data. We have a set of data lifecycle management tools that expire data based on business rules and compliance. We also have a portal that allows users to see the cost and size of their data footprint. In this talk, we’ll dive into these major uses of S3, as well as many smaller cases, where S3 smoothly addresses an important data infrastructure need. We will also provide solutions and methodologies on how you can build your own S3 big data hub.
2. What to Expect from the Session
How we use Amazon S3 as our centralized data hub
Our big data ecosystems on AWS
- Big data processing engines
- Architecture
- Tools and services
6. Why is it Intuitive?
It is a cloud native service! (free engineering)
‘Practically infinitely’ scalable
99.999999999% durable
99.99% available
Decouple compute and storage
7. Why is it Counter Intuitive?
Eventual consistency?
Performance?
18. Our Data Processing Engines
Data Hub
Hadoop Yarn
Clusters
~250 - 400 r3.4xl~3500 d2.4xl
19. Look at it from Scalability Angle
1 d2.4xl has 24 TB
60 PB/24 TB = 2,560 machines
To achieve 3 way replications for redundancy in one zone,
we need 7,680 machines!
The data size we have is beyond what we could fit into our
clusters!
21. What are the tradeoffs?
Eventual consistency
Performance
22. Eventual Consistency
Updates (overwrite puts)
- Always put new files with new keys when updating data;
then delete old files
List
- We need to know we missed something
- Keep prefix manifest in S3mper (or EMRFS)
23. Parquet
Majority of our data is in Parquet file format
Supported across Hive, Pig, Presto, Spark
Performance benefits in read
- Column projections
- Predicate pushdown
- Vectorized read
24. Performance Impact
Read
- Penalty: Throughput and latency
- Impact depends on amount of data read
- Improvement: I/O manager in Parquet
Write
- Penalty: Writing to local disk before upload to S3
- Improvement: Direct write via multi-part uploads
25. Performance Impact
List
- Penalty: List thousands of partitions for split calculation
- Each partition is a S3 prefix
- Improvement: Track files instead of prefixes
26. Performance Impact – some good news
ETL jobs:
- Mostly CPU bound, not network bound
- Performance converges w/ volume and complexity
Interactive queries:
- % impact is higher … but they run fast
Benefits still outweigh the cost!
28. ....
..
For Users
Should I run my job on my laptop?
Where can I find the right version of
the tools?
Which cluster should I run
my high priority ETL job?
Where can I see all my jobs run yesterday?
29. ....
..
For Admins
How do I manage different versions
of tools in different clusters?
How can I upgrade/swap the
clusters with no downtime to users?
30. Genie – Job and Cluster Mgmt Service
Users:
- Discovery: find the right cluster to run the jobs
- Gateway: to run different kinds of jobs
- Orchestration: and, the one place to find all jobs!
Admins:
- Config mgmt: multiple versions of multiple tools
- Deployment: cluster swap/updates with no downtime
38. Metacat
Common APIs for our applications and tools. Thrift APIs for
interoperability.
Metadata discovery across data sources
Additional business context
- Lifecycle policy (TTL) per table
- Table owner, description, tags
- User-defined custom metrics
39. Data Lifecycle Management
Janitor tools
- Delete ‘dangling’ data after 60 days
- Delete data obsoleted by ‘data updates’ after 3 days
- Delete partitions based on table TTL
41. Deletion Service
Centralized service to handle errors, retries, and backoffs
of S3 deletes
Cool-down period to delete after a few days
Store history and statistics
Allow easy recovery based on time and tags
64. Big Data API (aka Kragle)
import kragle as kg
trans_info = kg.transport.Transporter()
.source('metacat://prodhive/default/my_table')
.target('metacat://redshift/test/demo_table')
.execute()