At Netflix, we have traditionally approached cloud efficiency from a human standpoint, whether it be in-person meetings with the largest service teams or manually flipping reservations. Over time, we realized that these manual processes are not scalable as the business continues to grow. Therefore, in the past year, we have focused on building out tools that allow us to make more insightful, data-driven decisions around capacity and efficiency. In this session, we discuss the DIY applications, dashboards, and processes we built to help with capacity and efficiency. We start at the ten thousand foot view to understand the unique business and cloud problems that drove us to create these products, and discuss implementation details, including the challenges encountered along the way. Tools discussed include Picsou, the successor to our AWS billing file cost analyzer; Libra, an easy-to-use reservation conversion application; and cost and efficiency dashboards that relay useful financial context to 50+ engineering teams and managers.
13. Success Criteria [qual]
Feedback from Engineering teams:
• Regular use of our tools and insights
• Raised awareness of their impact on efficiency
• Pro-active engagement on efficiency projects
16. HIERARCHY OF NEEDS
THE EFFICIENCYTransparency
What do you need to know before you
can even ask about efficiency ?
17. Transparency
HIERARCHY OF NEEDS
THE EFFICIENCY
What do you need to know before
you can even ask about efficiency ?
Cost and Usage :
AWS DBR or CUR file(s)
System data : S3 Inventory,
AWS CloudTrail, …
Metadata : AWS Tags,
Org structure, …
Undocumented facts
(a.k.a., tribal knowledge)
“If you cannot measure it, you cannot improve it.”
– Lord Kelvin
18. HIERARCHY OF NEEDS
THE EFFICIENCY
What do you need to know before
you can even ask about efficiency ?
“That which is measured improves.
That which is measured and reported improves exponentially.”
– Karl Pearson (or Thomas Monson)
1. Tailor views to specific use cases
2. Add business context
3. When possible, co-locate with existing tools / workflows
Transparency through dashboards, with a few important rules:
Transparency
20. HIERARCHY OF NEEDS
THE EFFICIENCY
Cloud Cost Dashboard
1. Enrich cost and usage data with
internal metadata (org, platforms…)
2. Add business context
3. Tailor views to users
Transparency
• Data: Billing+ Metadata
• Tech: Spark + Tableau
21. HIERARCHY OF NEEDS
THE EFFICIENCY
Cloud Cost Dashboard
1. Enrich cost and usage data with
internal metadata (org, platforms…)
2. Add business context
3. Tailor views to users
Transparency
• Data: Billing+ Metadata
• Tech: Spark + Tableau
22. HIERARCHY OF NEEDS
THE EFFICIENCY
Libra
1. Visualize reserved and used instances
across zones and instance types
2. Rebalance as necessary
3. Built-in retry logic
Transparency
• Data: AWS Reservations API
• Tech: JavaScript
23. HIERARCHY OF NEEDS
THE EFFICIENCY
What story do you need to tell to
make your efficiency goal a reality ?
DeepDives
24. HIERARCHY OF NEEDS
THE EFFICIENCY
What story do you need to
tell to make your efficiency goal a reality ?
Sometimes it takes more than a few dashboards
to know how and where to invest your cloud efficiency efforts.
Tell a story (e.g., write a memo)
showing the potential impact of
your efficiency project to generate
buy-in from your organization.
Connect the components of
complex architectures to show
the bigger picture.
DeepDives
25. HIERARCHY OF NEEDS
THE EFFICIENCY
Darwin QL: new UI engine for TVs about to roll out, how
will it impact our cloud efficiency ?
Relative change in Demand
(demand = # requests x duration)
sequitur_service-prod
session_logs
evcache_yellow2
evcache_map_lt
evcache_ciners
evcache_chunk_vhs
evcache_ab
nccp-bladerunner
licenseaccounting-bladerunner
evcache_sub
evcache_pnp
playready
evcache_pbc_si
playback_history
api-prod
evcache_playlist
evcache_cineps
DeepDives
Data: Billing+ Request Tracing
26. HIERARCHY OF NEEDS
THE EFFICIENCY
Zuul
µ-Service
A
µ-Service Z
Keystone
(Kafka)
S3
ES
EMR,
BDAS, …
Mantis,
Spark, …
Kafka
Cass
DeepDives
Data is a big piece of our Cloud
costs, but tracking and attributing
that cost to a team is complex.
How do you holistically optimize a
distributed system ?
Cradle to Grave (C2G): track the end-to-end cost of
ingesting, storing and processing data at Netflix to identify
efficiency opportunities.
27. HIERARCHY OF NEEDS
THE EFFICIENCY
Zuul
µ-Service
A
µ-Service Z
Keystone
(Kafka)
S3
ES
EMR,
BDAS, …
Mantis,
Spark, …
Kafka
• Each system (red boxes), tracks its own resource apportioning at
the topic / table / job level.
• We add some logic and tribal knowledge to link topics / tables /
jobs from one system to the next.
Cass
DeepDives
28. HIERARCHY OF NEEDS
THE EFFICIENCY
What do you need to know,
and when do you want it ?
Actionable
Insights
29. HIERARCHY OF NEEDS
THE EFFICIENCY
What do you need to know,
and when do you want it ?
Strive to minimize the cognitive load for
your target audience.
Targeted messages: send alerts only
when something actionable occurs
(regressions, anomalous metrics), and
provide quick links to supporting
evidence.
Insights Digest: provide summaries
that quickly get to the important
message (typically per use-case).
Actionable
Insights
30. HIERARCHY OF NEEDS
THE EFFICIENCY
Efficiency Score Cards (email)
• 2 core efficiency metrics (system and business context).
• Monitor changes in magnitude and trend over weeks (non-operational).
• Link each card to a detailed dashboard
Actionable
Insights
Data: System Monitoring (Atlas)
31. HIERARCHY OF NEEDS
THE EFFICIENCY
Efficiency Score Cards (email)
• 2 core efficiency metrics (system and business context).
• Link each card to a detailed dashboard
Actionable
Insights
Data: System Monitoring (Atlas)
32. HIERARCHY OF NEEDS
THE EFFICIENCY
EC2 Alerts (Picsou)
• Compute reservation shortages
across all dimensions (accounts x
zones x instance families)
• List in descending order of cost
• Attribute to top growing apps
• Also sent as a digest email linked
back to Picsou
Actionable
Insights
Data: Billing + Tribal Knowledge + Metadata
33. HIERARCHY OF NEEDS
THE EFFICIENCY
Let the machines take over,
what could go wrong ?
Automation
34. HIERARCHY OF NEEDS
THE EFFICIENCY
Let the machines take over,
what could go wrong ?
Safely automate repetitive or complex tasks.
Start simple: rules-
engine, optimization,
…
Graduate your
Actionable Insights
Show your Work
Automation
35. HIERARCHY OF NEEDS
THE EFFICIENCY
S3 Storage Class Optimization
• Very similar to AWS S3 Analytics product.
• In fact, we use use AWS S3 access analysis data, but make our own
recommendations.
Automation
36. HIERARCHY OF NEEDS
THE EFFICIENCY
S3 Storage Class Optimization
• Every recommendation can be explained from the very same dashboard
Automation
Data: AWS S3 Analytics + Tribal Knowledge
38. RI management
Picsou (today)
- Explore cost and usage
- Notify of RI shortages
Picsou RI Recommendation (Q1’2018)
- Ingest output from shortage analysis (EC2 Alerts)
- Use Linear Programming to compute optimal RI modification/purchase
• Email recommendations to our finance partners for sanity checking and execution
• Collect feedback to improve optimization
• Define “recommendation score” for monitoring
• Once we gain enough confidence in the recommendation
• Automatically execute recommendation
39. Self-Service C2G
Give data producers, consumers and caretakers the ability to
manage their own efficiency :
• Identify all involved parties along a data-topics
• Apportion data-infrastructure cost to all relevant teams
• Quickly notice low usage data-topic
• Estimate data-replication or large sinks to users ratios
Long Term : enable data-platform owners to use this tool or underlying
data to add some automation.
40. Device-Cloud Efficiency
Expose the impact of Device/UI features on efficiency :
• Provide the relative cost change for AB test cells
• Attribute micro-services growth and cost to each Device/UI family
41. 2. This is achieved by implementing the successive layers of our efficiency hierarchy of
needs :
1. Netflix culture, scale, architecture and priorities requires efficiency to be
championed by a central team, but enforced by all engineers.
2a. Transparency to get context,
2b. Deep Dives to tell compelling stories and assemble puzzles,
2c. Actionable Insights to reduce the cognitive load on your organization,
2d. Automation to scale the impact of efficiency efforts.
KEY TAKEWAYS
43. Netflix Talks @ re:Invent
Monday
10:45am ARC208:Walking the tightrope: Balancing Innovation, Reliability, Security, and Efficiency (Venetian)
12:15pm SID206: Best Practices for Managing Security on AWS (MGM)
Tuesday
10:45am ARC209: A Day in the Life of a Netflix Engineer (Venetian)
11:30am CMP204: How Netflix Tunes EC2 Instances for Performance (Venetian)
Wednesday
11:30am MCL317: Orchestrating ML Training for Netflix Recommendations (Venetian)
12:15pm NET303: A day in the life of a Cloud Network Engineer at Netflix (Venetian)
1:00pm ARC312: Why Regional Reservations are a Game Changer for Netflix (Venetian)
1:00pm SID304: SecOps 2021 Today: Using AWS Services to Deliver SecOps (MGM)
1:45pm DEV334: Performing Chaos at Netflix Scale (Venetian)
4:45pm SID316: Using Access Advisor to Strike the Balance Between Security and Usability (MGM)
Thursday
12:15pm CMP311: Auto Scaling Made Easy: How Target Tracking Scaling Policies Hit the Bullseye (Palazzo)
12:15pm DAT308: Codex: Conditional Modules Strike Back (Venetian)
12:55pm CMP309: How Netflix Encodes at Scale (Venetian)
5:00pm ABD401: How Netflix Monitors Applications Real Time with Kinesis (Aria)
Friday
8:30am ABD319: Tooling Up For Efficiency: DIY Solutions @ Netflix (Aria)
10:00am ABD401: Netflix Keystone SPaaS - Real-time Stream Processing as a Service (Aria)
44. Architecture
Mon 10:45am ARC208:Walking the tightrope: Balancing Innovation, Reliability, Security, and Efficiency (Venetian)
Tue 10:45am ARC209: A Day in the Life of a Netflix Engineer (Venetian)
Wed 1:00pm ARC312: Why Regional Reservations are a Game Changer for Netflix (Venetian)
Compute
Tue 11:30am CMP204: How Netflix Tunes EC2 Instances for Performance (Venetian)
Thu 12:15pm CMP311: Auto Scaling Made Easy: How Target Tracking Scaling Policies Hit the Bullseye (Palazzo)
Thu 12:55pm CMP309: How Netflix Encodes at Scale (Venetian)
Security, Compliance, and Identity
Mon 12:15pm SID206: Best Practices for Managing Security on AWS (MGM)
Wed 1:00pm SID304: SecOps 2021 Today: Using AWS Services to Deliver SecOps (MGM)
Wed 4:45pm SID316: Using Access Advisor to Strike the Balance Between Security and Usability (MGM)
Machine Learning
Wed 11:30am MCL317: Orchestrating ML Training for Netflix Recommendations (Venetian)
Networking
Wed 12:15pm NET303: A day in the life of a Cloud Network Engineer at Netflix (Venetian)
Developer Community
Wed 1:45pm DEV334: Performing Chaos at Netflix Scale (Venetian)
Databases
Thu 12:15pm DAT308: Codex: Conditional Modules Strike Back (Venetian)
Analytics & Big Data
Thu 5:00pm ABD401: How Netflix Monitors Applications Real Time with Kinesis (Aria)
Fri 8:30am ABD319: Tooling Up For Efficiency: DIY Solutions @ Netflix (Aria)
Fri 10:00am ABD401: Netflix Keystone SPaaS - Real-time Stream Processing as a Service (Aria)
Netflix Talks @ re:Invent
47. Credits
All the tools featured in this presentation were designed and built by
members of the Cloud Capacity Planning teams over the past 2 years,
specifically, Torio Risianto, Rajan Mittal, and Qian Li.