SlideShare ist ein Scribd-Unternehmen logo
1 von 18
Downloaden Sie, um offline zu lesen
Show Me The Money!
Cost & Resource Tracking for Hadoop & Storm
Hadoop Summit
June 30, 2016
Kendall	
  Thrapp
• 3000+ grid users
• ~600 distinct projects
• Running 1.2M+
apps/day
… all focused on meeting their
own SLAs but not necessarily
on how their grid usage impacts
YAHOO PROPRIETARY
Hadoop @ Yahoo Scale
2
Tracking	
  resource	
  usage	
  and	
  cost	
  is	
  cri0cal	
  to	
  manage	
  capacity	
  and	
  ensure	
  fairness
Image	
  by	
  b	
  k	
  @	
  h3ps://flic.kr/p/4EjNgb	
  (CC	
  BY-­‐SA	
  2.0)
YAHOO PROPRIETARY
Why Care About Resource Utilization?
3
Capacity	
  Planning
OperaLonal	
  Efficiency
Profitability	
  &	
  ROI
Grid	
  Efficiency
Transparency
See	
  trends	
  over	
  Lme	
  and	
  predict	
  future	
  shorValls
Provide	
  jusLficaLon	
  for	
  engineering	
  more	
  efficient	
  code
Include	
  Hadoop	
  plaVorm	
  usage	
  cost	
  in	
  overall	
  project	
  cost
Move	
  projects	
  between	
  clusters	
  to	
  maximize	
  efficiency
See	
  resource	
  usage	
  and	
  cost	
  of	
  all	
  grid	
  tenants
YAHOO PROPRIETARY
Three Year Mission…
4
But tracking resource usage in Hadoop was hard… really hard.
So three years ago, we set out on a mission to show:
Image	
  derived	
  from	
  h3ps://flic.kr/p/dN895J	
  by	
  JD	
  Hancock	
  (CC	
  BY	
  2.0)	
  
• Resource usage for any YARN app
• Resource usage over time for clusters,
queues, users, and projects
• Cost for any resource usage
YAHOO PROPRIETARY
The Language of Grid Resource Usage
5
Resource	
  Usage	
  	
  =	
  	
  amount	
  allocated	
  	
  x	
  	
  0me	
  allocated
One 2GB mapper running for 5 hours = 10 GB-Hour
Five 2GB mappers running for 1 hour = 10 GB-Hour
Resource Example	
  Units
RAM GB-­‐Hour	
  or	
  MB-­‐Second
CPU vCore-­‐Hour	
  or	
  vCore-­‐Second
Image	
  by	
  Casey	
  Fleser	
  @	
  h3ps://flic.kr/p/6ACfUz	
  (CC	
  BY	
  2.0)
• 28 months from JIRA to full deployment
• First time getting resource usage for non-
MR applications, like Spark, TEZ, or Slider.
• Available through the Hadoop UI, even
while the app is still running.
• Stored long term by Grid UI team and made
available through a REST API.
• Can benchmark apps to see how code &
config changes affect resource usage.
• Can convert this to a $ cost using TCO
method described later.
YAHOO PROPRIETARY
Introducing YARN-415
6
Capture aggregate resource allocation at the app-level in MB-secs & vCore-secs
• Sample cluster, queue, and user-
level compute resource utilization
every minute across all clusters.
• Make available via Grid Utilization
Dashboard and REST API.
• Further aggregate by project and
time at hourly, daily, and monthly
intervals.
• Projects can see a rolling one year
history of their compute and
storage usage on Doppler.
YAHOO PROPRIETARY
Resource Utilization Over Time
7
YARN-415 only gives us half the story…
Image	
  from	
  Grid	
  ULlizaLon	
  Dashboard
YAHOO PROPRIETARY8
Viewing Project
Compute Utilization
In the Doppler web application
Monthly average RAM & CPU usage for the current
month and past three months, as well as quotas
Zoom by time window or date range
Rolling one-year historical charts for RAM & CPU
● Central solid line is daily average
● Inner (darker) band is average ± 1 SD
● Outer (lighter) band is daily min/max
● Dashed line is approved quota
Hover over chart to see exact values for dates
When zoomed in, use scrollbar to see other dates
Flags to indicate major events, like upgrade to
Hadoop 2.6
Click name in legend to show or hide series. Chart
axes will dynamically resize to maximize detail.
Webpage has additional panels like this for each
queue ever used by the project
YAHOO PROPRIETARY9
Viewing Project
Storage Utilization
In the Doppler web application
Rolling one-year historical charts for disk and
namespace usage:
● Blue area is daily average
● Dashed orange line is actual quota
Show current utilization and quota both before and
after replication
Webpage has additional panels like this for each
project directory used by the project
Gauges showing latest observed disk and
namespace usage -- gradually turns from green to
red as utilization approaches 100%
Hover over chart to see exact values for dates
YAHOO PROPRIETARY
Show Me the Money!
10
• Total	
  Cost	
  of	
  Ownership	
  (TCO)	
  iniLaLve	
  in	
  2015	
  to	
  began	
  compuLng	
  a	
  $	
  
cost	
  for	
  all	
  compute	
  and	
  storage	
  uLlizaLon	
  by	
  projects	
  on	
  Hadoop.	
  
• In	
  June	
  2015,	
  we	
  added	
  a	
  TCO	
  panel	
  to	
  all	
  Hadoop	
  project	
  and	
  project	
  
environment	
  pages	
  in	
  the	
  Doppler	
  web	
  applicaLon	
  showing	
  historical	
  
monthly	
  TCO	
  cost.
YAHOO PROPRIETARY
How is Project TCO Calculated?
11
Total Hadoop TCO
Disk NamespaceCPURAM
1. Compute total Hadoop TCO
a. Comprised of many different sources of cost --
not just hardware (see next slide)
2. Divide total TCO amongst resource types
a. Even distribution chosen initially
b. Distribution can be adjusted (monthly) to allow
for scarce resources to be priced more
expensively.
3. Compute project resource TCO as a fraction of total
resource TCO:
4. Total project TCO is the sum of all individual project
resource TCOs.
25% 25% 25%25%
Project Resource Usage
Total Resource Usage
X Total Resource TCO = Project Resource TCO
This distributes overhead/unused capacity costs across projects proportional to their grid usage.
YAHOO PROPRIETARY12
Total Hadoop TCO Makeup
$8.1 M
60%
12%
7%
6%
3%
2%
6
5
4
3
2
1
7
10%
Operations Engineering
▪ Headcount for service engineering and data operations teams responsible for day-to-day ops and support
66
Acquisition/ Install (One-time)
▪ Labor, POs, transportation, space, support, upgrades, decommissions, shipping/receiving, etc.
5
Network Hardware
▪ Aggregated network component costs, including switches, wiring, terminal servers, power strips, etc.
4
Active Use and Operations (Recurring)
▪ Recurring datacenter ops cost (power, space, labor support, and facility maintenance)
3
R&D HC
▪ Headcount for platform software development, quality, and release engineering
2
Cluster Hardware
▪ Data nodes, name nodes, job trackers, gateways, load proxies, monitoring, aggregator, and web servers
1
Monthly TCOTCO Components
Network Bandwidth
▪ Data transferred into and out of clusters for all colos, including cross-colo transfers
7
6
6
ILLUSTRATIVE
YAHOO PROPRIETARY13
TCO Dashboard
In the Doppler web application
Filter TCO data on:
● Date range
● Project name
● Business unit
● Cluster name
● Cluster type
Search on anything
in the table
Export to CSV for
offline analysis
One row in table per project
environment and month
The TCO Dashboard (yo/grid-tco) allows
users to view and sum TCO information
along a variety of dimensions.
Resource and cost totals for all filtered
results are shown here
Sort on any column
or multiple columns
Note: Cost data is for illustrative
purposes only (not real unit costs)
• Unmasked hidden issues, like:
– Projects using far more compute resources than they were ever
approved for
– Projects requesting more resources when they were
underutilizing what they already had
– Projects launching apps in queues they weren’t supposed to be
using
– Zombie projects that were cancelled/retired but continuing to
consume grid resources
• Helped teams verify a significant reduction in their compute usage
after some major efficiency improvements
YAHOO PROPRIETARY
Results!
14
YAHOO PROPRIETARY15
Beyond Hadoop:
Storm Project
Compute Utilization
In the Doppler web application
• Sample assigned RAM & CPU
per-topology every minute across
all clusters using Nimbus’
topology summary REST API
• Aggregate by user and by project
• Make available via Doppler UI
and REST API
• Coming soon: Compare assigned
memory/cpu vs. actual usage
• Convert to monthly $ cost via
TCO model
● Get compute resource usage for all Hadoop
apps through YARN-415
● Store historical Hadoop resource utilization at
the cluster, queue, user, and project levels
● Store historical Storm resource utilization at the
topology, user and project levels
● Developed a cost model and applied to it
compute monthly cost for all Hadoop and
Storm projects
● Make utilization and cost data and charts
available web apps and REST APIs
YAHOO PROPRIETARY
Recap
16
Resource and cost tracking for Hadoop & Storm
• Visibility and cost for NameNode
operations
• Visibility and cost for network
utilization in Storm
• Identify waste when there are
large gaps between allocated
and peak used container
memory (Downsizer)
• Move to an OPEX model for
where teams just pay for what
they use
YAHOO PROPRIETARY
The mission continues…
17
Image	
  by	
  Reinhard	
  Kuchenbäcker	
  @	
  h3ps://flic.kr/p/naFkFH	
  (CC	
  BY	
  2.0)
Q&A
Authors:
• Kendall Thrapp
• Shawna Martell
• Alessandro Bellina
• Eric Payne
• Sumeet Singh

Weitere ähnliche Inhalte

Was ist angesagt?

Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
DataWorks Summit
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with Hadoop
DataWorks Summit
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache Hadoop
Brock Noland
 
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
DataWorks Summit
 

Was ist angesagt? (20)

Keys for Success from Streams to Queries
Keys for Success from Streams to QueriesKeys for Success from Streams to Queries
Keys for Success from Streams to Queries
 
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
 
Data Regions: Modernizing your company's data ecosystem
Data Regions: Modernizing your company's data ecosystemData Regions: Modernizing your company's data ecosystem
Data Regions: Modernizing your company's data ecosystem
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with Hadoop
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache Hadoop
 
What's new in Ambari
What's new in AmbariWhat's new in Ambari
What's new in Ambari
 
Splice machine-bloor-webinar-data-lakes
Splice machine-bloor-webinar-data-lakesSplice machine-bloor-webinar-data-lakes
Splice machine-bloor-webinar-data-lakes
 
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
 
Filling the Data Lake
Filling the Data LakeFilling the Data Lake
Filling the Data Lake
 
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
 
Real World Use Cases: Hadoop and NoSQL in Production
Real World Use Cases: Hadoop and NoSQL in ProductionReal World Use Cases: Hadoop and NoSQL in Production
Real World Use Cases: Hadoop and NoSQL in Production
 
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudBring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
 
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
 
Large-scaled telematics analytics
Large-scaled telematics analyticsLarge-scaled telematics analytics
Large-scaled telematics analytics
 
Admiral Group
Admiral GroupAdmiral Group
Admiral Group
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
 
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
 
Protecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against DisastersProtecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against Disasters
 

Andere mochten auch

Andere mochten auch (18)

Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleManaging Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
 
Hpe Data Protector Disaster Recovery Guide
Hpe Data Protector Disaster Recovery GuideHpe Data Protector Disaster Recovery Guide
Hpe Data Protector Disaster Recovery Guide
 
Zero Downtime App Deployment using Hadoop
Zero Downtime App Deployment using HadoopZero Downtime App Deployment using Hadoop
Zero Downtime App Deployment using Hadoop
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
 
Lambda-less Stream Processing @Scale in LinkedIn
Lambda-less Stream Processing @Scale in LinkedIn Lambda-less Stream Processing @Scale in LinkedIn
Lambda-less Stream Processing @Scale in LinkedIn
 
Hadoop Platform at Yahoo
Hadoop Platform at YahooHadoop Platform at Yahoo
Hadoop Platform at Yahoo
 
Stream Processing made simple with Kafka
Stream Processing made simple with KafkaStream Processing made simple with Kafka
Stream Processing made simple with Kafka
 
IoT:what about data storage?
IoT:what about data storage?IoT:what about data storage?
IoT:what about data storage?
 
Big Data Ready Enterprise
Big Data Ready Enterprise Big Data Ready Enterprise
Big Data Ready Enterprise
 
Timeline service V2 at the Hadoop Summit SJ 2016
Timeline service V2 at the Hadoop Summit SJ 2016Timeline service V2 at the Hadoop Summit SJ 2016
Timeline service V2 at the Hadoop Summit SJ 2016
 
The Stream Processor as a Database Apache Flink
The Stream Processor as a Database Apache FlinkThe Stream Processor as a Database Apache Flink
The Stream Processor as a Database Apache Flink
 
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & TrifactaExtend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Effective Spark on Multi-Tenant Clusters
Effective Spark on Multi-Tenant ClustersEffective Spark on Multi-Tenant Clusters
Effective Spark on Multi-Tenant Clusters
 
Apache Phoenix + Apache HBase
Apache Phoenix + Apache HBaseApache Phoenix + Apache HBase
Apache Phoenix + Apache HBase
 
Hadoop, the Apple of Our Eyes (這些年,我們一起追的 Hadoop)
Hadoop, the Apple of Our Eyes (這些年,我們一起追的 Hadoop)Hadoop, the Apple of Our Eyes (這些年,我們一起追的 Hadoop)
Hadoop, the Apple of Our Eyes (這些年,我們一起追的 Hadoop)
 
Big Data Security and Governance
Big Data Security and GovernanceBig Data Security and Governance
Big Data Security and Governance
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
 

Ähnlich wie Show me the Money! Cost & Resource Tracking for Hadoop and Storm

Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Sumeet Singh
 
Costing your Bug Data Operations
Costing your Bug Data OperationsCosting your Bug Data Operations
Costing your Bug Data Operations
DataWorks Summit
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - final
Sub Szabolcs Feczak
 

Ähnlich wie Show me the Money! Cost & Resource Tracking for Hadoop and Storm (20)

Get Started Building YARN Applications
Get Started Building YARN ApplicationsGet Started Building YARN Applications
Get Started Building YARN Applications
 
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
 
The RECAP Project: Large Scale Simulation Framework
The RECAP Project: Large Scale Simulation FrameworkThe RECAP Project: Large Scale Simulation Framework
The RECAP Project: Large Scale Simulation Framework
 
Introduction to GCP Data Flow Presentation
Introduction to GCP Data Flow PresentationIntroduction to GCP Data Flow Presentation
Introduction to GCP Data Flow Presentation
 
Introduction to GCP DataFlow Presentation
Introduction to GCP DataFlow PresentationIntroduction to GCP DataFlow Presentation
Introduction to GCP DataFlow Presentation
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
 
Cloud computing and software engineering
Cloud computing and software engineeringCloud computing and software engineering
Cloud computing and software engineering
 
[Srijan Wednesday Webinars] How to Build a Cloud Native Platform for Enterpri...
[Srijan Wednesday Webinars] How to Build a Cloud Native Platform for Enterpri...[Srijan Wednesday Webinars] How to Build a Cloud Native Platform for Enterpri...
[Srijan Wednesday Webinars] How to Build a Cloud Native Platform for Enterpri...
 
Costing your Bug Data Operations
Costing your Bug Data OperationsCosting your Bug Data Operations
Costing your Bug Data Operations
 
MapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR 5.2: Getting More Value from the MapR Converged Data PlatformMapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR 5.2: Getting More Value from the MapR Converged Data Platform
 
IRJET- Analysis of Boston’s Crime Data using Apache Pig
IRJET- Analysis of Boston’s Crime Data using Apache PigIRJET- Analysis of Boston’s Crime Data using Apache Pig
IRJET- Analysis of Boston’s Crime Data using Apache Pig
 
Powering Data Science and AI with Apache Spark, Alluxio, and IBM
Powering Data Science and AI with Apache Spark, Alluxio, and IBMPowering Data Science and AI with Apache Spark, Alluxio, and IBM
Powering Data Science and AI with Apache Spark, Alluxio, and IBM
 
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
 
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successArchitecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
 
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
 
apidays London 2023 - API Green Score, Yannick Tremblais & Julien Brun, Green...
apidays London 2023 - API Green Score, Yannick Tremblais & Julien Brun, Green...apidays London 2023 - API Green Score, Yannick Tremblais & Julien Brun, Green...
apidays London 2023 - API Green Score, Yannick Tremblais & Julien Brun, Green...
 
Sam segal resume
Sam segal resumeSam segal resume
Sam segal resume
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - final
 
Project Controls Expo, 13th Nov 2013 - "Loading Cost and Activity data into P...
Project Controls Expo, 13th Nov 2013 - "Loading Cost and Activity data into P...Project Controls Expo, 13th Nov 2013 - "Loading Cost and Activity data into P...
Project Controls Expo, 13th Nov 2013 - "Loading Cost and Activity data into P...
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
 

Mehr von DataWorks Summit/Hadoop Summit

How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 

Mehr von DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Show me the Money! Cost & Resource Tracking for Hadoop and Storm

  • 1. Show Me The Money! Cost & Resource Tracking for Hadoop & Storm Hadoop Summit June 30, 2016 Kendall  Thrapp
  • 2. • 3000+ grid users • ~600 distinct projects • Running 1.2M+ apps/day … all focused on meeting their own SLAs but not necessarily on how their grid usage impacts YAHOO PROPRIETARY Hadoop @ Yahoo Scale 2 Tracking  resource  usage  and  cost  is  cri0cal  to  manage  capacity  and  ensure  fairness Image  by  b  k  @  h3ps://flic.kr/p/4EjNgb  (CC  BY-­‐SA  2.0)
  • 3. YAHOO PROPRIETARY Why Care About Resource Utilization? 3 Capacity  Planning OperaLonal  Efficiency Profitability  &  ROI Grid  Efficiency Transparency See  trends  over  Lme  and  predict  future  shorValls Provide  jusLficaLon  for  engineering  more  efficient  code Include  Hadoop  plaVorm  usage  cost  in  overall  project  cost Move  projects  between  clusters  to  maximize  efficiency See  resource  usage  and  cost  of  all  grid  tenants
  • 4. YAHOO PROPRIETARY Three Year Mission… 4 But tracking resource usage in Hadoop was hard… really hard. So three years ago, we set out on a mission to show: Image  derived  from  h3ps://flic.kr/p/dN895J  by  JD  Hancock  (CC  BY  2.0)   • Resource usage for any YARN app • Resource usage over time for clusters, queues, users, and projects • Cost for any resource usage
  • 5. YAHOO PROPRIETARY The Language of Grid Resource Usage 5 Resource  Usage    =    amount  allocated    x    0me  allocated One 2GB mapper running for 5 hours = 10 GB-Hour Five 2GB mappers running for 1 hour = 10 GB-Hour Resource Example  Units RAM GB-­‐Hour  or  MB-­‐Second CPU vCore-­‐Hour  or  vCore-­‐Second Image  by  Casey  Fleser  @  h3ps://flic.kr/p/6ACfUz  (CC  BY  2.0)
  • 6. • 28 months from JIRA to full deployment • First time getting resource usage for non- MR applications, like Spark, TEZ, or Slider. • Available through the Hadoop UI, even while the app is still running. • Stored long term by Grid UI team and made available through a REST API. • Can benchmark apps to see how code & config changes affect resource usage. • Can convert this to a $ cost using TCO method described later. YAHOO PROPRIETARY Introducing YARN-415 6 Capture aggregate resource allocation at the app-level in MB-secs & vCore-secs
  • 7. • Sample cluster, queue, and user- level compute resource utilization every minute across all clusters. • Make available via Grid Utilization Dashboard and REST API. • Further aggregate by project and time at hourly, daily, and monthly intervals. • Projects can see a rolling one year history of their compute and storage usage on Doppler. YAHOO PROPRIETARY Resource Utilization Over Time 7 YARN-415 only gives us half the story… Image  from  Grid  ULlizaLon  Dashboard
  • 8. YAHOO PROPRIETARY8 Viewing Project Compute Utilization In the Doppler web application Monthly average RAM & CPU usage for the current month and past three months, as well as quotas Zoom by time window or date range Rolling one-year historical charts for RAM & CPU ● Central solid line is daily average ● Inner (darker) band is average ± 1 SD ● Outer (lighter) band is daily min/max ● Dashed line is approved quota Hover over chart to see exact values for dates When zoomed in, use scrollbar to see other dates Flags to indicate major events, like upgrade to Hadoop 2.6 Click name in legend to show or hide series. Chart axes will dynamically resize to maximize detail. Webpage has additional panels like this for each queue ever used by the project
  • 9. YAHOO PROPRIETARY9 Viewing Project Storage Utilization In the Doppler web application Rolling one-year historical charts for disk and namespace usage: ● Blue area is daily average ● Dashed orange line is actual quota Show current utilization and quota both before and after replication Webpage has additional panels like this for each project directory used by the project Gauges showing latest observed disk and namespace usage -- gradually turns from green to red as utilization approaches 100% Hover over chart to see exact values for dates
  • 10. YAHOO PROPRIETARY Show Me the Money! 10 • Total  Cost  of  Ownership  (TCO)  iniLaLve  in  2015  to  began  compuLng  a  $   cost  for  all  compute  and  storage  uLlizaLon  by  projects  on  Hadoop.   • In  June  2015,  we  added  a  TCO  panel  to  all  Hadoop  project  and  project   environment  pages  in  the  Doppler  web  applicaLon  showing  historical   monthly  TCO  cost.
  • 11. YAHOO PROPRIETARY How is Project TCO Calculated? 11 Total Hadoop TCO Disk NamespaceCPURAM 1. Compute total Hadoop TCO a. Comprised of many different sources of cost -- not just hardware (see next slide) 2. Divide total TCO amongst resource types a. Even distribution chosen initially b. Distribution can be adjusted (monthly) to allow for scarce resources to be priced more expensively. 3. Compute project resource TCO as a fraction of total resource TCO: 4. Total project TCO is the sum of all individual project resource TCOs. 25% 25% 25%25% Project Resource Usage Total Resource Usage X Total Resource TCO = Project Resource TCO This distributes overhead/unused capacity costs across projects proportional to their grid usage.
  • 12. YAHOO PROPRIETARY12 Total Hadoop TCO Makeup $8.1 M 60% 12% 7% 6% 3% 2% 6 5 4 3 2 1 7 10% Operations Engineering ▪ Headcount for service engineering and data operations teams responsible for day-to-day ops and support 66 Acquisition/ Install (One-time) ▪ Labor, POs, transportation, space, support, upgrades, decommissions, shipping/receiving, etc. 5 Network Hardware ▪ Aggregated network component costs, including switches, wiring, terminal servers, power strips, etc. 4 Active Use and Operations (Recurring) ▪ Recurring datacenter ops cost (power, space, labor support, and facility maintenance) 3 R&D HC ▪ Headcount for platform software development, quality, and release engineering 2 Cluster Hardware ▪ Data nodes, name nodes, job trackers, gateways, load proxies, monitoring, aggregator, and web servers 1 Monthly TCOTCO Components Network Bandwidth ▪ Data transferred into and out of clusters for all colos, including cross-colo transfers 7 6 6 ILLUSTRATIVE
  • 13. YAHOO PROPRIETARY13 TCO Dashboard In the Doppler web application Filter TCO data on: ● Date range ● Project name ● Business unit ● Cluster name ● Cluster type Search on anything in the table Export to CSV for offline analysis One row in table per project environment and month The TCO Dashboard (yo/grid-tco) allows users to view and sum TCO information along a variety of dimensions. Resource and cost totals for all filtered results are shown here Sort on any column or multiple columns Note: Cost data is for illustrative purposes only (not real unit costs)
  • 14. • Unmasked hidden issues, like: – Projects using far more compute resources than they were ever approved for – Projects requesting more resources when they were underutilizing what they already had – Projects launching apps in queues they weren’t supposed to be using – Zombie projects that were cancelled/retired but continuing to consume grid resources • Helped teams verify a significant reduction in their compute usage after some major efficiency improvements YAHOO PROPRIETARY Results! 14
  • 15. YAHOO PROPRIETARY15 Beyond Hadoop: Storm Project Compute Utilization In the Doppler web application • Sample assigned RAM & CPU per-topology every minute across all clusters using Nimbus’ topology summary REST API • Aggregate by user and by project • Make available via Doppler UI and REST API • Coming soon: Compare assigned memory/cpu vs. actual usage • Convert to monthly $ cost via TCO model
  • 16. ● Get compute resource usage for all Hadoop apps through YARN-415 ● Store historical Hadoop resource utilization at the cluster, queue, user, and project levels ● Store historical Storm resource utilization at the topology, user and project levels ● Developed a cost model and applied to it compute monthly cost for all Hadoop and Storm projects ● Make utilization and cost data and charts available web apps and REST APIs YAHOO PROPRIETARY Recap 16 Resource and cost tracking for Hadoop & Storm
  • 17. • Visibility and cost for NameNode operations • Visibility and cost for network utilization in Storm • Identify waste when there are large gaps between allocated and peak used container memory (Downsizer) • Move to an OPEX model for where teams just pay for what they use YAHOO PROPRIETARY The mission continues… 17 Image  by  Reinhard  Kuchenbäcker  @  h3ps://flic.kr/p/naFkFH  (CC  BY  2.0)
  • 18. Q&A Authors: • Kendall Thrapp • Shawna Martell • Alessandro Bellina • Eric Payne • Sumeet Singh