Costing your Bug Data Operations

Costing Your Big Data Operations
P R E S E N T E D B Y S u m e e t S i n g h , Am r i t L a l ⎪ J u n e 5 , 2 0 1 4
2 0 1 4 Ha d o o p Su mmi t , Sa n J o s e , Ca l i f o r n i a

Introduction
2
 Product Manager at Yahoo engaged in building high
class and robust Hadoop infrastructure services
 Eight years of experience across HSBC, Oracle and
Google in developing products and platforms for high
growth enterprises
 MBA from Carnegie Mellon University
 Manages Hadoop products team at Yahoo!
 Responsible for Product Management, Strategy and
Customer Engagements
 Managed Cloud Services products team and headed
Strategy functions for the Cloud Platform Group at
Yahoo
 MBA from UCLA and MS from Rensselaer
Polytechnic Institute (RPI)
Sumeet Singh
Senior Director, Product Management
Hadoop and Big Data Platforms
Cloud Engineering Group
701 First Avenue,
Sunnyvale, CA 94089 USA
@sumeetksingh
Amrit Lal
Product Manager
Hadoop and Big Data Platforms
Cloud Engineering Group
701 First Avenue,
Sunnyvale, CA 94089 USA
@amritasshwar
2014 Hadoop Summit, San Jose, California

Agenda
3
Total Cost of Ownership (TCO) Models1
Deeper Understanding of (Resource) Usage
P&L, Metering and Billing Provisions
Benchmark Costs
Improve Utilization and ROI
2
3
4
5

Why do Costing?
4
Profitability
Understanding the data services costs (an element of your total project cost) to determine how
profitable the project is
ROI Investment decisions both at the platform and app / project level
Operational
Efficiency
Benchmark, improve ops by focusing on avg. utilization, increasing the # hosted apps, storage
efficiencies, job performance etc.
Planning Capital planning and budgeting, product improvements
Cost Transparency Metering / usage metrics, billing, chargeback / showback, P&L

Costing is Relevant Irrespective of the Service Model
5
Private
Cloud
Public
Cloud
 Fixed costs that favors scale and
24x7 operations
 Centralized operations
 Multi-tenant clusters with security
and data sharing
 Cost a function of desired SLA
 Utilization and # hosted apps a
primary lever
 Tenants often tend to ignore costs
 Variable with usage and favors a run
and done model
 Decentralized operations, ops /
headcount costs still relevant
 Dedicated virtual clusters
 Monthly bills!
 Releasing cluster instances, when not
needed, a wise idea
 Users often overlook the peripheral
costs

0
50
100
150
200
250
300
350
400
450
500
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
2006 2007 2008 2009 2010 2011 2012 2013 2014
RawHDFSStorage(inPB)
NumberofServers
Year
Servers Storage
Important with Multi-tenancy and Scale
6
Yahoo!
Commits to
Scaling Hadoop
for Production
Use
Research
Workloads
in Search and
Advertising
Production
(Modeling)
with machine
learning &
WebMap
Revenue
Systems with
Security,
Multi-tenancy,
and SLAs
Open Sourced
with Apache
Hortonworks
Spinoff for
Enterprise
hardening
Nextgen
Hadoop
(H 0.23 YARN)
New Services
(HBase, Storm,
Hive etc.
Increased
User-base
with partitioned
namespaces
Apache H 2.x
(Low latency,
Util, HA etc.)

272
330
382
495
525
260
310
360
410
460
510
560
Q1-11 Q2-11 Q3-11 Q4-11 Q1-12 Q2-12 Q3-12 Q4-12 Q1-13 Q2-13 Q3-13 Q4-13 Q1-14
Hosted Apps Growth on Apache Hadoop
7
NumberofNewProjects
New Customer Apps On-boarded
58 projects in
2011
52 projects in
2012
113 projects in
2013

Multi-tenant Apache HBase Growth
8
1140
33.6 PB
0
5
10
15
20
25
30
35
40
0
200
400
600
800
1000
1200
Q1-13 Q2-13 Q3-13 Q4-13 Q1-14
DataStored(inPB)
NumberofRegionServers
Zero to “20” Use Cases (60,000 Regions) in a Year
Region Servers Storage

760
175
0
20
40
60
80
100
120
140
160
180
200
0
100
200
300
400
500
600
700
800
Q1-13 Q2-13 Q3-13 Q4-13 Q1-14
NumberofTopologies
NumberofSupervisors
Supervisor Topologies
Multi-tenant Apache Storm Growth
9
Zero to “175” Production Topologies in a Year
Multi-tenancy
Release

Capital Deployment for Big Data Infrastructure
10
DataNode NodeManager
NameNode RM
DataNodes RegionServers
NameNode HBase Master Nimbus
Supervisor
Administration, Management and Monitoring
ZooKeeper
Pools
HTTP/HDFS/GDM
Load Proxies
Applications and Data
Data
Feeds
Data
Stores
Oozie
Server
HS2/
HCat
Network
Backplane

Big Data Platforms Technology Stack at Yahoo
11
Compute
Services
Storage
Infrastructure
Services
HivePig Oozie HDFS ProxyGDM
YARN MapReduce
HDFS HBase
Zookeeper
Support
Shop
Monitoring Starling
Messaging
Service
HCatalog
Storm SparkTez

Resources Consumed in Big Data Operations
12
.
.
.
.
Colo 1
Rack 1 Rack N
.
.
Clusters in Datacenters Server Resources

Elements of a TCO Model
13
$2.1 M
60%
12%
7%
6%
3%
2%
6
5
4
3
2
1
7
10%
Operations Engineering
 Headcount for service engineering and data operations teams responsible for day-to-day ops and
support
6
Acquisition/ Install (One-time)
 Labor, POs, transportation, space, support, upgrades, decommissions, shipping/ receiving etc.
5
Network Hardware
 Aggregated network component costs, including switches, wiring, terminal servers, power strips etc.
4
Active Use and Operations (Recurring)
 Recurring datacenter ops cost (power, space, labor support, and facility maintenance
3
R&D HC
 Headcount for platform software development, quality, and release engineering
2
Cluster Hardware
 Data nodes, name nodes, job trackers, gateways, load proxies, monitoring, aggregator, and web servers
1
Monthly TCOTCO Components
Network Bandwidth
 Data transferred into and out of clusters for all colos, including cross-colo transfers
7
ILLUSTRATIVE

Understanding Apache Hadoop Resources
14
Task 1
Task 2
Task 3
NameNode Resource Manager
DFS
Blocks
DFS
Blocks
DataNode Node Manager
MR
Containers
MR
Containers
Storage and Compute MapReduce and Memory
. . . . . .

Unit Costs for Hadoop Operations
15
Containers where
apps can perform
computation and
access HDFS if
needed
HFDS (usable) space
needed by an app
with default
replication factor of
three
Network
bandwidth needed
to move data
into/out of the
clusters by the app
Files and
directories used
by the apps to
understand/ limit
the load on NN
$ / GB-Hour (H 0.23/2.0)
GBs of Memory
available for an hour
Monthly Compute Cost
Avail. Compute Capacity
$ / GB Stored
Usable storage space
(less replication and
overheads)
Monthly Storage Cost
Avail. Usable Storage
Unit
Total Capacity
Unit Cost
$ / GB for Inter-region
data transfers
Inter-region (peak) link
capacity
[Monthly GB In + Out] x
$ / GB
N/A
N/A
N/A

Working Through A Hadoop Example
16
Monthly TCO (less bw.) = $2 M
Compute @ 50% = $1 M
315 TB memory
== 315 TB x 24 x 30
= 227 M GB-Hours
$1 M/ 227 M GB-Hours
= $0.004 / GB-Hour / Month
Monthly TCO (less bw.) = $2 M
Storage @ 50% = $1 M
RAW HDFS = 200 PB
Usable HDFS == [ 200 x 0.8 (20%
overhead) ] / 3
= 53.3 PB
$ 1 M / 53.3 PB
= $ 0.019 / GB / Month
Monthly Cost
Monthly
Capacity
Unit Cost
Monthly Charges = $0.1 M
Total Data In + Out = 5 PB
$ 0.1 M / 5 PB
= $ 0.02/ GB transferred
ILLUSTRATIVE

Measuring Hadoop Resource Consumption
17
Map GB-Hours = GB(M1) x T(M1) +
GB(M2) x T(M2) + …
Reduce GB-Hours = GB(R1) x T(R1)
+ GB(R2) x T(R2) + …
Cost = (M + R) GB-Hour x $0.004 /
GB-Hour / Month
= $ for the Job/ Month
(M+R) GB-Hours for all jobs can
summed up for the month for a user,
app, BU, or the entire platform
Monthly Job
and Task
Cost
Monthly Roll-
ups
/ project (app) directory quota in
GB (peak monthly storage used)
/ user directory quota in GB (peak
monthly storage used)
/ data is accounted for as each user
accountable for their portion of use.
For e.g.
GB Read (U1)
GB Read (U1) + GB Read (U2) + …
Roll-ups through relationship
among user, file ownership, app,
and their BU
Bandwidth measured at the cluster
level and divided among select
apps and users of data based on
average volume In/Out
among user, app, and their BU

18 2014 Hadoop Summit, San Jose, California
queue 2
queue 1
queue 3
queue 4
queue 5
queue 6
queue 7
queue 8
queue 11
queue 9
queue 10

SLA Dashboard on Hadoop Analytics Warehouse

Putting it Together for Hadoop Services
20
BU
HDFS (Storage) Compute Network Bandwidth
Total Cost
($ M)Used
(PB)
Effective Used
(PB)
Cost
($ M)
Used
(GB-hour)
Cost
($ M)
Transferred
(GB)
Cost
($ M)
BU1 15 PB 3.45 PB $0.065 12.5 M $0.05 1.25 PB $0.025 $0.15 M
BU2 10 PB 2.65 PB $0.05 6.25 M $0.025 0.5 PB $0.01 $0.085 M
… …. … … … … … …
BU N … … … … … … ...
Total 148 PB 39.5 PB $0.75 M 125 M $0.5 M 5 PB $0.1 M $1.35 M
Resource Unit Aggregated / Measured Cost
HDFS (Storage) GB Monthly, Peak storage used $ 0.019/GB
Compute Map-Reduce GB Hours Number of GBs used by mappers and reducers and hours they ran for $ 0.004/GB-Hour
Network Bandwidth GB Monthly, total in /out $ 0.02/GB
Hadoop Services Billing Rate Card [ Monthly Rates ]
Monthly Bill for May 2014
ILLUSTRATIVE

Multi-Tenant Deployment For Apache HBase
21
Region Server M
X:Table:Region M
Y:Table:Region M
…
Z:Table:Region M
Region Server N
X:Table:Region N
Y:Table:Region N
…
Z:Table:Region N
Projects X,Y & Z
RegionServerJVMHDFSReads/Writes
Shared Region Servers
Region Server 2
X:Table:Region 2
Y:Table:Region 2
…
Z:Table:Region 2
…
HMaster
Zookeeper
Region Server 1
X:Table:Region 1
Y:Table:Region 1
…
Z:Table:Region 1

Understanding Apache HBase Resources
22
X:Table:Region 1
Y:Table:Region M
…
Regionlevel
Reads/Writes
HFile
HFile
HFile
HDFS Storage (Disk)RegionServer JVM (Heap)
Z:Table:Region N
…
Total Reads @ RS
Reads (Table X: Reg 1
+ Table X: Reg 2
+ …
+ Table Z: Reg N)
Read Share (X)
Total Table X
Total Table (X, Y, Z)
Total Writes @ RS
Writes (Table X: Reg 1
+ Table X: Reg 2
+ …
+ Table Z: Reg N)
Total Table Data @
RS
Table X: Reg 1
+ Table X: Reg 2
+ …
+ Table Z: Reg N
Write Share (X)
Total Table X
Total Table (X, Y, Z)
Reads Writes Data Stored

Unit Costs for HBase Operations
23
Write Operations
performed on Region
Server while writing
to individual table
regions
Read Operations
performed on Region
Server while reading
from individual table
regions
HFDS (usable)
space needed by
table region’s
HFiles with default
replication factor
Network bandwidth
needed to move
data in to/out of the
clusters by clients
$ / 1000 Writes
Total Write operations
across Region Servers
Monthly Write TCO
Total Write Ops (K)
$ / 1000 Reads
Total Read operations
Monthly Read TCO
Total Read Ops (K)
Unit
Total Capacity
Unit Cost
$ / GB Stored
Usable storage space
(less replication and
overheads)
Monthly Storage Cost
Avail. Usable Storage
$ / GB for Inter-region
data transfers
Inter-region (peak) link
capacity
Monthly GB [In + Out] x
$ / GB

Working Through An HBase Example
24
Monthly TCO (less bw.)
= $60 K
Write Serving @ 25%
= $15 K
Total Write operations
= 100 M
$ 15 K / 100 M = $0.15 per
1000 writes per month
= $60 K
Write Serving @ 25% =
$15 K
Total Read operations
= 200 M
$ 15 K / 200 M = $0.075 per
1000 reads per month
Monthly Cost
Monthly
Capacity
Unit Cost
= $60 K
Storage @ 50%
= $30 K
RAW HDFS = 10 PB
Usable HDFS == [ 10 x 0.8
(20% overhead) ] / 3
= 2.67 PB
$ 30 K / 2.67 PB
= $ 0.011 / GB / Month
Monthly Charges
= $5 K
Total Data In + Out
= 0.25 PB
$ 5 K / 0.25 PB =
$ 0.02 / GB transferred
ILLUSTRATIVE

Measuring HBase Resource Consumption
25
Write Ops per Region Server
per Table Region =
#W(R1:RS1)+#W(R2:RS1)+
…
Cost = Total Writes x $0.15
/1000 writes/month
=$ for the Table/RS/Month
Write Ops cost for all tables
across all region servers for
a user ,app, BU or the
platform
Read Ops per Region Server
per Table Region =
#R(R1:RS1)+#R(R2:RS1)+
…
Cost = Total Reads x $0.075
/1000 writes/month
=$ for the Table/RS/Month
Read Ops cost for all tables
across all region servers for
a user ,app, BU or the
platform
Monthly
HBase
Project Cost
Monthly Roll-
ups
HDFS size of regions under
hbase/table/<regions> in
GBs
Cost = Total HDFS size x
$ 0.011 / GB / Month
=$ for the Table/Month
Total HDFS size for all
tables across all region
servers for a user ,app, BU
or the platform
Bandwidth measured at the
cluster level and divided
among select apps and
users of data based on
average volume In/Out
among user, app, and their
BU

Putting it Together for HBase Services
26
Write Operations Count of operations Monthly, Total write operations across regions of table $ 0.15 / 1000 Writes
Read Operations Count of operations Monthly, Total read operations across regions of table $ 0.075 / 1000 Reads
HDFS (Storage) GB Monthly, Peak storage used $ 0.011 / GB
Network Bandwidth GB Monthly, total in /out $ 0.02 / GB
HBase Services Billing Rate Card [ Monthly Rates ]
BU
Write Operations Read Operations HDFS (Storage) Network Bandwidth
Total Cost
($ K)Count
(M)
Cost
($ K)
Count
(M)
Cost
($ K)
Used
(PB)
Effective Used
(PB)
Cost
($ K)
Transferred
(PB)
Cost
($ K)
BU 1 30 M $ 4.5 20 M $ 1.5 3 PB 0.8 PB $ 8.80 1.25 PB $ 0.025 $ 14.82
BU 2 10 M $ 1.5 60 M $ 4.5 1 PB 0.27 PB $ 2.93 0.5 PB $ 0.01 $ 8.94
… …. … … … … …
BU N … … … … … ...
Total 100 M $ 15 200 M $ 15 10 PB 2.67PB $ 29.4 0.25 PB $ 5 $ 64.4
ILLUSTRATIVE

Multi-Tenant Deployment For Apache Storm
27
Topologies X,Y & Z
SharedSupervisors
NimbusZookeeper
Supervisor M
X: Worker M
Y: Worker M
…
Z: Worker M
Supervisor N
X: Worker N
Y: Worker N
…
Z: Worker N
Supervisor 2
X: Worker 2
Y: Worker 2
…
Z: Worker 2
…
Supervisor 1
X: Worker 1
Y: Worker 1
…
Z: Worker 1

Understanding Apache Storm Resources
28
Topology A : Worker
Task
Task
Task
Task
Supervisor
FixedWorkerSlots
 Supervisor runs one or worker
processes for one or more
topologies
 Each Supervisor have fixed
number of worker slots
 A worker process belongs to a
specific topology
 The workers from topologies are
distributed randomly on
supervisor
 Tasks perform the actual data
processing
Topology B : Worker
Task
Task
Task
Task

$ / Slot-Hour
Total number of slots
Monthly Slots Used
Avail. Slots
Unit Costs for Storm Operations
29
Worker Slots where topology
workers execute the actual
logic / tasks of spout and bolts
in parallel
Network bandwidth needed to
move data into/out of the
clusters by topologies
Unit
Total Capacity
Unit Cost
$ / GB for Inter-region data transfers
Inter-region (peak) link capacity
[Monthly GB In + Out] x $ / GB

Monthly TCO (less bw.) = $30 K
24 Slots Per Supervisors@100%
= $30 K
19.2 K Slots = 19.2 K x 24 x30
= 13.8 M Slot Hours
$ 30 K / 13.8 M Slot-Hours
= $0.002 / Slot-Hour / Month
Working Through a Storm Example
30
Monthly Cost
Monthly
Capacity
Unit Cost
Monthly Charges = $2.5 K
Total Data In + Out = 0.12 PB
[$ 2.5 K / 0.12 PB
= $ 0.02/ GB transferred
ILLUSTRATIVE

Worker Slot-Hours for Topologies =
#W(TP1) x T(TP1) + #W(TP2) x
T(TP2) + …
Cost = Worker Slot-Hours x $0.002 /
Slot-Hour / Month
= $ for the Topology / Month
Worker Slot-Hours for all Topologies
can be summed up for the month for
a user, app, BU, or the entire
platform
Measuring Storm Resource Consumption
31
Monthly Cost
Monthly Roll-
ups
Bandwidth measured at the cluster
level and divided among select apps
and users of data based on average
volume In/Out
Roll-ups through relationship among
user, app, and their BU
ILLUSTRATIVE

Putting it Together for Storm Services
32
BU
Compute Network Bandwidth
Total Cost
($ K)Used
(Slot hour)
Cost
($ K)
Transferred
(PB)
Cost
($ K)
BU1 2.5 M $ 5 0.02 PB $ 0.4 $ 5.4
BU2 1.25 M $ 2.5 0.04 PB $ 0.8 $ 3.3
… … … … …
BU N … … … ...
Total 10 M $ 20 0.12 PB $ 2.4 K $ 22.4
Compute Worker Slot Hours Number of slots used by Topology workers and hours they ran for $ 0.002/Slot-Hour
Network Bandwidth GB Monthly, total in /out $ 0.02/GB
Storm Services Billing Rate Card [ Monthly Rates ]
ILLUSTRATIVE

Project Based Costing for Grid Services
33
Project Summary Period Cost (K)
Grid Services Cost May 2014 $ 165.5 K
Project Usage Details (Data Center DC1) Usage Cost (K)
Apache Hadoop Services $ 126 K
Compute (Map & Reduce GB-Hours consumed @ $0.004/GB-Hour) 12.5 M $ 50 K
Storage (GBs of peak storage used @ $ 0.019/GB) 3.45 PB $ 66 K
Network (GBs In/Out @ $0.02/GB) 0.5 PB $ 10 K
Apache HBase Services $ 34.1 K
Reads (Number of Read Operations @ $0.075/1000 Reads) 30 M $ 2.2 K
Writes (Number of Write Operations @ $0.15/1000 Writes) 20 M $ 3.0 K
Storage (GBs of peak storage used @ $ 0.011/GB) 2.45 PB $26.9 K
Network (GBs In/Out @ $0.02/GB) 0.1 PB $2 K
Apache Storm Services $ 5.4 K
Compute (Slot Hours consumed @ $ 0.002/Slot-Hour) 2.5 M $ 5 K
Network (GBs In/Out @ $0.02/GB) 0.02 PB $ 0.4 K
ILLUSTRATIVE

Platform P&L
34
Line Item Q4’12 Q1’13 Q2’13 Q3 ’13 Total Total %
Y! Gross Revenues
Cost of revenues (less Grid CapEx)
Gross Profit
Grid OpEx
R&D Headcount
SE&O Headcount
Acquisition/Install
Active Use/ Ops
Network Bandwidth
Total Gird OpEx
Grid CapEx
Grid Services
Total Grid CapEx
Contribution Margin
Indirect Costs
G&A
Sales and Marketing
ILLUSTRATIVE
LEFT BLANK ON PURPOSE

Hadoop Cost Benchmarking – An Approach
35
Monthly Used Unused Total
Public Pricing or Terms-based (Used On-Premise
Eqv.)
M/R 71.4 M 61.6 M 133 M
Compute Instances (normalized time,
RAM, 32/64 ops, I/O etc.)
1,000
instances/ hr.
HDFS 148 PB 52 PB 200 PB
Storage
(account for 3x repl., job/ app space)
30 PB/
month
Avg. Data
Processe
d
- - 75 PB Instance Storage 2.5 PB daily
M/R $0.50 M $0.50 M $1 M 1,000 x $0.70/ instance/ hr. x 24 x 30 $0.5 M
HDFS $0.75 M $0.25 M $1 M 30 PB x $0.04/GB/month $1.2 M
Other Costs (if any) such as reads,
writes, data services/ hour etc.
$0.25 M
Total * $1.25 M $0.75 M $2 M Total $ 1.95 M
Quantity
equivalent
Cost
equivalent
* Ignored bandwidth, assumed equivalent
ILLUSTRATIVE

HBase and Storm Cost Benchmarking
36
Total
Public Pricing or Terms-based (Used On-Premise
Eqv.)
Reads
Peak concurrent reads for
a given record size
300 MB/s
Reads on chosen instances
(benchmarks 45MB/s)
300/45 = 7
instances
Writes
Peak concurrent writes for
a given record size
160 MB/s
Writes on chosen instances
(benchmarks 10MB/s)
160/10 = 16
instances
Storage
Data storage in tables (incl.
replication)
1.6 TB
Data served per instance (benchmarks
0.5 TB incl. repl.)
1.6/0.5 = 3
Cost calculations stay the same as Hadoop.
Instances required based on thru-put
and storage needs
16 instances/
hour
Slots-
Hours
Slot hours per month 2.5M
Instance hours based on memory and
CPU requirements (12 slots / instance)
0.21 M
instance
hours
Cost calculations stay the same as Hadoop.
Quantity
equivalent
* Ignored bandwidth, assumed equivalent
ILLUSTRATIVE
Quantity
equivalent

Improving Utilization favors on-premise setup
37
Utilization / Consumption (Compute and Storage)
Cost($)
On-premise Hadoop
as a Service
On-demand public
cloud service
Terms-based public
cloud service
Favors on-premise
Hadoop as a Service
Favors public cloud
service
x
x
Sensitivity analysis on
costs based on current
and expected
utilization
or target utilization can
provide further insights
into your operations and
cost competitiveness
Highstartingcost
Scalingup

Improving Utilization improves ROI
38
Time
CostAmortizedoverApps($)
Phase I 2012 – 2013 (H 0.23) 2014 & Future
Time = t Time = t’
Cost (t) = C
Cost (t’)= C’
# App continue to
grow on the Platform
At time t, BU profits are
R (t) – C(t) = π (t)
Platform’s goal is to continue
to increase the ROI while
supporting new technology
and services
R (t’) – C (t’) = π (t’), where
C (t’) < C (t) and π (t’) > π (t)
for same or bigger revenues.

Going Forward
Hadoop HBase Storm
 CPU as a resource
 Pre-emption and priority
 Long-running jobs
 Other potential
resources such as disk,
network, GPUs etc.
 Tez as the execution
engine / Container
reuse
 Multiple Region Servers
per node
 Larger JVMs / GC
improvements
 HBase-on-YARN
 cgroup profiles
 Storm-on-YARN
 Resource aware
scheduling (memory,
CPU, network)
 cgroup profiles
 More experience with
multi-tenancy

Thank You
@sumeetksingh
@amritasshwar
We are hiring!
Stop by Kiosk P9
or reach out to us at
bigdata@yahoo-inc.com.

Costing your Bug Data Operations

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (6)

Ähnlich wie Costing your Bug Data Operations

Ähnlich wie Costing your Bug Data Operations (20)

Mehr von DataWorks Summit

Mehr von DataWorks Summit (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Costing your Bug Data Operations

Hinweis der Redaktion