Reference Architecture-Validated & Tested Approach to Define Network Design

Hadoop - Validated Network Architecture
and Reference Deployment in Enterprise

Nimish Desai – nidesai@cisco.com

Technical Leader, Data Center Group
Cisco Systems Inc.

Session Objectives & Takeways

Goal 1: Provide Reference Network
Architecture for Hadoop in Enterprise

Goal 2: Characterize Hadoop Application
on Network

Goal 3: Network Validation Results with
Hadoop Workload

2

Big Data in
Enterprise

3

Validated 96 Node Hadoop Cluster
Nexus 7000 Nexus 7000

2248TP-E Nexus 3000
Nexus 3000
2248TP-E

Name Node
Name Node
Cisco UCS C 200
Cisco UCS C200
Single NIC
Single NIC

… … … …
Data Nodes 1 – 48 Data Nodes 49- 96 Data Nodes 1 – 48 Data Nodes 49 - 96
Cisco UCS C 200 Single NIC Cisco UCS 200 Single NIC Cisco UCS C 200 Single NIC Cisco UCS C 200 Single NIC

Traditional DC Design Nexus 55xx/2248 Nexus 7K-N3K based Topology

§  Hadoop Framework §  Network
Apache 0.20.2
Three Racks each with 32 nodes
Linux 6.2
Distribution Layer – Nexus 7000 or
Slots – 10 Maps & 2 Reducers per node Nexus 5000
§  Compute – UCS C200 M2 ToR – FEX or Nexus 3000
Cores: 12 2 FEX per Rack
Processor: 2 x Intel(R) Xeon(R) CPU X5670
@ 2.93GHz Each Rack with either 32 single or
Disk: 4 x 2TB (7.2K RPM) dual attached host
Network: 1G: LOM, 10G: Cisco UCS P81E

Data Center Infrastructure
WAN Edge
Layer

FC FC
SAN A SAN B
Nexus 7000 Layer 3
MDS 9500 10 GE Core Layer 2 - 1GE
SAN
Director
Layer 2 - 10GE
Core Layer 10 GE DCB
(LAN & SAN) 10 GE FCoE/DCB
4/8 Gb FC

Nexus 7000
10 GE Aggr

vPC+ L3
FabricPath
Aggregation
& Services L2
Layer
Network
Services

FC
FC
SAN
SAN A
Access B
Layer
Nexus
SAN Edge 5500
MDS 9200 / FCoE
9100

B22
FEX
Nexus 5500 10GE CBS 31xx Nexus 7000 Nexus 5500 FCoE UCS FCoE HP
Bare Metal Nexus 2148TP-E Blade switch Nexus 2232 Nexus 3000 Blade
End-of-Row
1G Nexus 3000 Bare Metal Top-of-Rack Top-of-Rack
C-class
Top-of-Rack 10G

1 GbE Server Access & 4/8Gb FC via dual HBA (SAN A // SAN B) 10Gb DCB / FCoE Server Access or 10 GbE Server Access & 4/8Gb FC via dual HBA (SAN A // SAN B)
© 2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 5

Big Data Application Realm – Web
2.0 & Social/Community Networks
§  Data live/die in Internet only
entities
§  Data Domain Partially private Data
UI Service
§  Homogeneous Data Life Cycle store
Mostly Unstructured
Web Centric, User Driven
Unified workload – few process &
owners
Typically non-virtualized

§  Scaling & Integration Dynamics
Purpose Driven Apps
Thousands of nodes
Hundreds of PB and growing
exponentially

6

Big Data Application Realm - Enterprise
§  Data Lives in a confined zone of enterprise
repository
§  Long Lived, Regulatory and Compliance Call Sales ERP Doc Recor Doc
Driven Cente Pipeli Modul Mgmt ds Mgmt
r ne eA A Mgmt B

§  Heterogeneous Data Life Cycle
Data ERP
Soc Office Video
§  Many Data Models Servic
Media
Modul
eB
Apps Conf
Collab
e
§  Diverse data – Structured and Unstructured
Produc
Catalo Exec
§  Diverse data sources - Subscriber based Customer DB
(Oracle/SAP)
t
Catalo
g VOIP Report
g Data s
§  Diverse workload from many sources/groups/
process/technology
§  Virtualized and non-virtualized with mostly
SAN/NAS base
§  Scaling & Integration Dynamics are different
§  Data Warehousing(structured) with divers repository +
Unstructured Data
§  Few hundred to thousand nodes, few PB
§  Integration, Policy & Security Challenges

§  Each Apps/Group/Technology limited in
§  data generation
§  Consumption
§  Servicing confined domains
7

Big Data Framework Application Comparison
Batch-oriented Real-time Big
Relational Big Data
Database Data NoSQL
(Hadoop)
•  Structured Data – Rows •  Unstructured Data – •  Hbase, Cassandra,
Oriented Files, logs, Web-Clicks Oracle
•  Optimized for OLTP/ •  Data format is •  Structured and
OLAP abstracted to higher Unstructured Data
•  Rigid schema applied to level application •  Sparse column-family
data on insert/update programing data storage or Key-
•  Read and write (insert, •  Schema-less, flexible value pair
update) many times for later re-use •  Not a RDBMS, though
•  Non-linear scaling •  Write once, read many with some schema
•  Most transactions and •  Data never dies •  Random read and write
queries involve a small •  Linear scaling •  Modeled after Google’s
subset of data set •  Entire data set at play BigTable
•  Transactional – scaling for a given query •  High transaction – real
to thousands of queries •  Multi PB time scaling to millions
•  GB to TBs size •  Not suited for ad-hoc
analysis
•  More suited for ~1 PB

8

Data Sources
Big Data
Enterprise Application
Machine logs Sensor data
Sales

Products
Call data records Web click
Process

Inventory

stream data
Finance
Payroll

Satellite feeds GPS data Sales
Shipping

Tracking
data Blogs Emails Pictures
Authoriza;on

Video
Customers
Proﬁle

mn
Colu
Store

ness
Busi ence
ig
Intell

9

Big Data Building Blocks into the Enterprise
Big Data Socia
Event
Application Click
Streams Data l
Media
Mobility
Trends
Virtualized,
Bare Metal and Cloud
Sensor
Logs Data

Cisco Unified Fabric

Traditional
“Big Data” Storage “Big Data”
Database
NoSQL

Real-Time Capture, Store and Analyze
SAN and NAS
Read and Update RDBMS
Operations

10

Infinite Use Cases

§  Web & E-Commerce
Faster User Response
Customer Behaviors & Pricing Models
Ad Target

§  Retails
Customer Churn & Integration of brick &
mortar with .com business models
PoS Transactional Analysis

§  Insurance & Finance
Risk Management
User Behavior & Incentive Management
Trade Surveillance for Financials

§  Network Analytics – Splunk
Text Mining
Fault Prediction

§  Security & Threat Defense

11

Hadoop Cluster Design &
Reference Network
Architecture

12

Hadoop Components and Operations
Hadoop Distributed File System
Blo Blo Blo Blo Blo Blo
ck 1 ck 2 ck 3 ck 4 ck 5 ck 6

§  Data is not centrally located,
Data is stored across all data nodes
in the cluster
§  Scalable & Fault Tolerant
§  Data is divided in multiple large ToR FEX/ ToR FEX/ ToR FEX/
blocks – 64MB default, typical switch switch switch
block 128MB
Data Data Data
§  Blocks are not the related to disk node 1 node 6 node 11
geometry
Data Data Data
§  Data is stored reliably. Each block node 2 node 7 node 12
is replicated 3 times
Data Data Data
§  Types of Functions node 3 node 8 node 13
§  Name Node (Master) - Manages Data Data Data
Cluster
node 4 node 9 node 14
§  Data Node (Map and Reducer) –
Carries blocks Data Data Data

13

Name
§  Name Node Node
Runs a scheduler – Job Tracker

Manages all data nodes, in memory

Secondary Name Node – Snapshot of meta data of
HDFS cluster
ToR FEX/ ToR FEX/ ToR FEX/
Typically all three JVM can run on single node switch switch switch

Data Data Data
§  Data Node node 1 node 6 node 11

Task Tracker Receives Job Info from Job Tracker Data Data Data
(Name Node) node 2 node 7 node 12
Map & Reducer Task Managed by Task Tracker
Data Data Data
Configurable Ratio of Map & Reduce Task for various node 3 node 8 node 13
workload per Node/CPU/Core
Data Data Data
Data Locality - IF data not available where the map node 4 node 9 node 14
task is assigned, a missing block be copied over the
network
Data Data Data

14

Characteristics that Affect Hadoop Clusters
§  Cluster Size §  Characteristics of Data Node
Number of Data Nodes ‒ I/O, CPU, Memory, etc.
§  Data Model & Mapper/Reduces §  Networking Characteristics
Ratio ‒  Availability
MapReduce functions ‒  Buffering
§  Input Data Size ‒  Data Node Speed (1G vs. 10G)

Total starting dataset ‒  Oversubscription
‒  Latency
§  Data Locality in HDFS
Ability to processes data where it already is
located

§  Background Activity
Number of Jobs running
http://www.cloudera.com/resource/hadoop-
type of jobs world-2011-presentation-video-hadoop-network-
Importing and-compute-architecture-considerations/
exporting

15

Unstructured Data

§  The Data Ingest & Replication
External Connectivity Map Map Map Map
Map Map Map Map
East West Traffic (Replication of data Map Map
Map
Map
Map
Map
Map
Map
blocks) Map Map Map Map

§  Map Phase – Raw data Analyzed
and converted to name/value pair. Shuffle Phase

Workload translate to multiple
batches of Map task Key 1
Key 1
Key 1
Key 1
Key 1
Key 1
Key 1
Key 1
Key 1 Key 1 Key 1 Key 1
Reducer can start the reduce Key 1 Key 2 Key 3 Key 4

phase ONLY after the entire Map
set is complete
§  Mostly a IO/compute function Reduce Reduce Reduce Reduce

Result/Output

16

Unstructured Data
§  Shuffle Phase - All name/value pair are
sorted and grouped by their keys.
Map Map Map Map
§  Mapper sending the data to Reducers Map Map Map Map
Map Map Map Map
§  High Network Activity Map Map Map Map
Map Map Map Map
§  Reduce Phase – All values associates with a key
are process for results, three phases

Copy - get intermediate result from each data Shuffle Phase
node local disk
Merge - to reduce the number of files Key 1 Key 1 Key 1 Key 1
Reduce method Key 1 Key 1 Key 1 Key 1
§  Output Replication Phase - Reducer
replicating result to multiple nodes
Highest Network Activity
Reduce Reduce Reduce Reduce
§  Network Activities Dependent on Workload
Behavior

Result/Output

17

MapReduce Data Model
ETL & BI Workload Benchmark

The complexity of the functions used in Map and/or Reduce has
a large impact on the job completion time and network traffic.

Yahoo
TeraSort
–
ETL
Workload
–
Most
Network
Intensive

Reducers
Start

Map
Start
Map
Finish
Job
Finish

•  Input,
Shuffle
and
Output
data
size
is
the
same
–
e.g.
10
TB
data
set
in
all
phases

•  Yahoo
Terasort
has
a
more
balanced
Map
vs.
Reduce
funcEons
-‐
linear
compute
and
IO

Shakespeare
WordCount
–
BI
Workload

Reducers
Start
Map
Finish

Map
Start
Job
Finish

•  Data
set
size
varies
in
various
phase
–
Varying
impact
on
the
network
e.g.
1TB
Input,

10MB
Shuffle,
1MB
Output

•  Most
of
the
processing
in
the
Map
FuncEons,
smaller
intermediate
and
even
smaller
final

Data

18

ETL Workload (1TB Yahoo Terasort)
Network Graph of all Traffic Received on an Single Node (80 Node Run)
Shortly
aNer
the
Reducers
start
Map
tasks
are
finishing
and
data
is
being
shuffled
to
reducers

As
Maps
completely
finish
the
network
is
no
loner
used
as
Reducers
have
all
the
data
they

need
to
finish
the
job

The red line is
the total These
amount of symbols
traffic represent a
received by node sending
hpc064 traffic to
HPC064

Reducers Job
Start Maps Complete
Maps Start
Finish
19

ETL Workload (1TB Yahoo Terasort)
Network Activity of all Traffic Received on an Single Node (80 Node Run)

If
output
replica;on
is
enabled,
then
the
end
of
the
terasort,
must
store
addi;onal

copies.
For
a
1TB
sort,
2TB
will
need
to
be
replicated
across
the
network.

Output Data Replication Enabled
§  Replication of 3 enabled (1 copy stored locally, 2 stored remotely)
§  Each reduce output is replicated now, instead of just stored locally
20

BI Workload
Network Graph of all Traffic Received on an Single Node (80 Node Run)
Wordcount on 200K Copies of complete works of Shakespeare
Due
the
combinaEon
of
the
length
of
the
Map
phase
and
the
reduced
data
set
being

shuﬄed,
the
network
is
being
uElized
throughout
the
job,
but
by
a
limited
amount.

These
The red line is symbols
the total represent a
amount of node sending
traffic traffic to
received by HPC064
hpc064

Reducers Job
Maps Start
Start Maps Complete
Finish
21

Data Locality in HDFS

Data Locality – The
ability to process
data where it is
locally stored.
Observations
§Notice this initial
spike in RX Traffic is
before the Reducers
kick in.

§ It represents data
each map task needs
Note: that is not local.

During the Map Phase, the JobTracker § Looking at the spike
attempts to use data locality to schedule it is mainly data from
only a few nodes.
map tasks where the data is locally
stored. This is not perfect and is
dependent on a data nodes where the Reducers Start Job
data is located. This is a consideration Maps Start
Maps Finish Complete

when choosing the replication factor. Map
Tasks:
IniEal
spike
for
non-‐local
data.
SomeEmes
a
task

may
be
scheduled
on
a
node
that
does
not
have
the
data

More replicas tend to create higher available
locally.

probability for data locality.

22

Map to Reducer Ratio Impact on Job Completion
§  1 TB file with 128 MB Blocks == 7,813 Map Tasks
§  The job completion time is directly related to number of reducers
§  Average Network buffer usage lowers as number of reducer gets lower (see
hidden slides) and vice versa.
Job Completion Time in Sec
800
Total Graph of Job 700
Completion Time in Sec 600
500
30000 400
300
200
25000
100
0
192 96 48
20000
No. Of Reduceers

15000
Job Completion Time in Sec
10000 30000
25000
20000
5000
15000
10000
0 5000
192 96 48 24 12 6 0
No. Of Reduceers 24 12 6
No. Of Reduceers

23

Job Completion Time with 96 Reducers

24

Job Completion Time with 48 Reducers

25

Job Completion Graph with 24 Reducers

26

Network Characteristics
The relative impact of various network
characteristics on Hadoop clusters*

Availablity
Buffering
Oversubscription
Data Node Speed
Latency

* Not a scaled or measured data

27

Validated Network
Reference
Architecture

28

Data Center Access Connectivity
Nexus 7000 MDS 9000
Core
Distribution

LAN SAN

Unified
Access Layer
Nexus 5000 Nexus 1000V
Direct Attach
10GE Nexus
4000 Cisco
Nexus Nexus Nexus
2000 2000 2000 UCS

1 & 10GE 10GE Blade
1GE Rack 10GE Rack 10GE Rack UCS Compute
Blade Servers Switch w/ FCoE
Mount Servers Mount Servers Mount Servers Blade & Rack
w/ Pass-Thru (IBM/Dell)

BRKAPP-2027 © 2012 Cisco and/or its affiliates. All rights reserved. Cisco Public 29

Network Reference Architecture

§ Network Attributes

Nexus LAN and SAN Core:
Optimized for Data Centre
§ Architecture
§  Availability
§ Capacity, Scale &
Oversubscription
blade1

§ Flexibility
blade1
slot 1 slot 1
blade2
slot 2 blade2
slot 2
blade3 blade3
slot 3 slot 3
blade4
slot 4 blade4
slot 4
blade5 blade5
slot 5 slot 5
blade6
slot 6 blade6
slot 6
blade7 blade7
slot 7 slot 7
blade8
slot 8 blade8
slot 8

Edge/Access Layer
blade1 blade1
slot 1 slot 1

§ Management & Visibility
blade2
slot 2 blade2
slot 2
blade3 blade3
slot 3 slot 3
blade4
slot 4 blade4
slot 4
blade5 blade5
slot 5 slot 5
blade6
slot 6 blade6
slot 6
blade7 blade7
slot 7 slot 7
blade8
slot 8 blade8
slot 8

blade1 blade1
slot 1 slot 1
blade2
slot 2 blade2
slot 2
blade3 blade3
slot 3 slot 3
blade4
slot 4 blade4
slot 4
blade5 blade5
slot 5 slot 5
blade6
slot 6 blade6
slot 6
blade7 blade7
slot 7 slot 7
blade8
slot 8 blade8
slot 8

30

Scaling the Data Centre Fabric
Changing the device paradigm
§  De-Coupling of the Layer 1 and Layer 2 Topologies

§  Simplified Management Model, plug and play provisioning,
centralized configuration

§  Line Card Portability (N2K supported with Multiple Parent
Switches – N5K, 6100, N7K)

§  Unified access for any server (100Mà1GEà10GEà
FCoE): Scalable Ethernet, HPC, unified fabric or
virtualization deployment

...
Virtualized Switch

© 2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 31

Hadoop Network Topologies - Reference
Unified Fabric & ToR DC Design

§  Integration with Enterprise architecture –
essential pathway for data flow §  1Gbps Attached Server
Integration §  Nexus 7000/5000 with 2248TP-E
Consistency §  Nexus 7000 and 3048
Management
Risk-assurance §  NIC Teaming - 1Gbps Attached
Enterprise grade features §  Nexus 7000/5000 with 2248TP-E

§  Consistent Operational Model §  Nexus 7000 and 3048
NxOS, CLI, Fault Behavior and Management §  10 Gbps Attached Server
§  Though higher BW east-west compared §  Nexus 7000/5000 with 2232PP
to traditional transactional networks
§  Nexus 7000 and 3064
§  Over the time it will have multi-user, multi-
workload behavior §  NIC Teaming – 10 Gbps Attached
Need enterprise centric features Server
Security, SLA, QoS etc §  Nexus 7000/5000 with 2232PP

§  Nexus 7000 & 3064

Validated Reference Network Topology

Nexus 3000
2248TP-E Nexus 3000
2248TP-E

Name Node
Name Node
Cisco UCS C 200
Cisco UCS C200
Single NIC
Single NIC

… … … …
Data Nodes 1 – 48 Data Nodes 49- 96 Data Nodes 1 – 48 Data Nodes 49 - 96
Cisco UCS C 200 Single NIC Cisco UCS 200 Single NIC Cisco UCS C 200 Single NIC Cisco UCS C 200 Single NIC

Traditional DC Design Nexus 55xx/2248 Nexus 7K-N3K based Topology

§  Hadoop Framework §  Network
Apache 0.20.2
Three Racks each with 32 nodes
Linux 6.2
Distribution Layer – Nexus 7000 or
Slots – 10 Maps & 2 Reducers per node Nexus 5000
§  Compute – UCS C200 M2 ToR – FEX or Nexus 3000
Cores: 12 2 FEX per Rack
Processor: 2 x Intel(R) Xeon(R) CPU X5670
@ 2.93GHz Each Rack with either 32 single or
Disk: 4 x 2TB (7.2K RPM) dual attached host
Network: 1G: LOM, 10G: Cisco UCS P81E

Characteristics


§  Availability
Oversubscription
blade1

§ Flexibility
blade1
slot 1 slot 1
blade2
slot 2 blade2
slot 2
blade3 blade3
slot 3 slot 3
blade4
slot 4 blade4
slot 4
blade5 blade5
slot 5 slot 5
blade6
slot 6 blade6
slot 6
blade7 blade7
slot 7 slot 7
blade8
slot 8 blade8
slot 8

Edge/Access Layer
blade1 blade1
slot 1 slot 1

blade2
slot 2 blade2
slot 2
blade3 blade3
slot 3 slot 3
blade4
slot 4 blade4
slot 4
blade5 blade5
slot 5 slot 5
blade6
slot 6 blade6
slot 6
blade7 blade7
slot 7 slot 7
blade8
slot 8 blade8
slot 8

blade1 blade1
slot 1 slot 1
blade2
slot 2 blade2
slot 2
blade3 blade3
slot 3 slot 3
blade4
slot 4 blade4
slot 4
blade5 blade5
slot 5 slot 5
blade6
slot 6 blade6
slot 6
blade7 blade7
slot 7 slot 7
blade8
slot 8 blade8
slot 8

34

High Availability Switching Design
Common High Availability Engineering Principles
§  The Core High Availability Design
L3
Principles are common across all Dual Node
Network Systems Designs
§  Understand the causes of network Full Mesh
outages
Component Failures L2
Dual Node
Network Anomalies
Full Mesh
§  Understand the Engineering
foundations of systems level availability ToR
Dual Node
Device and Network level MTBF
Understanding Hierarchical and Modular Design NIC
Teaming
Understand the HW and SW interaction in the
system

§  Enhance VPC allows such topology and Dual NIC
Dual NIC
Single NIC 802.3ad
ideally suited for Big Data applications Active/Standby

§  Enhanced vPC (EvPC)configuration any and System High Availability is a function of
all server NIC teaming configurations will be topology and component level High
supported on any port
Availability

Availability with Single Attached Server
1G or 10G
§  Important to evaluate the overall availability of
the system.
Network failures can span many nodes in the system
causing rebalancing and decreased overall resources.
Typically multi-TB of data transfer occurs for a single
ToR or FEX failure
Load Sharing, ease of management and
consistent SLA is important to enterprise
operation

§  Failure Domain Impact on Job Completion
§  1 TB Terasort typically Takes ~4.20- 4.30 minutes

§  A failure of a SINGLE NODE (either NIC or server
component) results in roughly doubling of the job
completion time

§  Key observation is that the failure impact is
dependent on type of workload being run on the
Single NIC
cluster 32 per ToR
Short lived interactive vs. Short live batch
Long job – ETL, Normalization, Joins
36

Single Node Failure Job Completion Time
§  The MAP job are executed parallel so unit time for each MAP tasks/node remains same and more less
completes the job roughly at the same time.

§  However during the failure, set of MAP task remains pending (since other nodes in the cluster are still
completing their task) till ALL the node finishes the assigned tasks.

§  Once all the node finishes their MAP task, the left over MAP task being reassigned by name node, the unit time
it take to finish those sets of MAP task remain the same(linear) as the time it took to finish the other MAPs – its
just happened to be NOT done in parallel thus it could double job completion time. This is the worst case
scenario with Terasort, other workload may have variable completion time.

37

1G Port Traffic & Job Completion Time

38

1G Port Failure Traffic & Job Completion Time

39

Availability with Dual Attached Server
1G and 10G
Server NIC Teaming Topologies

§  Dual homing(active-active) network connection
from server allows
Reduced replication and data movements during failure
Allow optimal load-sharing

§  Dual homing FEX avoids single point of failure.
§  Enhance VPC allows such topology and ideally
suited for Big Data applications
§  Enhanced vPC (EvPC)configuration any and all
server NIC teaming configurations will be
supported on any port
§  Supported with Nexus 5500 only
§  Alternatively Nexus 3000 vPC allows host level
redundancy with ToR ECMP

Dual NIC
Single NIC 802.3ad Dual NIC Active/
Standby

40

Availability
Single Attached vs. Dual Attached Node
§  No single point of failure from network view point. No impact on job completion time
§  NIC bonding configured at Linux – with LACP mode of bonding
§  Effective load-sharing of traffic flow on two NICs.
§  Recommended to change the hashing to src-dst-ip-port (both network and NIC bonding in
Linux) for optimal load-sharing

41

Availability Network Failure Result – 1TB Terasort - ETL
§  Failure of various components
FEX/ToR A
§  Failure introduce at 33%, 66% and 99%
of reducer completion FEX/ToR A
§  Singly attached NIC server & Rack 96 Nodes
failure has bigger impact on job 2 FEX per Rack
completion time then any other failure
§  FEX Failure is a RACK failure for 1G topology
Job Completion Time with Various Failure

1G Single 2G
Failure Point FEX/ToR B
Attached Dual Attached
FEX/ToR A
Peer Link 5000 301 258

FEX * 1137 259
Rack * 1137 1017
See
A Port – Single previous
See previous Slide
Attached Slide

See
1 port – Dual previous See previous Slide
Attach Slide

*Variance in run time with % reducer completed Rack 1 Rack 2 Rack 3
42

Characteristics

§ Availability
Oversubscription
§ Flexibility

43

Cluster Scaling
Nexus 7K/5K & FEX - 2248TP-E or 2232

§ 1G Based - Nexus
2248TP- E
48 1G host ports and up to 4
Uplinks
uplinks bundled into a single
port channel Host Interface

§ 10G Based Nexus 2232
32 10G host ports and up to 8
uplinks bundled into a single
port channel
802.3ad &
vPC 802.3ad & Single
vPC Attached

Nexus 2248TP-E and 2232 support
both local port channel and vPC
for distributed port channels

Oversubscription Design
§  Hadoop is a parallel batch job oriented framework
§  Primary benefits of hadoop is the reduction in job completion time that would
otherwise would take longer with traditional technique. E.g. Large ETL, Log
Analysis, Join-only-Map job etc.
§  Typically oversubscription occurs with 10G server access then at 1G server
§  Non-blocking network is NOT a needed, however degree of oversubscription
matters for
Job Completion Time
Replication of Results
Oversubscription during rack or FEX failure

§  Static vs. actual oversubscription
Often how much data a single node push is IO bound and number of disk configuration

Uplinks Oversubscription Measured
Theoretical (16 Servers)
8 2:1 Next Slides
4 4:1 Next Slides
2 8:1 Next Slides
1 16:1 Next Slides
45

Network Oversubscriptions
§  Steady state
§  Result Replication with 1,2,4, & 8 uplink
§  Rack Failure with 1, 2, 4 & 8 Uplink

46

Data Node Speed Differences
1G vs. 10G TCPDUMP of Reducers TX

•  Generally 1G is being used largely due to the cost/performance trade-offs. Though 10GE can
provide benefits depending on workload
•  Reduced spike with 10G and smoother job completion time
•  Multiple 1G or 10G links can be bonded together to not only increase bandwidth, but increase
resiliency.
47

1GE vs. 10GE Buffer Usage
Moving from 1GE to 10GE actually lowers the buffer requirement at the switching layer.

Job
Completion
Cell
Usage

109
121
133
145
157
169
181
193
205
217
229
241
253
265
277
289
301
313
325
337
349
361
373
385
397
409
421
433
445
457
469
481
493
505
517
529
541
553
565
577
589
601
613
625
637
649
661
673
685
697
709
721
733
745
757
769
781
793
1
13
25
37
49
61
73
85
97

1G
Buffer
Used 10G
Buffer
Used 1G
Map
% 1G
Reduce
% 10G
Map
% 10G
Reduce
%

By
moving
to
10GE,
the
data
node
has
a
wider
pipe
to
receive
data
lessening
the

need
for
buﬀers
on
the
network
as
the
total
aggregate
transfer
rate
and
amount

of
data
does
not
increase
substanEally.
This
is
due,
in
part,
to
limits
of
I/O
and

Compute
capabiliEes

48

Characteristics


§ Capacity
§ Scale & Oversubscription blade1
slot 1
blade1
slot 1
blade2
slot 2 blade2
slot 2
blade3 blade3
slot 3 slot 3

§ Flexibility
blade4
slot 4 blade4
slot 4
blade5 blade5
slot 5 slot 5
blade6
slot 6 blade6
slot 6
blade7 blade7
slot 7 slot 7
blade8
slot 8 blade8
slot 8

Edge/Access Layer
blade1 blade1
slot 1 slot 1
blade2
slot 2 blade2
slot 2
blade3 blade3
slot 3 slot 3
blade4 blade4

slot 4 slot 4
blade5
blade5
slot 5 slot 5
blade6
slot 6 blade6
slot 6
blade7 blade7
slot 7 slot 7
blade8
slot 8 blade8
slot 8

blade1 blade1
slot 1 slot 1
blade2
slot 2 blade2
slot 2
blade3 blade3
slot 3 slot 3
blade4
slot 4 blade4
slot 4
blade5 blade5
slot 5 slot 5
blade6
slot 6 blade6
slot 6
blade7 blade7
slot 7 slot 7
blade8
slot 8 blade8
slot 8

49

Multi-use Cluster Characteristics

Hadoop clusters are
generally multi-use. The
effect of background use
can effect any single jobs
completion.

A given Cluster, running many different types of Jobs, Importing into HDFS, Etc.

Importing Data into HDFS

Large ETL Job Overlaps with medium and small ETL Jobs and many small BI Jobs
(Blue lines are ETL Jobs and purple lines are BI Jobs)

Example View of 24 Hour Cluster Use
50

100 Jobs each with 10GB Data Set
Stable, Node & Rack Failure

•  Almost all jobs are impacted with a single node failure
•  With multiple jobs running concurrently, node failure impact is as significant
as rack failure

51

Characteristics


§ Capacity
§ Scale & Oversubscription blade1
slot 1
blade1
slot 1
blade2
slot 2 blade2
slot 2
blade3 blade3
slot 3 slot 3

§ Flexibility
blade4
slot 4 blade4
slot 4
blade5 blade5
slot 5 slot 5
blade6
slot 6 blade6
slot 6
blade7 blade7
slot 7 slot 7
blade8
slot 8 blade8
slot 8

Edge/Access Layer
blade1 blade1
slot 1 slot 1
blade2
slot 2 blade2
slot 2
blade3 blade3
slot 3 slot 3
blade4 blade4

slot 4 slot 4
blade5
blade5
slot 5 slot 5
blade6
slot 6 blade6
slot 6
blade7 blade7
slot 7 slot 7
blade8
slot 8 blade8
slot 8

blade1 blade1
slot 1 slot 1
blade2
slot 2 blade2
slot 2
blade3 blade3
slot 3 slot 3
blade4
slot 4 blade4
slot 4
blade5 blade5
slot 5 slot 5
blade6
slot 6 blade6
slot 6
blade7 blade7
slot 7 slot 7
blade8
slot 8 blade8
slot 8

52

Burst Handling and Queue Depth

A network that cannot handle
•  Several HDFS operations and bursts effectively will drop
phases of MapReduce jobs are very packets, so optimal buffering is
bursty in nature needed in network devices to
absorb bursts.
•  The extent of bursts largely depend
on the type of job (ETL vs. BI) Optimal Buffering
•  Given large enough incast, TCP
will collapse at some point no
•  Bursty phases can include matter how large the buffer
replication of data (either importing •  Well studied by multiple
into HDFS or output replication) and universities
the output of the mappers during •  Alternate solutions (Changing
TCP behavior) proposed rather
the shuffle phase. than Huge buffer switches
http://simula.stanford.edu/
sedcl/files/dctcp-final.pdf

53

Nexus 2248TP-E Buffer Monitoring
§  Nexus
2248TP-‐E
uElizes
a
32MB
shared
buffer
to
handle
larger
traffic
bursts

§  Hadoop,
NAS,
AVID
are
examples
of
bursty
applicaEons

§  You
can
control
the
queue
limit
for
a
specified
Fabric
Extender
for
egress

(network
to
the
host)
or
ingress(host
to
network)

§  Extensive Drop Counters
§  Provides drop counters for both directions: Network to host and Host to Network on a per
host interface basis
§  Drop counters for different reason
•  Out of buffer drop, No credit drop, Queue limit drop(tail drop), MAC error drop, Truncation
drop, Multicast drop
§  Buffer Occupancy Counter
§  How much buffer is being used. One key indicator of congestion or bursty traffic

N5548-L3(config-fex)# hardware N2248TPE queue-limit 4000000 rx
N5548-L3(config-fex)# hardware N2248TPE queue-limit 4194304
tx

fex-110# show platform software qosctrl asic 0 0

54

Buffer Monitoring
switch# attach fex 110
Attaching to FEX 110 ...
To exit type 'exit', to abort type '$.'

fex-110# show platform software qosctrl asic 0 0
number of arguments 4: show asic 0 0
----------------------------------------
QoSCtrl internal info {mod 0x0 asic 0}
mod 0 asic 0:
port type: CIF [0], total: 1, used: 1
port type: BIF [1], total: 1, used: 0
port type: NIF [2], total: 4, used: 4
port type: HIF [3], total: 48, used: 48

bound NIF ports: 2

N2H cells: 14752
H2N cells: 50784

----Programmed Buffers---------

Fixed Cells : 14752
Shared Cells : 50784 ç Allocated Buffer in terms of cells
(512Bytes)
----Free Buffer Statistics-----
Total Cells : 65374
Fixed Cells : 14590
Shared Cells : 50784 ç Number of free cells to be monitored

55

%)(($,-./)$
!"#$"%&'
!"#("#('
!"#)"#*'
!"%#"+('
!"%+"+%'
!"%$"+)'
!"%(",&'
phases.
!"%)"$,'
!"+%"!#'
!"+,"!*'
!"+&"#&'
!"+*"%+'
!",!"+#'
!",%"$#'
!",,"$)'
!",("#)'
completion times.
!",)"%$'
!"$#",&'
!"$,"!&'
!"$&"%&'
!"$*"++'
#"!!",!'
#"!%",('
#"!,",+'
Shuffle Phase
#"!("!+'
#"!)"##'
#"##"+!'
Buffer Usage During

#"#+"+&'
#"#$",+'
#"#("$!'
#"#)"$('
#"%%"!$'
#"%,"%$'

-./'0#'
#"%("%+'
#"+!"#)'
#"++"%,'
#"+&"+#'
#"+)"%$'

-./'0%'
#",%"++'
#",$"%('
#",("$&'
#"$!"+('

123'4'
#"$%",*'
#"$,"#*'
#"$,"#*'
%"!*",$'
%"##"#('
%"#(",!'

567896'4'
%"%!"%,'
%"%%"$$'
%"%$",!'
%"%*"%$'
%"+#"!*'
%"++"$+'
%"+&"+*'
%"+)"%,'
%",%"%#'
%",$"!&'
%",("$#'
%"$!"+&'
%"$+"%!'
%"$&"!$'
%"$*"$!'
+"!#"+('
+"!,"+$'
+"!("%!'
+"#!"!*'
+"#%"$$'
+"#$"$&'
+"#*",('
output Replication

+"%#"$,'
Buffer Usage During

+"%$"!+'
+"%*"#!'
+"+#"#&'
+"+,"!('
+"+&"++'
+"+)"%#'
!"#$%"&'()*"+$
§  The buffer utilization is highest during the shuffle and output replication

§  Optimized buffer sizes are required to avoid packet loss leading to slower job
TeraSort FEX(2248TP-E) Buffer Analysis (10TB)

56

Buffer depth monitoring: interface
§  Real time command displaying the status of the shared buffer.
§  XML support will be added in the maintenance release
§  Counters are displayed in cell count. A cell is approximately 208 bytes

show hardware internal buffer info pkt-stats [brief|clear|detail]

Buffer
Free
Total
buffer
Max
buffer

usage
buffer

space
on
the
usage
since

plaòrm
clear

57

%)(($,-./)$

!"#$%#&'(
!"#&)#&"(
phases.
!"#&'#&*(
!"#&+#&+(
!"#'!#&%(
!"#'&#'!(

Rack
layer

!"#'*#')(
!*#,,#'$(
!*#,$#'&(
!*#,"#''(
!*#,%#'"(
!*#!)#'+(
!*#!'#'%(
!*#!%#,,(
!*#))#,!(
!*#)'#,)(
!*#)+#,&(
!*#$!#,'(
!*#$&#,"(
Shuffle Phase
!*#$*#,*(
!*#&,#,*(
Buffer Usage During

!*#&$#,+(
!*#&"#,%(
!*#&%#!,(
!*#')#!!(

-$,&+(.!(
slower job completion times.

!*#''#!!(
!*#'+#!$(
!+#,!#!&(
!+#,&#!'(
!+#,*#!"(
!+#!,#!*(

-$,&+(.)(
!+#!$#!%(
!+#!"#),(
!+#!%#)!(
!+#))#))(
!+#)'#)$(

-$,"&(
!+#)+#)&(
!+#$!#)'(
!+#$&#)*(
!+#$*#)+(
!+#&,#)%(

/01(2(
!+#&$#$,(
!+#&"#$!(
!+#&%#$$(
!+#')#$&(
!+#''#$'(
!+#'+#$"(
345674(2(

!%#,!#$*(
!%#,&#$+(
!%#,*#$%(
!%#!,#&!(
!%#!$#&)(
!%#!"#&)(
Replication

!%#!%#&$(
!%#))#&&(
!%#)'#&"(
!%#)+#&*(
!%#$!#&+(
!%#$&#&%(
!%#$*#',(
Buffer Usage During output

!%#&,#'!(
!%#&$#')(
!%#&"#'&(
!%#&%#''(
TeraSort(ETL) N3k Buffer Analysis (10TB)

!%#')#'"(
!%#''#'*(
•  Optimized buffer sizes are required to avoid packet loss leading to

!"#$%"&'()*"+$
The
AggregaEon
switch
buﬀer
remained
ﬂat
as
the
bursts
were
absorbed
at
the
Top
of

•  The buffer utilization is highest during the shuffle and output replication

58

Network Latency
Generally network
latency, while N3K Topology 5k/2k Topology
consistent latency
being important, does
not represent a
significant factor for

Completion Time (Sec)
Hadoop Clusters.

Note:
There is a difference in network latency
vs. application latency. Optimization in
the application stack can decrease
application latency that can potentially
have a significant benefit. 1TB 5TB 10TB
Data Set Size (80 Node Cluster)

59

Summary §  10G and/or Dual attached server
§  Extensive Validation of provides consistent job completion
time & better buffer utilization
Hadoop Workload
§  10G provide reduce burst at the
§  Reference Architecture access layer

Make it easy for Enterprise §  A single attached node failure has
considerable impact on job
Demystify Network for completion time
Hadoop Deployment §  Dual Attached Sever is recommended
Integration with Enterprise design – 1G or 10G. 10G for future
proofing
with efficient choices of
network topology/devices §  Rack failure has the biggest impact
on job completion time

§  Does not require non-blocking
network

§  Degree of oversubscription does
impact job completion time

§  Latency does not matter much in
Hadoop work load
60

128
Node/1PB
test

Big Data @ Cisco cluster

Cisco.com
Big
Data

www.cisco.com/go/bigdata

Cer;ﬁca;ons
and
Solu;ons
with
UCS
C-‐Series

and
Nexus
5500+22xx

•  EMC
Greenplum
MR
SoluEon

•  Cloudera
Hadoop
CerEﬁed
Technology

•  Cloudera
Hadoop
SoluEon
Brief

•  Oracle
NoSQL
Validated
SoluEon

•  Oracle
NoSQL
SoluEon
Brief

Mul;-‐month
network
and
compute
analysis

tes;ng

(In
conjunc;on
with
Cloudera)

•  Network/Compute
ConsideraEons
Whitepaper

•  Presented
Analysis
at
Hadoop
World

61

THANK YOU FOR
LISTENING

Nimish Desai – nidesai@cisco.com
Technical Leader, Data Center Group
Cisco Systems Inc.

Break!
Break takes place in the Community Showcase (Hall 2)
Sessions will resume at 3:35pm

Page 63

Reference Architecture-Validated & Tested Approach to Define Network Design

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Reference Architecture-Validated & Tested Approach to Define Network Design

Ähnlich wie Reference Architecture-Validated & Tested Approach to Define Network Design (20)

Mehr von DataWorks Summit

Mehr von DataWorks Summit (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Reference Architecture-Validated & Tested Approach to Define Network Design