Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Reference Architecture-Validated & Tested Approach to Define Network Design
1. Hadoop - Validated Network Architecture
and Reference Deployment in Enterprise
Nimish Desai – nidesai@cisco.com
Technical Leader, Data Center Group
Cisco Systems Inc.
2. Session Objectives & Takeways
Goal 1: Provide Reference Network
Architecture for Hadoop in Enterprise
Goal 2: Characterize Hadoop Application
on Network
Goal 3: Network Validation Results with
Hadoop Workload
2
4. Validated 96 Node Hadoop Cluster
Nexus 7000 Nexus 7000
Nexus 5548 Nexus 5548
2248TP-E Nexus 3000
Nexus 3000
2248TP-E
Name Node
Name Node
Cisco UCS C 200
Cisco UCS C200
Single NIC
Single NIC
… … … …
Data Nodes 1 – 48 Data Nodes 49- 96 Data Nodes 1 – 48 Data Nodes 49 - 96
Cisco UCS C 200 Single NIC Cisco UCS 200 Single NIC Cisco UCS C 200 Single NIC Cisco UCS C 200 Single NIC
Traditional DC Design Nexus 55xx/2248 Nexus 7K-N3K based Topology
§ Hadoop Framework § Network
Apache 0.20.2
Three Racks each with 32 nodes
Linux 6.2
Distribution Layer – Nexus 7000 or
Slots – 10 Maps & 2 Reducers per node Nexus 5000
§ Compute – UCS C200 M2 ToR – FEX or Nexus 3000
Cores: 12 2 FEX per Rack
Processor: 2 x Intel(R) Xeon(R) CPU X5670
@ 2.93GHz Each Rack with either 32 single or
Disk: 4 x 2TB (7.2K RPM) dual attached host
Network: 1G: LOM, 10G: Cisco UCS P81E
6. Big Data Application Realm – Web
2.0 & Social/Community Networks
§ Data live/die in Internet only
entities
§ Data Domain Partially private Data
UI Service
§ Homogeneous Data Life Cycle store
Mostly Unstructured
Web Centric, User Driven
Unified workload – few process &
owners
Typically non-virtualized
§ Scaling & Integration Dynamics
Purpose Driven Apps
Thousands of nodes
Hundreds of PB and growing
exponentially
6
7. Big Data Application Realm - Enterprise
§ Data Lives in a confined zone of enterprise
repository
§ Long Lived, Regulatory and Compliance Call Sales ERP Doc Recor Doc
Driven Cente Pipeli Modul Mgmt ds Mgmt
r ne eA A Mgmt B
§ Heterogeneous Data Life Cycle
Data ERP
Soc Office Video
§ Many Data Models Servic
Media
Modul
eB
Apps Conf
Collab
e
§ Diverse data – Structured and Unstructured
Produc
Catalo Exec
§ Diverse data sources - Subscriber based Customer DB
(Oracle/SAP)
t
Catalo
g VOIP Report
g Data s
§ Diverse workload from many sources/groups/
process/technology
§ Virtualized and non-virtualized with mostly
SAN/NAS base
§ Scaling & Integration Dynamics are different
§ Data Warehousing(structured) with divers repository +
Unstructured Data
§ Few hundred to thousand nodes, few PB
§ Integration, Policy & Security Challenges
§ Each Apps/Group/Technology limited in
§ data generation
§ Consumption
§ Servicing confined domains
7
8. Big Data Framework Application Comparison
Batch-oriented Real-time Big
Relational Big Data
Database Data NoSQL
(Hadoop)
• Structured Data – Rows • Unstructured Data – • Hbase, Cassandra,
Oriented Files, logs, Web-Clicks Oracle
• Optimized for OLTP/ • Data format is • Structured and
OLAP abstracted to higher Unstructured Data
• Rigid schema applied to level application • Sparse column-family
data on insert/update programing data storage or Key-
• Read and write (insert, • Schema-less, flexible value pair
update) many times for later re-use • Not a RDBMS, though
• Non-linear scaling • Write once, read many with some schema
• Most transactions and • Data never dies • Random read and write
queries involve a small • Linear scaling • Modeled after Google’s
subset of data set • Entire data set at play BigTable
• Transactional – scaling for a given query • High transaction – real
to thousands of queries • Multi PB time scaling to millions
• GB to TBs size • Not suited for ad-hoc
analysis
• More suited for ~1 PB
8
9. Data Sources
Big Data
Enterprise Application
Machine logs Sensor data
Sales
Products
Call data records Web click
Process
Inventory
stream data
Finance
Payroll
Satellite feeds GPS data Sales
Shipping
Tracking
data Blogs Emails Pictures
Authoriza;on
Video
Customers
Profile
mn
Colu
Store
ness
Busi ence
ig
Intell
9
10. Big Data Building Blocks into the Enterprise
Big Data Socia
Event
Application Click
Streams Data l
Media
Mobility
Trends
Virtualized,
Bare Metal and Cloud
Sensor
Logs Data
Cisco Unified Fabric
Traditional
“Big Data” Storage “Big Data”
Database
NoSQL
Real-Time Capture, Store and Analyze
SAN and NAS
Read and Update RDBMS
Operations
10
11. Infinite Use Cases
§ Web & E-Commerce
Faster User Response
Customer Behaviors & Pricing Models
Ad Target
§ Retails
Customer Churn & Integration of brick &
mortar with .com business models
PoS Transactional Analysis
§ Insurance & Finance
Risk Management
User Behavior & Incentive Management
Trade Surveillance for Financials
§ Network Analytics – Splunk
Text Mining
Fault Prediction
§ Security & Threat Defense
11
13. Hadoop Components and Operations
Hadoop Distributed File System
Blo Blo Blo Blo Blo Blo
ck 1 ck 2 ck 3 ck 4 ck 5 ck 6
§ Data is not centrally located,
Data is stored across all data nodes
in the cluster
§ Scalable & Fault Tolerant
§ Data is divided in multiple large ToR FEX/ ToR FEX/ ToR FEX/
blocks – 64MB default, typical switch switch switch
block 128MB
Data Data Data
§ Blocks are not the related to disk node 1 node 6 node 11
geometry
Data Data Data
§ Data is stored reliably. Each block node 2 node 7 node 12
is replicated 3 times
Data Data Data
§ Types of Functions node 3 node 8 node 13
§ Name Node (Master) - Manages Data Data Data
Cluster
node 4 node 9 node 14
§ Data Node (Map and Reducer) –
Carries blocks Data Data Data
node 5 node 10 node 15
13
14. Hadoop Components and Operations
Name
§ Name Node Node
Runs a scheduler – Job Tracker
Manages all data nodes, in memory
Secondary Name Node – Snapshot of meta data of
HDFS cluster
ToR FEX/ ToR FEX/ ToR FEX/
Typically all three JVM can run on single node switch switch switch
Data Data Data
§ Data Node node 1 node 6 node 11
Task Tracker Receives Job Info from Job Tracker Data Data Data
(Name Node) node 2 node 7 node 12
Map & Reducer Task Managed by Task Tracker
Data Data Data
Configurable Ratio of Map & Reduce Task for various node 3 node 8 node 13
workload per Node/CPU/Core
Data Data Data
Data Locality - IF data not available where the map node 4 node 9 node 14
task is assigned, a missing block be copied over the
network
Data Data Data
node 5 node 10 node 15
14
15. Characteristics that Affect Hadoop Clusters
§ Cluster Size § Characteristics of Data Node
Number of Data Nodes ‒ I/O, CPU, Memory, etc.
§ Data Model & Mapper/Reduces § Networking Characteristics
Ratio ‒ Availability
MapReduce functions ‒ Buffering
§ Input Data Size ‒ Data Node Speed (1G vs. 10G)
Total starting dataset ‒ Oversubscription
‒ Latency
§ Data Locality in HDFS
Ability to processes data where it already is
located
§ Background Activity
Number of Jobs running
http://www.cloudera.com/resource/hadoop-
type of jobs world-2011-presentation-video-hadoop-network-
Importing and-compute-architecture-considerations/
exporting
15
16. Hadoop Components and Operations
Hadoop Distributed File System
Unstructured Data
§ The Data Ingest & Replication
External Connectivity Map Map Map Map
Map Map Map Map
East West Traffic (Replication of data Map Map
Map
Map
Map
Map
Map
Map
blocks) Map Map Map Map
§ Map Phase – Raw data Analyzed
and converted to name/value pair. Shuffle Phase
Workload translate to multiple
batches of Map task Key 1
Key 1
Key 1
Key 1
Key 1
Key 1
Key 1
Key 1
Key 1 Key 1 Key 1 Key 1
Reducer can start the reduce Key 1 Key 2 Key 3 Key 4
phase ONLY after the entire Map
set is complete
§ Mostly a IO/compute function Reduce Reduce Reduce Reduce
Result/Output
16
17. Hadoop Components and Operations
Hadoop Distributed File System
Unstructured Data
§ Shuffle Phase - All name/value pair are
sorted and grouped by their keys.
Map Map Map Map
§ Mapper sending the data to Reducers Map Map Map Map
Map Map Map Map
§ High Network Activity Map Map Map Map
Map Map Map Map
§ Reduce Phase – All values associates with a key
are process for results, three phases
Copy - get intermediate result from each data Shuffle Phase
node local disk
Merge - to reduce the number of files Key 1 Key 1 Key 1 Key 1
Key 1 Key 1 Key 1 Key 1
Reduce method Key 1 Key 1 Key 1 Key 1
Key 1 Key 2 Key 3 Key 4
§ Output Replication Phase - Reducer
replicating result to multiple nodes
Highest Network Activity
Reduce Reduce Reduce Reduce
§ Network Activities Dependent on Workload
Behavior
Result/Output
17
18. MapReduce Data Model
ETL & BI Workload Benchmark
The complexity of the functions used in Map and/or Reduce has
a large impact on the job completion time and network traffic.
Yahoo
TeraSort
–
ETL
Workload
–
Most
Network
Intensive
Reducers
Start
Map
Start
Map
Finish
Job
Finish
• Input,
Shuffle
and
Output
data
size
is
the
same
–
e.g.
10
TB
data
set
in
all
phases
• Yahoo
Terasort
has
a
more
balanced
Map
vs.
Reduce
funcEons
-‐
linear
compute
and
IO
Shakespeare
WordCount
–
BI
Workload
Reducers
Start
Map
Finish
Map
Start
Job
Finish
• Data
set
size
varies
in
various
phase
–
Varying
impact
on
the
network
e.g.
1TB
Input,
10MB
Shuffle,
1MB
Output
• Most
of
the
processing
in
the
Map
FuncEons,
smaller
intermediate
and
even
smaller
final
Data
18
19. ETL Workload (1TB Yahoo Terasort)
Network Graph of all Traffic Received on an Single Node (80 Node Run)
Shortly
aNer
the
Reducers
start
Map
tasks
are
finishing
and
data
is
being
shuffled
to
reducers
As
Maps
completely
finish
the
network
is
no
loner
used
as
Reducers
have
all
the
data
they
need
to
finish
the
job
The red line is
the total These
amount of symbols
traffic represent a
received by node sending
hpc064 traffic to
HPC064
Reducers Job
Start Maps Complete
Maps Start
Finish
19
20. ETL Workload (1TB Yahoo Terasort)
Network Activity of all Traffic Received on an Single Node (80 Node Run)
If
output
replica;on
is
enabled,
then
the
end
of
the
terasort,
must
store
addi;onal
copies.
For
a
1TB
sort,
2TB
will
need
to
be
replicated
across
the
network.
Output Data Replication Enabled
§ Replication of 3 enabled (1 copy stored locally, 2 stored remotely)
§ Each reduce output is replicated now, instead of just stored locally
20
21. BI Workload
Network Graph of all Traffic Received on an Single Node (80 Node Run)
Wordcount on 200K Copies of complete works of Shakespeare
Due
the
combinaEon
of
the
length
of
the
Map
phase
and
the
reduced
data
set
being
shuffled,
the
network
is
being
uElized
throughout
the
job,
but
by
a
limited
amount.
These
The red line is symbols
the total represent a
amount of node sending
traffic traffic to
received by HPC064
hpc064
Reducers Job
Maps Start
Start Maps Complete
Finish
21
22. Data Locality in HDFS
Data Locality – The
ability to process
data where it is
locally stored.
Observations
§Notice this initial
spike in RX Traffic is
before the Reducers
kick in.
§ It represents data
each map task needs
Note: that is not local.
During the Map Phase, the JobTracker § Looking at the spike
attempts to use data locality to schedule it is mainly data from
only a few nodes.
map tasks where the data is locally
stored. This is not perfect and is
dependent on a data nodes where the Reducers Start Job
data is located. This is a consideration Maps Start
Maps Finish Complete
when choosing the replication factor. Map
Tasks:
IniEal
spike
for
non-‐local
data.
SomeEmes
a
task
may
be
scheduled
on
a
node
that
does
not
have
the
data
More replicas tend to create higher available
locally.
probability for data locality.
22
23. Map to Reducer Ratio Impact on Job Completion
§ 1 TB file with 128 MB Blocks == 7,813 Map Tasks
§ The job completion time is directly related to number of reducers
§ Average Network buffer usage lowers as number of reducer gets lower (see
hidden slides) and vice versa.
Job Completion Time in Sec
800
Total Graph of Job 700
Completion Time in Sec 600
500
30000 400
300
200
25000
100
0
192 96 48
20000
No. Of Reduceers
15000
Job Completion Time in Sec
10000 30000
25000
20000
5000
15000
10000
0 5000
192 96 48 24 12 6 0
No. Of Reduceers 24 12 6
No. Of Reduceers
23
27. Network Characteristics
The relative impact of various network
characteristics on Hadoop clusters*
Availablity
Buffering
Oversubscription
Data Node Speed
Latency
* Not a scaled or measured data
27
32. Hadoop Network Topologies - Reference
Unified Fabric & ToR DC Design
§ Integration with Enterprise architecture –
essential pathway for data flow § 1Gbps Attached Server
Integration § Nexus 7000/5000 with 2248TP-E
Consistency § Nexus 7000 and 3048
Management
Risk-assurance § NIC Teaming - 1Gbps Attached
Enterprise grade features § Nexus 7000/5000 with 2248TP-E
§ Consistent Operational Model § Nexus 7000 and 3048
NxOS, CLI, Fault Behavior and Management § 10 Gbps Attached Server
§ Though higher BW east-west compared § Nexus 7000/5000 with 2232PP
to traditional transactional networks
§ Nexus 7000 and 3064
§ Over the time it will have multi-user, multi-
workload behavior § NIC Teaming – 10 Gbps Attached
Need enterprise centric features Server
Security, SLA, QoS etc § Nexus 7000/5000 with 2232PP
§ Nexus 7000 & 3064
33. Validated Reference Network Topology
Nexus 7000 Nexus 7000
Nexus 5548 Nexus 5548
Nexus 3000
2248TP-E Nexus 3000
2248TP-E
Name Node
Name Node
Cisco UCS C 200
Cisco UCS C200
Single NIC
Single NIC
… … … …
Data Nodes 1 – 48 Data Nodes 49- 96 Data Nodes 1 – 48 Data Nodes 49 - 96
Cisco UCS C 200 Single NIC Cisco UCS 200 Single NIC Cisco UCS C 200 Single NIC Cisco UCS C 200 Single NIC
Traditional DC Design Nexus 55xx/2248 Nexus 7K-N3K based Topology
§ Hadoop Framework § Network
Apache 0.20.2
Three Racks each with 32 nodes
Linux 6.2
Distribution Layer – Nexus 7000 or
Slots – 10 Maps & 2 Reducers per node Nexus 5000
§ Compute – UCS C200 M2 ToR – FEX or Nexus 3000
Cores: 12 2 FEX per Rack
Processor: 2 x Intel(R) Xeon(R) CPU X5670
@ 2.93GHz Each Rack with either 32 single or
Disk: 4 x 2TB (7.2K RPM) dual attached host
Network: 1G: LOM, 10G: Cisco UCS P81E
35. High Availability Switching Design
Common High Availability Engineering Principles
§ The Core High Availability Design
L3
Principles are common across all Dual Node
Network Systems Designs
§ Understand the causes of network Full Mesh
outages
Component Failures L2
Dual Node
Network Anomalies
Full Mesh
§ Understand the Engineering
foundations of systems level availability ToR
Dual Node
Device and Network level MTBF
Understanding Hierarchical and Modular Design NIC
Teaming
Understand the HW and SW interaction in the
system
§ Enhance VPC allows such topology and Dual NIC
Dual NIC
Single NIC 802.3ad
ideally suited for Big Data applications Active/Standby
§ Enhanced vPC (EvPC)configuration any and System High Availability is a function of
all server NIC teaming configurations will be topology and component level High
supported on any port
Availability
36. Availability with Single Attached Server
1G or 10G
§ Important to evaluate the overall availability of
the system.
Network failures can span many nodes in the system
causing rebalancing and decreased overall resources.
Typically multi-TB of data transfer occurs for a single
ToR or FEX failure
Load Sharing, ease of management and
consistent SLA is important to enterprise
operation
§ Failure Domain Impact on Job Completion
§ 1 TB Terasort typically Takes ~4.20- 4.30 minutes
§ A failure of a SINGLE NODE (either NIC or server
component) results in roughly doubling of the job
completion time
§ Key observation is that the failure impact is
dependent on type of workload being run on the
Single NIC
cluster 32 per ToR
Short lived interactive vs. Short live batch
Long job – ETL, Normalization, Joins
36
37. Single Node Failure Job Completion Time
§ The MAP job are executed parallel so unit time for each MAP tasks/node remains same and more less
completes the job roughly at the same time.
§ However during the failure, set of MAP task remains pending (since other nodes in the cluster are still
completing their task) till ALL the node finishes the assigned tasks.
§ Once all the node finishes their MAP task, the left over MAP task being reassigned by name node, the unit time
it take to finish those sets of MAP task remain the same(linear) as the time it took to finish the other MAPs – its
just happened to be NOT done in parallel thus it could double job completion time. This is the worst case
scenario with Terasort, other workload may have variable completion time.
37
40. Availability with Dual Attached Server
1G and 10G
Server NIC Teaming Topologies
§ Dual homing(active-active) network connection
from server allows
Reduced replication and data movements during failure
Allow optimal load-sharing
§ Dual homing FEX avoids single point of failure.
§ Enhance VPC allows such topology and ideally
suited for Big Data applications
§ Enhanced vPC (EvPC)configuration any and all
server NIC teaming configurations will be
supported on any port
§ Supported with Nexus 5500 only
§ Alternatively Nexus 3000 vPC allows host level
redundancy with ToR ECMP
Dual NIC
Single NIC 802.3ad Dual NIC Active/
Standby
40
41. Availability
Single Attached vs. Dual Attached Node
§ No single point of failure from network view point. No impact on job completion time
§ NIC bonding configured at Linux – with LACP mode of bonding
§ Effective load-sharing of traffic flow on two NICs.
§ Recommended to change the hashing to src-dst-ip-port (both network and NIC bonding in
Linux) for optimal load-sharing
41
42. Availability Network Failure Result – 1TB Terasort - ETL
§ Failure of various components
FEX/ToR A
§ Failure introduce at 33%, 66% and 99%
of reducer completion FEX/ToR A
§ Singly attached NIC server & Rack 96 Nodes
failure has bigger impact on job 2 FEX per Rack
completion time then any other failure
§ FEX Failure is a RACK failure for 1G topology
Job Completion Time with Various Failure
1G Single 2G
Failure Point FEX/ToR B
Attached Dual Attached
FEX/ToR A
Peer Link 5000 301 258
FEX * 1137 259
Rack * 1137 1017
See
A Port – Single previous
See previous Slide
Attached Slide
See
1 port – Dual previous See previous Slide
Attach Slide
*Variance in run time with % reducer completed Rack 1 Rack 2 Rack 3
42
44. Cluster Scaling
Nexus 7K/5K & FEX - 2248TP-E or 2232
§ 1G Based - Nexus
2248TP- E
48 1G host ports and up to 4
Uplinks
uplinks bundled into a single
port channel Host Interface
§ 10G Based Nexus 2232
32 10G host ports and up to 8
uplinks bundled into a single
port channel
802.3ad &
vPC 802.3ad & Single
vPC Attached
Nexus 2248TP-E and 2232 support
both local port channel and vPC
for distributed port channels
45. Oversubscription Design
§ Hadoop is a parallel batch job oriented framework
§ Primary benefits of hadoop is the reduction in job completion time that would
otherwise would take longer with traditional technique. E.g. Large ETL, Log
Analysis, Join-only-Map job etc.
§ Typically oversubscription occurs with 10G server access then at 1G server
§ Non-blocking network is NOT a needed, however degree of oversubscription
matters for
Job Completion Time
Replication of Results
Oversubscription during rack or FEX failure
§ Static vs. actual oversubscription
Often how much data a single node push is IO bound and number of disk configuration
Uplinks Oversubscription Measured
Theoretical (16 Servers)
8 2:1 Next Slides
4 4:1 Next Slides
2 8:1 Next Slides
1 16:1 Next Slides
45
46. Network Oversubscriptions
§ Steady state
§ Result Replication with 1,2,4, & 8 uplink
§ Rack Failure with 1, 2, 4 & 8 Uplink
46
47. Data Node Speed Differences
1G vs. 10G TCPDUMP of Reducers TX
• Generally 1G is being used largely due to the cost/performance trade-offs. Though 10GE can
provide benefits depending on workload
• Reduced spike with 10G and smoother job completion time
• Multiple 1G or 10G links can be bonded together to not only increase bandwidth, but increase
resiliency.
47
48. 1GE vs. 10GE Buffer Usage
Moving from 1GE to 10GE actually lowers the buffer requirement at the switching layer.
Job
Completion
Cell
Usage
109
121
133
145
157
169
181
193
205
217
229
241
253
265
277
289
301
313
325
337
349
361
373
385
397
409
421
433
445
457
469
481
493
505
517
529
541
553
565
577
589
601
613
625
637
649
661
673
685
697
709
721
733
745
757
769
781
793
1
13
25
37
49
61
73
85
97
1G
Buffer
Used 10G
Buffer
Used 1G
Map
% 1G
Reduce
% 10G
Map
% 10G
Reduce
%
By
moving
to
10GE,
the
data
node
has
a
wider
pipe
to
receive
data
lessening
the
need
for
buffers
on
the
network
as
the
total
aggregate
transfer
rate
and
amount
of
data
does
not
increase
substanEally.
This
is
due,
in
part,
to
limits
of
I/O
and
Compute
capabiliEes
48
50. Multi-use Cluster Characteristics
Hadoop clusters are
generally multi-use. The
effect of background use
can effect any single jobs
completion.
A given Cluster, running many different types of Jobs, Importing into HDFS, Etc.
Importing Data into HDFS
Large ETL Job Overlaps with medium and small ETL Jobs and many small BI Jobs
(Blue lines are ETL Jobs and purple lines are BI Jobs)
Example View of 24 Hour Cluster Use
50
51. 100 Jobs each with 10GB Data Set
Stable, Node & Rack Failure
• Almost all jobs are impacted with a single node failure
• With multiple jobs running concurrently, node failure impact is as significant
as rack failure
51
53. Burst Handling and Queue Depth
A network that cannot handle
• Several HDFS operations and bursts effectively will drop
phases of MapReduce jobs are very packets, so optimal buffering is
bursty in nature needed in network devices to
absorb bursts.
• The extent of bursts largely depend
on the type of job (ETL vs. BI) Optimal Buffering
• Given large enough incast, TCP
will collapse at some point no
• Bursty phases can include matter how large the buffer
replication of data (either importing • Well studied by multiple
into HDFS or output replication) and universities
the output of the mappers during • Alternate solutions (Changing
TCP behavior) proposed rather
the shuffle phase. than Huge buffer switches
http://simula.stanford.edu/
sedcl/files/dctcp-final.pdf
53
54. Nexus 2248TP-E Buffer Monitoring
§ Nexus
2248TP-‐E
uElizes
a
32MB
shared
buffer
to
handle
larger
traffic
bursts
§ Hadoop,
NAS,
AVID
are
examples
of
bursty
applicaEons
§ You
can
control
the
queue
limit
for
a
specified
Fabric
Extender
for
egress
(network
to
the
host)
or
ingress(host
to
network)
§ Extensive Drop Counters
§ Provides drop counters for both directions: Network to host and Host to Network on a per
host interface basis
§ Drop counters for different reason
• Out of buffer drop, No credit drop, Queue limit drop(tail drop), MAC error drop, Truncation
drop, Multicast drop
§ Buffer Occupancy Counter
§ How much buffer is being used. One key indicator of congestion or bursty traffic
N5548-L3(config-fex)# hardware N2248TPE queue-limit 4000000 rx
N5548-L3(config-fex)# hardware N2248TPE queue-limit 4194304
tx
fex-110# show platform software qosctrl asic 0 0
54
55. Buffer Monitoring
switch# attach fex 110
Attaching to FEX 110 ...
To exit type 'exit', to abort type '$.'
fex-110# show platform software qosctrl asic 0 0
number of arguments 4: show asic 0 0
----------------------------------------
QoSCtrl internal info {mod 0x0 asic 0}
mod 0 asic 0:
port type: CIF [0], total: 1, used: 1
port type: BIF [1], total: 1, used: 0
port type: NIF [2], total: 4, used: 4
port type: HIF [3], total: 48, used: 48
bound NIF ports: 2
N2H cells: 14752
H2N cells: 50784
----Programmed Buffers---------
Fixed Cells : 14752
Shared Cells : 50784 ç Allocated Buffer in terms of cells
(512Bytes)
----Free Buffer Statistics-----
Total Cells : 65374
Fixed Cells : 14590
Shared Cells : 50784 ç Number of free cells to be monitored
55
57. Buffer depth monitoring: interface
§ Real time command displaying the status of the shared buffer.
§ XML support will be added in the maintenance release
§ Counters are displayed in cell count. A cell is approximately 208 bytes
show hardware internal buffer info pkt-stats [brief|clear|detail]
Buffer
Free
Total
buffer
Max
buffer
usage
buffer
space
on
the
usage
since
pla`orm
clear
57
58. %)(($,-./)$
!"#$%#&'(
!"#&)#&"(
phases.
!"#&'#&*(
!"#&+#&+(
!"#'!#&%(
!"#'&#'!(
Rack
layer
!"#'*#')(
!*#,,#'$(
!*#,$#'&(
!*#,"#''(
!*#,%#'"(
!*#!)#'+(
!*#!'#'%(
!*#!%#,,(
!*#))#,!(
!*#)'#,)(
!*#)+#,&(
!*#$!#,'(
!*#$&#,"(
Shuffle Phase
!*#$*#,*(
!*#&,#,*(
Buffer Usage During
!*#&$#,+(
!*#&"#,%(
!*#&%#!,(
!*#')#!!(
-$,&+(.!(
slower job completion times.
!*#''#!!(
!*#'+#!$(
!+#,!#!&(
!+#,&#!'(
!+#,*#!"(
!+#!,#!*(
-$,&+(.)(
!+#!$#!%(
!+#!"#),(
!+#!%#)!(
!+#))#))(
!+#)'#)$(
-$,"&(
!+#)+#)&(
!+#$!#)'(
!+#$&#)*(
!+#$*#)+(
!+#&,#)%(
/01(2(
!+#&$#$,(
!+#&"#$!(
!+#&%#$$(
!+#')#$&(
!+#''#$'(
!+#'+#$"(
345674(2(
!%#,!#$*(
!%#,&#$+(
!%#,*#$%(
!%#!,#&!(
!%#!$#&)(
!%#!"#&)(
Replication
!%#!%#&$(
!%#))#&&(
!%#)'#&"(
!%#)+#&*(
!%#$!#&+(
!%#$&#&%(
!%#$*#',(
Buffer Usage During output
!%#&,#'!(
!%#&$#')(
!%#&"#'&(
!%#&%#''(
TeraSort(ETL) N3k Buffer Analysis (10TB)
!%#')#'"(
!%#''#'*(
• Optimized buffer sizes are required to avoid packet loss leading to
!"#$%"&'()*"+$
The
AggregaEon
switch
buffer
remained
flat
as
the
bursts
were
absorbed
at
the
Top
of
• The buffer utilization is highest during the shuffle and output replication
58
59. Network Latency
Generally network
latency, while N3K Topology 5k/2k Topology
consistent latency
being important, does
not represent a
significant factor for
Completion Time (Sec)
Hadoop Clusters.
Note:
There is a difference in network latency
vs. application latency. Optimization in
the application stack can decrease
application latency that can potentially
have a significant benefit. 1TB 5TB 10TB
Data Set Size (80 Node Cluster)
59
60. Summary § 10G and/or Dual attached server
§ Extensive Validation of provides consistent job completion
time & better buffer utilization
Hadoop Workload
§ 10G provide reduce burst at the
§ Reference Architecture access layer
Make it easy for Enterprise § A single attached node failure has
considerable impact on job
Demystify Network for completion time
Hadoop Deployment § Dual Attached Sever is recommended
Integration with Enterprise design – 1G or 10G. 10G for future
proofing
with efficient choices of
network topology/devices § Rack failure has the biggest impact
on job completion time
§ Does not require non-blocking
network
§ Degree of oversubscription does
impact job completion time
§ Latency does not matter much in
Hadoop work load
60
61. 128
Node/1PB
test
Big Data @ Cisco cluster
Cisco.com
Big
Data
www.cisco.com/go/bigdata
Cer;fica;ons
and
Solu;ons
with
UCS
C-‐Series
and
Nexus
5500+22xx
• EMC
Greenplum
MR
SoluEon
• Cloudera
Hadoop
CerEfied
Technology
• Cloudera
Hadoop
SoluEon
Brief
• Oracle
NoSQL
Validated
SoluEon
• Oracle
NoSQL
SoluEon
Brief
Mul;-‐month
network
and
compute
analysis
tes;ng
(In
conjunc;on
with
Cloudera)
• Network/Compute
ConsideraEons
Whitepaper
• Presented
Analysis
at
Hadoop
World
61
62. THANK YOU FOR
LISTENING
Nimish Desai – nidesai@cisco.com
Technical Leader, Data Center Group
Cisco Systems Inc.
63. Break!
Break takes place in the Community Showcase (Hall 2)
Sessions will resume at 3:35pm
Page 63