2. Application Architect and Networking
Traditionally application architect’s foray into networking dealt with solving server IO bottleneck
and offloading the CPU. Virtualization did not change this focus.
Application architects focused on solving the IO bottleneck
in order to minimize waste in CPU cycles. Technologies
such as RSS, LSO and TSO were incorporated into an
intelligent NIC to load balance traffic across multiple cores
in the server and therefore avoid cpu starvation
A parallel focus - driven by cost savings achieved via storage
and ethernet traffic convergence - was converged NIC
(CNIC) which carried storage and ethernet traffic on a single
wire.
Virtualization did not shift the focus away from solving IO
bottleneck. PCIe innovation such as SR-IOV and MR-IOV
were incorporated into CNICs. IOV technologies enabled
vNICs and VM specific offload services such as hypervisor
bypass technologies.
The scale of the application in Web 1.0 world did not
require application architects to focus on network topology,
segmentation and control plane protocols. 3-Tiered
datacenter network was sufficient
VSM Intelligent NIC
- TCP Offload
- Converged Wire
- RSS/LSO
- Flow Classification
N1Kv3
4
Networking Focus of Application
Architect in Web 1.0
2
1
1
2
3
4
3. Why Care for Network Topology
Today, network plays a critical role in distributed application execution. Two key service assurances
– Latency and Bandwidth – are influenced by network
• Today’s programming frameworks widely
use asynchronous IO, Latency Shifting
(Caching) and message based
communication. These frameworks enable
application logic and data to be distributed
among tens of thousands of servers across
multiple tiers. The nodes within a tier and
across tiers communicate synchronously or
asynchronously over a routed IP network.
• A distributed application execution
environment has to arbitrate the tradeoffs
in latency and bandwidth both of which are
greatly influenced by underlying network
topology and routing control plane.
L1
L2$
P P
L1
DRAM
Disk
DRAM
DiskDRAM
Disk
Rack
Local
Cluster
DRAM
Disk
DRAM
Disk
DRAM
Disk
Distributed System Latency Hierarchy
Local System
Mem: 80ns
Disk: 10ms
Latency
Local Rack
Mem: 200us
Disk: 28ms
Remote Rack
Mem: 500us
Disk: 30ms
DRAM
Disk
4. Network Topology – Graph Model
Network layout is a combination of chosen topology (design decision) plus chosen technology
(architecture decision). Graph is a concise and precise notation to describe a network topology.
• Crossbar
• Good for small input/output
• Complexity is N^2 where N is number of
input/output
• Number of switches required N^2 –
Problem when N is large.
• Fat tree Clos
• Can be non-blocking 1:1 or blocking x:1
• Characterized as Clos (m,n,r)
• Complexity is (2n+r)*rn = # of switches
• Torus
• Blocking network; but great at scale
• Optimized for data locality
• Good for growth and hybrid networks
• Complexity increases with switch port
count. k/(2logbaseK/2 (N). Where k = port
count and N = number of servers.
• High port count switches are better with
Clos than Tori.
• Direct and Indirect Topologies
• Crossbar, Fat-tree clos are indirect network
i.e. nodes are not part of the network
topology. Torus is a direct network.
1 2 n
x
x x
xx
x
x x x
1
2
n
Crossbar
1n
2n
rn
.
.
.
n x m
1
2
m
.
.
.
1
2
r
.
.
.
n
n
n
Fat-tree
2DTorus
5. Characterizing Network Performance
Latency =
Sending overhead +
TLinkProp x (d + 1) + (Tr + Ts+ Ta) x d +
PacketSize/BW x (d + 1) +
Receiving overhead
where
d = number of hops
Tr = Switch routing delay
Ta = Switch arbitration delay
Ts = Switch switch delay (pin2pin)
TLinkProp = Per link propagation delay
Effective Bandwidth = min-of (
N * BWIngress, s * N,
r * (BWBisection/g) ,
s * N * BWEgress )
where
s is the fraction of traffic that is accepted
r is the network efficiency
g is fraction that crosses bi-section
Network Performance
• Port Buffers directly affect “s”. Port buffers
sized to the length of the link optimizes “s”
and can be assumed to be s =1 .
• g is directly correlated to application traffic
pattern. A well distributed application will
max out the BWbisection
• r – network efficiency is a function of
multiple factors. The most prominent is link
and routing efficiency i.e. control plane.
• Effective bandwidth is the bandwidth
between user and application i.e. north
south. Bisectional bandwidth is the
minimum bandwidth between two nodes
i.e. east-west
Network topology affects the hop count i.e. paths through the network and therefore Bi-sectional
bandwidth and Latency. Application traffic patterns drives the rest of the performance metrics.
6. Traditional Datacenter Network
Traditionally datacenter networks were optimized to remove bottlenecks in the north-south traffic
i.e. optimize Effective Bandwidth. However, that architecture is not suitable for a distributed
application that has dominant flows that traverse east-west
Main Issues with this Architecture
• Topology is single rooted tree with single span/path
between source and destination, which causes bi-
sectional bandwidth to be much lower than effective
bandwidth i.e. no multiple paths
• Traffic among servers is 4X or higher compared to
traffic in/out of datacenter
• Not optimized for small flows. Observed flows inside
datacenter are short with 10-20 flows per server
• Adaptive routing is not fast enough. Optimization
requires complex L2/L3 configuration
• Ratio of bandwidth between memory/disk and CPU
to bandwidth between servers at all time high. This
hurts distributed computing which use the inter
server bandwidth
CoreAggregationAccess
Traditional Datacenter
Single L2 Domain
L3 Boundary
7. Changing Traffic Pattern in Datacenter
The observed ratio of north south traffic coming into a web application to traffic that is generated
inside the datacenter to serve the incoming session is observed to be 1:80 and higher
Web App
GUI Layer
Bus. Logic
Layer
session
cache
North-South
Traffic
Public Profile
WebApp
External Ad
Server
Internal
Private
Cloud
http-rpc or
Jms calls
Profile Service
Messenger
Service
Groups
Service
News
Service
Search
East-West
Traffic
r/o
r/w
r/w
r/w
Replicated
DB
1:80
Core
DB
write
DB Server
updates
Update
Server
Graph
Updates
Profile
Updates
JDBC
etc.
8. Datacenter Fabric
Industry took two approaches to scale the datacenter network: Overlays and
Interconnects.
• Issues that Overlays Address
• Multi-tenant Scalability
• VM mobility
• Virtual Network Scalability
• VM Placement
• Virtual to Physical and Virtual to Virtual
communication scalability
• Asymmetry of network innovation between
physical and virtual world
• What is not addressed by Overlays
• Standard way to terminate tunnel on the
hypervisor and physical switch
• Mapping between the virtual addresses and
physical addresses. (who fills that table at the
border gateway?)
• Network flooding (ARP and L2 Multcast)
• Topology unware and unoptimized
• Compatibility with ECMP
• Inter- datacenter traffic mobility
• Trombone because L2 focus of overlays
• Future proofing with SDN
Overlays should address the challenges
presented by
a. Highly distributed virtual applications such as
Hadoop/Bigdata. Where an application can span
multiple physical and virtual switches. Any overlay
tunnel should support both virtual and physical
endpoint
b. Sparse and intermittent connectivity of virtual
machines. The access switch may drop in/out of
participating in the virtual network
c. VMs are dynamic. VM creation, deletion,
Suspend/Resume cycles present a challenge for
network
d. Should work with existing physical switches without
software upgrade. Only the first hop that
add/removes packet markings should be required
new purchase
e. Failure domains should be limited to tunnel endpoints
f. Define multiple administrative domains
9. Datacenter Overlay Landscape
Overlay Technologies Adjacency Pros Cons
Fabric Path L2 - vPC Support
- ECMP upto 256
- Faster Convergence
- Multiple L2 VLAN
- No inter DC
- Needs ASIC Support
- Not vm aware
- No support for FCoE
TRILL L2 - Unlimited ECMP
- SPF delivery of unicast
- Fast convergence
- No inter DC
- Needs ASIC Support
- New Tools OA&M
- No vm aware
Shortest Path Bridging
(802.1aq)
L2 - Support for existing ethernet data
planes standards .ah and .ad
- Unicast/multicast
- Faster Convergence
- 16 way ECMP only
- Limited market traction
- Not vm aware
VXLAN L2 - MAC-in-UDP with 24 bit VNI
- Scalable
- Enables virtual L2 segments
- Lacks explicit control plane
- Requires IP Multicast
- Needs ASIC support
- Virtual tunnel endpoint only
NVGRE (Microsoft) L2 - GRE Tunnels
- Most asics have support for GRE
- Does not leverage UDP so out packet
headers cannot be leveraged
OTV/LISP L2 - Datacenter Interconnect - Limited platform support
Vpn4dc L3 - Proposed by service provider - Not much vendor support
There are multiple competing standards for overlays i.e. using L3 network infrastructure
to solve L2 scalability problems.
10. Datacenter Fabric – Programmatic View
The management plane offers DevOps the opportunity to influence the path of their
application data over the network. It is also the plane used by Cloud Controllers to
provision resource along that path
• Thus far, applications adapted to a network. With
the new management plane, the network can
adapt to the application.
• Intelligence shifts to the edge of the network.
Application can use APIs to probe networks and
alter their consumption and constraints.
• Policy definition points can analyze network data
to create patterns which drive policy creation
tools e.g. triangulating privacy zone, sampling at
100Gbs rates etc
• The network comes under pressure to scale
up/down to application needs. All the datacenter
fabric technologies aim to enable this elasticity in
the network.
Physical Switch
OpenStack
Virtual Switch
Server
VM
Compute
Service
Storage
Service
API
Network
Service
DevOps
Cloud
Controllers
Network
virtualization
technologies such
as FP, TRILL,
VXLAN, NVGRE,
SPB play here
11. Virtual Networking
Industry has a few competing virtualization stacks. The components may be different
but the networking issues are similar for DevOps
Components DevOps Needs to be aware of this embedded networking
functionality …
Hypervisor - Implements the v-switch. Examples of virtual switches include
Cisco N1Kv, OpenSwitch etc.
- Initiates the Vmotion which requires L2 adjacency i.e. within a
VLAN
- Challenges in scaling L2 across datacenter (DCI)
Virtual Switches - VLAN Capable
- Port group associated with VLAN
- Host processor does packet processing
- Challenges includes trunking of links between switch and
server
- Mapping server VLANS (in hypervisor) to physical switch VLANs
- Size of VLAN is increasingly becoming an issue. Being resolved
through encapsulation of L2 frames inside L3 (VXLAN, NVGRE,
FabricPath, TRILL)
Virtual NICs - Increasing getting intelligent with hardware assisted vNIC.
- Offloading to assist in TCP latency
- Teaming to increasing bandwidth into server
- Multi-tenancy with FEX (adapter and VM)
Cloud
Orchestration
Directors
- What changed is scalability and integration with external
orchestration systems
- Distributed Virtual Switches (across servers) presented
coordination challenges. Single control point are called
directors.
- Each hypervisor in a cluster continues to switch at L2
independently i.e. data paths are not centralized
Physical
Network
Virtual
Machines
Virtual
Switch
Virtual
Servers
Management
Center
VM vFW vSLB
vWAAS
Virtual Networking Basics
12. Software Defined Networking
Host based Centralized Controller
Orchestration
Topology
Director
Physical Network
Component Description
Directors - Directors for orchestration and topology –
need to scale. Topology graph needs to scale
for MSDC datacenter. What is the storage
model (asset inventory, configuration etc.)
- No explicit DevOps support i.e. no server and
tooling for developers
Controller - Centralized controller is yet to be proven for
datacenter class. deployment . Issues remain
around scalability, redundancy, security etc.
- Theoretically good for large scale tables, but
does not solve per device overflow
- Programmability comes at cost of
configuration latency
Physical Network Existing network with support for OpenFlow
Control
Plane
Mgmt Plane
Data Plane
Featur
es
Fwdin
g
Switch
Control
Plane
Mgmt Plane
Data Plane
Featur
es
Fwdin
g
Switch
Control
Plane
Mgmt Plane
Data Plane
Featur
es
Fwdin
g
Switch
SDN decouples control plane from the data plane with a yet to be proven assumption
that the economics of the two planes are distinct
Note: Software defined network is different from software
driven network. The latter is applications using available
APIs to provision the network services for higher level SLA
such as reservation, security etc.
13. HyperScal eDatacenter
HSDC address scale out networking requirement of very large datacenters with 100K+
hosts. Innovations are targeting four key areas
• Topology
- To overcome limitations of traditional tree, folded-clos
inspired topologies are used.
- Some topologies include ToR as leaf node while some
other like Bcube include host based software switches as
leaf
• CPU vs. ASIC
- Switch microarchitecture based on merchant silicon
implements clos inside the switch. Infiniband started this
trend in early 2000s.
- MSDC is biased towards merchant silicon, even though no
compelling feature has been identified
• Layer 2 vs. Layer 3
- FabricPath and TRILL scale the layer-2 network through
encapsulation of mac inside IP packet. Others protocols prefer
IP inside IP to scale the network. E.G Cisco’s Vinci
• Multipath Forwarding
- ECMP based static hash based load balancing has
increased TCP layer latency. New proposals to introduce
dynamic traffic engineering are being discussed.
Hinweis der Redaktion
SR-IOV is a specification that allows a PCIe device to appear to be multiple separate physical PCIe devices – for a single server. MR stands for MultiRoot and therefore MR-IOV is for multiple servers
RSS, Receive side scaling, spreads the incoming packets across the multiple available core/cpus. Traditionally, the core0 received all incoming traffic and therefore became the bottleneck
LSO, Large Send Offload,
If you look at a distributed system as a hierarchy of stores (SRAM, DRAM and DISK) then the art or science of distribution of a network application boils down to managing the latency and bandwidth offered to an executing thread at different points in the distributed application. For example a thread executing has to make a tradeoff between a local cached line vs fetching from the memory. The tradeoff is arbitrated or managed by the local operating system. Similarly for a trade off between reading/writing from a local disk or a remote disk is arbitrated or managed by a network operating system in cohorts with the application infrastructure manager. This is when the topology can serve as a catalyst or as a inhibitor. Knowing the topology and therefore the capabilities or biases of the network operating system enables a devops/application architect to design better systems.
Network layout – that which is actually deployed – is a combination of chosen topology (design decision) plus chosen technology (architecture decision). The best way to analyze or design a topology is using graph theory.
CLOS
Classic paper on clos: BlackWidow: High-Radix Clos Networks, S. Scott, D. Abts, J. Kim, W.J. Dally
rn inputs, rn outputs (Note: in a switch the ports are bidirectional so the graph looks like it has even stages. Clos always has odd stages i.e. 3, 5, 7, ….)
So rn = switch port count
2rnm + mr2 switches (this is less than r2n2, the complexity for crossbar)
Let m = n (non blocking) then you have rn inputs
2rn2 + nr2 switches = (2n + r)rn
(a crossbar is rn2 switches)
Optimal choice of n and r? depends. For N3064, n = 32, R = number of leafs = number of spines. (assumption m = n)
Proof for clos is through mathematical induction i.e. it is true for all n if it is true for n =1, n-1 and n+1; When n =1 and r =1 or C(1,1,r) Clos trivializes to crossbar. For higher stages i.e r >1 we have the following
C(1) = N2 switches (crossbar)
C(3) = 6N3/2 – 3N
C(5) = 16N4/3 – 14N + 3N2/3
C(7) = 36N5/4 – 46N + 20N3/4 – 3N1/2
C(9) = 76N6/5 – 130N + 86N4/5 – 26N3/5 + 3N2/5
This says, we need to use more stages to scale the network higher i.e. bad idea to increase N (port count), M = spine width.
Switch microarchitecture optimized to improve s, r, and g
To make s = 1, buffer organizations to mitigate HOL blocking. r is optimized throught design of pipelining, queuing, routing, and arbitration within a switch boundary. r is calculated as
r = rL x rR x rA x rS x rmArch x …
Traffic matrix analysis research shows that it is difficult to summarize the patterns, the patterns are not repeating and unpredictable i.e. difficult to optimize.
Failure analysis research shows that they are small in size but long in duration.
Failures are mostly small in size (50%, < 4 devices, 95%, < 20 devices)
Downtimes can be significant: 95% < 1min, 98% < 1hr, 99.6% < 1 day, 0.09% > 10 days
With 1:1 redundance, 0.3% of failures in all redundant components.
Use n:m redundancy
Study of ARP in a datacenter http://www.nanog.org/meetings/nanog52/presentations/Tuesday/Karir-4-ARP-Study-Merit%20Network.pdf
TRILL vs FP http://www.networkworld.com/community/blog/full-tilt-boogie-networking-cisco’s-fabricpat
http://tools.ietf.org/html/draft-sridharan-virtualization-nvgre-00http://tools.ietf.org/html/draft-mahalingam-dutt-dcops-vxlan-00
Both vxlan and nvgre use a scalable L3 network infrastructure to scale L2. However, they do not solve the security issues. One can still poison the arp cache, when on physical network one can spoof the network. Also, there is no support external physical
VPN4DC lets VPN clients to connect to their leased or purchased computing resources in public data centers via their own VPNs.
http://tools.ietf.org/html/draft-so-vpn4dc-00
There is L2-LISP but could not figure out how anything LISP related is overlay
Juniper’s Qfabric and Brocade’s VCS are not mentioned as they are not on standards path or Cisco
Virtual networking depends heavily on the hypervisor selection but the end goal is the same i.e. connect the virtual machines to each other and outside world using virtual + physical networking. Removing the disparity between network services consumed by a physical server and that consumed by a virtual server was the initial focus of innovations in the virtual networking space. Recently the focus has shifted to (a) scaling the virtual network (b) enable hybrid networks where physical and virtual resources co-exist in a policy domain.
SDN decouples the control plane from data plane under the assumption that a separate control plane will follow a different economic curve from data plane. More specifically, control plane will follow the curve of server economics and data plane will follow the plane of commodity low end networking gear. The faults are already surfacing when we discuss the scalability of overlay protocols like vxlan and nvgre. Both of them require ASIC support i.e cannot be run faster in merchant silicon.