LinkedIn's Approach to Programmable Data Center

LinkedIn’s Approach to Programmable Data Center
Shawn Zandi
Principal Architect
Infrastructure Engineering

LinkedIn Infrastructure
• Infrastructure architecture based on application’s behavior & requirements
• Pre-planned static topology
• Single operator
• Single tenant with many applications
• As oppose to multi-tenant with different (or unknown) needs
• 34% infrastructure growth on annual basis and close to half a billion users

Edge Network to Eyeballs
Backbone Network
Bare Metal
Operating System
Container
Application
End to End Network Design
Data Center Network
End to end control enables us to tackle problems at different parts of the stack
From application code, os, network or client software to solve by architecture…
Bare Metal
Operating System
Container
Application

Traffic Demands
• High intra and inter-DC bandwidth demand due to organic growth
• Every single byte of member activity, creates thousands bytes of east-west traffic inside
data center:
• Application Call Graph
• Metrics, Analytics and Tracking via Kafka
• Hadoop and Offline Jobs
• Machine Learning
• Data Replications
• Search and Indexing
• Ads, recruiting solutions, etc.

Scaling Out Data Centers Network - Hardware
• White-box Switches (ODM)
• Vendor Switches Based (OEM)
• Based on Merchant Silicon
• Big Chassis Switches
• Designed around robustness (NSR, ISSU, etc.)
• Feature-rich but mostly irrelevant to LinkedIn needs
Project Falco

Data center were designed by redundant chassis at core
controlling and forwarding east-west and north-south traffic

PodW
SpineSpineSpine
LeafLeafLeafLeaf
Spine
PodX
SpineSpineSpine
LeafLeafLeafLeaf
Spine
PodY
SpineSpineSpine
LeafLeafLeafLeaf
Spine
PodZ
SpineSpineSpine
LeafLeafLeafLeaf
Spine
Fabric 2
Spine Spine Spine
Fabric 4
Spine Spine Spine
Fabric 1
Spine Spine Spine
4,096 x100G ports
Non-Blocking
Scale-out
Spine Spine Spine
Fabric 3
Chassis Free Data Center

Why No Chassis?
• Robust-yet-Fragile
• Complex due to NSR, ISSU, feature-sets, etc.
• Larger fault domain, Fail-over/Fail-back
• Indeterministic boot up process and long upgrade procedures
• Moved complexity from big boxes to our advantage, where we can manage and control!
• Better control and visibility to internals by removing black-box abstraction!
• Same Switch SKU on ToR, Leaf and Spine (Entire DC)
• Single chipset uniform IO design (same bandwidth, latency and buffering)
• True 5-Staged Clos Topology! with deterministic latency
• Dedicated control plane, OAM and CPU for each ASIC

W X Y Z
W X Y Z
W X Y Z
Distributed Control Plane Complexity
Pod 1
2 32…1
Pod 11
322 352…321
Pod 21
642 672…641
Pod 31
962 992…961
W X Y Z
2171217021692168213121302129212820912090208920882051205020492048
2339233823372336 2368 2369 2370 2371 2400 2401 2402 24032307230623052304

“Fabric wide visibility and telemetry”
The wider the fabric, flow tracking and fault isolation becomes more difficult
Problem 1

“Fabric wide traffic distribution and packet scheduling!”
Forwarding is different than routing, and out of scope for routing protocols.
Problem 2
We need a robust and scalable control protocol designed for a data center fabric

Control Plane :: Routing
• Routing protocols provide destination-based reachability information
• Routing protocols are not traffic aware.
• Best path selection is elementary.
• Network graph is built based on series of ECMP groups,
“Routing protocols are more about the destination than the journey”

ECMP forwarding simply does not cut it!
Problem #3

ECMP is not really equal!
• Elephants and mice issue
• ECMP Hashing is not bandwidth aware. Devices use an algorithm to
distribute traffic amongst links regardless of load.
• Traffic is routed using shortest path, not all the available paths,
hence not maximizing all the available capacity. Some links may
suffer while the other may be underutilized.
• Flows stick to a certain path, as hashing is performed per flow. An
established socket cannot be moved to a different path easily!

“We need a robust and scalable fabric-wide forwarding policy”
Problem #4

Lack of Centralized Policy and Control
• The more parallel links you add, forwarding decision becomes more
random.
• Devices were configured and maintained individually
• Routing/Forwarding policy management tasks are performed
individually and hop by hop.
• Know when/where to centralize or distribute to scale out!

“End to End Path Selection & Control”
No application, protocol or packet can dictate a path
Centralized flow based routing does not scale!
Problem #5

“Using the same familiar, robust and well-known solutions brings along the same
restrictions when they were originally designed”
Problem #6

Hardware
Network
Transport
Application
BGP (1990s)
Clos Topology (1950s)
Ethernet & IP (1980s)

IP Routing History
• IP routing is defined hop-by-hop
• BGP is “the” IDR designed to work between different autonomous
system, to provide policy and control between different routing
domains to select a best path.
• True: BGP can scale and is extensible. BGP has many policy knobs.
• A datacenter fabric operated under a single administrative domain
instead of series of individual routers with different policies and
decision process.

Forwarding traffic based on demands & patterns:
• Application
• Latency
• Loss
• Bandwidth (Throughput)
Programmable Data Center
A data center fabric that distributes traffic amongst all available links efficiently and
effectively while maintaining lowest latency and providing the most possible bandwidth to
different applications based on different needs and priorities.

Program forwarding tables individually on all switches from a centralized location
Approach #1

Flow x > Port 1
Flow x > Port 3
Flow x > Port 2
Forwarding and Control Element Separation

Encode path information into packet header
Approach #2

Distributed control plane for topology discovery and reachability information
+
Use a controller software for forwarding policy and optimizations
Approach #3

Scale: No state or flow information required to be stored on every box
Network can choose and move flows dynamically
Application can choose and move flows dynamically
Works with existing data plane (merchant silicon support)
Supports ECMP with fallback to IP routing
Automatic Local Repair / LFA

Hardware
Routing
Policy
Applications
Link Selection and Scheduling
Topology Discovery and Network Graph
Control
Telemetry/Visibility, Machine Learning, Prediction Engine, Self Healing, etc.
Forwarding
Merchant Silicon
Rethinking The Network Stack

Network Element
Management
Plane
SNMP, Syslog, etc.
System &
Environmental
Data
Packet & Flow
Data
Network Operating System
Kafka Network Agent
ASIC
System
Drivers
Reducing Protocols

Network ElementNetwork ElementNetwork ElementNetwork Element
Management
Plane
SNMP, Syslog, etc.
System &
Environmental
Data
Packet & Flow
Data
Network Operating System
Kafka Agent
Monitoring and Management System
Kafka Broker
Machine Learning & Data Processing
Alert
Processor
Log Retention
Data Store
Event
Correlation
Kafka Pub/Sub Pipeline
Record, Process and Replay Network State

Open19
OpenFabric
ASIC ODM
RIB / Forwarding Abstraction Layer
FALCO
Apps
Linux OS
Hardware
Physical Layer
Hardware Abstraction Layer
Metrics & Analytics
Machine Learning
Self Healing
etc. (API to Infrastructure)
Policy & Control
Operating System
Base Networking
LinkedIn Infrastructure Strategy

• Unified Architecture
• We used a single SKU (hardware and software) for all switches while procuring
hardware from multiple ODM channels (multi-homing)
• One Software: Base Networking on Merchant Silicon with minimum req. features
• No Overlay - For the infrastructure, the application is stateless
• No Middle-box (Firewall, Load-balancer, etc.) Moved to application
• Network is only a set of intermediate boxes running linux
Simplified Infrastructure to Own

• To control and own your architecture:
• End to end stack (app, operating system, network and architecture.)
• Ultimate sophistication: Simplicity
• In house support as far as possible
• Move complexity to your comfort zone!
Stay in Control

• SDN is nor a protocol or a tool or product off the shelf
• SDN is the whole network stack and architecture that enables
applications to meet and interact with infrastructure to:
SDN for LinkedIn
• Discover and Learn
• Provision
• Manage
• Control
• Monitor

Project Altair: The Evolution of LinkedIn’s Data Center Network
Project Falco: Decoupling Switching Hardware and Software
Open19: A New Vision for the Data Center

LinkedIn's Approach to Programmable Data Center

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (7)

Ähnlich wie LinkedIn's Approach to Programmable Data Center

Ähnlich wie LinkedIn's Approach to Programmable Data Center (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

LinkedIn's Approach to Programmable Data Center

Hinweis der Redaktion