6. Controllability and Observability
Basic concept is simple, a stable feedback control system requires:
1. ability to influence all important system states (controllable)
2. ability to monitor all important system states (observable)
Thursday, June 13, 13
7. It’s hard to stay on the road if you can’t see the
road, or keep to the speed limit without a
speedometer
It’s hard to stay on the road or maintain
speed if your brakes, engine or steering fail
Controllability and Observability driving example
Observability
Controllability
States location, speed, direction, ...
Dense Tule fog in Bakersfield, CA
Thursday, June 13, 13
8. Effect of delay on stability
Measurement delay Planning delay
Time
Configuration delayDisturbance Response delay
EffectLoop delay
DDoS launched Identify target, attacker Black hole, mark, re-route? Switch CLI commands Route propagation Traffic dropped
Components of loop delay
e.g. Slow reaction time causes
tired / drunk / distracted
driver to weave, very slow
reaction time and they leave
the road
Thursday, June 13, 13
9. Observability using sFlow standard
“In God we trust. All others bring data.”
Dr. Edwards Deming
Thursday, June 13, 13
11. Open source agents for hosts, hypervisors and applications
Host sFlow project (http://host-sflow.sourceforge.net) is center
of an ecosystem of related open source projects embedding
sFlow in popular operating systems and applications
Thursday, June 13, 13
13. Simple
- standard structures - densely packed blocks of counters
- extensible (tag, length, value)
- RFC 1832: XDR encoded (big endian, quad-aligned, binary) - simple to encode/decode
- unicast UDP transport
Minimal configuration
- collector address
- polling interval
Cloud friendly
- flat, two tier architecture: many embedded agents → central “smart” collector
- sFlow agents automatically start sending metrics on startup, automatically discovered
- eliminates complexity of maintaining polling daemons (and associated configurations)
Scaleable push protocol
Thursday, June 13, 13
14. • Counters tell you there is a
problem, but not why.
• Counters summarize
performance by dropping high
cardinality attributes:
- IP addresses
- URLs
- Memcache keys
• Need to be able to efficiently
disaggregate counter by
attributes in order to
understand root cause of
performance problems.
• How do you get this data
when there are millions of
transactions per second?
Counters aren’t enough
Why the spike in traffic?
(100Gbit link carrying 14,000,000 packets/second)
Thursday, June 13, 13
15. • Random sampling is lightweight
• Critical path roughly cost of
maintaining one counter:
if(--skip == 0) sample();
• Sampling is easy to distribute
among modules, threads,
processes without any
synchronization
• Minimal resources required to
capture attributes of sampled
transactions
• Easily identify top keys,
connections, clients, servers,
URLs etc.
• Unbiased results with known
accuracy
Break out traffic by client, server and port
(graph based on samples from100Gbit link carrying 14,000,000 packets/second)
sFlow also exports random samples
Thursday, June 13, 13
16. Integrated data model
Packet HeaderPacket Header
Source Destination
TCP/UDP Socket TCP/UDP Socket
MAC Address MAC Address
Sampled Packet Headers
I/F Counters
Power, Temp.
NETWORK
HOST
CPU
Memory
I/O
Power, Temp.
Adapter MACs
APPLICATION
Sampled Transactions
Transaction Counters
TCP/UDP Socket
Independent agents sFlow analyzer joins data for integrated view
Thursday, June 13, 13
19. Monitor
Feedback control loop with sFlow and OpenFlow
low configuration delay
low measurement delay
Together, sFlow and OpenFlow provide the observability and
controllability to enable SDN applications targeting low latency
control problems like load balancing and DDoS mitigation
low planning delay
SDN application
Thursday, June 13, 13
20. Network OSApplication
Open APIsApplication
Application
Data Plane
Control Plane
Configuration Forwarding Visibility
NETCONF/OF-Config
Open APIs
Hosts
sFlow adds actionable visibility to SDN stack
Actionable = complete + timely
Thursday, June 13, 13
21. REST API
Metrics
Flow Definitions
Thresholds
InMonsFlow-RT
REST API
OpenFlowController
Load Balancer DDoS Protection
REST Applications
Open “Southbound” APIs
Data Plane
Control Plane
Hosts
Open “Northbound” APIs
SDN Applications
SDN feedback control applications
Thursday, June 13, 13
22. ovs-vsctl set-controller br0 tcp:10.0.0.1:6633
ovs-vsctl -- --id=@sflow create sflow agent=eth0
target=”10.0.0.1:6343” sampling=1000 polling=20
-- -- set bridge br0 sflow=@sflow
Connect switches to central control plane
e.g connect Open vSwitch to OpenFlow controller
e.g. connect Open vSwitch to sFlow analyzer
Minimal configuration to connect switches to
controllers, intelligence resides in external software
Thursday, June 13, 13
23. Components of a DDoS flood attack
1. Command to attack target sent over
control network
2. Large number of compromised hosts
start sending traffic to target
3.Traffic converges on access link,
overwhelming capacity and denying
access
Thursday, June 13, 13
24. Define flow keys
DDoS Protection
define address groups
define flows
define thresholds
while(running) {
receive threshold event
monitor flow
deploy control
monitor flow
release control
}
OpenFlow
Controller
REST API
sFlow-RT
REST API
1
2
3
4
6
5
8
7
REST operation flow chart
Thursday, June 13, 13
25. curl -H "Content-Type:application/json" -X PUT
--data "{external:['0.0.0.0/0'], internal:['10.0.0.0/8']}"
http://localhost:8008/group/json
1. Define address groups
curl -H "Content-Type:application/json" -X PUT
--data "{keys:'ipsource,ipdestination', value:'frames',
filter:'sourcegroup=external&destinationgroup=internal'}"
http://localhost:8008/flow/incoming/json
2. Define flows
curl -H "Content-Type:application/json" -X PUT
--data "{metric:'incoming', value:1000}"
http://localhost:8008/threshold/incoming/json
3. Define thresholds
curl "http://localhost:8008/events/json?eventID=4&timeout=60"
4. Receive threshold events
Thursday, June 13, 13
27. threshold
attack starts
detected
control implemented attack eliminated
http://blog.sflow.com/2013/03/ddos.html
Before
After
DDoS mitigation results
packets/secondpackets/second
sustained 6M packets/second attack
(30 Gigabits/second)
http://packetpushers.net/openflow-1-0-actual-use-case-rtbh-of-ddos-traffic-while-keeping-the-target-online/
Also: http://blog.sflow.com/2013/05/controlling-large-flows-with-openflow.html
Thursday, June 13, 13
28. ECMP/LAG multi-path traffic distribution
http://static.usenix.org/event/nsdi10/tech/full_papers/al-fares.pdf
index = hash(packet fields) % linkgroup.size
selected_link = linkgroup[index]
Hash collisions reduce effective cross sectional bandwidth
1:1 subscription ratio doesn’t eliminate blocking, collision
probabilities are high, even with large numbers of paths
Thursday, June 13, 13
29. Birthday Paradox
What is the chance that at least two people in a room will share a birthday?
50/50 chance with 23 people, virtual certainty with the 60 people.This is a
“paradox” because the probability seems remarkably high considering that there
are 365 possible birthdays (366 if you include Feb 29) and 23 people represents
just over 6% of the theoretical maximum and 60 people is only 16%.
http://en.wikipedia.org/wiki/Birthday_problem
ECMP/LAG/MLAG collision probabilities are surprisingly high
Thursday, June 13, 13
32. Memcached hot keys
http://codeascraft.com/2012/12/13/mctop-a-tool-for-analyzing-memcache-get-traffic/
Server B
Server C
Server A
Client 1
Client 2
Client 3
Client 4
Memcache clusterMemcache clients
overloaded server link
http://blog.sflow.com/2013/01/memcache-hot-keys-and-cluster-load.html
• Interesting parallel with ECMP/LAG hash collisions
• Demonstrates linkage between network and application performance
• Can monitor cluster wide Memcached hot/missed keys with sFlow
• Possible SDN use cases:
• Server placement informed by visibility into network topology, loads etc.
• Use SDN to shorten paths, reducing latency and packet loss
• Avoid packet loss by steering packets around congested link
• Extension of OpenFlow to optical circuit switches allows network to be
rewired for actual demand
Thursday, June 13, 13
33. Next steps
Organizational: break down networking silo
- learn more about networking
- integrate networking into DevOps team
- think about observability and controllability when
purchasing equipment and architecting services
Strategic: Engage developer communities
- share operational expertise with SDN community
- help specify northbound APIs so that they deliver
functionality needed to integrate networking into
DevOps stack
Thursday, June 13, 13
36. • OpenFlow
• Hybrid combined OpenFlow (using NORMAL action)
• Puppet
• BGP policy
• RESTful API to switches
• NETCONF
• Optical circuit switching
- OpenFlow extensions
- SOCM (Service-Based Optical Connection Management)
Programatic control of switches
http://www.nanog.org/sites/default/files/wed.general.brainslug.lapukhov.20.pdf
http://blog.sflow.com/2013/03/pragmatic-software-defined-networking.html
http://www.nanog.org/sites/default/files/wed.general.socm_.samberg.6.pdf
Thursday, June 13, 13
37. packets
decode hash sendflow cache flushsample
NetFlow/IPFIX
send
polli/f counters
sample
• sFlow exports packet samples immediately
• sFlow also exports interface counters
• NetFlow exports flow data on end of flow, active-timeout or inactive-timeout
• NetFlow data generation requires significant resources on switch that can
be better applied to increase size of forwarding table(s)
• OpenFlow metering has similar architecture to NetFlow and similar
limitations
sFlow and NetFlow/IPFIX in a switch
Thursday, June 13, 13
38. InMon sFlow-RT
active timeout active timeout
NetFlow
Open
vSwitch
SolarWinds Real-Time NetFlow Analyzer
• sFlow does not use flow cache, so realtime charts more accurately reflect traffic trend
• NetFlow spikes caused by flow cache active-timeout for long running connections
Rapid detection of large flows
Flow cache active timeout delays large flow detection,
limits value of signal for real-time control applications
http://blog.sflow.com/2013/01/rapidly-detecting-large-flows-sflow-vs.html
Thursday, June 13, 13