SlideShare ist ein Scribd-Unternehmen logo
1 von 38
Performance Aware SDN
http://www.meetup.com/SF-Bay-Area-Large-Scale-Production-Engineering/
Peter Phaal
InMon Corp.
June 2013
Thursday, June 13, 13
Why monitor performance?
“If you can’t measure it, you can’t improve it”
Lord Kelvin
Thursday, June 13, 13
BalancerLoad ServerServerWeb
DatabaseApplication
Network
MemcacheMemcache
ServerServerWeb
Server
Application
Server
Database
BalancerLoad Balancer
Networking in large, scale-out, multi-tiered sites
• Large number of servers in each pool
• Servers constantly added/removed
• Network performance is critical
• scale-out applications dependent on network performance
• potential for propagating failures between tiers
Thursday, June 13, 13
http://www.osa.org/osaorg/media/osa.media/CorporateGateway/ExecutiveForum2013/2013_Executive_Forum_Keynote_Najam_Ahmad.pdf
Najam Ahmad, Vice President, Infrastructure, Facebook
OSA 2013 keynote
SDN brings network under software control
Extends DevOps tool stack to include network visibility and control
Thursday, June 13, 13
Feedback control
Measure
Control
System
desired
output
measured
output
Thursday, June 13, 13
Controllability and Observability
Basic concept is simple, a stable feedback control system requires:
1. ability to influence all important system states (controllable)
2. ability to monitor all important system states (observable)
Thursday, June 13, 13
It’s hard to stay on the road if you can’t see the
road, or keep to the speed limit without a
speedometer
It’s hard to stay on the road or maintain
speed if your brakes, engine or steering fail
Controllability and Observability driving example
Observability
Controllability
States location, speed, direction, ...
Dense Tule fog in Bakersfield, CA
Thursday, June 13, 13
Effect of delay on stability
Measurement delay Planning delay
Time
Configuration delayDisturbance Response delay
EffectLoop delay
DDoS launched Identify target, attacker Black hole, mark, re-route? Switch CLI commands Route propagation Traffic dropped
Components of loop delay
e.g. Slow reaction time causes
tired / drunk / distracted
driver to weave, very slow
reaction time and they leave
the road
Thursday, June 13, 13
Observability using sFlow standard
“In God we trust. All others bring data.”
Dr. Edwards Deming
Thursday, June 13, 13
Industry standard measurement technology integrated in switches
http://www.sflow.org
Thursday, June 13, 13
Open source agents for hosts, hypervisors and applications
Host sFlow project (http://host-sflow.sourceforge.net) is center
of an ecosystem of related open source projects embedding
sFlow in popular operating systems and applications
Thursday, June 13, 13
Network (maintained by hardware in network devices)
- MIB-2 ifTable: ifInOctets, ifInUcastPkts, ifInMulticastPkts, ifInBroadcastPkts, ifInDiscards, ifInErrors, ifUnkownProtos,
ifOutOctets, ifOutUcastPkts, ifOutMulticastPkts, ifOutBroadcastPkts, ifOutDiscards, ifOutErrors
Host (maintained by operating system kernel)
- CPU: load_one, load_five, load_fifteen, proc_run, proc_total, cpu_num, cpu_speed, uptime, cpu_user, cpu_nice,
cpu_system, cpu_idle, cpu_wio, cpu_intr, cpu_sintr, interupts, contexts
- Memory: mem_total, mem_free, mem_shared, mem_buffers, mem_cached, swap_total, swap_free, page_in, page_out,
swap_in, swap_out
- Disk IO: disk_total, disk_free, part_max_used, reads, bytes_read, read_time, writes, bytes_written, write_time
- Network IO: bytes_in, packets_in, errs_in, drops_in, bytes_out, packet_out, errs_out, drops_out
Application (maintained by application)
- HTTP: method_option_count, method_get_count, method_head_count, method_post_count, method_put_count,
method_delete_count, method_trace_count, method_connect_count, method_other_count, status_1xx_count,
status_2xx_count, status_3xx_count, status_4xx_count, status_5xx_count, status_other_count
- Memcache: cmd_set, cmd_touch, cmd_flush, get_hits, get_misses, delete_hits, delete_misses, incr_hits, incr_misses,
decr_hists, decr_misses, cas_hits, cas_misses, cas_badval, auth_cmds, auth_errors, threads, con_yields,
listen_disabled_num, curr_connections, rejected_connections, total_connections, connection_structures, evictions,
reclaimed, curr_items, total_items, bytes_read, bytes_written, bytes, limit_maxbytes
Standard counters
Thursday, June 13, 13
Simple
- standard structures - densely packed blocks of counters
- extensible (tag, length, value)
- RFC 1832: XDR encoded (big endian, quad-aligned, binary) - simple to encode/decode
- unicast UDP transport
Minimal configuration
- collector address
- polling interval
Cloud friendly
- flat, two tier architecture: many embedded agents → central “smart” collector
- sFlow agents automatically start sending metrics on startup, automatically discovered
- eliminates complexity of maintaining polling daemons (and associated configurations)
Scaleable push protocol
Thursday, June 13, 13
• Counters tell you there is a
problem, but not why.
• Counters summarize
performance by dropping high
cardinality attributes:
- IP addresses
- URLs
- Memcache keys
• Need to be able to efficiently
disaggregate counter by
attributes in order to
understand root cause of
performance problems.
• How do you get this data
when there are millions of
transactions per second?
Counters aren’t enough
Why the spike in traffic?
(100Gbit link carrying 14,000,000 packets/second)
Thursday, June 13, 13
• Random sampling is lightweight
• Critical path roughly cost of
maintaining one counter:
if(--skip == 0) sample();
• Sampling is easy to distribute
among modules, threads,
processes without any
synchronization
• Minimal resources required to
capture attributes of sampled
transactions
• Easily identify top keys,
connections, clients, servers,
URLs etc.
• Unbiased results with known
accuracy
Break out traffic by client, server and port
(graph based on samples from100Gbit link carrying 14,000,000 packets/second)
sFlow also exports random samples
Thursday, June 13, 13
Integrated data model
Packet HeaderPacket Header
Source Destination
TCP/UDP Socket TCP/UDP Socket
MAC Address MAC Address
Sampled Packet Headers
I/F Counters
Power, Temp.
NETWORK
HOST
CPU
Memory
I/O
Power, Temp.
Adapter MACs
APPLICATION
Sampled Transactions
Transaction Counters
TCP/UDP Socket
Independent agents sFlow analyzer joins data for integrated view
Thursday, June 13, 13
Virtual Servers
Applications
Apache/PHP
Tomcat/Java
Memcached
Virtual Network
Servers
Network
Embedded monitoring of all
switches, all servers, all
applications, all the time
Consistent measurements
shared between multiple
management tools
Comprehensive visibility
Thursday, June 13, 13
Software Defined Networking
“You can’t control what you can’t measure”
Tom DeMarco
Thursday, June 13, 13
Monitor
Feedback control loop with sFlow and OpenFlow
low configuration delay
low measurement delay
Together, sFlow and OpenFlow provide the observability and
controllability to enable SDN applications targeting low latency
control problems like load balancing and DDoS mitigation
low planning delay
SDN application
Thursday, June 13, 13
Network OSApplication
Open APIsApplication
Application
Data Plane
Control Plane
Configuration Forwarding Visibility
NETCONF/OF-Config
Open APIs
Hosts
sFlow adds actionable visibility to SDN stack
Actionable = complete + timely
Thursday, June 13, 13
REST API
Metrics
Flow Definitions
Thresholds
InMonsFlow-RT
REST API
OpenFlowController
Load Balancer DDoS Protection
REST Applications
Open “Southbound” APIs
Data Plane
Control Plane
Hosts
Open “Northbound” APIs
SDN Applications
SDN feedback control applications
Thursday, June 13, 13
ovs-vsctl set-controller br0 tcp:10.0.0.1:6633
ovs-vsctl -- --id=@sflow create sflow agent=eth0 
target=”10.0.0.1:6343” sampling=1000 polling=20 
-- -- set bridge br0 sflow=@sflow
Connect switches to central control plane
e.g connect Open vSwitch to OpenFlow controller
e.g. connect Open vSwitch to sFlow analyzer
Minimal configuration to connect switches to
controllers, intelligence resides in external software
Thursday, June 13, 13
Components of a DDoS flood attack
1. Command to attack target sent over
control network
2. Large number of compromised hosts
start sending traffic to target
3.Traffic converges on access link,
overwhelming capacity and denying
access
Thursday, June 13, 13
Define flow keys
DDoS Protection
define address groups
define flows
define thresholds
while(running) {
receive threshold event
monitor flow
deploy control
monitor flow
release control
}
OpenFlow
Controller
REST API
sFlow-RT
REST API
1
2
3
4
6
5
8
7
REST operation flow chart
Thursday, June 13, 13
curl -H "Content-Type:application/json" -X PUT 
--data "{external:['0.0.0.0/0'], internal:['10.0.0.0/8']}" 
http://localhost:8008/group/json
1. Define address groups
curl -H "Content-Type:application/json" -X PUT 
--data "{keys:'ipsource,ipdestination', value:'frames', 
filter:'sourcegroup=external&destinationgroup=internal'}" 
http://localhost:8008/flow/incoming/json
2. Define flows
curl -H "Content-Type:application/json" -X PUT 
--data "{metric:'incoming', value:1000}" 
http://localhost:8008/threshold/incoming/json
3. Define thresholds
curl "http://localhost:8008/events/json?eventID=4&timeout=60"
4. Receive threshold events
Thursday, June 13, 13
5. Monitor flow
curl http://localhost:8008/metric/10.0.0.16/4.incoming/json
[{
"agent": "10.0.0.16",
"dataSource": "4",
"metricName": "incoming",
"metricValue": 1582.93965044338071,
"topKeys": [
{
"key": "192.168.1.1,10.0.0.151",
"updateTime": 1357169662500,
"value": 1582.93965044338071
},
{
"key": "192.168.1.4,10.0.0.151",
"updateTime": 1357169665500,
"value": 46.552918457198984
}
],
"updateTime": 1357169665500
}]
6. Deploy control
curl -d '{"switch": "00:00:00:00:00:00:00:01",
"name":"ddos-1", "cookie":"0", "priority":"32768",
"ingress-port":"4","active":"true"}'
http://localhost:8080/wm/staticflowentrypusher/json
Thursday, June 13, 13
threshold
attack starts
detected
control implemented attack eliminated
http://blog.sflow.com/2013/03/ddos.html
Before
After
DDoS mitigation results
packets/secondpackets/second
sustained 6M packets/second attack
(30 Gigabits/second)
http://packetpushers.net/openflow-1-0-actual-use-case-rtbh-of-ddos-traffic-while-keeping-the-target-online/
Also: http://blog.sflow.com/2013/05/controlling-large-flows-with-openflow.html
Thursday, June 13, 13
ECMP/LAG multi-path traffic distribution
http://static.usenix.org/event/nsdi10/tech/full_papers/al-fares.pdf
index = hash(packet fields) % linkgroup.size
selected_link = linkgroup[index]
Hash collisions reduce effective cross sectional bandwidth
1:1 subscription ratio doesn’t eliminate blocking, collision
probabilities are high, even with large numbers of paths
Thursday, June 13, 13
Birthday Paradox
What is the chance that at least two people in a room will share a birthday?
50/50 chance with 23 people, virtual certainty with the 60 people.This is a
“paradox” because the probability seems remarkably high considering that there
are 365 possible birthdays (366 if you include Feb 29) and 23 people represents
just over 6% of the theoretical maximum and 60 people is only 16%.
http://en.wikipedia.org/wiki/Birthday_problem
ECMP/LAG/MLAG collision probabilities are surprisingly high
Thursday, June 13, 13
http://research.microsoft.com/en-us/UM/people/srikanth/data/imc09_dcTraffic.pdf
Small number of long lived large flows responsible for bulk of load
Use SDN load balancing applications to detect and
eliminate collisions by adjusting forwarding paths
Load balancing large flows
Thursday, June 13, 13
colliding flows
http://blog.sflow.com/2013/01/load-balancing-lagecmp-groups.html
http://blog.sflow.com/2013/03/ecmp-load-balancing.html
http://blog.sflow.com/2013/02/sdn-and-large-flows.html
https://datatracker.ietf.org/doc/draft-ietf-opsawg-large-flow-load-balancing/
Not just ECMP, also LAG/MLAG and WAN etc.
non-colliding flows
http://blog.sflow.com/2013/06/large-flow-detection.html
http://blog.sflow.com/2013/06/flow-collisions.html
Thursday, June 13, 13
Memcached hot keys
http://codeascraft.com/2012/12/13/mctop-a-tool-for-analyzing-memcache-get-traffic/
Server B
Server C
Server A
Client 1
Client 2
Client 3
Client 4
Memcache clusterMemcache clients
overloaded server link
http://blog.sflow.com/2013/01/memcache-hot-keys-and-cluster-load.html
• Interesting parallel with ECMP/LAG hash collisions
• Demonstrates linkage between network and application performance
• Can monitor cluster wide Memcached hot/missed keys with sFlow
• Possible SDN use cases:
• Server placement informed by visibility into network topology, loads etc.
• Use SDN to shorten paths, reducing latency and packet loss
• Avoid packet loss by steering packets around congested link
• Extension of OpenFlow to optical circuit switches allows network to be
rewired for actual demand
Thursday, June 13, 13
Next steps
Organizational: break down networking silo
- learn more about networking
- integrate networking into DevOps team
- think about observability and controllability when
purchasing equipment and architecting services
Strategic: Engage developer communities
- share operational expertise with SDN community
- help specify northbound APIs so that they deliver
functionality needed to integrate networking into
DevOps stack
Thursday, June 13, 13
Questions?
Thursday, June 13, 13
Backup slides
Thursday, June 13, 13
• OpenFlow
• Hybrid combined OpenFlow (using NORMAL action)
• Puppet
• BGP policy
• RESTful API to switches
• NETCONF
• Optical circuit switching
- OpenFlow extensions
- SOCM (Service-Based Optical Connection Management)
Programatic control of switches
http://www.nanog.org/sites/default/files/wed.general.brainslug.lapukhov.20.pdf
http://blog.sflow.com/2013/03/pragmatic-software-defined-networking.html
http://www.nanog.org/sites/default/files/wed.general.socm_.samberg.6.pdf
Thursday, June 13, 13
packets
decode hash sendflow cache flushsample
NetFlow/IPFIX
send
polli/f counters
sample
• sFlow exports packet samples immediately
• sFlow also exports interface counters
• NetFlow exports flow data on end of flow, active-timeout or inactive-timeout
• NetFlow data generation requires significant resources on switch that can
be better applied to increase size of forwarding table(s)
• OpenFlow metering has similar architecture to NetFlow and similar
limitations
sFlow and NetFlow/IPFIX in a switch
Thursday, June 13, 13
InMon sFlow-RT
active timeout active timeout
NetFlow
Open
vSwitch
SolarWinds Real-Time NetFlow Analyzer
• sFlow does not use flow cache, so realtime charts more accurately reflect traffic trend
• NetFlow spikes caused by flow cache active-timeout for long running connections
Rapid detection of large flows
Flow cache active timeout delays large flow detection,
limits value of signal for real-time control applications
http://blog.sflow.com/2013/01/rapidly-detecting-large-flows-sflow-vs.html
Thursday, June 13, 13

Weitere ähnliche Inhalte

Ähnlich wie Performance Aware SDN, LSPE talk

Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...Prolifics
 
Network visibility and control using industry standard sFlow telemetry
Network visibility and control using industry standard sFlow telemetryNetwork visibility and control using industry standard sFlow telemetry
Network visibility and control using industry standard sFlow telemetrypphaal
 
Naveen nimmu sdn future of networking
Naveen nimmu sdn   future of networkingNaveen nimmu sdn   future of networking
Naveen nimmu sdn future of networkingOpenSourceIndia
 
Naveen nimmu sdn future of networking
Naveen nimmu sdn   future of networkingNaveen nimmu sdn   future of networking
Naveen nimmu sdn future of networkingsuniltomar04
 
The Data Center and Hadoop
The Data Center and HadoopThe Data Center and Hadoop
The Data Center and HadoopMichael Zhang
 
Nonfunctional Testing: Examine the Other Side of the Coin
Nonfunctional Testing: Examine the Other Side of the CoinNonfunctional Testing: Examine the Other Side of the Coin
Nonfunctional Testing: Examine the Other Side of the CoinTechWell
 
SDN and NFV: Facts, Extensions, and Carrier Opportunities
SDN and NFV: Facts, Extensions, and Carrier OpportunitiesSDN and NFV: Facts, Extensions, and Carrier Opportunities
SDN and NFV: Facts, Extensions, and Carrier Opportunitiesrjain51
 
Performance Tuning Oracle Weblogic Server 12c
Performance Tuning Oracle Weblogic Server 12cPerformance Tuning Oracle Weblogic Server 12c
Performance Tuning Oracle Weblogic Server 12cAjith Narayanan
 
Building a Highly Available Data Infrastructure
Building a Highly Available Data InfrastructureBuilding a Highly Available Data Infrastructure
Building a Highly Available Data InfrastructureDataCore Software
 
How to Configure the CA Workload Automation System Agent agentparm.txt File
How to Configure the CA Workload Automation System Agent agentparm.txt FileHow to Configure the CA Workload Automation System Agent agentparm.txt File
How to Configure the CA Workload Automation System Agent agentparm.txt FileCA Technologies
 
Troubleshooting for Intent-based Networking
Troubleshooting for Intent-based NetworkingTroubleshooting for Intent-based Networking
Troubleshooting for Intent-based NetworkingOpen Networking Summit
 
IT Infrastructure Through The Public Network Challenges And Solutions
IT Infrastructure Through The Public Network   Challenges And SolutionsIT Infrastructure Through The Public Network   Challenges And Solutions
IT Infrastructure Through The Public Network Challenges And SolutionsMartin Jackson
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesAlexander Penev
 
Introduction to Software Defined Networking (SDN)
Introduction to Software Defined Networking (SDN)Introduction to Software Defined Networking (SDN)
Introduction to Software Defined Networking (SDN)rjain51
 
Faster, Higher, Stronger – Accelerating Fault Management to the Next Level
Faster, Higher, Stronger – Accelerating Fault Management to the Next LevelFaster, Higher, Stronger – Accelerating Fault Management to the Next Level
Faster, Higher, Stronger – Accelerating Fault Management to the Next LevelOPNFV
 
Load and Performance Testing for J2EE - Testing, monitoring and reporting usi...
Load and Performance Testing for J2EE - Testing, monitoring and reporting usi...Load and Performance Testing for J2EE - Testing, monitoring and reporting usi...
Load and Performance Testing for J2EE - Testing, monitoring and reporting usi...Alexandru Ersenie
 
OpenFlow: What is it Good For?
OpenFlow: What is it Good For? OpenFlow: What is it Good For?
OpenFlow: What is it Good For? APNIC
 
ML in Production at FunTech Meetup (Feb 2019)
ML in Production at FunTech Meetup (Feb 2019)ML in Production at FunTech Meetup (Feb 2019)
ML in Production at FunTech Meetup (Feb 2019)Mark Andreev
 

Ähnlich wie Performance Aware SDN, LSPE talk (20)

Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
 
Banv
BanvBanv
Banv
 
Network visibility and control using industry standard sFlow telemetry
Network visibility and control using industry standard sFlow telemetryNetwork visibility and control using industry standard sFlow telemetry
Network visibility and control using industry standard sFlow telemetry
 
Naveen nimmu sdn future of networking
Naveen nimmu sdn   future of networkingNaveen nimmu sdn   future of networking
Naveen nimmu sdn future of networking
 
Naveen nimmu sdn future of networking
Naveen nimmu sdn   future of networkingNaveen nimmu sdn   future of networking
Naveen nimmu sdn future of networking
 
The Data Center and Hadoop
The Data Center and HadoopThe Data Center and Hadoop
The Data Center and Hadoop
 
Nonfunctional Testing: Examine the Other Side of the Coin
Nonfunctional Testing: Examine the Other Side of the CoinNonfunctional Testing: Examine the Other Side of the Coin
Nonfunctional Testing: Examine the Other Side of the Coin
 
SDN and NFV: Facts, Extensions, and Carrier Opportunities
SDN and NFV: Facts, Extensions, and Carrier OpportunitiesSDN and NFV: Facts, Extensions, and Carrier Opportunities
SDN and NFV: Facts, Extensions, and Carrier Opportunities
 
Performance Tuning Oracle Weblogic Server 12c
Performance Tuning Oracle Weblogic Server 12cPerformance Tuning Oracle Weblogic Server 12c
Performance Tuning Oracle Weblogic Server 12c
 
Building a Highly Available Data Infrastructure
Building a Highly Available Data InfrastructureBuilding a Highly Available Data Infrastructure
Building a Highly Available Data Infrastructure
 
How to Configure the CA Workload Automation System Agent agentparm.txt File
How to Configure the CA Workload Automation System Agent agentparm.txt FileHow to Configure the CA Workload Automation System Agent agentparm.txt File
How to Configure the CA Workload Automation System Agent agentparm.txt File
 
Troubleshooting for Intent-based Networking
Troubleshooting for Intent-based NetworkingTroubleshooting for Intent-based Networking
Troubleshooting for Intent-based Networking
 
IT Infrastructure Through The Public Network Challenges And Solutions
IT Infrastructure Through The Public Network   Challenges And SolutionsIT Infrastructure Through The Public Network   Challenges And Solutions
IT Infrastructure Through The Public Network Challenges And Solutions
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE Architectures
 
Introduction to Software Defined Networking (SDN)
Introduction to Software Defined Networking (SDN)Introduction to Software Defined Networking (SDN)
Introduction to Software Defined Networking (SDN)
 
Faster, Higher, Stronger – Accelerating Fault Management to the Next Level
Faster, Higher, Stronger – Accelerating Fault Management to the Next LevelFaster, Higher, Stronger – Accelerating Fault Management to the Next Level
Faster, Higher, Stronger – Accelerating Fault Management to the Next Level
 
Load and Performance Testing for J2EE - Testing, monitoring and reporting usi...
Load and Performance Testing for J2EE - Testing, monitoring and reporting usi...Load and Performance Testing for J2EE - Testing, monitoring and reporting usi...
Load and Performance Testing for J2EE - Testing, monitoring and reporting usi...
 
Alfredo paganophd 3y
Alfredo paganophd 3yAlfredo paganophd 3y
Alfredo paganophd 3y
 
OpenFlow: What is it Good For?
OpenFlow: What is it Good For? OpenFlow: What is it Good For?
OpenFlow: What is it Good For?
 
ML in Production at FunTech Meetup (Feb 2019)
ML in Production at FunTech Meetup (Feb 2019)ML in Production at FunTech Meetup (Feb 2019)
ML in Production at FunTech Meetup (Feb 2019)
 

Kürzlich hochgeladen

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 

Kürzlich hochgeladen (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Performance Aware SDN, LSPE talk

  • 2. Why monitor performance? “If you can’t measure it, you can’t improve it” Lord Kelvin Thursday, June 13, 13
  • 3. BalancerLoad ServerServerWeb DatabaseApplication Network MemcacheMemcache ServerServerWeb Server Application Server Database BalancerLoad Balancer Networking in large, scale-out, multi-tiered sites • Large number of servers in each pool • Servers constantly added/removed • Network performance is critical • scale-out applications dependent on network performance • potential for propagating failures between tiers Thursday, June 13, 13
  • 4. http://www.osa.org/osaorg/media/osa.media/CorporateGateway/ExecutiveForum2013/2013_Executive_Forum_Keynote_Najam_Ahmad.pdf Najam Ahmad, Vice President, Infrastructure, Facebook OSA 2013 keynote SDN brings network under software control Extends DevOps tool stack to include network visibility and control Thursday, June 13, 13
  • 6. Controllability and Observability Basic concept is simple, a stable feedback control system requires: 1. ability to influence all important system states (controllable) 2. ability to monitor all important system states (observable) Thursday, June 13, 13
  • 7. It’s hard to stay on the road if you can’t see the road, or keep to the speed limit without a speedometer It’s hard to stay on the road or maintain speed if your brakes, engine or steering fail Controllability and Observability driving example Observability Controllability States location, speed, direction, ... Dense Tule fog in Bakersfield, CA Thursday, June 13, 13
  • 8. Effect of delay on stability Measurement delay Planning delay Time Configuration delayDisturbance Response delay EffectLoop delay DDoS launched Identify target, attacker Black hole, mark, re-route? Switch CLI commands Route propagation Traffic dropped Components of loop delay e.g. Slow reaction time causes tired / drunk / distracted driver to weave, very slow reaction time and they leave the road Thursday, June 13, 13
  • 9. Observability using sFlow standard “In God we trust. All others bring data.” Dr. Edwards Deming Thursday, June 13, 13
  • 10. Industry standard measurement technology integrated in switches http://www.sflow.org Thursday, June 13, 13
  • 11. Open source agents for hosts, hypervisors and applications Host sFlow project (http://host-sflow.sourceforge.net) is center of an ecosystem of related open source projects embedding sFlow in popular operating systems and applications Thursday, June 13, 13
  • 12. Network (maintained by hardware in network devices) - MIB-2 ifTable: ifInOctets, ifInUcastPkts, ifInMulticastPkts, ifInBroadcastPkts, ifInDiscards, ifInErrors, ifUnkownProtos, ifOutOctets, ifOutUcastPkts, ifOutMulticastPkts, ifOutBroadcastPkts, ifOutDiscards, ifOutErrors Host (maintained by operating system kernel) - CPU: load_one, load_five, load_fifteen, proc_run, proc_total, cpu_num, cpu_speed, uptime, cpu_user, cpu_nice, cpu_system, cpu_idle, cpu_wio, cpu_intr, cpu_sintr, interupts, contexts - Memory: mem_total, mem_free, mem_shared, mem_buffers, mem_cached, swap_total, swap_free, page_in, page_out, swap_in, swap_out - Disk IO: disk_total, disk_free, part_max_used, reads, bytes_read, read_time, writes, bytes_written, write_time - Network IO: bytes_in, packets_in, errs_in, drops_in, bytes_out, packet_out, errs_out, drops_out Application (maintained by application) - HTTP: method_option_count, method_get_count, method_head_count, method_post_count, method_put_count, method_delete_count, method_trace_count, method_connect_count, method_other_count, status_1xx_count, status_2xx_count, status_3xx_count, status_4xx_count, status_5xx_count, status_other_count - Memcache: cmd_set, cmd_touch, cmd_flush, get_hits, get_misses, delete_hits, delete_misses, incr_hits, incr_misses, decr_hists, decr_misses, cas_hits, cas_misses, cas_badval, auth_cmds, auth_errors, threads, con_yields, listen_disabled_num, curr_connections, rejected_connections, total_connections, connection_structures, evictions, reclaimed, curr_items, total_items, bytes_read, bytes_written, bytes, limit_maxbytes Standard counters Thursday, June 13, 13
  • 13. Simple - standard structures - densely packed blocks of counters - extensible (tag, length, value) - RFC 1832: XDR encoded (big endian, quad-aligned, binary) - simple to encode/decode - unicast UDP transport Minimal configuration - collector address - polling interval Cloud friendly - flat, two tier architecture: many embedded agents → central “smart” collector - sFlow agents automatically start sending metrics on startup, automatically discovered - eliminates complexity of maintaining polling daemons (and associated configurations) Scaleable push protocol Thursday, June 13, 13
  • 14. • Counters tell you there is a problem, but not why. • Counters summarize performance by dropping high cardinality attributes: - IP addresses - URLs - Memcache keys • Need to be able to efficiently disaggregate counter by attributes in order to understand root cause of performance problems. • How do you get this data when there are millions of transactions per second? Counters aren’t enough Why the spike in traffic? (100Gbit link carrying 14,000,000 packets/second) Thursday, June 13, 13
  • 15. • Random sampling is lightweight • Critical path roughly cost of maintaining one counter: if(--skip == 0) sample(); • Sampling is easy to distribute among modules, threads, processes without any synchronization • Minimal resources required to capture attributes of sampled transactions • Easily identify top keys, connections, clients, servers, URLs etc. • Unbiased results with known accuracy Break out traffic by client, server and port (graph based on samples from100Gbit link carrying 14,000,000 packets/second) sFlow also exports random samples Thursday, June 13, 13
  • 16. Integrated data model Packet HeaderPacket Header Source Destination TCP/UDP Socket TCP/UDP Socket MAC Address MAC Address Sampled Packet Headers I/F Counters Power, Temp. NETWORK HOST CPU Memory I/O Power, Temp. Adapter MACs APPLICATION Sampled Transactions Transaction Counters TCP/UDP Socket Independent agents sFlow analyzer joins data for integrated view Thursday, June 13, 13
  • 17. Virtual Servers Applications Apache/PHP Tomcat/Java Memcached Virtual Network Servers Network Embedded monitoring of all switches, all servers, all applications, all the time Consistent measurements shared between multiple management tools Comprehensive visibility Thursday, June 13, 13
  • 18. Software Defined Networking “You can’t control what you can’t measure” Tom DeMarco Thursday, June 13, 13
  • 19. Monitor Feedback control loop with sFlow and OpenFlow low configuration delay low measurement delay Together, sFlow and OpenFlow provide the observability and controllability to enable SDN applications targeting low latency control problems like load balancing and DDoS mitigation low planning delay SDN application Thursday, June 13, 13
  • 20. Network OSApplication Open APIsApplication Application Data Plane Control Plane Configuration Forwarding Visibility NETCONF/OF-Config Open APIs Hosts sFlow adds actionable visibility to SDN stack Actionable = complete + timely Thursday, June 13, 13
  • 21. REST API Metrics Flow Definitions Thresholds InMonsFlow-RT REST API OpenFlowController Load Balancer DDoS Protection REST Applications Open “Southbound” APIs Data Plane Control Plane Hosts Open “Northbound” APIs SDN Applications SDN feedback control applications Thursday, June 13, 13
  • 22. ovs-vsctl set-controller br0 tcp:10.0.0.1:6633 ovs-vsctl -- --id=@sflow create sflow agent=eth0 target=”10.0.0.1:6343” sampling=1000 polling=20 -- -- set bridge br0 sflow=@sflow Connect switches to central control plane e.g connect Open vSwitch to OpenFlow controller e.g. connect Open vSwitch to sFlow analyzer Minimal configuration to connect switches to controllers, intelligence resides in external software Thursday, June 13, 13
  • 23. Components of a DDoS flood attack 1. Command to attack target sent over control network 2. Large number of compromised hosts start sending traffic to target 3.Traffic converges on access link, overwhelming capacity and denying access Thursday, June 13, 13
  • 24. Define flow keys DDoS Protection define address groups define flows define thresholds while(running) { receive threshold event monitor flow deploy control monitor flow release control } OpenFlow Controller REST API sFlow-RT REST API 1 2 3 4 6 5 8 7 REST operation flow chart Thursday, June 13, 13
  • 25. curl -H "Content-Type:application/json" -X PUT --data "{external:['0.0.0.0/0'], internal:['10.0.0.0/8']}" http://localhost:8008/group/json 1. Define address groups curl -H "Content-Type:application/json" -X PUT --data "{keys:'ipsource,ipdestination', value:'frames', filter:'sourcegroup=external&destinationgroup=internal'}" http://localhost:8008/flow/incoming/json 2. Define flows curl -H "Content-Type:application/json" -X PUT --data "{metric:'incoming', value:1000}" http://localhost:8008/threshold/incoming/json 3. Define thresholds curl "http://localhost:8008/events/json?eventID=4&timeout=60" 4. Receive threshold events Thursday, June 13, 13
  • 26. 5. Monitor flow curl http://localhost:8008/metric/10.0.0.16/4.incoming/json [{ "agent": "10.0.0.16", "dataSource": "4", "metricName": "incoming", "metricValue": 1582.93965044338071, "topKeys": [ { "key": "192.168.1.1,10.0.0.151", "updateTime": 1357169662500, "value": 1582.93965044338071 }, { "key": "192.168.1.4,10.0.0.151", "updateTime": 1357169665500, "value": 46.552918457198984 } ], "updateTime": 1357169665500 }] 6. Deploy control curl -d '{"switch": "00:00:00:00:00:00:00:01", "name":"ddos-1", "cookie":"0", "priority":"32768", "ingress-port":"4","active":"true"}' http://localhost:8080/wm/staticflowentrypusher/json Thursday, June 13, 13
  • 27. threshold attack starts detected control implemented attack eliminated http://blog.sflow.com/2013/03/ddos.html Before After DDoS mitigation results packets/secondpackets/second sustained 6M packets/second attack (30 Gigabits/second) http://packetpushers.net/openflow-1-0-actual-use-case-rtbh-of-ddos-traffic-while-keeping-the-target-online/ Also: http://blog.sflow.com/2013/05/controlling-large-flows-with-openflow.html Thursday, June 13, 13
  • 28. ECMP/LAG multi-path traffic distribution http://static.usenix.org/event/nsdi10/tech/full_papers/al-fares.pdf index = hash(packet fields) % linkgroup.size selected_link = linkgroup[index] Hash collisions reduce effective cross sectional bandwidth 1:1 subscription ratio doesn’t eliminate blocking, collision probabilities are high, even with large numbers of paths Thursday, June 13, 13
  • 29. Birthday Paradox What is the chance that at least two people in a room will share a birthday? 50/50 chance with 23 people, virtual certainty with the 60 people.This is a “paradox” because the probability seems remarkably high considering that there are 365 possible birthdays (366 if you include Feb 29) and 23 people represents just over 6% of the theoretical maximum and 60 people is only 16%. http://en.wikipedia.org/wiki/Birthday_problem ECMP/LAG/MLAG collision probabilities are surprisingly high Thursday, June 13, 13
  • 30. http://research.microsoft.com/en-us/UM/people/srikanth/data/imc09_dcTraffic.pdf Small number of long lived large flows responsible for bulk of load Use SDN load balancing applications to detect and eliminate collisions by adjusting forwarding paths Load balancing large flows Thursday, June 13, 13
  • 31. colliding flows http://blog.sflow.com/2013/01/load-balancing-lagecmp-groups.html http://blog.sflow.com/2013/03/ecmp-load-balancing.html http://blog.sflow.com/2013/02/sdn-and-large-flows.html https://datatracker.ietf.org/doc/draft-ietf-opsawg-large-flow-load-balancing/ Not just ECMP, also LAG/MLAG and WAN etc. non-colliding flows http://blog.sflow.com/2013/06/large-flow-detection.html http://blog.sflow.com/2013/06/flow-collisions.html Thursday, June 13, 13
  • 32. Memcached hot keys http://codeascraft.com/2012/12/13/mctop-a-tool-for-analyzing-memcache-get-traffic/ Server B Server C Server A Client 1 Client 2 Client 3 Client 4 Memcache clusterMemcache clients overloaded server link http://blog.sflow.com/2013/01/memcache-hot-keys-and-cluster-load.html • Interesting parallel with ECMP/LAG hash collisions • Demonstrates linkage between network and application performance • Can monitor cluster wide Memcached hot/missed keys with sFlow • Possible SDN use cases: • Server placement informed by visibility into network topology, loads etc. • Use SDN to shorten paths, reducing latency and packet loss • Avoid packet loss by steering packets around congested link • Extension of OpenFlow to optical circuit switches allows network to be rewired for actual demand Thursday, June 13, 13
  • 33. Next steps Organizational: break down networking silo - learn more about networking - integrate networking into DevOps team - think about observability and controllability when purchasing equipment and architecting services Strategic: Engage developer communities - share operational expertise with SDN community - help specify northbound APIs so that they deliver functionality needed to integrate networking into DevOps stack Thursday, June 13, 13
  • 36. • OpenFlow • Hybrid combined OpenFlow (using NORMAL action) • Puppet • BGP policy • RESTful API to switches • NETCONF • Optical circuit switching - OpenFlow extensions - SOCM (Service-Based Optical Connection Management) Programatic control of switches http://www.nanog.org/sites/default/files/wed.general.brainslug.lapukhov.20.pdf http://blog.sflow.com/2013/03/pragmatic-software-defined-networking.html http://www.nanog.org/sites/default/files/wed.general.socm_.samberg.6.pdf Thursday, June 13, 13
  • 37. packets decode hash sendflow cache flushsample NetFlow/IPFIX send polli/f counters sample • sFlow exports packet samples immediately • sFlow also exports interface counters • NetFlow exports flow data on end of flow, active-timeout or inactive-timeout • NetFlow data generation requires significant resources on switch that can be better applied to increase size of forwarding table(s) • OpenFlow metering has similar architecture to NetFlow and similar limitations sFlow and NetFlow/IPFIX in a switch Thursday, June 13, 13
  • 38. InMon sFlow-RT active timeout active timeout NetFlow Open vSwitch SolarWinds Real-Time NetFlow Analyzer • sFlow does not use flow cache, so realtime charts more accurately reflect traffic trend • NetFlow spikes caused by flow cache active-timeout for long running connections Rapid detection of large flows Flow cache active timeout delays large flow detection, limits value of signal for real-time control applications http://blog.sflow.com/2013/01/rapidly-detecting-large-flows-sflow-vs.html Thursday, June 13, 13