Sergey Fedorov, Senior Software Engineer at Netflix, describes a client-side network measurement system called "Probnik", and how it can be used to improve performance, reliability and control of client-server network interactions.
10. INTRO
What could possibly go
wrong:
Client app release/bug/change
Client OS release
DNS issue
Last mile network issue
Internet Congestion
Route leak
AWS outage
AWS microservice release
...
CDN
APIAcceleration
Private
Backbone
Video
DNS
Smallassets
Client
11. How do I
know that
the issue is
network
related?
INTRO
And how do I fix it?
35. $1M+per year
for a vendor solution
6K+
probes per second
14
recipes
1K+
devices
100M+
locations
MEASURING
NETWORK
COMPONENTS
Probe Stats at
Netflix
37. How to use Probes
to detect and
triage network
issues
Detecting
Network
Issues
MEASURING
NETWORK
COMPONENTS
1234
Measure
Troubleshoot
Remediate
Prevent
40. ISP
IX
CDN
Cloud
CDN
Cloud
IX
ISP
isp: OK / FAIL
ix: OK / FAIL
cloud: OK / FAIL
type: HTTP GET
name: reachability test
targets:
isp.test.me/probe
ix.test.me/probe
cloud.test.me/probe
DETECTING
NETWORK
ISSUES
Reachability Test
Setup
46. DETECTING
NETWORK
ISSUES
Can Drill Down to Various Dimensions of User Connectivity
IX
Cloud
CDN
Cloud
IX
ISP
CDN CDN
Cloud
IX
ISP
Cloud
IX
ISP
Cloud
Cloud
ISP3ISP2ISP1
ISP1CDN
ISP2CDN
ISP3CDN
49. Cloud
AuthDNS C
AuthDNS B
AuthDNS A probe.dnsA.me -> 1.2.3.4
probe.dnsB.me -> 1.2.3.4
probe.dnsC.me -> 1.2.3.4
1.2.3.4
DETECTING
NETWORK
ISSUES
Beyond HTTP
Reachability:
Auth DNS
Availability
50. Cloud
AuthDNS C
AuthDNS B
AuthDNS A
type: HTTP GET
name: DNS test
targets:
probe.dnsA.me/probe
probe.dnsB.me/probe
probe.dnsC.me/probe
probe.dnsA.me -> 1.2.3.4
probe.dnsB.me -> 1.2.3.4
probe.dnsC.me -> 1.2.3.4
1.2.3.4
Auth DNS A: OK / FAIL
Auth DNS B: OK / FAIL
Auth DNS C: OK / FAIL
DETECTING
NETWORK
ISSUES
Beyond HTTP
Reachability:
Auth DNS
Availability
51. Cloud
AuthDNS C
AuthDNS B
AuthDNS A probe.dnsA.me -> 1.2.3.4
probe.dnsB.me -> 1.2.3.4
probe.dnsC.me -> 1.2.3.4
DETECTING
NETWORK
ISSUES
Beyond HTTP
Reachability:
Auth DNS
Availability
1.2.3.4
59. ISP
IX
Cloud
CLOUD
CDN
IX-CLOUD
CDNISP-CLOUD
ISP-IX-CLOUD
cloud: OK / FAIL
ix-cloud: OK / FAIL
isp-cloud: OK / FAIL
Isp-ix-cloud: OK / FAIL
type: HTTP GET
name: steering test
targets:
cloud.test.me/probe
ix-cloud.test.me/probe
isp-cloud.test.me/probe
isp-ix-cloud.test.me/probe
REMEDIATION
Probe for
Reachability
Private
Backbone
60. cloud: FAIL
ix-cloud: OK
isp-cloud: FAIL
isp-ix-cloud: OK
What’s broken?
- ISP’s connection to AWS
Can we fix it?
- YES - Move traffic via the IX CDN server
REMEDIATION
Remediation for
Broken Path
ISP
IX
Cloud
CLOUD
CDN
IX-CLOUD
CDNISP-CLOUD
ISP-IX-CLOUD
Private
Backbone
61. cloud: FAIL
ix-cloud: FAIL
isp-cloud: FAIL
isp-ix-cloud: FAIL
What’s broken?
- ISP outage or client last mile
Can we fix it?
- NO (we don’t have a routable path)
REMEDIATION
Remediation for
Full Isolation
ISP
IX
Cloud
CLOUD
CDN
IX-CLOUD
CDNISP-CLOUD
ISP-IX-CLOUD
Private
Backbone
70. Cloud
AuthDNS A
Prod Auth DNS probe.prodDNS.test -> 1.2.3.4
probe.dnsA.test -> 1.2.3.4
1.2.3.4
PREVENTION
Testing DNS
Changes with
Probes
Test DNS change on Probe traffic before applying to PROD
80. site1: % probes
site2: % probes
site3: % probes
...
siteN: % probes
Internet
IX
PREVENTION
Netflix Example:
Provisioning the
Backbone AWS
Site 1
AWS
Cloud
Site 2
Site 3
Site N
Private
Backbone
IX-Cloud
type: HTTP GET
name: IX Steering
targets:
policy.ixaws.me/probe
Prod Traffic:
- RPS
- Gbs In
- Gbs Out
81. site1: % probes
site2: % probes
site3: % probes
...
siteN: % probes
Internet
IX
PREVENTION
Netflix Example:
Provisioning the
Backbone AWS
Site 1
AWS
Cloud
Site 2
Site 3
Site N
Private
Backbone
IX-Cloud
Client-IX
Steering
Policy
AWS
Region
Steering
Policy
type: HTTP GET
name: IX Steering
targets:
policy.ixaws.me/probe
Prod Traffic:
- RPS
- Gbs In
- Gbs Out
82. PREVENTION
From Probes to
Traffic Estimates
Site:
- % probes X
PROD
Traffic:
- RPS
- Gbs In
- Gbs Out
=
Site RPS
Site Gbs In
Site Gbs Out
83. PREVENTION
From Probes to
Traffic Estimates
Probe Site %
Backbone Topology
Prod: RPS, Gbs In, Gbs Out
Client to IX Site Steering Policy
IX to AWS Region Steering
Policy
Input:
Probes + Prod Traffic
Variations
Objective(s):
- min latency
- min cost
- min risk
- ... Backbone link -> <traffic>
link1: <gbs>
link2: <gbs>
link3: <gbs>
...
linkN: <gbs>
84. PREVENTION
Summary
● Leverage your clients
● Sophisticated analysis instead of tooling
● Probe design is important
● Rich insights with basic measurements
● Applications beyond monitoring
1234
Measure
Troubleshoot
Remediate
Prevent