What Nature Can Tell Us About IoT Security at Scale
1. WHAT NATURE CAN
TELL US ABOUT IOT
SECURITY AT SCALE
R I S Q C O N F E R E N C E – N O V E M B E R 2 0 1 8
G E O F F S U L L I V A N – H E A D O F M A R K E T I N G ,
A M E R I C A S I N T E R N A T I O N A L , J U N I P E R
N E T W O R K S
The deck that I’m about to go through, Kireeti Kompella presented at NGMN in Vancouver in November 2018. And shout out to Steve Kohalmi for the overall flocks of birds analogy. And the file you’ll see here is the liteweight version. We have some really cool videos embedded in the heavy version that we have up on SAVO (which I know is going to be replaced by another tool soon).
It’s been 2 or years now since Kireeti first developed this concept of The Self-Driving Network as an analogy to self-driving cars.
The premise is that networks today are too complex, difficult, and expensive to manage and operate. They’re too brittle and not adequately protected from threats. We need networks that are operationally efficient, secure, reliable, and resilient.
And while technologies such as SDN & NFV have been incredibly useful, they haven’t fully addressed the fundamental challenges of operating a network. And we’ve joked that the Self-Driving Network is the new “SDN.”
And this is a quick review, everyone should already be familiar with this - The Self-Driving Network is a framework around the progression of several technologies, including: Real-time Telemetry, Automation, Programmability, Intent-driven programming, Multi-modal views of the network, Rules-Based systems vs. Machine learning, and more.
The Self-Driving Network: self-configures, self-corrects, . . . And getting more toward the topic of today, it Self-Defends. BTW, if you’re keeping score at home this is yet another “SDN”.
And there is a lot at stake. You’ve all heard the scary stories about security breaches. Some of them real, some of them imagined. A couple of months, the big news was the discovery of malicious chips that had been inserted directly onto motherboards of servers sold by Supermicro. At the origin, at the factory in China where they’re assembled.
And what we’re hearing is that threat researchers were able to discover this based on unusual communication behaviour from the motherboard. This is an important point. Security policies did not catch the problem, it was behavioural analysis.
Today, I wanted to introduce another analogy to help us think about IoT security. And that is the idea of IoT devices as flocks of birds.
Most IoT devices are flawed but in ways not like humans. They’re simple, dumb, and often single-minded, with a defined purpose. But they’re predictable. And I guess you can apply those adjectives to some humans too.
But anyway, IoT Devices often hide behind gateways and roaming agreements, their traffic can be encrypted, they may not have well known, established signatures or trusted IDs. Visibility is an issue. We have device profilers but they can be fooled. And with “massive IoT,” one of the 3 principle use cases for 5G the name says it all – there are too many devices (and too many TYPES of devices) on your network to manage manually. And the scale and complexity of the problem overwhelms current approaches that are only If-Then or Rule-Based.
So, all of this points to Behavioral analysis as a key part of solving the IoT Security problem. Network operators should operate with a zero trust mentality when it comes to new devices joining the network. Like flocks of birds, IoT devices of the same type (thermostats, cars, baby monitors, elevators, etc.) behave in similar ways. They rarely deviate, but when they do, usually all of that category does.
We can use Cluster Analysis, which is a form of unsupervised machine learning. Again, if you’re playing buzzword bingo along with me, yes, this is machine learning. Plus we can also us Classification (a type of weak, supervised machine learning) to determine these different “flocks” or groups, categories and then determine normal behavior vs. abnormal behaviour. And the math and many of the techniques behind this are not new, but we do have work to do to apply it to solving IoT Security problems. The scale & size of the problem present a challenge, but the amount of data gives us the opportunity for more accurate clustering.
Here’s how we see this working. Let’s call it, ‘The Behavioral Analysis 5- Step. . .’
Step 1. Observe.
In principle, IoT security is simple, each device connects to its application, running on servers in the cloud. They mutually authenticate one another and protect all their communications with integrity checking and encryption. But this completely ignores the reality that there is a network in-between. In many cases, not a single network, but several. Look at a car connecting to its manufacturer. It first connects to the cellular service provider, actually two of them if the car is roaming, and then on to the Internet which again could be made of multiple service providers, before reaching the car manufacturer’s data center.
These networks in the middle have very little, most likely no visibility into the identity of the IoT devices, the IoT applications, and their intended uses and behaviors. And most IoT traffic is encrypted anyway.
So, we need to monitor device BEHAVIOR. Who is the device communicating to? How often? How much? The packet and flow characteristics.
And most of this learning should be done centrally to maximize the amount of data that we’re feeding into algorithms.
Step 2. Cluster similar devices.
Cluster analysis and other unsupervised Machine Learning techniques reveal the “flocks”, the categories of IoT devices that share the same behavioral/dominant characteristics. This establishes baseline, normal behavior. Cluster analysis doesn’tt tell us what the devices are – domain expertise is required here (again humans are not being replaced, just augmented). Domain experts will develop rules that can be used to label the data (and we need classification algorithms to classify the types of IoT devices). Even a small data set labelled with these rules can train a model, which can be used to label more data (more IoT devices).
An example – The observed behaviour from a device is that it sits idle most of the time (from a network traffic perspective). Gets an OS patch download from a public Cloud about once every two months. An end user checks it over a web interface once a week. --> Maybe this is a thermostat (or another consumer device)
Other examples - Video cameras stream pretty regular, high rate traffic; sensors transmit small amounts of data, many of them do this very infrequently; a laptop will have many, many sessions, exhibiting different traffic patterns; an on and on. And Fitbits, HVAC systems, Elevators, Parking Meters all exhibit different behaviour.
Step 3. Detect Anomalies
Once a device is a part of a known flock, cluster analysis reveals normal vs. deviant behavior.
I saw this ostrich doing a crazy dance at an ostrich farm a few months ago – but it turns out that it’s actually normal behavior. You can google this and see what I’m talking about. But if you saw a seagull do this you would know something has gone wrong.
Now, a fundamental reality is that the behavior of a network does not remain the same for long. This is different from many other machine learning use cases that you hear about. For example, new breeds of dogs don’t pop up that frequently that are going to stump an image recognition algorithm. Sure a few new words might pop up here and there but overall languages don’t change that much.
But new IoT devices or groups of devices do appear and disappear regularly; existing ones can change their behavior completely from one day to another. Think of a software update enabling the microphone in a refrigerator to, not only monitor the noise of its compressor, but also to listen for voice commands. This new functionality will likely to alter its behavior on the network using a different session for the continuous audio stream for voice recognition.
And it’s not possible to train for errors and anomalies. The number of negative examples is too low compared to the positive ones. And it’s impossible to know and cover all the potential problems. The only viable strategy is to look for outliers and determine their causes. Now there are, however, feature engineering techniques to make this process easier. It’s important to select the behavioral characteristics (in ML speak – “Features”) that matter - not only to identify a group, distinguish it from another, but also to select the features by the correct, expected behavior of the group. For example, a temperature sensor is unlikely to connect to large number of servers or to make repeated DNS requests. Or a parking meter is unlikely to have a Facebook account. Selecting the right features makes anomaly detection much more effective.
Step 4. Identify root cause
We need to determine whether this anomaly is a Security problem? Is it accidental? Is it malicious?
Going back to the animal analogy: So, the animal is acting strangely? - Is a tsunami coming or just a car driving by? Is it just a wounded bird we’re dealing with or is it a cuckoo that’s trying to fooling you? Managing the false positives that overwhelm a lot of security operations today.
The solution is:
To zoom-in on the problem for granular analysis. So much of what we talk about today is local vs. central processing. For Juniper, our ATP appliance can help out in the case of zooming in and local processing. But we also need to aggregate/correlate data over time, geography, different network domains, etc. to get the full picture.
And to some extent it’s up to the human experts to figure out whether an unpredicted event is an anomaly or just a change in behavior of existing devices or a whole new group of, class of, devices. So, an expert has to perform root-cause analysis, look back in history if there was something unusual; something that’s gone under the radar, but in light of this new behavior it’s now recognized. And you decide whether to flag this as a security event or to retrain the model, for example to include SW updates.
So, if the anomaly turns out to be NOT malicious? We stop here - Update the baseline behavior of this category of IoT devices. Tune/re-train the model
And if the anomaly IS malicious or something to worry about, then we go to step 5.
And Step 5 is: Remediate
And this can come in different forms. From a more manual intervention to something more automated like calling a python script to: rate limit, block a port, instantiate a new policy, re-boot a line card. And you can see an evolution where machine learning could create the scripts and apply to particular situations.
And when the problem is fixed you do need to zoom-out to preserve resources and limit the amount of information, network traffic your monitoring.
I’ve laid out, in some ways, the ideal state that we want to get to. But there are certainly some ongoing challenges for Juniper and the industry in general. . .
Telemetry/ Visibility
IoT data encrypted (NB-IoT, SigFox, etc.)
Devices hidden behind firewalls, IoT gateways, non-IP LANs
--> This is why behavioral analysis is critical
Need more telemetry/visibility than firewalls can provide [jab at Palo Alto]. Telemetry from Routers, all network elements.
OpenNTI??
Encrypted content
Assume IoT traffic is encrypted. The “network in the middle” is in the dark à thus, the importance of behavioral analysis
Data
Networks produce mountains of data. Too much data?
Balance between Local/Global processing – Solution is BOTH (pre-processing and wider correlation). Edge & core processing.
Local processing (eg, in IoT G/W) saves bandwidth, faster, isolates threats
Global views, processing aggregated data provides context, better answers
Unlike many machine learning problems today, most network data is “structured” (time-stamped, IP address, etc.)
But data needs to be “Labelled” – tie an outcome or root cause to the observed behavior. And labelling for Classification. Clustering separates out the different devices, but it doesn’t tell you what they are. Need classification (supervised, weak learning).
Arms race
Bad guys learn to blend in with the flock. Learn to fool the algorithms.
Machine learning
Skills needed. Programming, Data Scientists (collect & prepare data), ML (create new algorithms)
Cluster Analysis actually doesn’t require deep domain (ie, networking) expertise (??). And that’s the power of it. Just gather the data, analyze it, and the clusters (flocks of birds) pop out.
But for prediction and closed-loop, automated remediation, we need additional skills
“Feature engineering” for the algorithms. In other words, what device behavioral characteristics matter? The cluster/behavioral analysis will help us with this.
Merging Networking with data science
Juniper
Building foundational security technologies. Option to process from customer premises through to centralized clouds
On-prem (J-ATP)
Security natively integrated with Contrail Edge Cloud Solution
Security natively integrated w/ SD-WAN
Edge/Core (MX as a F/W)
New network INTERFACES must be secured - IoT G/Ws. Edge/MEC. 5G mobile core is a set of functions in different VMs. vSRX, cSRX.
Working with cloud hyper-scalers (AWS, etc.) to integrate their IoT solutions with our infrastructure (eg, AWS Greengrass client is easily ported onto NFX)
Working w/ application vendors to better identify IoT devices (Forescout, Fingerbank) – (it’s not ALL behavioural analysis. . .)
cSRX deployed on an NXP ARM platform for connected vehicles – we will announce this at CES in January
Wrap up - The network has been seen as a liability with respect to security, but we need to view it as an asset. Pervasive security, integrated with the network. “ We don’t see a distinction between “networking” and “security” anymore. Need telemetry from all network elements to solve the IoT security at scale problem. Need more telemetry/visibility than only firewalls can provide.
OK, thanks for listening. And please feel free to use, re-use, re-purpose these slides for anything you like.