Constantly Learning: Every packet, Every flow Every app ; Machine learning to optimize; Benchmark each
Constantly Adapting: Run-time provisioning; On-demand capacity; Multicloud App mobility
Constantly Protecting: Micro-segmentation based isolation; Securing data in transit; Zero-trust model
Cisco Tetration is an analytics platform that provides a turn key solution for the data center security use cases. This removes any requirement for in-house data scientists or other programming expertise to deploy, operate and realize the benefits from the platform. These features are supported independent of the infrastructure that you have and where the applications are running.
In the Cisco Tetration platform we start with collecting rich telemetry from both servers as well as network, use unsupervised machine learning and other algorithmic approaches to baseline the behavior of the workloads and apply statistical models on top. Telemetry it collects includes meta data from every packet header (monitor every packet, every flow) within the datacenter, process related details on the servers such process name, user, process execution details and process binary hash. The platform correlates the network traffic to the process on a server.
Application insight use case is designed to provide clear and accurate view of application communications and its dependencies. It also provides information about which communication is going through L4-7 serverices such as load balancers, other infrastructure services that these applications are dependent on and external entities accessing the application.
Building on the insights foundation, Cisco Tetration extends the capability on the security by providing the ability to auto generate a consistent white list policy based on application behavior and near real-time approach to keep the policy up to date as the application behavior changes. Perform policy simulation and impact analysis using historical data or real-time data to determine what traffic will be allowed what will be dropped if that policy is enforced.
Application Segmentation: With Tetration analytics we are extending the platform to take action based on the white list generated as part of application Insight. This policy can then be enforced consistently to realize a consistent segmentation. Implementation of granular policies using operating system capabilities such as Iptable or windows firewall on the servers provides massive scale, consistent implementation whether it is a virtualized, bare-metal or container workloads. This model also ensures that the policy stays intact even when the workload moves. Once the policy is enforced, platform continuously monitors for compliance. If there is any policy deviation then notification is sent to northbound systems, enabling proactive security operations.
Process behavior deviation: Next step is to baseline the behavior of the processes running on the servers. Then apply models to identify any behavior deviations and compare them with known “bad” patterns exhibited by these processes. When these indicators of compromise are detected early on you can minimize the impact by faster remediation.
Software vulnerability detection: last but not least, keeping an accurate inventory of the software packages and versions installed on the servers and then identifying common vulnerabilities and exposures associated with them is also key. This enables you to significantly reduce risk and threat exposure from known vulnerabilities.
In order to implement effective and efficient Application segmentation is critical to understand how application components are communicating with each other, what infrastructure services they are dependent on and how the component clusters are grouped together. Tetration uses rich telemetry and un-supervised machine learning to achieve this. This application insight and dependency forms the basis for the segmentation policy. To get started, customer needs to deploy Tetration Analytics platform and deploy the agents. If customer has the Nexus 9300-EX switches they can turn on hardware sensors, in addition to the software sensors.
Application administrators can perform ADM runs and create application workspaces. Application workspace allows users and administrators to:
Group endpoint hosts and application clusters together to create application views
Accurately understand the relationship of consumers and providers based on communication patterns
Understand the service dependencies for each component
Associate labels and tags with endpoints for easy understanding
[Mention this only when Audience is more technical and can understand hierarchy]
An application can be have multiple workspaces in a hierarchy. For example at the lower level may be the application team such as HRMS. HRMS may be using a database and that database could be part of 5 other applications. Even higher would be security ops or network ops teams defining their policy. When Tetration platform generates the policy it automatically merges the higher level policy with the application specific policies. For example, these could be production workload should not talk to non-production workload, database servers cannot connect to internet.
Application administrators will be able to view the higher level predefined policies but will not be able to modify it.
White-list policy for segmentation is available from the application workspace.
That’s said, the pervasive problem that we all face is that our operational paradigms in Data Centers are fundamentally reactive.
Narrative:
We have operational issues – end up with long troubleshooting cycles and war-rooms
We suffer a breach – and then look for where we left a hole in our policies, and do forensics after the fast
We are often not compliant with business intent, and failing audits initially is not uncommon
And how often do we make changes, only to roll-back because we made mistakes. It’s almost a norm, not an exception
We fundamentally have the inability to ASSURE INTENT PROACTIVELY
What we need is the ability to assure intent. It is a guarantee, the confidence that the infrastructure is doing exactly what you intended it to do
That your changes and config are correct and consistent
That the forwarding state has not drifted to a something bad
That VMs deployment and movement hasn’t broken your reachability intent
Or your security policies are achieving the segmentation goals per intent
That they are always compliant with business rules and you can pass audits easily
That’s what we are bringing to the market with Cisco Network Assurance Engine.
It’s a whole new way to solving this problem
It starts with building mathematically precise models of the network ---- For instance, we pick all your security contracts, represent them in a software model. Now you can ask that model all sorts of questions – can A talk to B, is A isolated, do we have any conflicting policies out of 1000s or millions of policies, and so on. We build models spanning security, forwarding, end-point configs, hardware resource utilization, policies etc. and This is the most comprehensive model of the network
We didn’t stop there. We then codified1000s of failure scenarios right out of the box, that run against these models – continuously verify and validate the entire network. These checks are based on our experience of how networks should correctly operate, best design practices from our AS teams, and the collective knowledge we have across TAC cases from 1000s of customers. These failure scenarios run against the real-time models, continuously checking the network for correctness.
That’s whats gives the operators the confidence that the network is indeed operating consistent with their intent. And here’s the key point – We can do this without needing to look at any packets – we build our models with all the configurations and dynamic state! And that make it’s fundamentally proactive, before any data traffic even enters the network.
The product is AVAILABLE NOW. It is delivered in an entirely software form factor.
The core idea here is that networks are deterministic. Every switch, router, firewall in the network essentially reads the header makes a decision on whether to push the packet, where, what priority etc. and changes the packet header. Essentially if you can infer this “network transfer function” you can predict and model the behavior of every device – in response to any change, or any incoming data packet. You tie these models across the dc and you have a mathematical model of the entire DC network.
We do this using a class of technique called “formal techniques” which is just an academic word to specify techniques that are intelligent and can reason about the behavior of the network…
Let’s double-click to see how it works.
1. Starting from the left – what data do we collect. Candid goes to every leaf, every spine in the network and collects all the configurations and control-place state, data-plane state, even hardware state like TCAM tables, VLAN tables etc. From the controller we pick up the entire policy and configs and a representation of the intent. In addition, we have the implicit intent based on the expected network behavior.
2. With all this we now build the comprehensive network model – underlay, overlay, and tenancy layers.
3. Against this model – we run checks based on 30+ years of Cisco operational domain experience. These checks are based on 3 things: i) our expertise on how networks and our hardware should correctly operate, - there should be no routing loops, or no overlapping subnets in a VRFs of duplicate Ips and so on. ii) best design practices that we learn from our AS teams. If you want a subnet to talk externally what are all the BD and L3out configs required, or all the access policies required to correctly deploy an EPG iii) finally, from our TAC cases. The 10% of of failure scenarios that cause 90% of failures in the field. Bringing this collective knowledge for all our customers.
Every 15 mins orso, the engine builds the most real-time model of the network, and runs these checks against that model – like an intelligent robot watching your back, always checking the network for correctness.
The first story is about a lurking human error in the config space. Heavy Equipment Manufacturer in the US. With a over a 100 leafs over 2 production fabrics. Mainframe device in a DR Datacenter. Innocent configuration error by operator: There was no contract to the Wan interface, which basically means traffic arriving from or destined to the mainframe subnet would basically be dropped. On a DR, this would prevent applications from failover from the production to the DR datacenter. Single error in tens of thousands of configuration code. [This is a company that counts thousands of dollars per minute of downtime for some of their applications…] This was a potential $M outage in case of a fail-over event, that we were able to avoid proactively.
The second example is related to analysis of the network-wide dynamic state. This was a Govt organisation in Europe. The users there were experiencing intermittent Skype traffic, with intermittency their VoIP and video communication. They eventually created a major ticket and were troubleshooting for days at which point they brought in Candid to look at their network. Literally in 15 mins, we found that they had a contract between 2 VRFs, leaking subnets that happened to overlap and leading to this issue. This is a classic example, had Candid had been in their product network ahead of time, we’d have caught the issue the moment the contract was created avoiding days of downtime.
The third one shows the true power of the formal modeling approach. This was a European service provider, multi-tenant network. Over the last couple of years, they had huge policy sprawl, with100K+ security policies. They reached a point where 20% of their leafs were running a max TCAM capacity, and they were unable to push any configs or policies to the network reliably. We go pulled in that point. Literally in few hours of analysis Candid was able to identify that 20% of their policies were redundant, duplicate intent basically opening the same ports in mutiple policies. Further, by looking at hit counters, we were able to get granular insight on the up to another 50% policies that had never been used, giving them the visibility to have the conversation with their security teams on tightening their security aperture and optimizing TCAM utilization.
Narrative: Discuss smart events, discuss the drilling down into human readable suggested next steps. The “Assurance Engine” talks to you…