The document discusses Cisco Network Assurance Engine, which provides intent assurance for data center networks. It builds mathematical models of network behavior to continuously verify and validate the entire network. It identifies issues proactively by analyzing configuration, routing, security policies and more using over 5,000 built-in scenarios. It is deployed non-intrusively and provides a dashboard and smart events to help users understand and address problems.
1 of 19
More Related Content
[Cisco Connect 2018 - Vietnam] Minh dang hcmc_cisco aci_delivering intent for data center networking
That’s said, the pervasive problem that we all face is that our operational paradigms in Data Centers are fundamentally reactive.
Narrative:
We have operational issues – end up with long troubleshooting cycles and war-rooms
We suffer a breach – and then look for where we left a hole in our policies, and do forensics after the fast
We are often not compliant with business intent, and failing audits initially is not uncommon
And how often do we make changes, only to roll-back because we made mistakes. It’s almost a norm, not an exception
We fundamentally have the inability to ASSURE INTENT PROACTIVELY
This is creating a major assurance gap in the industry.
As an operator, given the power, the questions you’d much rather ask the network are much more fundamental.
I’m making a set of changes to the network: how do I know that I haven’t introduced some blatant misconfigurations or errors that will bring down the application a couple of weeks from now. Maybe the security policy I programmed is conflicting with an existing deny policy I don’t know about, or I am programming a subnet overlapping with an existing subnet. Or I am migrating 500 VLANs from the legacy network to my new fabric, and fat finger 5 subnets. Likely I’ll only know about it weeks from now when some apps are not accessible and will take days to debug the errors. Wouldn’t it be nice if you had a system that PROACTIVELY analyzed all your policies and configurations for correctness and consistency and told you if you are making mistakes?
Second, I have programmed the network, great. But then these systems are dynamic systems, complex distributed systems. The fabric is learning prefixes from the outside world, what if a routing loop was created in the forwarding tables, or we learnt a more specific route from the branch so that traffic meant for an internal app will get diverted outside? Or when a VM moves from Leaf 1 to Leaf 45, if one in a million times the default gateway doesn’t get correctly programmed due to some connectivity issue, or the leaf ran out of TCAM space and all the policies didn’t get correctly programmed. We are now sitting with transient vulnerabilities. Or what if the Vmware admin programmed the port groups inconsistent with the APIC creating config mismatch. These are all extremely hard to find issues, even harder to reason about. But it’s critical to identify these because they can expose us to potential outages or vulnerabilities we have no clue about. Wouldn’t it be nice if you had a system that continuously analyzed your entire networks dynamic state – the forward state, end-point configs etc to ensure it is always consistent with your intent?
3. And finally, I am now programming the network in this abstracted language, in this new policy language with tenants, app profiles and EPGs but as networking folks, we need to understand the bottom view up of the network. Where are my BDs and VLANs sitting? Where are my EPGs deployed? How is connectivity being established between A and B? Reconstructing this bottom up network state is 80% of the challenge when I need to troubleshooting actual issues. Wouldn’t it be nice if you had a system that reconstructed the bottom up state of the network and correlated it to the policy, enabling you to troubleshoot issues order of magnitude faster?
What we need is the ability to assure intent. It is a guarantee, the confidence that the infrastructure is doing exactly what you intended it to do
That your changes and config are correct and consistent
That the forwarding state has not drifted to a something bad
That VMs deployment and movement hasn’t broken your reachability intent
Or your security policies are achieving the segmentation goals per intent
That they are always compliant with business rules and you can pass audits easily
That’s what we are bringing to the market with Cisco Network Assurance Engine.
It’s a whole new way to solving this problem
It starts with building mathematically precise models of the network ---- For instance, we pick all your security contracts, represent them in a software model. Now you can ask that model all sorts of questions – can A talk to B, is A isolated, do we have any conflicting policies out of 1000s or millions of policies, and so on. We build models spanning security, forwarding, end-point configs, hardware resource utilization, policies etc. and This is the most comprehensive model of the network
We didn’t stop there. We then codified1000s of failure scenarios right out of the box, that run against these models – continuously verify and validate the entire network. These checks are based on our experience of how networks should correctly operate, best design practices from our AS teams, and the collective knowledge we have across TAC cases from 1000s of customers. These failure scenarios run against the real-time models, continuously checking the network for correctness.
That’s whats gives the operators the confidence that the network is indeed operating consistent with their intent. And here’s the key point – We can do this without needing to look at any packets – we build our models with all the configurations and dynamic state! And that make it’s fundamentally proactive, before any data traffic even enters the network.
The product is AVAILABLE NOW. It is delivered in an entirely software form factor.
The core idea here is that networks are deterministic. Every switch, router, firewall in the network essentially reads the header makes a decision on whether to push the packet, where, what priority etc. and changes the packet header. Essentially if you can infer this “network transfer function” you can predict and model the behavior of every device – in response to any change, or any incoming data packet. You tie these models across the dc and you have a mathematical model of the entire DC network.
We do this using a class of technique called “formal techniques” which is just an academic word to specify techniques that are intelligent and can reason about the behavior of the network…
Fortunately these ideas are NOT new, and the concept has been there from Academia. Researchers at Stanford, UIUC re-kindled an idea that was initially talked about almost a decade ago…
They have been used extensively in other domains like chip design for instance.
1) These chips with billions of transistors they’re actually more complex than the networks we actually built. But amazingly, when you send a chip out for fabrication, it comes back and actually works most of the time it is because they have these set of tools built around formal methods. When the designer builds an adder or multiplier or a data pipeline in the high level language like verilog, he can look at his adder and check that given any 64 bit inputs this adder will always do A + B correctly without having to put every possible input stream into a simulation and checking it, that is computationally prohibitive. You can check that this adder, this finite state machine under any input, will not go into some funky state and that it will always complete its operation within 1 clock cycle and when the system translates that adder into the physical design - the gates and metal lines - it can check that they are both exactly the same giving you confidence that the chip will actually come back and work correctly.
2) The same in the software world, the developers have a whole set of tools around dynamic testing, checking memory profiling, checking pointers, checking variables etc.. And as a result, they’re able to catch 99% of the issues before the code is put into production and then putting monitoring tools like app dynamics to catch production traffic related issues. We don’t have that luxury in networks unfortunately we’ve been asked to always make changes in production with no tools, and ensure that there is zero down time. Kind of impossible, but that’s what formal methods help you assure and what Candid is bringing to this market …
With formal methods you can assure intent
Let’s double-click to see how it works.
1. Starting from the left – what data do we collect. Candid goes to every leaf, every spine in the network and collects all the configurations and control-place state, data-plane state, even hardware state like TCAM tables, VLAN tables etc. From the controller we pick up the entire policy and configs and a representation of the intent. In addition, we have the implicit intent based on the expected network behavior.
2. With all this we now build the comprehensive network model – underlay, overlay, and tenancy layers.
3. Against this model – we run checks based on 30+ years of Cisco operational domain experience. These checks are based on 3 things: i) our expertise on how networks and our hardware should correctly operate, - there should be no routing loops, or no overlapping subnets in a VRFs of duplicate Ips and so on. ii) best design practices that we learn from our AS teams. If you want a subnet to talk externally what are all the BD and L3out configs required, or all the access policies required to correctly deploy an EPG iii) finally, from our TAC cases. The 10% of of failure scenarios that cause 90% of failures in the field. Bringing this collective knowledge for all our customers.
Every 15 mins orso, the engine builds the most real-time model of the network, and runs these checks against that model – like an intelligent robot watching your back, always checking the network for correctness.
The first story is about a lurking human error in the config space. Heavy Equipment Manufacturer in the US. With a over a 100 leafs over 2 production fabrics. Mainframe device in a DR Datacenter. Innocent configuration error by operator: There was no contract to the Wan interface, which basically means traffic arriving from or destined to the mainframe subnet would basically be dropped. On a DR, this would prevent applications from failover from the production to the DR datacenter. Single error in tens of thousands of configuration code. [This is a company that counts thousands of dollars per minute of downtime for some of their applications…] This was a potential $M outage in case of a fail-over event, that we were able to avoid proactively.
The second example is related to analysis of the network-wide dynamic state. This was a Govt organisation in Europe. The users there were experiencing intermittent Skype traffic, with intermittency their VoIP and video communication. They eventually created a major ticket and were troubleshooting for days at which point they brought in Candid to look at their network. Literally in 15 mins, we found that they had a contract between 2 VRFs, leaking subnets that happened to overlap and leading to this issue. This is a classic example, had Candid had been in their product network ahead of time, we’d have caught the issue the moment the contract was created avoiding days of downtime.
The third one shows the true power of the formal modeling approach. This was a European service provider, multi-tenant network. Over the last couple of years, they had huge policy sprawl, with100K+ security policies. They reached a point where 20% of their leafs were running a max TCAM capacity, and they were unable to push any configs or policies to the network reliably. We go pulled in that point. Literally in few hours of analysis Candid was able to identify that 20% of their policies were redundant, duplicate intent basically opening the same ports in mutiple policies. Further, by looking at hit counters, we were able to get granular insight on the up to another 50% policies that had never been used, giving them the visibility to have the conversation with their security teams on tightening their security aperture and optimizing TCAM utilization.
Narrative: Discuss smart events, discuss the drilling down into human readable suggested next steps. The “Assurance Engine” talks to you…