Controls - Updated
I wrote the first version of this post nearly 3 years ago. It is interesting that since then much of it remains true. Oddly, it also still seems surprising to people that security breaches are often the result of unintended control lapses rather than innovative attacks or risk blind spots.
There are some notable exceptions, of course, especially with respect to exploits of zero-day vulnerabilities and whole classes of attacks where the control flow logic of application or API calls are manipulated.
But the vast majority of attacks that are either well known (or that I have otherwise become aware of) still seem to have this common pattern - that they are not the result of some awesome attacker capability to exploit some hitherto unknown vulnerability or to realize a risk from some combination of controls weakness not contemplated. Rather, a remarkably common pattern is that the control or controls that would have stopped the attack (or otherwise detected/contained it) were thought to be present and operational but for some reason were actually not - just when they were most needed. There are many reasons for these failures to sustain the correct implementation of important controls, for example: a bad change, operational error, a new implementation that was incomplete, other faults/breakage, some wider change that nullified the in-situ control, ongoing lack of configuration discipline and many other operational factors. In fact any issue that drives any type of system error can be an issue that negates or disables a control. The incident/after action reviews no doubt have the same moments of people exclaiming: “But didn’t we have a [ process | tool | component ] to stop that happening?” Also, sometimes the exclamation is: “But didn’t our [ Board | Risk Committee | CISO ] mandate that this be fixed some time ago?” Incidentally, we talk about attacks but a similar pattern exists for control failures that lead to other types of incidents across the full spectrum of enterprise risk domains. I’ve seen plenty of examples where there were runaway issues of system reboots, access revocation, duplicated transactions or errant algorithms where the circuit-breaker or other control harness (that was designed to be the independent safety check) failed due to insufficiently regular testing.
So, what to do. Treat controls as first class objects like other parts of a system's function.
Catalog controls. Build a catalog of key controls using a well formed ontology - the overall FAIR controls ontology is very good). OSCAL and an evolving eco-system of continuous controls validation is also making great progress.
Control design reviews. Conduct independent assurance / design reviews for key controls. This doesn't have to be fully independent - but at least a peer review in whatever development methodology / style you operate.
Controls as code. Treat controls (especially security controls) as automation / code. Build tests / coverage for control correctness as you would with other code.
Build-time tests. Test for the presence and integration of controls at build time in integration / deploy suites. Different styles of test will be needed depending on the nature of the controls (component, software, hardware etc.)
Continuous control monitoring. Perform continuous control monitoring to assure the continued correct operation of controls at run time and assure sustained completeness of deployment. Minimize the time between a potential control failure and the detection/correction of that failure. This can be done by collecting data from the control’s operation but it can also be detected by injecting synthetic events to test the control's operation and its liveness / effectiveness. I remember, a number of years ago a team of mine in a prior organization stumbled across some network intrusion detection sensors that has been reading “zero” i.e. no events for a few days and then correctly concluded that it was odd for there to be no events, not even false positives (which was pretty common for N-IDS those days). It turned out all the devices, which were connected to network switch span ports, had been isolated due to an ineffectively tested network change. After that we developed (and patented) a technology (internally called “Phantom Recon”) to inject synthetic events into controls to test they were always working. This pre-dated so called attack simulation technologies. As an aside, in part this was intellectually stimulated by the real-world situation of radiation hardened environments which are designed to keep radiation leaks from escaping. This hardening also stops background radiation so radiation detectors always read zero, hence you don’t know if they’re working. So, some contain small (harmless) radiation sources to make the sensors always read something so a casual glance can observe if it's reading zero, and hence observe it is broken.
Uninstrumented controls are bad controls. Declare any control that doesn’t prove amenable to such assurance or doesn’t emit the data needed for continuous monitoring to be a control in need of improvement / replacement. This is irrespective of whether, otherwise, it is an effective control.
Record control incidents. When a control (or instance of a control) is detected as having failed then declare a "control incident" and handle as if a security incident has occurred (as it might well actually become if not attended to quickly enough).
Manage control incidents. Treat control incidents as first class objects alongside security incidents (reporting, escalation, after action review, thematic analysis) whether or not the control incident actually resulted in a security incident. [consider close-calls as well as actual incidents].
Bottom line: many incidents are not due to a lack of conception of controls but due to failures of expected controls. Hence the need to conduct continuous control monitoring. Treat control incidents as first class events like security incidents. Validate continuously.