Fighting Security Entropy
Force 4 : Entropy is King
Central Idea: Adopting a control reliability engineering mindset by continuous control monitoring is essential to counter the inevitable decay of control effectiveness.
Continuing our theme of exploring the 6 fundamental forces that shape information security risk we will now look at Force 4: Entropy is King. As we did in the last post we can move from treating the symptoms to getting to grips with the underlying force itself.
First, a reminder of how we state Force 4: Unchecked controls fail over time, untested resilience fades gradually and then suddenly. Constant counterbalance is needed. Everything degrades unless countered with a force to keep it in place. This is why continuous control monitoring and, in effect, “control reliability engineering” is so essential.
This post might seem familiar as it is basically covering ideas we’ve explored in prior posts on continuous control monitoring but it is worth a revisit.
I’ve spent a lot of time thinking about security entropy over the years and I’m still surprised that it is not more widely discussed. I’m also puzzled that people, admittedly outside of security, are not aware that many breaches are the result of unintended control lapses rather than innovative attacks or true risk blind spots. There are some notable exceptions, of course, especially with respect to exploits of zero-day vulnerabilities and whole classes of attacks where the control flow logic of application or API calls are manipulated.
So, the vast majority of attacks that are either well known (or that I have otherwise become aware of) still seem to have this common pattern - that they are not the result of some awesome attacker capability to exploit some hitherto unknown vulnerability or to realize a risk from some combination of controls weakness not contemplated. Rather, a remarkably common pattern is that the control or controls that would have stopped the attack (or otherwise detected/contained it) were thought to be present and operational but for some reason were actually not - just when they were most needed.
There are many reasons for these failures to sustain the correct implementation of important controls, for example: a bad change, operational error, a new implementation that was incomplete, other faults/breakage, some wider change that nullified the in-situ control, ongoing lack of configuration discipline and many other operational factors. In fact any issue that drives any type of system error can be an issue that negates or disables a control. The post mortems no doubt have the same moments of people exclaiming: “But didn’t we have a [ process | tool | component ] to stop that happening?” Also, sometimes the exclamation is: “But didn’t our [ Board | Risk Committee | CISO ] mandate that this be fixed some time ago?” The answers to which of course are some variant of “Yes, but it got undone as part of last quarter's system upgrade”, “The last patch from our operating system vendor undid a prior control”, or “This new system was rushed to launch and didn’t follow the regular review processes”.
We talk about attacks but a similar pattern exists for control failures that lead to other types of incidents across the full spectrum of enterprise risk domains. I’ve seen plenty of examples where there were runaway issues of system reboots, access revocation, duplicated transactions or errant algorithms where the circuit-breaker or other control harness (that was designed to be the independent safety check) failed due to insufficient regular probing / testing. So, what to do. Treat controls as first class objects like other parts of a system's function and obsess over countering every part of the decay that comes with the inevitable entropy of large scale distributed systems.
Treating Symptoms - dealing with what the force does
The detection and response to the symptoms of control failure mostly comes down to observing the consequent vulnerability or driving humans to look for such failures in manual or semi-automated ways. The key differentiator between symptoms and root cause is simple in that handling the symptoms is observing the system state from the outside and hoping to spot lapses that can then be fixed. Getting to the root cause is having controls be a core part of system design and then performing continuous validation of that control effectiveness and presence at every stage of system specification, design, build and ongoing maintenance.
Risk and Control Self Assessments. Periodic, human driven, risk identification, assessment of control completeness (does it mitigate the risks) and control effectiveness (is it working). In my experience this is partially useful but almost always fails to identify actual issues as people completing the assessments don’t have the visibility over objective control performance measurements and so answer with intuition or partial data. I know some organizations, on their journey to continuous control monitoring, present objective measurement of controls during assessments and have seen, unsurprisingly, a significant rise in assessor-raised issues to deal with the reality that confronts them.
Audits. Many control deficiencies (coverage or performance) are found during periodic audits. Some audit teams have instituted their own flavor of continuous auditing based on data collection but without a full operational cadence these are just a different flavor of manual assessment.
Penetration Testing. From a purely security perspective, a penetration test can reveal vulnerabilities that are the result of missing controls or lapses of expected controls.
Vulnerability Scanning. Similar to penetration testing, vulnerability scanning not only discovers apparent vulnerabilities through various techniques but also, often called posture management or configuration validation, can identify policy-defined misconfigurations that are at least a proxy for control failures.
Red Teams. Red team exercises can not only identify all the elements from other forms of assessments, scanning or configuration validation but can additionally look for the seams between control structures that may identify deeper issues.
Treating Causes - dealing with the force itself
Catalog Controls. Build a catalog of key controls using a well formed ontology - the overall FAIR controls ontology is very good. OSCAL and an evolving eco-system of continuous controls validation is also making great progress.
Control Design Reviews. Conduct independent assurance / design reviews for key controls. This doesn't have to be fully independent - but at least a peer review in whatever development methodology / style you operate.
Controls as Code. Treat controls (especially security controls) as automation / code. Build tests / coverage for control correctness as you would with other code.
Build-time Tests. Test for the presence and integration of controls at build time in integration / deploy suites. Different styles of test will be needed depending on the nature of the controls (component, software, hardware etc.)
Software Defined Infrastructure. Adopt declarative system configurations as much as possible and embed controls at multiple layers in the declarative specification so the specification can be analyzed for control conformance and that the deployment / orchestration system can help monitor for continued adherence of the deployed reality.
Continuous Control Monitoring. Perform continuous control monitoring to assure the continued correct operation of controls at run time and assure sustained completeness of deployment. Minimize the time between a potential control failure and the detection/correction of that failure. This can be done by collecting data from the control’s operation but it can also be detected by injecting synthetic events to test the control's operation and its liveness / effectiveness. I remember, a number of years ago a team of mine in a prior organization stumbled across some network intrusion detection sensors that has been reading “zero” i.e. no events for a few days and then correctly concluded that it was odd for there to be no events, not even false positives (which was pretty common for N-IDS those days). It turned out all the devices, which were connected to network switch span ports, had been isolated due to an ineffectively tested network change. After that we developed (and patented) a technology (internally called “Phantom Recon”) to inject synthetic events into controls to test they were always working. This pre-dated so called attack simulation technologies. As an aside, in part this was intellectually stimulated by the real-world situation of radiation hardened environments which are designed to keep radiation leaks from escaping. This hardening also stops background radiation so radiation detectors always read zero, hence you don’t know if they’re working. So, some contain small (harmless) radiation sources to make the sensors always read something so a casual glance can observe if it's reading zero, and hence observe it is broken.
Flag Uninstrumented Controls. Declare any control that doesn’t prove amenable to such assurance or doesn’t emit the data needed for continuous monitoring to be a control in need of improvement / replacement. This is irrespective of whether, otherwise, it is an effective control.
Record Control Incidents. When a control (or instance of a control) is detected as having failed then declare a "control incident" and handle as if a security incident has occurred (as it might well actually become if not attended to quickly enough).
Manage Control Incidents. Treat control incidents as first class objects alongside security incidents (reporting, escalation, after action review, thematic analysis) whether or not the control incident actually resulted in a security incident [consider close-calls as well as actual incidents].
Process Reengineering. Similarly when business processes are being adjusted there is an opportunity to improve a variety of end to end controls not just those in the underlying technology implementation. Here is where strong partnership with business line operating risk, compliance or other teams involved in other risk mitigation activities is crucial since there may be common controls to mitigate multiple risks.
Customer / Employee Experience Changes. A subset of process reengineering, but one worth highlighting, is to seize any opportunity where customer or user experiences are being adjusted. In a lot of organizations the limiting factor for security enhancements are real (or perceived) impacts to customer experience (employee experience and productivity impact is a significant matter as well). If such an experience is being reimagined then that is a great time to address necessary improvements around issues like authentication strength, anomaly detection, delegated authorization and credential resets.
Bottom line: many incidents are not due to a lack of conception of controls but due to failures of expected controls. Hence the need to conduct continuous control monitoring. Treat control incidents as first class events like security incidents. Validate continuously to keep the decay of entropy at bay.