Force 5 : Complex Systems break in Unpredictable Ways
Central Idea: While component level simplicity is vital, seeking to eliminate complexity at a systems level is often futile. Develop techniques to manage, or even harness, system-wide complexity.
Continuing our theme of exploring the 6 fundamental forces that shape information security risk we will now look at Force 5: Complex Systems break in Unpredictable Ways. As we did in the last post we can move from treating the symptoms to getting to grips with the underlying force itself.
First, a reminder of how we state Force 5: Complex Systems break in Unpredictable Ways - simple systems that work may still collectively fail when composed together - and will often do so in unpredictable and volatile ways.
This phenomenon is well studied in systems-thinking and control theory literature. John Gall’s Systemantics (The Systems Bible) is a great summary of this, applied to human and societal “systems” as well as technical systems.
In my view this is one of the essential books for risk and cybersecurity professionals. Some of my favorite snippets from the book are:
A large system, produced by expanding the dimensions of a smaller system, does not behave like the smaller system.
Systems tend to malfunction conspicuously just after their greatest triumph.
A temporary path will likely be permanent.
People in systems do not do what the system says they are doing.
The system itself does not do what it says it is doing.
Any large system is going to be operating most of the time in failure mode.
In setting up a new system, tread softly. You may be disturbing another system that is actually working.
The meaning of a communication is the behavior that results.
In a closed system, information tends to decrease and hallucination tends to increase.
A system that ignores feedback has already begun the process of terminal instability.
If your problem seems unsolvable, consider that you may have a meta-problem.
In order to remain unchanged, the system must change.
Cherish your bugs. Study them.
Treating Symptoms - dealing with what the force does
One of the many pieces of accepted wisdom in information/cybersecurity is that complexity is the enemy of security. The logical next step then would be to adopt a fanatical approach to eliminate complexity and declare all complexity bad. But is it? You certainly should not go about deliberately trying to introduce complexity in the name of security - although I’m sure we have all seen approaches and products that seem to take that approach. At a certain level of abstraction, for example at the component level, it’s crucial to strive for simplicity but at the systems level it can be futile. In other words, even if we do assert that complexity is the enemy of security then simplicity cannot be the answer for the basic reason that all systems are necessarily complex to be useful. It is just a question of where the complexity is.
Tesler’s, so-called, Law of the Conservation of Complexity sums this up nicely, albeit in this case specifically directed at UX matters, but I think it follows elsewhere:
In reasoning about complexity and security we also have to deal with the fact that security is not something you can necessarily design-in to complex systems. Rather, security is an emergent property of a complex system. You don’t do security. You do things to get security.
Even if we have what we might describe as a simple system then that can still be quite hard to secure. Simple systems can be equally confusing and hard to manage if not designed well. This becomes more of an issue with scale since, unless we are extraordinarily careful, issues in complex systems are a function of (Simple ^ N) vs. (Simple x N). Most of the time when we encounter security issues and then blame the complexity of the system, environment, or eco-system we are simply being lazy. In my experience, when you do this with some intellectual honesty you often find not an issue with the complexity itself but rather with faulty design that has failed to deal with that complexity or has created confusion in the human interaction with that system.
There’s little need to enumerate a list of current tactics to deal with complexity in the ways many environments manage it today but in short it’s:
Wishing it away - futile efforts to remove it that actually, as per Tesler’s Law, just moves it somewhere else. This somewhere else may just hide it, or worse put it in a place where it festers and is unable to be managed.
Applying brute force methods to deal with the side-effects of complexity such as constant fault enumeration or unnecessarily complex / excessive defense in depth.
Applying excess system-wide defenses or constraints that limit the usefulness of a system because of the fear of what might happen.
Treating Causes - dealing with the force itself
Remember, our re-frame is that complexity is not the enemy of security. Bad design is. In doing this, let’s look at some examples of good design principles that I think are most useful:
Abstraction. Abstract complexity away in APIs and other interfaces. Then, think about security policy objectives and see if they require you to consider and enforce controls at multiple levels of abstraction. If you have to do this, then the abstraction is not designed well enough. For example, using network or application segmentation is a vital security principle but it is hard work as the tools most often used work at one level of abstraction but the policy models and operational reality to make it work without breaking your business exist at other levels of abstraction. For example, look at this highly stylized map of the layers of the financial sector and think about a business segmentation objective and where and how you would actually implement that. If you're implementing a business process control segmentation objective at the network level then it's going to get messy really quickly.
Linked Behaviors. Establish linking conditions across elements of the system or across layers of abstraction. Setting one control objective shouldn't imply the need to set additional controls in multiple places by other elements of the control plane or by additional human toil. For example, if you have a goal to have one protected entry point into a system then the control plane should automatically enforce that goal across all elements of the stack whether it is ingress controllers, load balancers, perimeter front end gateways, service meshes or distributed firewalls. This is hard, very hard, and I don’t think any environment does this well. However, some major cloud providers are getting ever closer to this.
Opinionated Defaults. Improving the handling of system-wide control objectives to cause better security properties to emerge through linked behaviors can be made easier through comprehensive setting of control defaults. This could be everything from specific control measures like having encryption on, everywhere, by default through to ensuring systems and components are closed when instantiated and then have to be more loosely configured as needed. This is where the canon of wider security principles can be applied, from least privilege, fail closed, default deny, protocol least privilege / allow-listing and so on.
Declarative Configuration. Define a configuration of the system in a declarative manner so that the actual configuration of the system can be generated and continuously compared to the specification. Treat policy / controls as a lifecycle managed part of the configuration. If you have controls as code then you need to inspect the ability of that control configuration to match intended policy, not just inspect the instantiated environment that results from the declarative configuration. Cloud or on-premise cloud-like modern IT environments have this at their core and as a result deal with complexity much better.
Idempotence. A by-product of a declarative approach to configuration, although by no means guaranteed, is the property of idempotence. That is, if you push something to happen such as enforcing a particular control then no matter how many times you push it won’t have any other effect than sustaining the control. This is an important design property for applying declarative configurations in complex environments as you simply assert the outcome and let the control plane worry about state.
Visualization and Great UX. People think visually, and no matter how good we get at being comfortable that our goals have been expressed in the configuration specification we can still find flaws in our specification by diagrammatically representing that. Overlay visual design cues to highlight where there might be trouble, for example, single points of failure or composition of services in ways that fail to achieve an overall SLO. The tooling we provide engineers and end users should be designed to be intuitive and provide immediate feedback as to whether the intent of an action was in fact performed.
Observability and Feedback Loops. Observability of the behavior of the system overall is critical, but it is only useful if the data from observation is put into feedback loops - either positive or negative - to correctly amplify or dampen behaviors. To reiterate what we started with, security is an emergent property of a complex system. One of the ways to drive that emergence is by taking action as a result of feedback loops.
Reduce Error Messages and Guidance. Work to eliminate error messages, user guides and other configuration or set-up guidance for systems. Yes, this is an extreme statement but it should be aimed for. Instead of creating better error messages work to reduce the scope of errors needing messaging. Similarly, if your user and configuration guides keep getting bigger and more numerous then ask yourself: are you making the right design choices on defaults, linked behaviors, and levels of abstraction?
System-wide Invariants : People, Process and Technology. Like defaults, and other good elements of design, it is valuable to set system wide invariants - properties you want to be held true - and then build processes to enforce them. For example, if you want no single points of failure then build processes to find them and the feedback loops to eliminate them, and the design review practices to discourage the design patterns that lead to them. Additionally, work hard to not experience broken processes.
Desire Lines: Principle of Least Action. One unifying theme here, borrowed from physics, is that where we see return on investment, our work aligns with the principle of least action. If our controls and approach can align with the natural "happy path" for our employees and our customers then that will likely provide the maximum returns over time. Another concept is desire lines, if you consistently see an actual or attempted engineer or end user behavior (particularly in a complex environment) then you should find that and make that a secure path. This is like the, perhaps apocryphal, stories of only paving public walkways after the patterns of walking (and wearing away grass) have been observed in practice.
Chaos Engineering. Creating random failures in a complex environment teaches engineers and other parts of the system to build for assumed failure. In an environment of large enough scale you likely don’t need to introduce such failures, they’ll happen anyway and the best design is to not try to avoid them (although you should not let per unit failure exceed efficiency expectations) but, rather, make resilience a system not a component goal.
5Ys and Blame-free Post-mortems. The final aspect of good design is the ultimate feedback loop of looking for the root cause of the root cause (the system cause) when an event or close-call has occurred. Follow the 5Y’s approach: asking "why?" until you can get no further is crucial. This needs the psychological safety of blame-free post-mortems to do this well. It is also important to remember that human error is not an explanation but rather something to be explained. Most of the time when I've seen claims of human error as an incident cause I have, when digging deeper, been amazed at how well the humans have actually been performing in the face of bad design to keep the incident levels less numerous.
Bottom line: complexity is not the enemy of security. Bad design is. All useful things are complex at some level of abstraction. To wish the complexity away is to fail to apply good design to deal with it.