Resilience Engineering - Step by Step

Phil Venables
Jul 15, 2023
13 min read

Resilience Engineering: Concepts and Precepts is an excellent collection of standalone essays, woven into a consistent whole on the subject of resilience, complexity, safety and systems / control theory.

Rather than review this book conventionally I’m going to pick a few of the topics from each chapter that stand out as the most useful, that are in some way memorable, or particularly applicable to cyber-resilience.

Chapter 1: Resilience - The Challenge of the Unstable

A useful definition of resilience is the ability of a system or an organization to react to and recover from disturbances at an early stage with minimal effect on its dynamic stability.

Chapter 2: Essential Characteristics of Resilience

The key characteristics, or properties, of resilience are:

Buffering capacity: the size or kinds of disruptions the system can absorb or adapt to without a fundamental breakdown in performance or the system's structure.
Flexibility vs. stiffness: the system’s ability to restructure itself in response to external changes or pressure.
Margin: how close or how precarious the system is currently operating relative to one or another kind of performance boundary.
Tolerance: how a system behaves near a boundary - whether the system gracefully degrades as stress/pressure increases or collapses quickly when pressure exceeds its adaptive capacity.

An interesting insight, perhaps obvious but often not discussed, is that conflicting goals harm resilience. The insight is not to develop new techniques to deal with such conflicts but to deal with the inherent tension in conflicted goals and let more natural resilience return. In fact, dealing with the tension irrespective of the outcome should be rewarded. For example, in a production environment calling a halt when there is a safety issue needs to be rewarded even if it is, in hindsight, a false alarm.

Chapter 3: Defining Resilience

A useful reminder of the utility of the bow-tie model which I think many organizations use instinctively rather than formulaically.

Chapter 4: Complexity, Emergence, Resilience

A tour of some basic complexity science and a reminder that scale free architectures are robust to random failures but not ones targeted at key nodes - which, of course, given the propensity of many real-world networks to be scale-free does focus our attention to create the properties that cause emergence of other types of resilience.

Chapter 5: A Typology of Resilience Situations

Identifies the need to be aware of the nature of different types of threats / events and the different approaches to mitigate and to prepare to respond to those.

Situation 1: The Regular Threat (e.g. road accidents)
Situation 2: The Irregular Threat (e.g. Apollo 13)
Situation 3: The Unexampled Event (e.g. 9/11)

Even in the case of situation 3 where, by definition, there can be no specific preparedness, there still needs to be a developed organization capability so there is capacity and muscle memory to be adaptable in the face of the unpredictable. There are additional insights for many of these types of situation:

Resilience is the ability to prevent something bad from happening
Or the ability to prevent something bad from becoming worse
Or the ability to recover from something bad once it has happened

Above all we need to be attuned to faint signals of what might indicate an unexampled event and the requisite imagination to contemplate needed capacity for that, no matter how unimagined.

Chapter 6: Incidents - Markers of Resilience or Brittleness?

In looking at adaptive capacity and brittleness the question arises whether the past cases of successful response to disruptions are actually stories of resilience that are predictive of future success if new disruptions occur?

In other words, how much is past documented success more about luck? I’m often reminded of incident analysis where the threshold for consideration is a monetary loss value e.g. $10,000 but often the loss amount is context, not event, specific. Under different conditions the loss could be immensely more. For example, a commercial bank has a systems outage for 30 mins during mid-morning on a quiet Tuesday in summer vs. the same outage at 15 minutes to payments cut-off deadline on a busy Friday at quarter-end after a major stock index rebalance. Same event - different losses. Incidentally, this is also why it’s important to look at unexpected gains as well as losses.

This all points to a vital part of assessing a system’s resilience is finding whether the system knows if it is operating near boundary conditions. So called decompensation exists in 2 phases, where the first phase of routine response can mask the pressure building up that could result in a catastrophic break:

Automated loops (system or people) compensate for a growing disturbance. The successful compensation handles this.
The second phase occurs because the automated response cannot compensate for the disturbance indefinitely. Then sudden collapse.

The question is whether the “supervisory controller” (people or systems) can detect the developing problem during the first phase? Or do they detect the lowest order is working harder and harder to compensate and getting nearer to its capacity limit as the external challenge persists or grows?

The critical information is not the events themselves but the “force” with which they must be resisted relative to capabilities of the base control system. In our world of security, for example, do you know for your various layers of defense in depth what the pressure on each individual control is? Can you proactively deal with the pressure when the system or people are having to over compensate.

Chapter 7: Resilience Engineering - Confused Consensus

Following from the issue of decompensation is the notion of drifting into failure (there’s a whole different great book on this topic):

Detecting drift into failure that happens to seemingly safe systems before breakdowns occur is a major role of resilience engineering. We have to get smarter at predicting the next accident - the drift into safety margins during normal work.

Echoing the discussion of resilience being driven by conflicts, this is where less visible conflict can cue this drift. Drift can be the small sacrificing decisions accumulating over time. Connecting micro-level managerial decisions to macro level trade-offs is vital - but hard. Each micro-level decision to make a trade-off lower down the management chain may be valid in itself, and be a natural response to the macro-level constraints directed by more senior decision makers. That does not mean those decision makers foresaw, let alone wanted, the drift resulting from those decisions.

In many respects, drift is a function of how operations are imagined to be vs. their reality. To counter this requires a culture where management at all levels are open to more than just reassuring news. It also means it’s important to keep discussions of risk alive even when everything looks safe. Be on the lookout for any operational activities that “borrow from safety”.

I worked in one organization a while ago where many of the senior technology leadership (management and specialists) were long-tenured and had risen up through the ranks over decades. This was a tremendously positive aspect for risk management due to experience, expertise and an inherent desire to protect their legacy. However, there was an inherent downside in that their formative experiences of “how things are done” were of a different era of technology and scale and it took various approaches to counter the resistance to deal with issues of reality that didn’t conform to their outdated mental model of how things worked. So, a necessary prerequisite for dealing with resiliency issues is to constantly work to close that gap of reality vs. what is imagined at various levels of leadership.

Chapter 8: Engineering Resilience into Safety Critical Systems

This chapter covers STAMP (Systems-Theoretic Accident Modeling and Processes). Systems are viewed in STAMP as interrelated components that are in a state of dynamic equilibrium by feedback loops of information and control. Thus safety (or risk reduction) is an emergent system property. It deals with many of the following elements to drive such emergence:

Needed constraints (e.g. power is never on when the control box is open).
Required hierarchical levels of control.
Specified process models.
Enforcing a downward constraints channel - constraints applyied at multiple levels.
Upward measurement channel - making sure measurement at all levels reaches the top.
Implementing models and simulation, thereby identifying discrepancies between models (human/mental, system) and the system being controlled to force constant recalibration of expectation vs. reality, which as we covered in the prior chapter can be so foundational for resilience.
Looking for where issues exist in the boundary or overlap where 2 or more controllers (human or system) govern the same process.

Chapter 9: Is Resilience Really Necessary? The Case of Railways

An interesting case study of bringing the prior chapters’ concepts together on a railway transport safety example - to illustrate problems:

No formal risk, threat and safety models.
Extreme centralization of leadership and little edge (peer to peer) communications.
Defenses eroding under production pressure. The organization cannot respond flexibly to (rapidly) changing demands and is not able to cope with unexpected situations.
Past good performance taken as a reason for future confidence (complacency) about risk control.
Fragmented problem solving masks the big picture - no shared risk picture between multiple levels of the organization.
Failure to revise risk assessments appropriately as new evidence accumulates.

Chapter 10: Structure for the Management of Weak Signals

A vital lesson is to ensure weak and diffuse issues are heard and analyzed so that early warnings, or the accumulated pressure of small signals are correctly aggregated and dealt with. It is difficult to get people to report seemingly small issues in group settings so there are case study examples (Swedish nuclear power plants) where meeting culture shifts so that everyone has to report something - to guarantee even minor things are discussed.

Chapter 11: Organizational Resilience and Industrial Risk

For resilience to emerge in systems it is essential that the organization that designs and operates those systems is itself organizationally resilient, meaning:

There is strong coordination of processes by routinizing procedures in operations and organizational systems.
Increase reliability by removing unnecessary variance of individual skills to ensure the substitutability of different people through standardized selection and training to provide some degree of adaptive human capacity.
Ensuring through supervision, inspection, auditing etc. that standardization of the work process does control the actual flow of work.
Automation of routine or complex functions.

Of course, it’s worth remembering that the degree of such organizational (human structure) resilience is highly dependent on the nature of the systems the organization develops or operates. Knowledge work through to manufacturing may have wildly different approaches.

Chapter 12: Safety Management in Airlines

Airline safety, particularly in large commercial airlines, is an exemplar of the culture and process of managing safety and many resilience properties. There are key elements to why this is so:

Ensuring there are air safety reports that are sufficiently flexible to accommodate different approaches to dealing with specific issues:
- Air Traffic Incident (e.g. air traffic control)
- Technical Incident (e.g. cracks, corrosion, fires, false fire warnings, engine problems, systems failure)
- Ground incident (e.g. collision with cars, terrain)
- Operational incident (e.g. crew member issues, rejected take off, exceeding operational limit, error in load / fuel, low fuel in flight, deviation from course/height, wrong landing
Then making sure that reports and safety information are protected and used appropriately to assure trust in the safety system overall whether it’s for airline safety information sharing or safety organization publication. Such sources and reports include:
- Air safety reports (from pilots) → risk level and action
- Investigations of accidents and incidents (team of investigators) → report and actions
- Flight data monitoring (automatic) → screened by systems and exceptions reviewed
- Cockpit voice recorder (automatic) → accident investigation
- Quality audits (qualified auditors) → recommendations and mandatory actions
- Airline data share (automatic) → interrogated when analysis needed
- Safety organization publications (publication) → browsed and stored

Chapter 13: Taking Things in One’s Stride

This chapter discusses the various aspects of human performance in emergency situations such as:

Handling soft emergencies e.g. hospital ER capacity issues in normal operation such creating a “bump room” to deal with triage overflow.
Handling the extreme e.g. a terrorist event like a bus bombing.
Explicitly planning for the return to normal operations that if it isn’t well managed can create “echoes” of resiliency disturbance even to minor perturbations during that return to normal operation.

On this last point, I think we’ve all likely witnessed poorly executed changes in the recovery efforts after a major event that collapses the recovery or that created equally serious but unrelated events. I have vivid memories of capacity issues in bringing back failed networks where failing to stage the recovery would unintentionally DDoS key services like DHCP that weren’t designed to be hit all together by entire buildings coming back on line.

Chapter 14: Erosion of Managerial Resilience: Vasa to NASA

There are some great examples of the erosion over time of management resilience - that is various levels of leadership being worn down over time in the face of top down pressure that results in poor decisions or an abandonment of the right safety or security culture. Examples of this include:

The Swedish warship Vasa which sank on its maiden voyage due a catastrophically poor design that stemmed from top management meddling (the King) and lack of managerial resilience to resist this.
The NASA Challenger disaster where a lead engineer as part of launch authorization was explicitly told to: “take off his engineering hat and put on his managerial hat” as a means to exert top down pressure to go for launch despite prior judgements to the contrary.
The Piper Alpha oil rig disaster where a lack of assertiveness from managers on adjacent rigs resulted in their being overruled by onshore supervisors preventing them reducing oil flow from their platforms which was driving pressure on the rig at risk.

There are many techniques to bolster leadership resilience at all levels to handle such pressure:

Split risk and safety roles from production line management decisions (even if those roles need to be embedded for culture and expertise alignment) such that it is enshrined that in a debate between risk/safety and production goals then risk/safety always wins by default.
Detect operational drift toward a safety boundary through many of the modeling and other techniques we’ve discussed above.
Assertiveness training for managers and a reinforcement of organizational culture training for safety and security - this can include scripts for managers to use in specific contexts e.g. a risk manager saying “it is my role to take the side of risk reduction, I have to be stringent - we can externalize the risk trade off to some other group of leaders to have an even handed debate but I can’t always internalize the trade off as an individual.”
Enactment of “whistle blowing” laws to act as a deterrent for irresponsible behavior.

I’ve witnessed first hand many situations where risk and security managers have had to have the moral courage, and the tenacity to stand up for the right outcome in the face of pressure. My favorite example many years ago was in a disaster recovery situation of an inbound hurricane. The context was in the prior year, we’d gone into full disaster mode for a hurricane that in the end didn’t actually land in the worst case scenario and so all that triggering of response (expensive and inconvenient) was seen as a “waste” in various people’s minds. The following year there was another hurricane that required advanced preparation to trigger the right response (>5 days) and there was a leadership push to: “not overreact, like last time”. To the credit of the disaster recovery team, and ultimately of leadership, two things occurred. The disaster recovery team pointed out matter of factly that this particular hurricane checked all of the worst-case scenario indicators such as projected problematic land-fall, time of year, full moon high tide and so on. But the real clincher was a simple reminder to leadership that “we’ve drilled for this, it’s go time.” This stiffened resolve. The rest is (good) history.

Chapter 15: Learning How to Create Resilience in Systems

This chapter is a fantastic list of good practices for creating resilience in complex adaptive systems:

Use of systems thinking approaches, in particular using control theory to develop a control model to identify key positive and negative feedback loops that contribute to resilience.
Modeling and looking for transition and recovery points over various states of the system from: healthy state, unhealthy state, and catastrophic state. Key metrics that signal transition of states from healthy to unhealthy to catastrophic and scenario modeling to see what moves those metrics.
Considering a business system as a dynamic open system with “wholes” organized at multiple levels. Each whole is defined as the sum of interactions with other wholes. In other words there is no one “whole” system, rather there are multiple levels of fractal-like abstraction. Business resilience is maintaining overall health despite sub-wholes being subject to various events.
The desired behaviors of the system stem from goals, policies, standards, processes and procedures. Goals can be in conflict.

Chapter 16: Safety & Resilience : Agonistic or Antagonistic?

Forcing a system to adopt the safety standards of the best performers is not only a naive requirement but could easily result in accelerating the collapse of the system.

Chapter 17: Properties of Resilient Organizations

A further exploration of earlier topics on organizational/cultural properties that aid resilience:

Top level commitment.
“Just” culture - fair and psychologically safe.
Learning culture.
Awareness and curiosity.
Preparedness.
Flexibility.
Opacity (knows how close it is to the edge of its “boundaries”) – collective mindfulness.

Fundamentally the ultimate indicator of organizational behavior that results in resilience is how much ownership of risk, safety and security issues and concerns are distributed throughout an organization at all levels.

Chapter 18: Auditing Resilience

Auditing for resilience, or more precisely for the presence of behaviors and capabilities that are likely to lead to the emergent property of resilience include:

Risk identification and selection.
Monitoring, feedback, learning and change management.
Procedures, rules, goals.
Availability, manpower planning.
Competence, suitability.
Commitment, conflict resolution.
Communication coordination.
Design to installation.
Inspection to repair.

This includes the assessment of resilience criteria such as:

Do defenses erode under pressure?
Is past good performance taken as a reason for future confidence (complacency) about risk control?
Does fragmented problem-solving cloud the big picture - no shared risk picture?
Is there a failure to revise risk assessments appropriately as new evidence accumulates?
Are there breakdowns at organization boundaries that impede communication and coordination?
How does the organization respond flexibly to (rapidly) changing demands and whether it is able to cope with unexpected situations?
Is there a high enough devotion to safety above or alongside other system goals?
Is safety built-in inherently in the system and the way it operates by default?

Chapter 19: How to Design a Safety Organization

Key indicators for leadership to monitor their levels of resilience capability could be:

How regularly do they revise and reframe the organizations’ assessment of the risks it faces and the effectiveness of its countermeasures against those risks as new evidence accumulates.
Detect when safety margins are eroding over time - monitor operating points relative to boundaries - in particular monitor the organizations’ model of itself - the risk the organization is operating closer to safety boundaries than it realizes.
Monitor risk continuously throughout the life cycle of a system so as to maintain a dynamic balance between safety and the often considerable pressures to meet production and efficiency goals.

As a result the goal of the risk (or safety / security) team becomes:

Independent voice to challenge conventional assumptions within senior management.
Constructive involvement in everyday decision-making (e.g. standards, waivers, readiness reviews, anomaly definition).
Generate information about how the organization is actually operating and the vectors of change that affect how it will operate.
Information about weaknesses in the organization and weaknesses of its adaptability.

The key point of all of this is to operate with suitable constructive tension.

Chapter 20: Distancing Through Differencing

“The future seems implausible; the past seems incredible.” Woods and Cook (2002)

It is vital to continuously learn from events, near miss incidents and accidents. Barriers to learning include:

Negative consequences for people reporting or doing the analysis.
When failure generates pressure to resolve the situation (if you know about it you have to do something) and so people are reluctant to open up the "can of worms".
Financial responsibility for ameliorating the consequences and losses from the failure.
Desire for retribution.
Confronting dissonance, changing concepts and ways of acting are painful and costly in non-economic senses.

Chapter 21: States of Resilience

As we near the end of the book there is a great summary of the various states of resilience:

Reduced functioning (irregular).
Disturbed functioning.
Repair.
Normal functioning.
Reduced functioning (regular).

Bottom line: safety and security are things an organization does rather than has. Distinguish outcomes vs. leading indicators: an organization that appears safe might just be lucky and is rather a time-bomb of control breakage pressure building up. An organization might have incidents but less than it would without its safety or security processes given its situation or inherent risk. So, often we can only measure the potential for resilience not resilience itself. The required qualities of a resilience system are simply:

Anticipation : knowing what to expect (knowledge)
Attention : knowing what to look for (competence)
Response : knowing what to do (rational response and resources)
Learning and updating processes which surround all of this.

You need a constant sense of unease.

RISK & CYBERSECURITY

Thoughts from the Field