top of page

Security Leadership Master Class 6 : When disaster strikes

  • Phil Venables
  • 1 minute ago
  • 5 min read

This is part 6 of a 7 part series grouping together sets of prior posts into a particular theme. 



No matter how good you, your team, or your wider organization is you will always have to deal with incidents or even full blown disasters. The true test is how you contain, respond, maintain operations (perhaps in a degraded state), recover, and then learn from the events. Doing this well while also providing excellent customer support and taking care of your workforce can often be a brand enhancing event despite the initial impact.  


The Importance of Capabilities, Not Just Plans

As with many things, responding to incidents is not purely a matter of natural skill. Rather, it is the muscle memory from drills and exercises applied in the use of established capabilities to respond to anything that is thrown at you, not just what might be explicitly documented in some plan. 


  • Resilience is defined by capabilities: Resilience is fundamentally about having the established capabilities: the people, processes, and technology needed to absorb shocks, adjust as necessary, and continue operation in the face of adversity, and still meeting obligations even if through graceful degradation.

  • Detailed plans fail under pressure: In actual crisis situations, adrenaline-fueled personnel are highly unlikely to take the time to consult voluminous manuals to figure out what to do.

  • Crises are unique: Most significant events are unique, making it a difficult and time-consuming effort to contort existing plans to fit the specific situation being faced.


  • Muscle memory beats documentation: When an organization is hit hard, success depends on falling back on muscle memory and adapting, which requires preparation beyond mere documentation. Instead of those detailed plans, focus on equipping people with checklists and run books to aid that memory in executing specific tasks.

  • Capabilities are continuously maintained: Capabilities should be constantly maintained and tested independently from crisis drills.


How to Engineer Resilience Into a System

Resilience must be intentionally engineered into systems, treating it as an emergent property of complex designs that allow flexibility and adaptation. This involves identifying and addressing the boundary conditions of failure and prioritizing survivability.


  • Design for degraded states: Systems and business processes should be engineered to continue operating in some basic way, potentially with temporarily limited functions or deferred decisions, even when dependencies or surrounding services have failed.

  • Minimize blast radius and increase loose coupling: It is critical to minimize the impact scope (blast radius) of potential events and increase the loose coupling of systems and processes, including those in the supply chain, to make response efforts easier to manage.

  • Focus on key resilience characteristics: Engineering efforts must maximize characteristics like buffering capacity (the size of disruptions the system can absorb), flexibility (the ability to restructure in response to pressure), margin (distance from performance boundaries), and tolerance (graceful degradation near a boundary).

  • Use control theory for system modeling: Implement systems thinking approaches, utilizing control theory to develop models that identify and manage key positive and negative feedback loops contributing to system resilience.

  • Enforce safety and constraints hierarchically: Ensure safety (and risk reduction) is an emergent system property by establishing needed constraints (e.g. power can never be on when the control box is open), defining hierarchical control levels, and building an upward measurement channel that ensures performance data reaches the top.


Keeping the Organization Prepared and Building Muscle Memory

The ultimate goal of preparation and exercises is to instill organizational muscle memory. This is the ability to constantly arrange and rearrange capabilities in response to a developing situation. This requires frequent, focused practice that goes beyond annual, massive drills.


  • Separate drills from capability testing: Ensure that all foundational capabilities (e.g., crisis communications, back-up sites) are routinely tested to assure their operation, allowing large-scale crisis response drills to focus purely on building adaptive response muscle memory, rather than revealing basic capability failures.

  • Conduct frequent “micro-drills”: Increase the volume and frequency of practice using micro-drills. These are small tests, typically less than one hour, involving subsets of the organization to assure rapid response to various event types, such as launching an executive crisis response call or failing over to back-up systems.

  • Regularly use built-in capabilities: Run day-to-day business operations using resilient capabilities as much as possible, thereby assuring their correct operation; if specialized crisis technologies are necessary, use them regularly for non-crisis activities (e.g. holding regular staff meetings on the back-up communications system).

  • Establish rehearsed crisis leadership structures: Formalize and rehearse communication protocols within the crisis structure, ensuring separate but highly linked response forums for executives (enterprise crisis management) and operators/engineers (incident response teams) to prevent C-suite interruptions from derailing the technical response.


Learning from Events and Improving

Resilience requires a learning culture and a dedication to addressing the sometimes painful truths revealed by incidents, near-misses, and past failures.


  • Continuously learn from failure: It is vital to continuously learn from events, near-miss incidents, and accidents, integrating these lessons to constantly enhance organizational muscle memory.

  • Monitor organizational drift: Be vigilant for drift into failure. This is the slow erosion of safety margins caused by small, accumulating sacrificing decisions made in response to macro-level constraints or production pressures.

  • Bust the "Thermocline of Truth": Leaders must create a culture through deliberate action, such as using canary milestones or executive “pull questions” to break through the barrier where managers shield higher executives from the problematic reality of projects and risks.

  • Build “Shrines of Failure”: Formalize learning by building a library of stories about past failures or even creating physical "Shrines of Failure" where artifacts and lessons learned are preserved, helping new hires and existing teams reflect on risk priorities.

  • Remove barriers to reporting: Actively counter organizational factors that inhibit learning, such as generating information about how the organization is actually operating and removing negative consequences for people who report issues.


_____________________________________


Ultimately, safety and security are things an organization does rather than merely have. By shifting focus from static plans to dynamic capabilities, engineering systems to absorb stress, and rigorously exercising response muscle memory, organizations can ensure they maintain core functions even in the face of extreme scenarios. 


Here’s a short video (thanks to NotebookLM) covering all of this.




The blog posts used to build this video and summary are here:



Recent Posts

See All
Subscribe for updates.

Thanks for submitting!

© 2020 Philip Venables. 

bottom of page