Resilience is about Capabilities not Plans - Updated
Over the past 2 years, since I wrote the first version of this post, we’ve had a lot of opportunity to test our collective resilience. Resilience in the face of a global pandemic and the under- and over-reaction in certain ways to that, which in turn had knock-on effects that we also had to be resilient to. Resilience in the face of weather and seismic events, kinetic and cyber conflict, supply chain impacts, economic challenges, and increasing levels of disruptive and destructive crime. We will continue to be tested and who knows what is to come in the coming months and quarters.
Despite some bumps along the way I think it’s fair to say the world has shown a lot of resilience in the face of all this. Governments, organizations, and individuals have coped remarkably well. All have adapted and shown resilience and much of that was done not according to a plan but due to inherent adaptability utilizing capabilities built and established over time. For example, I know many organizations that moved confidently into the pandemic response of remote work not because they had an explicit documented plan to do so but rather because they had invested in the pervasive capability to support 100% continuous remote work for their workforce and had regularly tested that capability. This is just an example of where a portfolio of capabilities assembled in response to any event with the organizational muscle-memory to be flexible beats relying on arbitrarily comprehensive binders full of detailed plans and procedures for specific events. Such operational resilience is vital. Let’s take a step back and look at the correct focus on resilience as capabilities not plans.
Resilience can be thought of as the ability to absorb shocks, adjust as needed and continue operation in the face of adversity. In other words, to meet your obligations no matter what is thrown at you - perhaps with some graceful degradation of specific service levels. It is not simply the ability to deflect, avoid or prevent events. Events in this context can be across all business and technology risk domains - whether they are slow or fast moving - from cyber to pandemics. One of the common mistakes many organizations make is to think that resilience can be obtained by simply writing down comprehensive plans and procedures on what to do and how to respond to specific events. When someone thinks of a new event or scenario then a new plan is written and carefully filed away in a Big Book of Plans. Eventually there is a whole shelf full (or virtual equivalent) of these things. Sometimes plans are even tested to see if they actually work. There a three major problems with this, when facing the reality of actual events:
In an actual crisis situation, adrenaline-fueled people are unlikely to take the time to consult large manuals to tell them what to do.
Most crisis or significant events are unique and even if you consulted the plans it would be a lot of effort to contort them to the specific situation you are facing.
Not all plans can be tested frequently and so the underlying means (people, process, technology) of implementing actions in those plans may not have been sufficiently maintained and may only be seen to be deficient when most needed.
The answer to these problems is deceptively simple but profoundly effective. That is to focus on capabilities not plans. Established capabilities are combined / utilized at a time of need by a trained work-force to deal with whatever event is thrown at them. Capabilities are constantly maintained and tested independent from crisis / event drills. Drills then focus on building crisis response muscle memory across the organization. I have seen many organizations shift to this approach and they are immensely more resilient for it - and those that operated in this way before the past 2 years of resiliency challenges have coped superlatively relative to their competition who hadn't. More specifically (general) resilience comes from:
Baseline Capabilities. A set of people, process and technology capabilities that are maintained to defined service levels and continuously monitored as being able to meet those service levels. Examples: remote access services for your workforce able to support everyone connected simultaneously, dispersed physical offices and back-up sites, pre-negotiated contracts to expand office space or add new temporary locations, employee wellness / medical support, dispersed technology delivery, tested burst capacity, distributed voice and video communications including the capability to be used on non-corporate devices in secure ways, and critical business operations pre-dispersed among disparate locations or regions.
Use the Capabilities. Run day to day business using these capabilities as much as you can, so that they are assured of correct operation. If you can't, then test them regularly such that they meet defined service levels. Example: if your crisis communications technologies are not the same as the technologies people use every day then they are unlikely to be used successfully in a crisis, instead create inherently resilient / survivable communications approaches - and if do you need something totally different then use it regularly across your population such as holding staff meetings on the back-up communications system.
Capacity. Understand the capacity constraints of your capabilities and if you can't economically run with excess capacity then conduct regular testing of your ability to quickly ramp up.
Scenario Catalogs. Develop a scenario catalog that can be used to assess whether your capabilities have the means to respond to and operate well in such a scenario. Pick scenarios that exercise the whole spectrum of the risk distribution from expected to tail events. Looking at the extremes is useful to see where your capabilities break down and if that is appropriate for your risk appetite. Remember, scenarios aren't plans, and shouldn't become detailed plans.
Capability Testing. Separate out capability operational testing from crisis response drills. I've come across many large scale drills that have had issues because of failures in basic capabilities such as crisis communications technology (e.g. bridge lines, video conferencing), deficiencies in technology at back-up sites, lack of access to back-up sites, or revealed capacity constraints that have caused the drill to fail early. At one level these are still a success because the organization learnt and fixed these things, at another level it's a failure because they never got to really fulfill the intent of the crisis drill: to build muscle memory for adaptive response. Rather, make sure that all the capabilities that are needed for resilience have regular testing so that their failure never has to be revealed during a drill.
Micro-drills. The goal of drills is to build and constantly enhance the organization muscle memory of how to respond to events or crises. You need to constantly drill / exercise but you can't do this if you only do massive ones - the sort maybe you can only do a few times a year. You can increase the volume and frequency of drills using "micro-drills". These are small tests typically less than 1 hour involving subsets of the organization to assure response to various types of events or broader scenarios, for example: launching an executive crisis response call, ramping up DDoS capacity, coordinating a leadership meeting at short notice from a back-up location, walking the floor and asking people where they would work from if a crisis event were called, rotating people to work from home periodically, fully failing over to back-up systems in the event of any IT failure. In fact, getting "trigger happy" in invoking crisis responses to any and many events is a useful practice. If you find yourself thinking whether a situation is worthy of going into full response mode then occasionally do it no matter what, just to exercise your response and sustain your muscle memory for the real deal. If you can’t afford to get trigger happy then use that as a signal that you’re not resilient enough. Think of this like chaos monkey for people processes.
Blast radius. Minimize the blast radius of potential events and increase loose coupling of systems and processes (including those in your supply chain) such that response to any event is easier to deal with. Apply resilience engineering principles to your systems design, maintenance and operational processes - including chaos engineering techniques.
Look around corners. Broaden how you think about threat intelligence to include sourcing data about incidents and close-calls across the spectrum of risks from all types of organizations across all sectors. Use this feed to challenge assumptions and, with your scenario catalog, do the work to assess how well your capabilities would perform. Use these as sources for future drills. Think about the worst case by combining scenarios in more extreme ways: a WOW (Worst of the Worst) scenario planning exercise in which you assess how well you can be resilient in the face of several really bad scenarios happening at once can really test your mettle.
Playbooks and checklists. Now, having said you should focus on capabilities not plans, you do need some operational documentation. However, these become much more abbreviated in the form of playbooks or checklists for the use of capabilities (e.g. how to activate a crisis call tree), time bound activities as people are forming response (e.g. 8 things to do in first 30 mins of a security incident), or trigger-based action plans (e.g. what to enact when W.H.O. declares a Phase 5 pandemic).
Establish effective crisis leadership structures. A large part of the success criteria for dealing with events that become serious is how leadership manages the response. This is as much organization design and culture as opposed to just the inherent qualities of particular leaders. Having separate but highly linked response forums / calls for executives (enterprise crisis management) and operators/engineers (incident response teams) is critical to ensure people remain focused. How many times have you been on an incident response call when numerous Senior VPs or C-suite join at random times and ask for an immediate update? This can derail the response process. Instead, there should be rehearsed communication protocols, prepared responses (think of this as a communications toolkit capability) and designated "runners" to bridge different forums. Throughout a drill or an actual event response constantly ask is the team working effectively, remembering that sometimes your best crisis leaders are not those is positions of utmost authority in regular situations.
Bottom line: the most resilient organizations are so because they have a tremendous set of base capabilities (people, process and technology) already established, they have sustained organizational muscle memory to arrange (and constantly rearrange) those capabilities in response to a developing situation and the culture to constantly adjust both of those - quickly.