- Phil Venables
Resilience is about Capabilities not Plans
Resilience can be thought of as the ability to absorb shocks, adjust as needed and continue operation in the face of adversity. In other words, to meet your obligations no matter what is thrown at you - perhaps with some graceful degradation of specific service levels. It is not simply the ability to deflect, avoid or prevent events. Events in this context can be across all business and technology risk domains - whether they are slow or fast moving - from cyber to pandemics.
One of the common mistakes many organizations make is to think that resilience can be obtained by simply writing down comprehensive plans and procedures on what to do and how to respond to specific events. When someone thinks of a new event or scenario then a new plan is written and carefully filed away in a Big Book of Plans. Eventually there is a whole shelf full (or virtual equivalent) of these things. Sometimes plans are even tested to see if they actually work.
There a three major problems with this, when facing the reality of actual events:
In an actual crisis situation, adrenaline-fueled people are unlikely to take the time to consult large manuals to tell them what to do.
Most crisis or significant events are unique and even if you consulted the plans it would be a lot of effort to contort them to the specific situation you are facing.
Not all plans can be tested frequently and so the underlying means (people, process, technology) of implementing actions in those plans may not have been sufficiently maintained and may only be seen to be deficient when most needed.
The answer to these problems is deceptively simple but profoundly effective. That is to focus on capabilities not plans. Established capabilities are combined / utilized at a time of need by a trained work-force to deal with whatever event is thrown at them. Capabilities are constantly maintained and tested independent from crisis / event drills. Drills then focus on building crisis response muscle memory across the organization. I have seen many organizations shift to this approach and they are more resilient for it.
More specifically (general) resilience comes from:
Baseline Capabilities. A set of people, process and technology capabilities that are maintained to defined service levels and continuously monitored as being able to meet those service levels. Examples: remote access services for your workforce able to support everyone connected simultaneously, dispersed physical offices and back-up sites, pre-negotiated contracts to expand office space or add new temporary locations, employee wellness / medical support, dispersed technology delivery, tested burst capacity, distributed voice and video communications including capability to be used on non-corporate devices, critical business operations pre-dispersed among disparate locations or regions.
Use the Capabilities. Run day to day business using these capabilities as much as you can, so that they are assured of correct operation. If you can't, then test them regularly such that they meet defined service levels. Example: if your crisis communications technologies are not the same as the technologies people use every day then they are unlikely to be used successfully in a crisis, instead create inherently resilient / survivable communications approaches - and if do you need something totally different then use it regularly across your population such as holding staff meetings on the back-up communications system.
Capacity. Understand the capacity constraints of your capabilities and if you can't economically run with excess capacity then conduct regular testing of your ability to quickly ramp up.
Scenario Catalogs. Develop a scenario catalog that can be used to assess whether your capabilities have the means to respond to and operate well in such a scenario. Pick scenarios that exercise the whole spectrum of the risk distribution from expected to tail events. Looking at the extremes is useful to see where your capabilities break down and if that is appropriate for your risk appetite. Remember, scenarios aren't plans, and shouldn't become detailed plans.
Capability Testing. Separate out capability operational testing from crisis response drills. I've come across many large scale drills that have had issues because of failures in basic capabilities such as crisis communications technology (e.g. bridge lines), deficiencies in technology at back-up sites, lack of access to back-up sites, or revealed capacity constraints that have caused the drill to fail early. At one level these are still a success because the organization learnt and fixed these things, at another level it's a failure because they never got to really fulfill the intent of the crisis drill: to build muscle memory for adaptive response. Rather, make sure that all the capabilities that are needed for resilience have regular testing so that their failure never has to be revealed during a drill.
Micro-drills. The goal of drills is to build and constantly enhance the organization muscle memory of how to respond to events or crises. You need to constantly drill / exercise but you can't do this if you only do massive ones - the sort maybe you can only do a few times a year. You can increase the volume and frequency of drills using "micro-drills". These are small tests typically less than 1 hour involving subsets of the organization to assure response to various types of events or broader scenarios, for example: launching an executive crisis response call, ramping up DDoS capacity, coordinating a leadership meeting at short notice from a back-up location, walking the floor and asking people where they would work from if a crisis event were called, rotating people to work from home periodically, fully failing over to back-up systems in the event of any IT failure. In fact, getting "trigger happy" in invoking crisis responses to any and many events is a useful practice. If you find yourself thinking whether a situation is worthy of going into full response mode then occasionally do it no matter what, just to exercise your response and sustain your muscle memory for the real deal. If you can’t "afford" to get trigger happy then use that as a signal that you’re not resilient enough. Think of this like chaos monkey for people processes.
Blast radius. Minimize the blast radius of potential events and increase loose coupling of systems and processes (including those in your supply chain) such that response to any event is easier to deal with. Apply resilience engineering principles to your systems design, maintenance and operational processes - including chaos engineering techniques.
Look around corners. Broaden how you think about threat intelligence to include sourcing data about incidents and close-calls across the spectrum of risks from all types of organizations across all sectors. Use this feed to challenge assumptions and, with your scenario catalog, do the work to assess how well your capabilities would perform. Use these as sources for future drills. Think about the worst case by combining scenarios in more extreme ways: a WOW (Worst of the Worst) scenario planning exercise in which you assess how well you can be resilient in the face of several really bad scenarios happening at once can really test your mettle.
Playbooks and checklists. Now, having said you should focus on capabilities not plans, you do need some operational documentation. However, these become much more abbreviated in the form of playbooks or checklists for the use of capabilities (e.g. how to activate a crisis call tree), time bound activities as people are forming response (e.g. 8 things to do in first 30 mins of a security incident), or trigger-based action plans (e.g. what to enact when W.H.O. declares a Phase 5 pandemic).
Establish effective crisis leadership structures. A large part of the success criteria for dealing with events that become serious is how leadership manages the response. This is as much organization design and culture as opposed to just the inherent qualities of particular leaders. Having separate but highly linked response forums / calls for executives (enterprise crisis management) and operators/engineers (incident response teams) is critical to ensure people remain focused. How many times have you been on an incident response call when numerous Senior VPs or C-suite join at random times and ask for an immediate update? This can derail the response process. Instead, there should be rehearsed communication protocols, prepared responses (think of this as a communications toolkit capability) and designated "runners" to bridge different forums. Throughout a drill or an actual event response constantly ask : "are we an effective team", remembering that sometimes your best crisis leaders are not those is positions of utmost authority in regular situations.
Finally, thinking about now, some organizations have coped well in the recent and ongoing Covid-19 situation not because they had a pandemic plan fully worked out (because I bet you that plan, no matter how good, didn’t perfectly describe all aspects of how Covid-19 has developed). Rather, they continue to do well because they had a tremendous set of base capabilities (people, process and technology) already established, they had sustained organizational muscle memory to arrange (and constantly rearrange) those capabilities in response to a developing situation and the culture to constantly adjust both of those - quickly.