- Phil Venables
The Bank of England has recently released a sequence of consultation papers, after an earlier discussion paper, laying out a framework for operational resilience. In their own words: Operational Resilience is the ability of firms and the financial system as a whole to absorb and adapt to shocks, rather than contribute to them.
Why should people outside of the UK financial sector care about any of this? Two reasons:
The concept will inevitably spread to other countries and then other sectors - courtesy of supply chain risk management processes, customer demand, regulatory and audit adoption.
The concepts are not only important to customers but are also a useful construct for risk managers, leadership and Boards of organizations.
The launch of this has been met with some eye-rolling in some quarters with some people essentially saying: “we already do this – it’s called business continuity planning”. I think this is missing the point, and fails to see to the potential benefits of the subtle and not so subtle approaches the framework introduces that will help converge the management of an array of operational risks – not least cyber, IT and business process risk.
So, in very simple terms, what is operational resilience and how is it different from some existing risk frameworks and approaches:
It takes a business service oriented view of resilience - as opposed to a business function (department / process) or IT-centric view of resilience. Some organizations already have this as part of their resilience program. However, some claim/think they do but actually have a business function approach not a business service approach (the customer's perspective). When you look at this you often find a business service is, naturally, made up of multiple business functions - and resiliency planning for an end to end service can require trade-offs between functions.
It looks at the inter-play of all operating risks across people, process and technology, across dimensions of risk like info/cybersecurity, physical disasters, capacity risk, software lifecycle risk and so on. This involves looking at their interplay in quantitative as well as qualitative terms - and being able to articulate and test trade-offs and methods of operating in degraded states in adverse situations.
It looks at the extended enterprise – not just downstream into an organization's supply chain but also upstream into customer's processes and systems to cover the full end to end business/digital interaction. This is especially critical when customer interaction is not just being digitized, in the sense of web/mobile access, but is also being systematized through APIs.
Finally, and in my view most significantly, it establishes impact tolerances. This is a big transformation point that has been most confused and least talked about. Most current resilience programs tend to focus on the ability of organizations to meet Recovery Time Objectives (RTOs) and Recovery Point Objectives (RPOs) for punishing but not necessarily extreme events e.g. contemplate a data center field outage or a whole data center, but not always a whole region, contemplate a supplier going away for a day, but not necessarily weeks (or permanently), and so on. The more severe or extreme events are the real test of crisis response beyond the more service level objective oriented and "routine" resiliency management approaches many organizations have in place. Impact tolerance introduces the notion of plausible but (more) severe events and requires establishing how long a business service can be impacted and in what state (e.g. degraded capacity) before there is excessive or irreparable harm to the business, customers or the market as a whole. These tolerances will likely be in excess of more routine RTOs and require more significant effort to sustain business services in the face of these scenarios - in fact you'll know you're on the right track if you are selecting scenarios that push the envelope a bit. These impact tolerances will need to be agreed by the Board and adherence regularly tested.
I must confess, in my first look at this some time ago, I was tempted to simply think of this as essentially a coating around an existing business continuity/resilience program, but I have come to see the difference in this approach and I now think this will be transformative for 4 reasons:
Truly taking a customer's perspective on the resilience of your services can uncover some seams in your resilience planning that you might not expect - and it can yield some interesting risk mitigation approaches such as substitutability of services / functions vs. point resilience.
Many organizations have in place tested RTOs against scenarios that don't go far enough into the tail risks of low frequency but high severity events that should be additionally prepared for. The shift to more severe events - but with flexibility to not simply adhere to an existing RTO - is a healthy step to increase openness of planning and robustness of services.
The key assumption of operational resilience, that you should plan to respond to events as if they will happen, as opposed to claiming other risk mitigation will avoid the event's occurrence side-steps organization pressure to avoid actually planning in depth for these more extreme events.
Organizations need to adopt a less binary approach of perfect resilience vs. operating in degraded states during the process of recovery – this framework is an opportunity to enshrine that, in addition to further driving consideration of a wider array of risks in scenario selection.