10 Fundamental (but really hard) Security Metrics
As an industry we have been trying to deal with the issue of security metrics for a long time. I’ve written about this here, and in the context of Board and risk reporting here and here. It is still a struggle though.
I’m becoming more convinced it is a struggle because we (and I absolutely include me in this) are not being brave or anywhere near ambitious enough in our measurement goals. We have fallen into and continue to stay mostly in the trap of “count what you can count” as opposed to “count what counts”.
Instead, we need a smaller number of more fundamental metrics that drive bigger outcomes. We might call these “Pareto metrics” in that they represent 20% of the range of metrics one could develop but they drive at least 80% of the outcomes and behaviors to take security to the next level.
But, here’s the issue, many of them are immensely difficult to measure. Also, when measured they will look like failure because there is so much progress needed to bring them into line - this is kind of the point though. They are also inevitably subtle, so the Board and management may not initially understand them. They will need to be educated on how to think about them, use them and stipulate their progress toward some acceptable level of adherence. As a result of all of this you will need to make liberal use of the "don’t fire me chart".
As I’ve been discussing some of these these concepts with a lot of very wise and experienced people over the past few months I’ve been getting quite a bit of push back. This is because people immediately and correctly conclude that it will be hard to do for all the reasons I just stated. As a result I’ve been hesitating to write this post because I’ve been tempted to think I might just be wrong on this.
However, as I think about this more I am becoming more certain that we have to do this. Perhaps not exactly the ideas and examples I give below, but something similar and maybe better. The main point, though, is that these metrics aren’t necessarily to be driven to 100% (or 0 depending on the measure). Rather, organizations can choose where to set their goal and hence risk appetite. Achieving the set goals defined by the metrics should all also create adjacent benefits more than just mitigating cybersecurity risk - this could be to drive the modernization of the technology environment with all those commercial or mission benefits that come from that. Even the act of figuring how to measure these will bring improvements in and of itself. Here is what could be the Top 10:
1. Software Reproducibility
What percentage of your entire software is reproducible through a CI/CD pipeline (or equivalent). From this comes so much in terms of security, reliability, resilience and recovery as well as business agility. Patching and vulnerability measurement outcomes can stem from this. If this percentage is low then inevitably the time to resolve vulnerabilities or completeness of resolution will not be what you want. In my experience there are a small number of tech companies or digital natives that are 80%+ (if not near 100%), other types of organizations would be doing well to hit 50%, but sadly a lot are in the 10-20% range.
2. Infrastructure Reproducibility
What percentage of your infrastructure (on-premise or cloud) is software defined, follows an immutable infrastructure pattern and for which the configuration code that defines this is adherent to the reproducibility approach defined in 1.
3. Software Lifecycle Security
For each business/mission process what is the lowest common SLSA level for the software that underpins that process. This requires a software dependency map tied to a business or mission process inventory. This means you are only as good as the weakest link in that dependency chain - which, of course, will be the thing that bites you if you don’t drive it up.
4. Time to Reboot the Company
Imagine everything you have is wiped by a destructive attack or other cause. All you have is bare metal in your own data centers or empty cloud instances and a bunch of immutable back-ups (tape, optical, or other immutable storage). Then ask, how long does it take to rehydrate/rebuild your environment? In other words, how long does is take to reboot the company? Note: there is a subtle difference here between conventional backup/restore processes vs. rebuild/reboot. There are a lot of companies that I’ve seen struggle with timely recovery from ransomware events despite having otherwise good backups. This is because the backups were insufficient for a rebuild/reboot vs. a conventional server failure, site disaster, or file deletion driven restore need. e.g. missing catalogs, missing software, missing configuration.
From what I’ve seen there’s almost no organization that can completely answer this question at extreme breadth, depth and accuracy. It’s unusual for many organizations to be able to partially answer this. However, I’ve seen some positive outcomes in a number of organizations for certain things like desktops, core critical systems, and some key business systems that have gone from unknown to known measurement - but still too long timeframes. They have then invested to bring this down from weeks, to days, to hours. One organization I knew went from global desktop recovery status of unknown to then be able to take 80,000+ virtual and physical desktops from bare metal to full function in a few hours. Others I’ve seen can fully rehydrate their “core” (DNS, network, authentication, directory, time, and other services) in a few hours.
The great thing about this metric is it is feasible to put costs to each reduction in risk and it’s usually quite clear what the diminishing returns are. For example, to go from 1 month to 1 week might be $X, 1 week to one day might be $10X but 1 day to 1 minute might be $1000X. For Boards or executive leadership to initially naively demand instant recovery this gives them a more palatable and balanced range of options. Again, many organizations I have seen that use a variant of this measure have actually seen management think $10M a bargain to go from 1 week+ to sub 1 day for core functions. Many also use cloud and IT modernization approaches as a means of doing this and, of course, this becomes easier if you can become really good at software and infrastructure reproducibility.
5. OODA Spread (Observe, Orient, Decide, Act)
How much faster (or slower) is your OODA loop than your attackers. Responsiveness and adaptiveness in the face of attacker’s capabilities and intent is a key signal of how likely you are to be subject to a successful attack. This is very hard to measure, but in doing so would require an understanding of the threats you face and how you are able to monitor or procure services to understand their evolving TTPs. It also requires you measure your ability to successfully tactically mitigate or strategically outclass threats (solve for whole classes of attacks). Not all organizations will need this measure, especially if you are in the class of opportunistically targeted. A more advanced variant of this, especially if you are on the wrong side of the metric would be to measure its first order derivative so you can instead watch how much you are closing the spread.
6. Blast Radius Index
What percentage of roles in your organizations have a potential incident (insider risk or error driven) damage blast radius greater than the organization span of the role N steps (e.g. N=2) above that in the organization. There could be many ways to parameterize the risk appetite here and many ways of constructing the index as a risk-weighted basket across different parts of the organization. This is also, like all our other Pareto metrics, hard to measure, but doing the measuring will likely reveal many issues that need resolving so you can then measure.
7. Systems Stagnancy
I’m not a fan of the term legacy system, even though I’ve been known to use it a lot. Some legacy systems can very well be a true “legacy” of the company and might even be well maintained, reliable, secure and supported despite being of an older technology generation. The real issue, which is what I think we mean when we casually throw around the phrase legacy systems are systems that are stagnant. In other words, systems that are not maintained or kept up to date. As with many of the metrics it might be ok there are stagnant systems under various risk conditions, but having full transparency into what is stagnant is crucial.
8. Preventative Maintenance
A big root cause of many of the issues that make some of these fundamental metrics hard to measure or drive good outcomes from is insufficient budget or resources. This reduces the ability of a team to undertake activities like maintenance, technical debt pay down, or other work that would in other realms be considered preventative maintenance. There’s likely no correct level for this but it’s reasonable for there to be an assigned budget amount, expressed as a percentage of the wider operating budget, that can go up or down each year (or quarter depending on your budget processes). It’s reasonable for management or the Board to dictate that this budget increases according to prior failures (a signal that more maintenance is needed) or it can decrease because of the positive effect of maintenance once that has been fully demonstrated (to avoid premature cutting). The key point here is to make that budget eat into or free up the operating capital or expenditure of that business / department so there are aligned incentives.
9. Control Pressure Index
For specific lines of attack (application compromise, e-mail delivered malware, web drive-by-downloads, etc) what is the average level in the defense-in-depth stack that stops the attack and at what point is the attack detected. The logic here is the slower the time to detect or the more down the defense in depth stack the attack gets the more investment needs to be made to redesign or otherwise bolster that defense-in-depth to lessen the pressure.
10. Inventory Triangulation
What percentage of what inventories are subject to a reconciliation with other inventories and/or how much an inventory has to be accurate because it is depended on by another process. Many of the other metrics rely on a set of inventories and rely on those being accurate. Inventories are rarely accurate unless they are actively used in a process that would apparently fail without such accuracy. So, you have to measure how much this is in fact true.
There could be many other ideas to be developed such as what percentage of business transactions are reversible, how much will certain types of incident move your company's credit rating, what is the extent to which missions can be executed in varying degrees of degraded IT operation, what percentage of business line profitability can be compensated for by insurance coverage for disruptive events, and how many people have privileges that span toxic combinations of access that work against your blast radius index. You can likely think of many more - but they all have the same pattern: they are hard to measure, but in working to do so you get benefits from the ability to measure itself as well as revealing necessary actions to drive outcomes that have many benefits beyond just security.
The one big negative point for a lot of people I’ve spoken to about these ideas is the correct assertion that the Board or Executive leadership (perhaps even IT leadership) might not understand these metrics. This is absolutely correct. But we have to educate people. It is worrying the consensus is still that we have to talk to the Board only on the Board’s terms. That is true to a degree, but it is not unreasonable to expect a deeper level of understanding from Boards and Executives on these metrics provided that we take the effort to educate them and then be ruthlessly consistent in their use, so that their investment in time to understand them is seen as worthwhile.
This happens in other fields or risk domains. When you look at the Boards of banks, healthcare, energy, or a variety of other companies, each of these Boards has to be adept at understanding qualitatively and quantitatively some quite intricate topics. This could be accounting, bank stressed capital, value at risk, pharmaceutical safety, energy regulation and safety and a variety of other topics. Board members are selected for some of this expertise and others are immersed in training as part of Board induction. For example, when I joined a Bank Board, despite having worked at the same Bank, I still had to go through significant training to understand and be able to deal with credit risk, market risk and liquidity risk concepts as well as a number of non-security compliance topics I didn’t know enough about. It would have been unthinkable for me to expect the specialist teams or business units that presented to the Board to have to define terms in rudimentary ways when they were asking for Board approval of, say, a credit risk limit increase. Regulators even scrutinized the abilities of the Board to assure their degree of competence at least at a supervisory level. Similar things happen in other critical infrastructure sectors. Now all companies have become digital businesses with missions founded in technology, security and resilience, expecting more expertise from Boards on this topic is entirely reasonable.
Bottom line: there are too many inconsequential security metrics that are either vanity metrics or are metrics that are available to us but are not what we truly need. We need to adopt more “Pareto metrics” that drive more significant outcomes. This will be hard but it will be worth it - both for the outcomes and adjacent benefits as well as the improvements that stem intrinsically from the ability to measure. There will be push back that these are subtle or difficult topics that might not be intuitive to Boards or leadership, so education and consistency will be crucial. Indeed, this education may even have adjacent benefits.