Delivering Security at Scale: From Artisanal to Industrial

Phil Venables
Jun 3, 2023
8 min read

Maturing a security program in any type of organization is not just to increase specific control effectiveness but also to increase its scale, predictability and reliability - otherwise that effectiveness cannot be sustained. A key factor in doing this is to move from “artisanal” ways of working to become more “industrial” - that is to move beyond individual team member craftsmanship toward relentlessly consistent organization-wide outcomes.

Let's examine the difference between artisanal and industrial security programs, the metrics that measure the evolution, the forces to harness, the need for continuous controls monitoring and, finally, to take the perspective of end to end business service (or mission) assurance.

1. Artisanal vs. Industrial

I don’t mean to dismiss artisanal approaches, they are often high quality, but they rely on individual rather than collective performance to get the right outcomes. Instead, an industrial approach establishes an extended organization of people with different abilities and skills according to an overall objective. The people in these types of organizations can still be immensely skilled but they need to deliver enterprise-wide outcomes with a degree of quality and predictability that is not solely dependent on them as an individual.

When I use the term industrial I want you to think not of, say, a 19th century steel-mill but rather a 21st century automated hi-tech manufacturing plant. The following table highlights some of the differences between the artisanal and the industrial.

So to shift your organization from artisanal to industrial you need to shift your individual approaches from artisanal to industrial. But there are other things you need to do as part of this industrial design.

2. Pareto Metrics

A big part of your drive to industrialization is to shift the measurement of performance goals from solely lagging indicators to include more leading indicators. Part of this is to use metrics (or wider performance and risk indicators) and their operation as a means to drive scale, performance, and accountability in your program. I call these Pareto Metrics.

I first coined the term Pareto Metrics in this post to mean those metrics that capture a range of actions that not only reduce risk and improve an organization’s processes and tooling but also that the very act of measurement causes a set of activities that advance the action being measured. I call them Pareto Metrics because these might be the 20% of the metrics that drive 80% of the outcomes you want. The last post defined the properties of these types of metrics in more depth.

An industrial-scale security program thrives on defining and bringing itself into conformance with some really hard Pareto Metrics and recognizes that these specific metrics require significant team effort to deliver them.

Some examples of these include:

Software reproducibility
Infrastructure reproducibility
Software lifecycle maturity
Cold restart time
OODA Spread
Blast radius index - systems and people
Systems stagnancy
Preventative maintenance
Control pressure index

It is perhaps a necessary characteristic that Pareto Metrics are difficult to measure. They’re worth the effort because when you get them going in the right direction you inevitably have to create the broader organization processes, systems and teamwork to make the organization capable of delivering them. Which is what you would need to do anyway to achieve the outcomes of more narrow metrics, the difference being that such organization needs are buried.

A big challenge, though, when first measured is that these metrics will look like failure because there is so much progress needed to bring them into line - this is kind of the point though. They are also inevitably subtle, so the Board and management may not initially understand them. They will need to be educated on how to think about them, use them and stipulate their progress toward some acceptable level of adherence. As a result of all of this you will need to educate people about the uncanny valley of security.

3. Inherent Forces to Tap Into

Another part of industrialization is to tap into forces (or megatrends) that naturally help you. Also, remember, adopting a strategy that flat out goes against one of these megatrends might be a signal that your strategy is wrong.

There are many examples of such forces in other domains ranging from regulatory structures, macro-economic forces, energy sustainability, demographic shifts, waves of new technology and so on. To pick a current example, if your strategy relies on your competition not becoming 10X more effective because of AI then you’re likely to have a problem.

In this Google Cloud blog post I cover the cloud security megatrends that drive iterative improvement in cloud-scale security. There are more general megatrends that shape how we industrialize and modernize IT, and also security, but these cloud security megatrends capture many of them.

In particular, while all of these megatrends drive industrial scale, there are some in particular that should be focused on:

Software defined infrastructure. This is a particular advantage for security since configuration in the cloud or in cloud-like environments is inherently declarative and programmatically configured. This also means that configuration code can be overlaid with embedded policy intent (policy-as-code and controls-as-code). You can validate configuration by analysis, and then can continuously assure that configuration corresponds to reality. You can model changes and apply them with less operating risk, permitting phased-in changes and experiments. As a result, you can take more aggressive stances to apply tighter controls with less reliability risk.
Software deployment velocity. Using a continuous integration/continuous deployment model is is a necessity for enabling innovation through frequent improvements. This automation and increased velocity decreases the time spent waiting for fixes and features to be applied. That includes the speed of deploying security features and updates, and permits fast roll-back for any reason.

4. Continuous Control Monitoring

Another essential element of your industrialized security program is to know definitively, in real-time, the state of all your required controls and to learn and adapt in response to any detected failures. In other words to not just do continuous control monitoring but to also operate what you might call a Control Reliability Engineering program.

This is so critical because, still, the majority of attacks have a common pattern - that they are not the result of some awesome attacker capability to exploit some hitherto unknown vulnerability or to realize a risk from some combination of controls weakness not contemplated. Rather, a remarkably common pattern is that the control or controls that would have stopped the attack (or otherwise detected/contained it) were thought to be present and operational but for some reason were actually not - just when they were most needed.

There can be many reasons for these failures to sustain the correct implementation of important controls, for example: a bad change, operational error, a new implementation that was incomplete, other faults/breakage, some wider change that nullified the in-situ control, ongoing lack of configuration discipline and many other operational factors. In fact any issue that drives any type of system error can be an issue that negates or disables a control. The incident/after action reviews no doubt have the same moments of people exclaiming: “But didn’t we have a [ process | tool | component ] to stop that happening?”

Incidentally, we talk about attacks but a similar pattern exists for control failures that lead to other types of incidents across the full spectrum of enterprise risk domains. I’ve seen plenty of examples where there were runaway issues of system reboots, access revocation, duplicated transactions or errant algorithms where the circuit-breaker or other control harness (that was designed to be the independent safety check) failed due to insufficient continuous testing.

So, what to do. Treat controls as first class objects like other parts of a system's function.

Catalog controls. Build a catalog of key controls using a well formed ontology. OSCAL and an evolving eco-system of continuous controls validation is also making great progress here.
Control design reviews. Conduct independent assurance / design reviews for key controls. This doesn't have to be fully independent - but at least a peer review in whatever development methodology / style you operate.
Controls as code. Treat controls (especially security controls) as automation / code. Build tests / coverage for control correctness as you would with other code.
Build-time tests. Test for the presence and integration of controls at build time in integration / deploy suites. Different styles of test will be needed depending on the nature of the controls (component, software, hardware etc.)
Continuous control monitoring. Perform continuous control monitoring to assure the continued correct operation of controls at run time and assure sustained completeness of deployment. Minimize the time between a potential control failure and the detection/correction of that failure. This can be done by collecting data from the control’s operation but it can also be detected by injecting synthetic events to test the control's operation and its liveness / effectiveness.
Uninstrumented controls are bad controls. Declare any control that doesn’t prove amenable to such assurance or doesn’t emit the data needed for continuous monitoring to be a control in need of improvement / replacement. This is irrespective of whether it is, otherwise, an effective control.
Record control incidents. When a control (or instance of a control) is detected as having failed then declare a "control incident" and handle as if a security incident has occurred (as it might well actually become if not attended to quickly enough).
Manage control incidents. Treat control incidents as first class objects alongside security incidents (reporting, escalation, after action review, thematic analysis) whether or not the control incident actually resulted in a security incident. [consider close-calls as well as actual incidents]. Empower a team, in an SRE-like way, to operate as a CRE (Control Reliability Engineering) function to monitor for control failures and then be empowered to diagnose and fix the issues. Of course, this could well be the SRE team itself.

5. Business Service and Mission Assurance

Finally, your overall goal of an industrialized program is to have significant ongoing assurance that your set of business services or missions (depending on the nature of your organization) are operating securely and reliably. This requires you adopt an operational resilience mindset where each business service is identified. You will likely want to start with your critical services. Indeed, a key benefit of this work is to actually do the mapping and determine which are really critical.

The key elements of this approach include:

Business service identification. Identify, rank criticality and develop accountable executive owners for each business service.
End to end. Map the elements of that business service across the extended enterprise, from upstream customer dependencies through your systems (applications or otherwise), mapping in data sources, and then downstream to your set of supplier (3rd party) dependencies (perhaps even to the 4th or 5th party extent).
Operational resilience. For each business service, understand the major service level objectives associated with control performance and resilience. For example, recovery time objectives (RTO) and recovery point objectives (RPO) from multiple causal factors of disruption. Develop a list of plausible but severe scenarios that aren’t otherwise contemplated in the list you already have. This is to test the edges of what you think you can handle to see if something you have previously excluded is now worth planning either because the threats and risks have changed to make it more plausible or because you want to get that extra assurance.
Integrated measurement. Mapping your continuous controls monitoring performance and wider sets of metrics.

Bottom line: it is vital to move from individual artisanal excellence to scale your security program with a relentless approach of progressively industrializing every part of your program. Do this in a way that does not diminish the capability of individually excellent people - in other words, you want your industrialization to amplify individuals to a highest common factor and make their actions scale rather than commoditize performance to a lowest common denominator.

RISK & CYBERSECURITY

Thoughts from the Field