Measurement Priorities
Seth Godin asks an excellent question:
Is something important because you measure it, or is it measured because it’s important?
I find that we tend to measure what we can, rather than working toward being able to measure what we should, in large part because some variation of this question is not asked.
I’m going to pick on malware fighting as a case-in-point. Is there lots of nasty malware out there than can really destroy your infrastructure? Absolutely, and as a result, IT and IT Security teams put tremendous effort into detecting and cleaning malware infections.
But how much of that malware actually impacts business, either by affecting the availability of the IT environment or producing a material incident? If your business is like most, then the answer is hardly ever*.
So why does the security industry spend so much money and time (another form of money ) on malware? Because we can. Never mind that the stuff you should be truly worried about if you’re talking about protecting The Business (as opposed to The Infrastructure) is the APT/Custom Malware/Targeted Threat stuff, which is invisible on the anti-virus console?
Because they can.
* While that could be changing thanks to innovations like StuxNet, who honestly thinks that messing their business up is worth burning three 0days? Really? Get over yourself.
In the meantime, you can test this argument in your own environment. Compare the number of pieces of malware you’ve detected and cleaned (versus prevented) versus the number that significantly impacted more than the infected person’s machine.
We’ve had one in the past year that might meet the disruption test, versus multiple malware cleanup tickets per week (not daily, but it tends to be spiky, so the average is greater than one per day. Still…). It took out a single user who had managed to break his Anti-Virus because either he didn’t like when the full scan ran or it kept stopping him from installing the trojaned, pirated software he’d downloaded–I never quite got a clear answer on this. The infection jammed the mail queue with outbound spam, causing a degradation (but not disruption) of outbound email for a few hours.
Three points make have I!
1. I’m going to pick on your malware case, even though I get the point of your original thesis. 🙂 We’ve been blessed in the last 5 years in terms of malware because they’re not necessarily regularly taking down infrastructure. Go back beyond 5 years, though, and we start talking about worms that would saturate entire Internet links, servers, networks, and systems, maybe even to the point of not being able to clean them confidently enough without a rebuild (data loss?). If we get too anti-anti-malware in security, we’re going to open ourselves back up to history repeating. To me, this is a lot of security: shoring up the possible issues so that we’re not looking like idiots when something reappears that we forgot about just because we’ve not seen it in a while? How much malware still attempts to propagate via the network over network shares or direct connections, and how much of that has been stopped because anti-malware is often packaged with host-based firewalls now?
2. Don’t forget the impact also goes to the IT team that needs to clean up malware issues, which pulls them away most probably from other projects.
3. So what should we be measuring? (Did I just ask that? I don’t mean that in a snarky way; more like a hypothetical…)
Vamp, I thought about that the large-scale worm days of yore and agree it’s been a long time since we saw people getting knocked off line by worms.
The worst of them were all mitigated by patching, though, and that’s something that people either didn’t (don’t) measure or measure dishonestly because it’s too hard to do well at the level which “looks good” as an honest metric.
For example, most larger organizations struggle to achieve 95% or even 90% of critical patches within 30 days. That’s probably adequate to allow response teams to deal with even the super-worms of yore like Code Red, SQL Slammer, et. al., which were all based on months-old vulnerability, but it doesn’t look good when people are being taught to think in terms of “nines”–decimals of 99%.
Part of a good metric, IMHO, is that it has deliberately defined goals or thresholds rather than lazily borrowing old telco availability standards.
As to what we should be measuring, I think it’s the things that are either too hard or not cost-effective to measure (e.g. real losses). Thus, the trick becomes to determine how to reduce the effort of getting there.