It’s easy to critique the recent Voltage report on breaches. (For example, “2009 started out to be a good year for hackers; in the first three months alone, there were already 132 data breaches reported.” That there were 132 breaches does not mean that hackers are having a good year; most breaches are not caused by hackers, and most breaches are small.)

But there’s some really interesting tidbits, including the claim that the log(10) of the size of the breach is a normal curve with an mean of 3.5 and a standard deviation of 1.2, which means the mean breach is about 3,200 people. I’ve been saying for a while that all the breaches we remember are outliers, and Voltage’s analysis would indicate that two standard deviations, or about 97%, of breaches are smaller than 10^5.9, or about 790,000 people. (It’s unclear why, having done analysis of the size of breaches, they use a order of magnitude system for rating breaches, rather than something based on deviations.)

What’s more interesting is that they’re making testable predictions about the future:

At that rate, we should expect roughly 528 data breaches in the next 12 months [from May 2009]. If that is the case, the probability of having one or more data breaches in the next year that exposes 1 million or more records is roughly 99.9951 percent, or a virtual certainty, and we should expect to see about 14 data breaches of that size in the next year – this represents 1 in 200 adults in the US being affected.

This model also tells us that the probability of any give breach exposing 10 million or more records is 0.001769, or about 0.18 percent. This means that we can expect about 0.18 percent, or about 1 in 565, of data breaches to be that big. If that is the case, then the probability of having one of more data breaches in the next year that exposes 10 million or more records is over 60 percent – this is the equivalent of 5% of the US adult population being affected.

The interesting thing about these predictions is that they can be tested in May 2010. (It would be helpful for Voltage to say exactly what period they mean by “the next 12 months.”) While Dissent says:

I am not sure that a logarithm model will be appropriate for predicting future breaches. If organizations were to actually learn lessons from known breaches … then we might expect to see fewer large breaches rather than more.

I tend to agree, but the great thing is our agreement *doesn’t matter*. If the prediction holds, then we know something about the model. If the prediction fails, then we know something about the model. That’s the great thing about presenting predictions which are specific and measurable. So thank you, Voltage, for putting forward predictions. I look forward to seeing how they play out.

(Mortman commented on this previously in “Voltage Security’s Breach Map.”)

I think this is a stretch to say this is a loss model more so then a graphical representation of previous loss data. If it were a model – then how is it used to make decisions? This model in no way takes into account the impact or cost magnitude of the breach and does not give the context of security posture relative to the companies whose breaches make up the data set.

All this model tells me is the average size of publicly known data breaches, that Voltage claims to predict the future, and that Voltage marketing is banking on the fact that if their predictions are flawed – that we will not remember them. Besides, if they are trying to predict when the next big breach will occur they should probably be using a different approach; Poisson like approach that takes into account industry type.

Woe to the decision maker that makes an investment decision based off this data.

Hi Chris,

Who said it was a decision model?

However, if you think that a Poisson approach is the right one, I’d encourage you to publish some data, and next May we can compare.