OR – RISK ANALYSIS POST-INCIDENT, HOW TO DO IT RIGHT
Rob Graham called me out on something I retweeted here (seriously, who calls someone out on a retweet? Who does that?):
And that’s cool, I’m a big boy, I can take it. And Twitter doesn’t really give you a means to explain why you think that it’s too early to do a hindsight review of Fukushima, so I’ll use this forum.
Here’s the TL;DNR version: It’s too early to do hindsight or causal analysis on Fukushima – there is still a non-zero chance that something really bad could happen, we’re not at a point where the uncertainty in our information has stabilized, and any analysis done now would still be predictive about a future state.
But if you’re interested in the extended remix, there are several great reasons NOT to use Fukushima for a risk management case study just yet:
- Um, the incident isn’t over. It’s closer to contained, sure, but it’s not inconceivable that there’s more that could go seriously wrong. Risk is both frequency and impact, an incident involves both primary and secondary threat agents. Expanding our thinking to include these concepts, it’s not difficult to understand thatwe’re a long way from being “done” with Fukushima.
- Similarly, given the forthrightness of TEPCO, I’d bet we don’t know the whole story of what has happened, much less the current state. The information that has been revealed has so much uncertainty in it, it’s near incapable of being used in causal analysis (more on that, below).
- The complexity of the situation requires real thought, not quick judgment.
Now Rob doesn’t claim to be an expert in risk analysis (and neither do I, I just know how horribly I’ve failed in the past). So we can’t blame him. But let’s discuss two basic reasons why Rob completely poops the bed on this one, why the entire blog post is wrong: Post-incident, our analytics aren’t nearly as predictive as pre-incident or during incident analytics. They can still be predictive (addressing the remaining uncertainty in whatever prior distributions we’re using), but they are generally much more accurate.
Second, what Rob doesn’t seem to understand is that post-incident risk management is kind of like causal analysis, but (hopefully) with science involved. It’s a different animal.
Post-incident risk analysis involves a basic model fit review, identifying why we weren’t precise in those likelihood (1) estimations Rob talks about. It’s in this manner that Jaynes describes probability theory as the logic of science, as the hypothesis you make (your “prediction”) most be examined and the model adjusted post-experiment. It’s basic scientific method. I don’t blame Rob for getting these concepts mixed up. I see it as a function of what Dan Geer calls “premature standardization”: our industry truly believes that the sum of risk management is only what is told to them by ISACA, the ISO, OCTAVE, and NIST about IT risk (as if InfoSec were the peak of probability theory and risk management knowledge). This is another reason to question the value of the CRISC, if in the (yet unreleased) curriculum there’s no focus on model selection, no model fit determination, or any model adjustment.
So if the idea of doing hindsight is invalid because what we’re dealing with now is a different animal than what we would be doing post-incident, what do we do?
First, if you really want, make another predictive model. The incident isn’t over, we don’t have all the facts, but if you really wanted to you could create another (time-framed) predictive model adjusted for the new information we do have.
Second, we wait. As we wait we collect more information, do more review of the current model. But we wait for the incident to be resolved, and by resolved I mean where it’s apparent that we have about as much information as we’ll be able to gather. THEN you do hindsight.
At the point of hindsight you review the two basic aspects of your risk model for accuracy – frequency and impact. Does the impact of what happened match expectations? To what degree are you off? Did you account for radioactive spinach? Did you account for panicky North Americans and Europeans buying up all the iodine pills? You get the picture. If, as in this case, there may be long term ramifications do we make a new impact model (I think so, or at least create a wholly new hypothesis about long term impact)?
Once you’re comfortable with impact review and any further estimation, you tackle the real bear – your “frequency” determination. Now depending on if you’re prone to either Bayesian probabilities or Frequentist counts, you’ll be approaching the subject differently but have key similarities. The good news is that despite the differing approaches the basic gist is: Identify the factors or determinants you missed or were horribly wrong about. This is difficult because more times than not in InfoSec, that 3rd bullet up there, the complexity thing, has ramifications. Namely:
There usually isn’t one cause, but a series of causes that create the state of failure. (link is Dr. Robert Cook’s wonderful .pdf on failure)
In penetration testing (and Errata should know this of all people), it’s not just the red/blue team identifying one weakness and exploiting it and then #winning. Usually (and especially in enterprise networks) there are sets of issues that cause a cascade of failures that lead to the ultimate incident. It’s not just SQLi, it’s SQLi, malware, obtain creds, find/access conf. data, exfiltrate, anti-forensics (if you’re so inclined). And that’s not even discussing tactics like “low and slow”. Think about it, in that very simple incident description there, we can identify a host of controls, processes and policies (and I’m not even bringing error into the conversation) that can and do fail – each causing the emergent properties that lead to the next failure.
This dependency trail is being fleshed out for Fukushima right now, but we don’t know the whole story. We certainly didn’t count on diesel generators to resist a tsunami, but then there was the incompatibility of the first back ups to arrive, the fact that nobody realized that in a big earthquake/tsunami it would create a transportation nightmare and it would take a week to get new power to the station, and there were probably dozens of other minor causes that created the whole of the incident. Without being at an end-state determination where we have a relatively final amount of uncertainty around the number and nature of all causes, it would be absurdly premature to start any sort of hindsight bias exercise.
It’s too early to do hindsight or causal analysis on Fukushima – there is still a non-zero chance that something really bad could happen, we’re not at a point where the uncertainty in our information has stabilized, and any analysis done now would still be predictive about a future state.
Finally, this: “The best risk management articles wouldn’t be about what went wrong, but what went right.” is just silly. The best risk management articles are lessons learned so that we are wiser, not some self-congratulatory optimism.
1. (BTW gentlereader, the word “likelihood” means something very different to statisticians. Just another point where we have really premature standardization)