Shostack + Friends Blog Archive


Human Error

In his ongoing role of “person who finds things that I will find interesting,” Adam recently sent me a link to a paper titled “THE HUMAN FACTORS ANALYSIS AND CLASSIFICATION SYSTEM–HFACS,” which discusses the role of people in aviation accidents.  From the abstract:

Human error has been implicated in 70 to 80% of all civil and military aviation accidents. Yet, most accident reporting systems are not designed around any theoretical framework of human error. As a result, most accident databases are not conducive to a traditional human error analysis, making the identification of intervention strategies onerous. What is required is a general human error framework around which new investigative methods can be designed and existing accident databases restructured. Indeed, a comprehensive human factors analysis and classification system (HFACS) has recently been developed to meet those needs.

Consider that pilots, whether private, commercial, or military, are one of the more stringently trained and regulated groups of people on the planet.  This is due, at least in part, to the history of aviation.  As the report notes,

In the early years of aviation, it could reasonably be said that, more often than not, the aircraft killed the pilot. That is, the aircraft were intrinsically unforgiving and, relative to their modern counterparts, mechanically unsafe. However, the modern era of aviation has witnessed an ironic reversal of sorts. It now appears to some that the aircrew themselves are more deadly than the aircraft they fly (Mason, 1993; cited in Murray, 1997). In fact, estimates in the literature indicate that between 70 and 80 percent of aviation accidents can be attributed, at least in part, to human error (Shappell & Wiegmann, 1996).

One upon a time, operating an airplane was so dangerous that only highly-skilled experts could do it, and even then the equipment would get out of their control and crash.  Later (yet still almost twenty years ago), the equipment improved to the point that equipment failure no longer overshadowed operator error, but planes still get out of control and crash.

Other than the fact that pilots are almost universally still highly-skilled and/or trained operators, this doesn’t sound all that different from the evolution of computing.

Flight has obviously never really had the adoption rate explode like PC’s in the Age of the Web, but there is still a strong parallel between aircraft accidents and Information Security failures.  This assertion becomes even more true once the paper gets into James Reason’s “Swiss Cheese” model of understanding root causes of aircraft accidents.

Reason identifies four factors that interact with each other increase accident rates, which I’ll paraphrase as:

  1. Unsafe Acts — This is the cause of the active failure (i.e. crash), such as a poor decision or a failure to watch the instruments or otherwise recognize the unsafe situation was forming or occurring
  2. Preconditions for Unsafe Acts– Situations that increase risk of an accident, such as miscommunication between aircrew members or with others outside the aircraft, such as air traffic control
  3. Unsafe Supervision– failures of management or leadership to recognize when they are, for example, pairing inexperienced pilots together in less-than-optimal conditions
  4. Organizational Influences — Usually business-level decisions, such as reducing training hours to reduce costs

How familiar does this sound?  If you’ve ever read an IT Audit report, this should seem painfully familiar, even if only analogously.  The paper provides a strong taxonomy within each area, and I could easily drill down at least one more level into each one.  Read the paper to learn more and become a better professional problem solver, security-related or otherwise.

For example, using a real-world case I dealt with recently.  This is an easy example which ties the four levels together more neatly than many, so consider it an “Example-Size Problem” and extend as you see appropriate.

The incident was the loss of sensitive business information, which I personally believe hurt the company in a negotiation:

  1. Unsafe Act:  The VP left his unencrypted laptop unattended while at a meeting — this was the Active Failure/Unsafe Act that led to the Mishap
  2. Preconditions:  The VP assumed that others were watching his laptop, but did not explicitly confirm this fact
  3. Unsafe Supervision:  Despite knowing that Executives are high-risk users with regards to sensitive information on their laptops, the IT Executive Support Team had recommended against deploying Full-Disk Encryption on executives’ laptops because they feared being held accountable if an executive lost information due to an encryption system failure
  4. Organizational Influences:  While a Laptop Encryption Policy existed and specified that the VP should have been encrypted for multiple reasons, the policy was widely ignored, there was no cultural pressure to ensure that mobile information was protected, and thus compliance was unacceptably low.  No pressure to comply was generated by Executive management because the cost associated with doing so was considered to be prohibitive.

In this case, the damage (opportunity cost) of lost revenue due to that single lost laptop was many multiples of the complete cost of deploying a Full-Disk Encryption system.  Unfortunately, in the absence of a comprehensive analysis of the series of failures leading up to the unsafe act, the real root cause of an incident may be ignored or mis-assigned, leading to either an incomplete or unsustainable remediation course.

When incidents occur, it’s rare to see a true and honest assessment not just what went wrong, but why.  Too often, in fact, the culture seems to be to put it down to, “nobody could have predicted it.”  Reject these assessments.  To improve an organization, we must refuse to accept these explanations.  Instead, find the root cause–all the way up to the Organizational Influences–and then Fix It.

2 comments on "Human Error"

  • One of the main items in the “Problem Management” part of “Systems Management” is to track the call logs to identify any consistency in issues that might point to a need for training. The point is to reduce call volumes by having properly trained personnel.

    Sounds like a fairly simple concept that could be applied to problems (accidents) in any other industry. Maybe even in Information Security??? Too often I see simple security problems going on-and-on and nothing being done about it. For example, bad habits around the choice and use of passwords … but most users still don’t know why they need to bother!

    And the point about “the cost associated with doing so was considered to be prohibitive” … seems like, subconsciously, they did a risk assessment and decided to “accept the risk” (of losing sensitive data) rather then to “mitigate it”.

Comments are closed.