Reporting on breaches
It started with Mark Jewell of the AP, “Groups: Record data breaches in 2007.” Dissent responded to that in “Looking at 2007’s data breaches in perspective:”
The following table depicts the number of U.S. incidents reported and the corresponding number of records reported expose by the three main sites that track such data: Attrition.org, the Privacy Rights Clearinghouse (PRC), and the Identity Theft Resource Center (ITRC).
Then Thomas Claburn writes “Data Breaches: Getting Worse Or Better?” in Information Week:
The year 2007 may or may not have been a record-setting year in terms of data breaches. Whether it was or wasn’t depends on how one counts.
Then Dissent followed up again, in “Second look: What kind of year was 2007 in terms of data breaches?”
Perhaps it would be more conservative to conclude that we simply don’t know whether the total number of incidents rose, fell, or remained the same (because of the lack of a national disclosure law), but with media sources claiming that it was “record year” in terms of number of incidents, I thought it important to point out where the data do not support that assertion.
…lots of analysis elided
The bottom line is that if we want to make any sense out of data, we need more transparency and mandatory disclosure so that we can get ALL of the numbers on ALL of the incidents.
I’m so eager to jump into this conversation, but have other writing that I need to finish. So go read what Dissent wrote, and I’ll just comment on how excited I am to see the emergence of all of this analysis around breach notices.
Seems like a storm in a teacup to me. We have data breach disclosure, now, and the rest is noise. As scientists we should be able to deal with noisy data, whether it be “most states” or “most comments”….
Thanks, Adam. When you get caught up, I also did some additional analyses on thefts, loss, and hacks in 2007 vs. 2006 that you may want to see.
Iang: yes, we should be able to (and often can) deal with noise. But this was not just random “noise” as much as confounds.
So I questioned claims that there were more incidents in 2007 than in 2006 and I also disputed sweeping claims that there were more records exposed in 2007 than in 2006. And provided some analyses to make the point that the headlines were misleading.
As to more data with all the numbers: if what we are getting is a random sample, then we’re ok in terms of analyses and inferences. But how confident are you that what we’re finding out about is really a random sample or a representative sample? I really have a lot of doubts about that, which is why I want to see more data. If policy makers and legislators want to determine where to set the bar on notification, then we need a more accurate estimate of how often there is exposure in various sectors, how often data are misused, and we need to follow-up for longer periods because we’ve already seen incidents where the entities claimed that they did not think there was any real risk of misuse, and yet the data were misused over a year later.