Shostack + Friends Blog Archive


Breach Datasource Design Criteria

 Most readers of these words are probably familiar with at least one of the lists of data breaches commonly referenced in the media and in specialized blogs.  Among these are’s Dataloss, and’s Breach Chronology.  The ID Theft Center also maintains a list (available, it seems, only as a PDF), and various academic researchers have also compiled breach datasets.  Clearly, there is interest in having a handy source of current information about such incidents.  This post is my take on what such a source should contain, and what sorts of activities it should allow.  I want to state right away, and emphatically, that these remarks are not a criticism of any existing list or database.  Indeed, those resources are the shoulders of giants on which I think we all can stand.  Seriously.

 First, the datasource should not be a simple list, oriented toward on-screen viewing or printing in its entirety.  It should be a database.  I am not speaking of its physical representation or storage format, but the kinds of uses it should be amenable to.  I’m saying that it should contain a rigidly-enforced set of data elements, and allow random access to records based on criteria pertaining to those elements.  That can be done with flat files, with a full-blown relational database, with a spreadsheet, or any number of other things.

Second, this datasource should have a web interface.  This interface should make easy the types of arbitrary queries referred to in the preceding paragraph.  For example, it should be possible for a user to obtain a listing of all breaches with a specified date range, or within a certain industry. The result sets obtained through these arbitrary queries should be made available for download in as universal a format as possible, for example as a CSV file, and should also be viewable on-line if their size permits.

Third, the datasource should contain a wider range of data pertaining to each breach than is currently available.  Today’s datasources typically contain variables such as these:

  • Date breach became known
  • Affected Company
  • Industry of Company
  • Number of records disclosed
  • Type(s) of information revealed
  • Names of third parties involved in the breach
  • Reference to a media account of the incident
  • General Comment

These, I would say, are “the basics” and are fine for many purposes.  However, deeper understanding requires more information.  Specifically, as my fourth point, I would like to see:

  • Full address information for the Affected Company
  • Company stock symbol and exchange (where relevant)
  • Indicator of company size, such as number of employees or gross revenue
  • Dates for the discovery of the breach, the occurence of the breach, and the reporting of the breach to “victims”, law enforcement, and regulators.
  • The proximate cause of the breach
  • Whether notice to victims took place
  • Whether this notice was required by law
  • The minimum and maximum number of affected individuals, where a precise count is not known.

 All of these variables are easily represented in text form, although some discipline is needed in coding the breach cause — there are many ways of saying “web server config error”, after all.  Similarly, maintainers of this ideal datasource should publish their coding guidelines, so that users of the datasource understand just how a “web server config error” is determined to be a breach cause, rather than (say) “improper firewall ruleset”. 

Fifth, the datasource should  be easy to extend and link to other sources of information about an incident.  The DLDOS datasource has begun using a unique identifier for each of its records, which allows others (of whom I am one) to link related information.  Since, unfortunately, much of the work in amassing the data on breaches has been a part-time thing conducted by amateurs, it may be important to allow others to “connect” and reconcile different ways the same event has been recorded.  

Sixth, and related to the preceding, the datasource should allow for easy linkage of non-textual data.  My personal favorite example would be the notification letters and paperwork I have obtained from New York.  As interest in this subject grows, particularly among academics who can do this full-time (or more — grad students don’t work 9-5), notification letters contributed by their recipients may be sanitized and added.  For some breaches, I can even see Wiki-like possibilities.

Last, the datasource should be free, as in beer.  We can argue over which license is best in another post :^).

At any rate, those are some quick thoughts.  I’ve tried to instantiate them, but have been hampered by my lack of skill as a DBA and data-modeller, and my lack of time.  If others are interested in this subject, as I think the example of Attrition and easily demonstrates, we may all benefit from cooperating openly.  I’m interested in reactions, criticisms, suggestions, and even flames.  Let’s hear them.

3 comments on "Breach Datasource Design Criteria"

  • Lyger says:

    All excellent points, and I’m glad you posted your “quick” thoughts. A couple of mine:
    Regarding point 2: While convenient, web interfaces shouldn’t be a standard. Yes, while point-and-click GUI goodness may be desirable, it’s the data itself that poses the biggest issue, not the file format or protocol in which it’s presented. Attrition considered a web interface at one point but decided that the security risks weren’t worth the hassle.
    Regarding point 6: additional information would certainly be welcome. As you mentioned, this would be best done by a group who could dedicate a full-time effort. As you and I both know, gathering information, expanding catagories, and backfilling into a dataset almost six years old can sometimes take hours a day.
    Regarding point last: Free. And sushi.
    Would type more, but another breach probably just happened. AFK 60.. 😉

  • Chris says:

    “backfilling into a dataset almost six years old can sometimes take hours a day.”
    True dat. I want NY to make version 2.0 of their reporting document an NCS form. This data entry is an embarrassingly parallelizable task, people! :^)

  • Lyger says:

    “This data entry is an embarrassingly parallelizable task, people! :^)”
    Sometimes literally. Where’s my chiropractor?

Comments are closed.