Shostack + Friends Blog Archive



So awhile back, I posted the following to twitter:

Thought of the Day: We don’t need to share raw data if we can share meta-data generated using uniform analytical methodologies.

Adam, disagreed:

@mortman You can’t test & refine models without raw data, & you can’t ask people with the same orientation to bring diverse perspectives.

We went back and forth a bit until it became clear that this needed an actual blog post, so here it is:

I don’t disagree with Adam that we need raw data. He’s absolutely right that without it, you can’t test models. What I was trying to get at was that, even though I would absolutely love to have access to more raw data to test my own theories, it just isn’t realistic to expect that sort of access in the legal and business environment we have today. So until things change, we have to figure out another way to get at the data.

One thing that has become increasingly popular is for vendors to publish aggregate data about what they’ve seen with their customers or on their networks. Verizon and WhiteHat have used this model to great effect. Not only has it generated a lot of press for them, but we as an industry have learned a lot from these reports.

What would be even better is if people would share the models they are using when generating their data. This way, other organizations could use the models and as reports were published, the rest of us could actually compare apples to apples. This would also allow us to more quickly identify issues/errors in the models, allow for public discussion of necessary tweaks and then test said changes while limiting liability for the data owners.

This is really where I was going with my initial thought above; that we need common models so we can have an intelligent discussion. This is also how things generally work in the sciences (yes, Alex, I know, we’re not a science yet :). Researchers almost never publish their raw data, but just their models, methods and results. I feel strongly that until we can convince people to share raw data more openly, this is our best shot to figuring real information about what’s going on in the security world. It’s also what drove me to start developing the soon to be renamed Mortman/Hutton Model that Alex and I presented at Blackhat and BSides Las Vegas.

More data, even if it’s aggregate, is better then no data.

6 comments on "Meta-Data?"

  • Jon Robinson says:

    Have you guys listed somewhere your data demands? If not what are they? What type of data do you require to tell the future? Once you have the data and you use it to fit a model, how to you solve the problem of induction. There is no reason the future will be anything like that which the data you collect indicates. (Or maybe there is but this is what has been on my mind lately).

    • Lots of great questions. We should totally document our demands. Induction is a problem unless it’s for cooking and you have magnetic pans, in which case it’s great. More seriously, we deal with this issue all the time in real life without too much trouble. Look at any stock or mutual fund prospectus. I’m becoming more and more convinced that modeling in general and risk mgmt in particular isn’t about predicting the future but rather about better informing on the current state of being. This is really the stuff of a proper post though…

      • Jon Robinson says:

        Ya, they say, “past performance is not guaranteeā€¦” acknowledging the induction problem, but we as consumers look at the upward sloping chart and assume it will always be so. I think that is poor decision making.

        Please do post more about your thoughts on the modeling to inform on the current stateā€¦

        About data requirements: I ask that because I always notice demands for data sharing. I’m just wondering if you plan on just looking at the data for patterns and assuming those will always hold (I don’t think that would be a good idea) or if you have started with some a priori axioms that dictate which type of data would yield the knowledge you seek. I guess I’m wondering what your methodology/epistemology is.

    • Adam says:

      My desire for data is pretty simple: I’d like to know what went wrong and what was being doing to try to prevent things from going wrong.

      Note I said simple, not easy.

  • Russell says:

    At the risk of appearing to be a nitpicker, there is a significant difference between the terms “aggregate data” and “metadata”.

    Aggregate data are summary-level statistics (averages, sums, maximum, range, etc.) on the raw (source) data. All aggregates come from some calculation or information processing procedures (a.k.a. “models”).

    In contrast, Metadata is data *about* the raw data regarding its context. Typically this includes tags such as “data type”, “format”, “confidentiality”, “date published”, “owner”, and so on. Metadata helps you make use of the data, to avoid misuse, and to maintain it.

    It seems like David Mortman’s post is focused on “aggregate data” vs. “raw data”. I would second his motion for more transparent models.

Comments are closed.