Shostack + Friends Blog Archive

 

Reproducibility, sharing, and data sensitivity

What made this particular work different was that the packets we captured came through a Tor node. Because of this difference, we took extreme caution in managing these traces and have not and will not plan to share them with other researchers.

Response to Tor Study
I won’t get into parsing what “have not and will not plan to share” means, and will simply assume it means “haven’t shared, and will not share”. So, what we have here are data that are not personally identifying, but are sensitive enough that they cannot be shared, ever, with any other researchers.
What is it about the traces that makes them sensitive, then?
Given this policy, how can this work be replicated? How can it be checked for error, if the data are not shared with anyone?

Bonus rant, unrelated to the Tor paper

I am growing increasingly perturbed at the hoarding of data by those who, as scientific researchers, are presumably interested in the free flow of information and the increase of knowledge for the betterment of humanity.
Undoubtedly not all keep this information to themselves out of a base motive such as cranking out as many papers as possible before giving anyone else (especially anyone who might be gunning for that same tenure-track position) a shot, but others who play this game no doubt are. It’s unseemly and ultimately counterproductive.
It’s funny — the infosec community just went through an episode where a respected researcher said, in effect, “trust me — I found something important, but I can’t give you the information to verify my claim, lest it be misused by others less noble than we”, and various luminaries took it to be a sign of lingering institutional immaturity. Perhaps, as EE/CS becomes increasingly cross-pollinated with the likes of sociology, psychology, law and economics the same observation will hold. If so, we should see it coming and do the right things. This is one they teach in pre-school: “Sharing is Caring”.
As an example of what could be done, consider this and this.

5 comments on "Reproducibility, sharing, and data sensitivity"

  • Dean Loomis says:

    Of course the whole point of TOR is that nobody can trace traffic back to its origin, not even the developers (modulo Thompson Trojans) or the operators of any given node. So TOR traffic is already pre-sanitized with all identifying information removed. In spite of McCoy et al’s claims that they have contributed to the advancement of security and privacy in TOR, I’m afraid their statements show that they don’t really don’t know what they’re doing.
    One of the key aspects that distinguishes a professional cryptographer from an amateur is that the pro knows when to stop. An amateur will just keep piling obscuring feature on top of feature willy-nilly. For example, a document encrypted four times successively with 3DES*AES*RC5*twofish isn’t significantly more secure than the same document encrypted with any one of them.
    In the case of TOR studies, the authors may be subject to liability claims from downstream targets of hacking attacks that exited their node. They do report that they received numerous takedown notices and other inquiries about non-benign activity that originated upstream from them in the TOR cloud. The fact that this was a research project in a university doesn’t immunize them against subpoenas or other legal action, regardless of their wishes to keep their data private. Unlike journalists, researchers don’t have the legal precedents that allow them to stand on the first amendment, and even journalists occasionally go to jail for refusing to reveal their sources.
    But they didn’t cite the fifth amendement as a reason for keeping their data secret, showing again that they haven’t fully thought through the reasons for their actions.

  • Dan Weber says:

    “trust me — I found something important, but I can’t give you the information to verify my claim, lest it be misused by others less noble than we”,

    You left out one very important word: yet. We couldn’t reproduce his work immediately, but we had to wait 30 days and then we could.
    Even traditional research papers sometimes don’t release all their data right away, and for very good reasons. The hilarious Lenski dialog with Conservapedia lists some very good and legitimate reasons that labs don’t necessarily give absolutely everything to just anyone who asks. Search the page for the word ‘scoop.’

  • Chris says:

    @Dan:
    You’re right about “yet”. I focused on the ‘trust me’ part, because that is what Kaminsky is saying, if only for a month. The example is a timely one, but you are correct that in the grand scheme of things a month is no biggie.
    When I wrote my rant, I had in mind instances where the amount of labor was on the order of .1 – .25 grad student-years. In other words, not much. The Lenski example is not in the same category, but you are quite right that it raised good points. I will simply say, as I did in my post, that this infosec/sociology/econ/whatever it is hybrid we are dealing with is not an established academic discipline, and consequently one should not be required to show ones doctorate and academic library card in order to get access to the data. I do think I acknowledged the “scoop” angle.
    Maybe some performance art is in order. I’ll put together a dataset, declare a compilation copyright on it, and license it to all and sundry as long as any work which uses it contains at least one example of pirate-speak. Arrrr!

  • David Molnar says:

    Chris, in the first part of your post, when you ask
    So, what we have here are data that are not personally identifying, but are sensitive enough that they cannot be shared, ever, with any other researchers. What is it about the traces that makes them sensitive, then?

    it sounds like you’re drawing a dichotomy between “personally identifying information” and “not sensitive.” That’s a false dichotomy, as the AOL data release and the Netflix data set among other things shows — we have examples where post-processing of “anonymized” data, together with publically available other data, can in some cases reveal things which are embarassing or even personally identifying. What’s more, the specific post-processing techniques were not developed at the time the data was released. So even if the _original data collected is benign_, what could be done with the data given enough analysis _may not_ be benign.
    Given such a situation, where new techniques can target “anonymized” data, a conservative position to take would be to avoid releasing of any data that depends on private information, unless you can show that such post-processing is not possible. Doing such an analysis, even figuring out what that should *mean*, is an active area of research — see for example what Cynthia Dwork and her collaborators are doing at Microsoft Research.
    Yes, such a position does seriously impede the ability of others to replicate your work. Other groups have to re-do the same study and obtain their own data, then see if their findings agree with yours. That’s far from ideal. Unfortunately this looks like a balancing act where researchers have to make some tradeoff between allowing others to replicate their work and the privacy impact of releasing the data.
    I should note that I do not know what is in the Tor data set or if the authors feel this way. I’m not an author on the paper and I don’t speak for them. I’m just pointing out that data release issues here are more complicated than your comments quoted above suggest.
    There’s also a larger issue about how making the data available is just a start, and how it’s a lot of work which is currently pretty thankless in EECS, but that’s a separate comment.

  • David Molnar says:

    Two points about replication and data availability —
    1) Making the data for your study available is an excellent first step, but the gap between the data and the graphs in a paper can be surprisingly large. This past fall, I took a class whose final project was “pick a journal paper, then critique its use of statistics.” My partner and I intentionally picked a paper whose author had made its data available, but we then had to invest several days of work to make our analysis match what was reported. When we told the prof about this, he then informed us that the fact we could replicate at all was a pleasant surprise – many people in the past could not do the same with their papers, even given data published by the paper’s authors!
    The reasons for non-replication even with data are varied – ranging from different versions of statistical software to “data cleaning” procedures that aren’t well documented. Margo Seltzer’s group at Harvard actually has a project trying to address some of these issues:
    http://www.eecs.harvard.edu/~syrah/pass/
    2) Still, at least in economics, political science, and biology it does seem to be an emerging norm that data accompanies papers. I have seen less of this in EECS. I’ve heard several anecdotes in the computer systems community that people have tried to push for greater data availability, but that several things get in the way and so it hasn’t broadly happened yet. Speaking only for myself, here a few of the issues I’ve seen:
    * Documenting data is highly time-consuming, necessary if anyone other than you is going to use it, and not really rewarded by the current system of conference reviews. As a reviewer, the fact the data is available is “nice”, but I rarely have time to actually try to replicate the results during the review process. Documenting everything you did to the data is even more time-consuming, and sometimes you don’t remember everything…
    * It’s hard to know if you did a good job documenting the data until someone tries to actually replicate the analysis. You can’t do it, since you don’t have time and you know too much about the analysis anyway. Other people won’t do it, since it doesn’t advance their research.
    * The data set may be proprietary. For example, industrial research on the performance an internally developed web app compared to the performance of SPEC benchmark programs the name of web app or the access patterns, because that could reveal interesting information to a competitor (or ammo for a marketing department). There are also privacy issues with things like clickstreams or user stuy data, or disclosure issues with things like bugs found by a bugfinding tool.
    In the case of papers which report on the performance of an idea embodied in software, there’s also of course the issue that the software itself may be proprietary. It may not be possible to feasibly replicate the study without access to that software. Do we want to lose the reports which come out of studying such systems? I personally don’t think so, but we could disagree.
    * The code you wrote to gather and clean up the data may be an embarassing hack. You decide you want to hold things back until you clean it up, then you never get around to it.
    Purposely erecting a barrier to entry for others may be a reason, of course, but I haven’t seen it much. Most everyone I’ve ever talked to has had a “yes, nice idea, but…” attitude towards data sharing.
    That being said, I do see a trend towards people making data and code available. Margo Seltzer, again, has released multiple data sets over the years. For another example, the Asbestos project makes their code available via anonymous CVS, which then compiles into a nifty virtual machine image which can be executed by QEMU.

Comments are closed.