Sharing Research Data
I wanted to share an article from the November issue of the Public Library of Science, both because it’s interesting reading and because of what it tells us about the state of security research. The paper is “Willingness to Share Research Data Is Related to the Strength of the Evidence and the Quality of Reporting of Statistical Results.” I’ll quote the full abstract, and encourage you to read the entire 6 page paper.
Background
The widespread reluctance to share published research data is often hypothesized to be due to the authors’ fear that reanalysis may expose errors in their work or may produce conclusions that contradict their own. However, these hypotheses have not previously been studied systematically.Methods and Findings
We related the reluctance to share research data for reanalysis to 1148 statistically significant results reported in 49 papers published in two major psychology journals. We found the reluctance to share data to be associated with weaker evidence (against the null hypothesis of no effect) and a higher prevalence of apparent errors in the reporting of statistical results. The unwillingness to share data was particularly clear when reporting errors had a bearing on statistical significance.Conclusions
Our findings on the basis of psychological papers suggest that statistical results are particularly hard to verify when reanalysis is more likely to lead to contrasting conclusions. This highlights the importance of establishing mandatory data archiving policies.
Despite the fact that the research was done on papers published in psychology journals, it can teach us a great deal about the state of security research.
First, the full paper is available for free online. Compare and contrast with too many venues in information security.
Second, the paper considers and tests alternative hypotheses:
Although our results are consistent with the notion that the reluctance to share data is generated by the author’s fear that reanalysis will expose errors and lead to opposing views on the results, our results are correlational in nature and so they are open to alternative interpretations. Although the two groups of papers are similar in terms of research fields and designs, it is possible that they differ in other regards. Notably, statistically rigorous researchers may archive their data better and may be more attentive towards statistical power than less statistically rigorous researchers. If so, more statistically rigorous researchers will more promptly share their data, conduct more powerful tests, and so report lower p-values. However, a check of the cell sizes in both categories of papers (see Text S2) did not suggest that statistical power was systematically higher in studies from which data were shared. [Ed: “Text S2” is supplemental data considering the discarded hypothesis.]
But most important, what does it say about the quality of the data we so avariciously hoard in information security? Could it have something to do with higher prevalence of apparent errors?
Probably not. It might surprise you to hear me saying that, but hear me out. We almost never have hypotheses to test, and so our ability to perform statistical re-analysis is almost irrelevant. We’re much for fond of saying things like “It calls the same DLLs as Stuxnet, so it’s clearly also by the Israelis.” Actually, there are several implied hypotheses in there:
- No code by different authors calls the same DLL
- No code calls any undocumented APIs
- Stuxnet DLLs are not documented
Stuxnet being written by the Israelis is clearly not a hypothesis, but a fact, as documented by Nostradamus.
More seriously, read the paper, see how good science is done, and ask if anyone is holding us back but ourselves.
Thanks to Cormac Herley for the pointer.
Science is based on a couple of related core concepts – peer review, research transparency, and embracing criticism. The hard sciences certainly don’t have a bulletproof methodology (closed for-profit journals being the most glaring failure in my opinion), but by and large the system really is designed to continually revised understanding towards the more correct understanding. Labs that will not open up their data and are not transparent tend to be suspect by their peers, and labs that disregard criticism also generally have poor reputations (with the XMRV scandal the most public current example of that).
In many ways that is the reason the hard sciences scorn the social sciences, as those disciplines are much less rigorous in their processes for weeding out and refining ideas (also, because collecting data via self reporting surveys is worthless). Sadly I would say that security research hasn’t even met the low bar set by the social sciences – the majority of security “research” is self published and doesn’t even go through the more generous peer review process of the social sciences, much less the fairly rigorous processes employed by most hard science journals.
I think this lack of strong peer review and research transparency exists for several reasons. The majority of research is being done not by academia but by for profit security companies, and they consider their research proprietary, but even if they didn’t it isn’t clear that the security community has set up the mechanisms for proper review and comment. If Ponemon actually wanted to vet their paper before publishing it how would they go about engaging the community to do so, and how would they convey to the world upon publication that such a review happened? I think at the moment the answer would be through a very ad hoc engagement with other security professionals relevent to the subject (though using Ponemon breach cost reports as a specific example – are security professionals even the relevent reviewers? I would think financial analysts would be a better authority).
The trick, I think, is to slowly convince folks like Ponemon, NSS labs, Verizon, etc. that their proprietary reports have a lot more value if they have been peer reviewed prior to publication. In my ideal world the peer review would be akin to that employed by PLoS rather than Science or Nature, with the review, comments, and responses available for public viewing. I believe that transparent approach to science ultimately generates more reliable results in the long run.
I just hope that eventually our field will get to the point where the default mentality is “Hey everyone, please tell me what is wrong with this so I can refine my approach to produce increasingly accurate/reliable results” rather than the current “Here are my results, trust them because I say so”