A question of ethics
Various estimates have been made regarding the quantity of personal identifying information which has been exposed by various mechanisms. Obviously, though, we only know about what we can see, so seeing more would make such estimates better.
One way to see more would be to look in more places, for example on peer-to-peer file sharing networks.
So here’s the question: would it be ethical (and if so, under what conditions) to deliberately seek out files containing PII as made available via P2P networks, in order to better understand the extent to which such information is exposed, and how?
I have an opinion on this question, but I’m very interested in what others think.
Disclaimer: Not an expert, not a lawyer, actually just a student.
Rather, I just wanted to point out that people have been doing these kinds of things with Google for a long time:
http://johnny.ihackstuff.com/ghdb.php
I realize there’s a difference between:
a. finding a google hack (a query)
b. executing the query and looking at the results list
and c. clicking on one of the results in the list
I’m going to go out on a limb and say that a, b, and c are all legal and the only one with potential harm (to you) would be c.
Will just searching for filenames you think are > X% likely to have PII in them satisfy your research?
In the situation I had in mind, (a) would never happen — the technique would be to look for files shared via P2P networks, so (as I understand it) the amount of metadata available would be quite limited — file name, size, file type. “The query” would be a search using a P2P client.
In the scenario I posited, “clicking on the results” would mean transferring the files identified during the search and examining them (perhaps via an automated tool) to see if they indeed contained PII.
So, as described — no Google shenanigans, just searching for files using a P2P client but doing so knowing that there’s a non-trivial likelihood you’ll obtain PII — would this be unethical, and if so what could you change to make it acceptable.
A final clarification — I have no plans to do this. The question of whether it would be acceptable ethically arose[1], and I had to consider it at length before making my decision. Now I am second-guessing myself, and would like to hear what others think.
[1] I am being circumspect about how it arose simply to keep irrelevant details from coloring the evaluation people offer as to whether the activity is ethical.
Ethics … what are they?
Ethics tend to follow economics, which follows physics. That is, it can be done, and if there is no cost to it, then it is ethical.
Where we might differ is whether there are costs. I do not see them, myself. If you find the PII it is already disclosed, so there is a general concept that published info can be used. Unless it is disclosed accidentally. In which there may be a potential cost if you were to move them from accidental disclosure to publication.
Then there are other coincidental damages such as being stopped at a border with a laptop stuffed with PII. But this is part of general life.
So, in the sense of coffee or beer talk, I’d err on the positive side: it would not be against ethics to do a search for PII on p2p. Although it would help to establish a purpose and some safeguards, I’m assuming here that this is academically motivated.
Ian:
I have the following priors:
1. There is plenty of PII being shared.
2. It can be found via simple searches.
3. The sharing is unintentional in many cases.
If an academic researcher goes trawling, and he knows to a fair degree of certainty that his actions will yield him unintentionally-disclosed PII, is this ethically problematical? Note that I am assuming the files obtained will need to actually be examined to verify the nature of their contents.
I realize that asking about Right and Wrong is pointless in the abstract, but I am assuming we have enough common ground to make the opinions of professional peers relevant.
I can’t see that looking at data which is available to the public is deeply unethical. How you use it might be.
You might also want to look at analyzing queries rather than data. Are other people doing this? Queries are broadcast in p2p systems, and so seeing how often someone queries for SSN might be interesting in and of itself.
Good question, Chris.
In your scenario, what, if anything, would the researcher do if PII were discovered and appeared to have come from a government database or some national security database (just to make things more challenging)? Would there be any notification? Would any data opened/transferred to verify that it was PII be securely deleted from the researcher’s drive?
And if the file being shared that you now acquired was illegally obtained by the individual who shared it, have you now just become an accessory after the fact?
But all that said, no, I don’t really think it’s unethical to do what you asked about. You’re not talking about hacking but about availing yourself of files that the individual made available. How are you to know whether they intended to make them available or not?
So… ethical but possibly illegal? 🙂
@Adam:
If I do not acquire it, I cannot make things any worse (as, for example, I might if the machine I am using to harvest files is poorly protected, stolen, made part of a botnet, etc.). I agree that if I obtain it I can easily use it evilly. The question is whether obtaining it is, even where my motives are entirely laudable and pure, wrong.
I agree that analyzing queries alone would be useful and perfectly legitimate.
@Dissent:
The “accessory after the fact” angle occurred to me, and indeed for me the entire issue here is whether the files can reasonably be thought to have been shared voluntarily. To me it is blatantly obvious that in many cases they will not have been, but is the “presumption of voluntary disclosure” enough to say harvesting them is kosher?
Arguing the other way — if I received an email message addressed to me, but which *obviously* was intended for someone else whose email address was very similar to mine, and which contained PII, I would not feel it at all unethical to use the email as data for a research project.
Gahh. I just confused myself more…
If I were to pursue this in an academic setting I’d need to run it by some sort of review panel!
“To me it is blatantly obvious that in many cases they will not have been, but is the “presumption of voluntary disclosure” enough to say harvesting them is kosher?”
Let’s go back to what Dan said. How is this really any different than an entity who does not secure their server properly and allows Google to index and cache the data? As a researcher looking for PII, you’d enter some search string, and you might see a link in Google, but you’d have to open the file to determine if there were PII in it. Would you argue that inspecting or downloading such files is “unethical” if you have reason to suspect that their exposure was unintentional?
Isn’t it more the case that what you do with the data after you determine what’s in it that might be unethical (cf, my complaints about SSNBreach.org)? So how is your scenario really any different? Am I missing something here?
Compare to the various efforts to read data from disks after disposal. Provided you make no improper use of the data; and the summary you publish is only about the type and scale of what you found – not including recognisable examples I wouldn’t have a problem with it.
http://bible.cc/romans/13-10.htm