Shostack + Friends Blog Archive

 

Primary Colors, Author Unknown

primary-colors.jpg
In discussing private blogging at , the idea of identifying bloggers by their writing style kept coming up. The example that was used (at least) twice was the “computerized” identification of the anonymous author of Primary Colors. The trouble is, the identification wasn’t done by computer. It was done by Vassar English Professor Don Foster.

Foster has written a book about his (manual) methods, which is a short and entertaining read.

I think it would be great if someone were to write and release code that identifies authors of postings. Textual analysis is powerful because it doesn’t rely on records that may or may not be present. Contrast that with, say, Tor. If you are broadly monitoring the internet, and watching who is using Tor, you can learn a lot fairly quickly. I’m not saying you can learn a lot easily, only quickly once you make the effort. And there’s a lot of effort, and if you’re using subpoenas, its fairly obvious to some what you’re doing. If you tap cables, it may be less obvious, but there are physical taps. Now think about textual analysis. You don’t need to be monitoring a network and see packets go by to use it. The information you need can be gathered with a web browser, after the fact.

author-unknown.jpg
You might not even need a web browser, just a good library. “Does Publius use the same words and sentence structures as Hamilton?” The authorship of each of the Federalist Papers is generally attributed to one of Madison, Jay or Hamilton. There is general agreement, as this site says: “We have followed the consensus of scholars on attribution of each paper to its primary author, James Madison, John Jay, or Alexander Hamilton…”

Deciding if your sentences, word choices and order, use of subordinate phrases, etc are sufficiently unique to identify you is hard.

Deciding if your word choices and order, use of subordinate phrases, sentence structure, et cetera, are sufficiently unique to identify you is hard.

Understanding if your vocabulary and sentence structure is weird enough to nail you is hard.

As far as I know, no one has released code that measures these things. Having code would help frame a discussion of these things. Perhaps we could learn to measure some (if not all) of the personality in writing. We could test how well something like translating your text into another language and back defeats that code. We could look for measures that survive the testing. And we could help folks who want their authorship to remain unknown.

For more on Don Foster’s book, CNN has a review, “Don Foster enlightens readers with ‘Author Unknown.'”

2 comments on "Primary Colors, Author Unknown"

  • Brian says:

    I think there is code for this, in two primitive ways. Nobody’s merged them yet, so far as I know, but the foundations are out there. First, Bayesian word-frequency classifiers like ifile can do some of this. As written, they’ll work almost entirely by word choice, not ordering or preferences for tense and mood. But all it would take is an analyzing tokenizer which breaks words into roots with gramatical notations to let this handle grammar as well. That’s well-understood tech.
    Second, the auto-plagiarism detectors out there, though generally of poor quality, are directly addressing this problem.

  • Adam says:

    Thanks! I hadn’t thought of the auto-plagarism detectors. My understanding is that they’re largely closed-source, hidden algorithm. Is there anything out there in public?

Comments are closed.