I’d like some feedback on my data analysis, below, from anyone who is an expert on spam or anti-spam technologies.
Background: I’m currently working on a small research project to model the innovation arms race between attackers and defenders. For simplicity and availability of data, I’ve chosen to focus on spam and anti-spam (filtering). One of my goals is to model how innovation by one side triggers complementary innovation on the other side, and under what conditions would either side “escalate” the innovation arms race to pre-empt the other side.
To do this, I need to understand paths of innovation and the perceived costs and benefits of each path at any point in time. In particular, I need to model the dependencies between various components and technologies as they evolve. In the world of email spam, one such technology is the encoding and formatting of the email message (plain text, HTML, CSS, etc.) that is used by the spammer to “trick” the filter and the end user.
Data: One interesting source of data is John Graham-Cumming’s “Spammers’ Compendium“, where he tracks spammer “tricks” found in the wild, and codes them by method and by technology. One graph shows the number of “tricks” by technology type vs. time, shown here:
Here’s how you read this chart. The vertical axis is “number of tricks, cumulative”, and the horizontal axis is time, by quarter. In this time period (2003-2007), you can see that HTML has been the most prolific vehicle for spammer “tricks”, and continues to be popular in spite of more sophisticated technologies like PDF and Flash that have recently appeared.
My Analysis: I defined a small set of inferences regarding this data to derive other metrics. Namely, I’m interested in estimating the investment required to master each technology (roughly, the level of skill and effort) and also the “affordance” potential (i.e. how fruitful the technology is in allowing spammers to invent new tricks). I used inference rules to produce the following diagram. (The inference rules just look for certain patterns in the data, then draws conclusions regarding the two dimensions and also the evolution paths.)
Here’s how you read this chart. “Effort” (=spammer investment) is on the horizontal axis, ranging from “low” to “high”. (It’s probably a log scale, but I haven’t quantified it yet.) “Affordance” is on the vertical scale, again ranging from “low” to “high”. Roughly speaking, this scale could be quantified by the following test: “How many spam tricks could be invented by a reasonably skilled spam team, given a fixed amount of money and time?” But, again, I haven’t formally quantified this scale.
(Of course, I’m not literally saying that spammers have to develop these capabilities in this order. It’s more about logical dependence relations. For example, any toolkit that supports HTML spamming will also have some Plain Text spam trick capabilities, and so on.)
The dashed grey lines represent constant Return on Investment (ROI), where the investment is the spammers time, energy, and money spent learning the technology, mastering and configuring tools for automation, then then tuning the operation for effective mass production. (Embedded in the “return” element of ROI is the capability of spam filters at that point in time, but I’m modeling that separately.)
(FYI: other sources of data, including longitudinal case studies of spammers, will be used, so I’m not solely dependent on this one source. But it does seem especially good to focus on the rate of innovation on the side of spammers, and how innovation branches out.)
Questions for you:
- Do you think this analysis is credible? What are the holes, if any?
- Do you think I have placed PDF and Flash in the right place? According to a technology lifecycle perspective, it would appear that the spam potential of PDF and Flash had hardly been explored in 2007, but I don’t know if I can justify placing them so high on the affordance scale.
- Any other thoughts?
Thanks. You can also email me privately at russell <dot> thomas AT meritology <dot> com.