Estimating spammer's technical capabilities and pathways of innovation
I’d like some feedback on my data analysis, below, from anyone who is an expert on spam or anti-spam technologies.
Background: I’m currently working on a small research project to model the innovation arms race between attackers and defenders. For simplicity and availability of data, I’ve chosen to focus on spam and anti-spam (filtering). One of my goals is to model how innovation by one side triggers complementary innovation on the other side, and under what conditions would either side “escalate” the innovation arms race to pre-empt the other side.
To do this, I need to understand paths of innovation and the perceived costs and benefits of each path at any point in time. In particular, I need to model the dependencies between various components and technologies as they evolve. In the world of email spam, one such technology is the encoding and formatting of the email message (plain text, HTML, CSS, etc.) that is used by the spammer to “trick” the filter and the end user.
Data: One interesting source of data is John Graham-Cumming’s “Spammers’ Compendium“, where he tracks spammer “tricks” found in the wild, and codes them by method and by technology. One graph shows the number of “tricks” by technology type vs. time, shown here:
Here’s how you read this chart. The vertical axis is “number of tricks, cumulative”, and the horizontal axis is time, by quarter. In this time period (2003-2007), you can see that HTML has been the most prolific vehicle for spammer “tricks”, and continues to be popular in spite of more sophisticated technologies like PDF and Flash that have recently appeared.
My Analysis: I defined a small set of inferences regarding this data to derive other metrics. Namely, I’m interested in estimating the investment required to master each technology (roughly, the level of skill and effort) and also the “affordance” potential (i.e. how fruitful the technology is in allowing spammers to invent new tricks). I used inference rules to produce the following diagram. (The inference rules just look for certain patterns in the data, then draws conclusions regarding the two dimensions and also the evolution paths.)
Here’s how you read this chart. “Effort” (=spammer investment) is on the horizontal axis, ranging from “low” to “high”. (It’s probably a log scale, but I haven’t quantified it yet.) “Affordance” is on the vertical scale, again ranging from “low” to “high”. Roughly speaking, this scale could be quantified by the following test: “How many spam tricks could be invented by a reasonably skilled spam team, given a fixed amount of money and time?” But, again, I haven’t formally quantified this scale.
For example, what I’m asserting with this diagram is that Plain Text is easier for a spammer to master than HTML, but not by much, and that HTML offers much greater affordance for inventing new spammer tricks than Plain Text. Similar assertions can be made by the placement of the other technologies relative to the two axes. The arrows between the boxes indicate the evolution path for innovations: (roughly) they have to master innovating in Plain Text as a prerequisite to innovating in HTML, which is then a prerequisite for innovating in Javascript, CSS, Image spam, and others.
(Of course, I’m not literally saying that spammers have to develop these capabilities in this order. It’s more about logical dependence relations. For example, any toolkit that supports HTML spamming will also have some Plain Text spam trick capabilities, and so on.)
The dashed grey lines represent constant Return on Investment (ROI), where the investment is the spammers time, energy, and money spent learning the technology, mastering and configuring tools for automation, then then tuning the operation for effective mass production. (Embedded in the “return” element of ROI is the capability of spam filters at that point in time, but I’m modeling that separately.)
(FYI: other sources of data, including longitudinal case studies of spammers, will be used, so I’m not solely dependent on this one source. But it does seem especially good to focus on the rate of innovation on the side of spammers, and how innovation branches out.)
Questions for you:
- Do you think this analysis is credible? What are the holes, if any?
- Do you agree with my ordering of the content encoding technologies in the figure above? (For example, to me, the placement of Javascript seems low on the “affordance” scale. I would have thought it would be a very fruitful vehicle for spammer tricks, but maybe I’m missing something obvious)
- Do you think I have placed PDF and Flash in the right place? According to a technology lifecycle perspective, it would appear that the spam potential of PDF and Flash had hardly been explored in 2007, but I don’t know if I can justify placing them so high on the affordance scale.
- Any other thoughts?
Thanks. You can also email me privately at russell <dot> thomas AT meritology <dot> com.
Hi Thomas,
I do not think PDF and Flash have good usability for spammers. Well, yes, both of them are great to cover up the content and tricks but in general people do not tend to believe things in attached files compared to the straight visible things aka html, plain text, images. I am not saying there is no high volume such spam, not at all, I just think that these two platforms work better for malicious content delivery from compromised sides than spam delivery. If you are trying to find the best approach to deliver content in terms of needed skills and efficiency to past through the first place definitely holds plain text closely followed by html, javascript and images. It also applies to volumes. However, I agree pdf and flash holds the best possibility to trick the user.
Kind regards,
[!v@n]
Hi, Ivan Sabo, [comment deleted by editor as personal attack]