Category: visualization

Just Landed in…

Just Landed: Processing, Twitter, MetaCarta & Hidden Data:

This got me thinking about the data that is hidden in various social network information streams – Facebook & Twitter updates in particular. People share a lot of information in their tweets – some of it shared intentionally, and some of it which could be uncovered with some rudimentary searching. I wondered if it would be possible to extract travel information from people’s public Twitter streams by searching for the term ‘Just landed in…’.


This is a cool emergent effect of people chaotically announcing themselves on Twitter, a MetaCarta service that allows you to get longitude/latitude and a bunch of other bits all coming together to make something really cool looking.

Via Information Aesthetics

My Wolfram Alpha Demo

I got the opportunity a couple days ago to get a demo of Wolfram Alpha from Stephen Wolfram himself. It’s an impressive thing, and I can sympathize a bit with them on the overblown publicity. Wolfram said that they didn’t expect the press reaction, which I both empathize with and cast a raised eyebrow at.

There’s no difference, as you know, between an arbitrarily advanced technology and a rigged demo. And of course anyone whose spent a lot of time trying to create something grand is going to give you the good demo. It’s hard to know what the difference is between a rigged demo and a good one.

The major problem right now with Alpha is the overblown publicity. The last time I remember such gaga effusiveness it was over the Segway before we knew it was a scooter.

Alpha has had to suffer through not only its creator’s overblown assessments, but reviews from neophiles whose minds are so open that their occipital lobes face forward.

My short assessment is that it is the anti-Wikipedia and makes a huge splat on the fine line between clever and stupid, extending equally far in both directions. What they’ve done is create something very much like the computerized idiot savant. As much as that might sound like criticism, it isn’t. Alpha is very, very, very cool. Jaw-droppingly cool. And it is also incredibly cringe-worthily dumb. Let me give some examples.

Stephen gave us a lot of things that it can compute and the way it can infer answers. You can type “gdp france / germany” and it will give you plots of that. A query like “who was the president of brazil in 1930” will get you the right answer and a smear of the surrounding Presidents of Brazil as well.

It also has lovely deductions it makes. It geolocates your IP address and so if you ask it something involving “cups” it will infer from your location whether that should be American cups or English cups and give you a quick little link to change the preference on that. Very, very, clever.

It will also use your location to make other nice deductions. Stephen asked it a question about the population of Springfield, and since he is in Massachusetts, it inferred that Springfield, and there’s a little pop-up with a long list of other Springfields, as well. It’s very, very clever.

That list, however, got me the first glimpse of the stupid. I scanned the list of Springfields and realized something. Nowhere in that list appeared the Springfield of The Simpsons. Yeah, it’s fictional, and yeah that’s in many ways a relief, but dammit, it’s supposed to be a computational engine that can compute any fact that can be computed. While that Springfield is fictional, its population is a fact.

The group of us getting the demo got tired of Stephen’s enthusiastic typing in this query and that query. Many of them are very cool but boring. Comparing stock prices, market caps, changes in portfolio whatevers is something that a zillion financial web sites can do. We wanted more. We wanted our queries.

My query, which I didn’t ask because I thought it would be disruptive, is this: Which weighs more, a pound of gold or a pound of feathers? When I get to drive, that will be the first thing I ask.

The answer, in case you don’t know this famous question is a pound of feathers. Amusingly, Google gets it on the first link. Wolfram emphasizes that Alpha computes and is smart as opposed to Google just dumbly searching and collating.

I also didn’t really need to ask because one of the other people asked Alpha to plot swine flu in the the US, and it came up with — nil. It knows nothing about swine flu. Stephen helpfully suggested, “I can show you colon cancer instead” and did.

And there it is, the line between clever and stupid, and being on both sides of it. Alpha can’t tell you about swine flu because the data it works on is “curated,” meaning they have experts vet it. I approve. I’m a Wikipedia-sneerer, and I like an anti-mob system. However, having experts curate the data means that there’s nothing about the Springfield that pops to most people’s minds (because it’s pop culture) nor anything about swine flu. We asked Stephen about sources, and specifically about Wikipedia. He said that they use Wikipedia for some sorts of folk knowledge, like knowing that The Big Apple is a synonym for New York City but not for many things other than that.

Alpha is not a Google-killer. It is not ever going to compute anything that can be computed. It’s a humorless idiot savant that has an impressive database (presently some ten terabytes, according to the Wolfram folks), and its Mathematica-on-steroids engine gives a lot of wows.

On the other hand, as one of the people in my demo pointed out, there’s not anything beyond a spew of facts. Another of our queries was “17/hr” and Alpha told us what that is in terms of weekly, monthly, yearly salary. It did not tell us the sort of jobs that pay 17 per hour, which would be useful not only to people who need a job, but to socioeconomic researchers. It could tell us that, and very well might rather soon. But it doesn’t.

Alpha is an impressive tool that I can hardly wait to use (supposedly it goes on line perhaps this week). It’s something that will be a useful tool for many people and fills a much-needed niche. We need an anti-Wikipedia that has only curated facts. We need a computational engine that uses deductions and heuristics.

But we also need web resources that know about a fictional Springfield, and resources that can show you maps of the swine flu.

We also need tech reviewers who have critical faculties. Alpha is not a Google-killer. It’s also not likely as useful as Google. The gushing, open-brained reviews do us and Alpha a disservice by uncritically watching the rigged demo and refusing to ask about its limits. Alpha may straddle the line between clever and stupid, but the present reviewers all stand proudly on stupid.

Applied Security Visualization

Our publisher sent me a copy of Raffael Marty‘s Applied Security Visualization. This book is absolutely worth getting if you’re designing information visualizations. The first and third chapters are a great short intro into how to construct information visualization, and by themselves are probably worth the price of the book. They’re useful far beyond security. The chapter I didn’t like was the one on insiders, which I’ll discuss in detail further in the review.

In the intro, the author accurately scopes the book to operational security visualization. The book is deeply applied: there’s a tremendous number of graphs and the data which underlies them. Marty also lays out the challenge that most people know about either visualization or security, and sets out to introduce each to the other. In the New School of Information Security, Andrew and I talk about these sorts of dichotomies and the need to overcome them, and so I really liked how Marty called it out explicitly. One of the challenges of the book is that the first few chapters flip between their audiences. As long as readers understand that they’re building foundations, it’s not bad. For example, security folks can skim chapter 2, visualization people chapter 3.

Chapter 1, Visualization covers the whats and whys of visualization, and then delves into some of the theory underlying how to visualize. The only thing I’d change in chapter 1 is a more explicit mention of Tufte’s small multiples idea. Chapter 2, Data Sources, lays out many of the types of data you might visualize. There’s quite a bit of “run this command” and “this is what the output looks like,” which will be more useful to visualization people than to security people. Chapter 3, Visually Representing Data covers the many types of graphs, their properties and when they’re approprite. He goes from pie and bar charts to link graphs, maps and tree maps, and closes with a good section on choosing the right graph. I was a little surprised to see figure 3-12 be a little heavy on the data ink (a concept that Marty discusses in chapter 1) and I’m confused by the box for DNS traffic in figure 3-13. It seems that the median and average are both below the minimum size of the packets. These are really nits, it’s a very good chapter. I wish more of the people who designed the interfaces I use regularly had read it. Chapter 4, From Data to Graphs covers exactly that: how to take data and get a graph from it. The chapter lays out six steps:

  1. Define the problem
  2. Assess Available Data (I’ll come back to this)
  3. Process Information
  4. Visual Transformation
  5. View Transformation
  6. Interpret and Decide

There’s also a list of tools for processing data, and some comparisons. Chapter 5, Visual Security Analysis covers reporting, historical analysis and real time analysis. He explains the difference, when you use each, and what tools to use for each. Chapter 6, Perimeter Threat covers visualization of traffic flows, firewalls, intrusion detection signature tuning, wireless, email and vulnerability data. Chapter 7, Compliance covers auditing, business process management, and risk management. Marty makes the assumption that you have a mature risk management process which produces numbers he can graph. I don’t suppose that this book should go into a long digression on risk management, but I question the somewhat breezy assumption that you’ll have numbers for risks.

I had two major problems with chapter 8, Insider Threat. The first is claims like “fewer than half (according to various studies) of various studies involve sophisticated technical means” (pg 387) and “Studies have found that a majority of subjects who stole information…” (pg 390) None of these studies are referenced or footnoted, and this in a book that footnotes a URL for sendmail. I believe those claims are wrong. Similarly, there’s a bizarre assertion that insider threats are new (pg 373). I’ve been able to track down references to claims that 70% of security incidents come from insiders back to the early 1970s. My second problem is that having mis-characterized the problem, Marty presents a set of approaches which will send IT security scurrying around chasing chimeras such as “printing files with resume in the name.” (This because a study claims that many insiders who commit information theft are looking for a new job. At least that study is cited.) I think the book would have been much stronger without this chapter, and suggest that you skip it or use it with a strongly questioning bias.

Chapter 9, Data Visualization Tools is a guided tour of file formats, free tools, open source libraries, and online and commercial tools. It’s a great overview of the strengths and weaknesses of tools out there, and will save anyone a lot of time in finding a tool to meet various needs. The Live CD, Data Analysis and Visualization Linux can be booted on most any computer, and used to experiment with the tools described in chapter 9. I haven’t played with it yet, and so can’t review it.

I would have liked at least a nod to the value of comparative and baseline data from other organizations. I can see that that’s a little philosophical for this book, but the reality is that security won’t become a mature discipline until we share data. Some of the compliance and risk visualizations could be made much stronger by drawing on data from organizations like the Open Security Foundation’s Data Loss DB or the Verizion Breaches Report.

Even in light of the criticism I’ve laid out, I learned a lot reading this book. I even wish that Marty had taken the time to look at non-operational concerns, like software development. I can see myself pulling this off the shelf again and again for chapters 3 and 4. This is a worthwhile book for anyone involved in Applied Security Visualization, and perhaps even anyone involved in other forms of technical visualization.

Congratulations to Raffy!

security visualization.jpg

His book, Applied Security Visualization, is now out:

Last Tuesday when I arrived at BlackHat, I walked straight up to the book store. And there it was! I held it in my hands for the first time. I have to say, it was a really emotional moment. Seeing the product of 1.5 years of work was just amazing. I am really happy with how the book turned out. The color insert in the middle is a real eye-catcher for people flipping through the book and it greatly helps making some of the graphs better interpretable.

I’m really excited, and look forward to reading it!

Visualizing Risk

I really like this picture from Jack Jones, “Communicating about risk – part 2:”


Using frequency, we can account for events that occur many times within the defined timeframe as well as those that occur fewer than once in the timeframe (e.g., .01 times per year, or once in one hundred years). Of course, this raises the question of how we determine frequency, particularly for infrequent events. In the interest of keeping this post to a reasonable length, I’ll cover that another time (soon).

And I’m looking forward to how to Jack says we should determine those frequencies.

One suggestion for improvement: state the timeframe on the chart label: “Loss Event Frequency (per year).”

Reporting on Data Breaches: US and Great Britain

Is the recent wave of reporting on British data breaches similar to what we’ve been seeing in the US? A couple of things seem true: the US has way more reported breaches per capita, but both locations have seen greatly accelerated reporting.
Here’s a plot of all US (Country = ‘US’) and British (Country = ‘GB’) breaches in Attrition’s DLDOS, as of March 13, 2008.
The incident count has been normalized by dividing each series by the total number of incidents in that series. The US had 840 reported incidents, Great Britain had 33.


What does this mean? I’m not sure…
Update: Added vertical lines to graphic, in response to Lyger’s comment. Left one is Choicepoint 2/15/05. Right is HMRC 11/20/2007.

The Visual Display of Quantitative Lawsuits

So the Boston Globe has this chart of who’s suing whom over failures in the “Big Dig:” (Click for a bigger version)


What I find most fascinating is that it’s both pretty and pretty useless. Since just about everyone is suing everyone else, what would be perhaps more interesting is a representation of who’s not suing whom. That is, where there is no lawsuit. I’ll use clock positions to describe players. With some work, I’ve determined that the Mass Turnpike Authority (at about 5.30) is not suing HNTB (7.30 or so), who is the “engineering firm responsible for inspections of Big Dig after completion of the projects.” HNTB is also not being sued by Newman Renner Colony (3.30), “distributor of bolt-and-epoxy assembly that failed in the ceiling.”

(Thanks to Nicko for the pointer.)

The Two Minute Rule for Email and Slides?

So I’ve been discomfited by the thoughts expressed by Tom Ptacek and the Juice Analytics guys over what presentations are for, and a post over at Eric Mack’s blog, “A New Two Minute Rule for Email.” The thing that annoys me is the implicit assumption that all issues should be broken down into two minute chunks. That we’re all dumb enough to require summaries like “It’s a slam dunk, Mr. President.” I find myself slipping into this belief. Annoyed that the authors of “A Report on the Surveillance Society” prepared for the UK Information Commissioner didn’t make it shorter. It’s already easy to read, but it’s 102 friggin’ pages. Who wants to read 102 pages? You’re probably already onto the next blog post already.

If you’re not, it may be because you recognize that there are arguments that take longer. There’s also arguments that don’t take so long, and I think I’ve made mine.

PS: I don’t think that Juice or Tom would ever argue for a hard-and-fast rule of this sort, but guidelines with subtlety become rules that people get tied up about.

A Picture (or Three) Is Worth A Thousand Words

Iang over at Financial Cryptography talks about the importance of not just which cryptographic algorithm to use, but which mode it is implemented with. He uses three pictures from Mark Pustilnik’s paper “Documenting And Evaluating The Security Guarantees Of Your Apps” that are such a great illustration of the problem, that I have to include them here.
Adam and I have both been to Tufte’s courses on Presenting Data and Information and these strike me as the kind of illustrations he would appreciate. The beauty of them is that as a non-cryptographer, you don’t need to understand the technical differences between ECB and CBC modes, because the illustrations demonstrate them far better than any text could.
[Edit: In the comments, nicko points out this extremely cleaver idea was originally done with the Tux logo from Linux and that they can be found on wikipedia in the section on block cipher modes of operation.]
Figure 2a Plaintext
Figure 2b ECB Encryption
Figure 2c CBC Encryption