A Profusion of Taxonomies

In “In the Classification Kingdom, Only the Fittest Survive,” Carol Kaesuk Yoon writes about the profusion of naming schemes for animals:

Then there’s uBio, which has sidestepped the question of codes and regulations altogether and instead aims to record every single name ever used for any organism, scientific or common, correct or incorrect, down to the last variation and misspelling, as a way of linking all information ever recorded about an organism together.

The All Species Foundation aims not only to record all names but also to find every species and describe it, all in 25 years. And then there’s Wikispecies, Species 2000, the Electronic Catalogue of Names of Known Organisms and many more. Some have already come and gone, or nearly so, and others are expiring for lack of sustained funds.

So ZooBank finds itself born in the midst of a Cambrian explosion of initiatives, a proliferation not merely of Web sites and databases but of ideas about how to accomplish the task of naming and organizing all of life. And though disorder may be the most abhorrent thing to a tidy taxonomist, sometimes a little chaos can be healthy. [mmm, chaos!]

And I used to think this was simple. But as Clay Shirky has pointed out, vocabularies are most useful for a particular task, and different tasks, even in the same domain, may require slightly different “meta-data.” (That is, the information about the data in the taxonomy.)

I’ll note that uBio sounds a lot like the CVE, which is a computer vulnerability concordance, (concordance at Wikipedia) even though not everyone agrees with that definition.

A few Typographies of Bloggers

First, a very brief bit of terminology: A typography is a way to organize things, much like a taxonomy. Each item within a typography has clearly distinguishing characteristics, but there’s no hierarchy such as animal, vertebres, mammals, hominids, humans. To be honest, I’m not sure if this is a typography or just some categories. But “A few categories…” would be far less fun as a headline.

At BlogNashville, Rebecca McKinnon discussed the concept of “bridge bloggers,” those bloggers who make an effort to blog about their country in a way that an outsider or foreigner can understand. Its a great concept, but I’m having trouble finding a good link. Anyone? So much of what so many bloggers say is “inside baseball,” things that are hard for folks outside the club to understand (or even understand why you might bother to say them). This doesn’t just happen across national boundaries, it also takes place across organizational or professional lines. Milbloggers and peace bloggers often seem to be on different planets. No one takes the time to explain their orientation.

There are a few information security bridge bloggers: Steven Hofmeyer nthWorld, the mysterious John at “Internet Security: Be Careful,” Deb Radcliffe at “Security Chief.” Some people might stick Bruce Schneier may fit into the category; his last book was intended as a bridge, but his blog doesn’t always seem to fit.

In a closely related post, “An update from the Weblog Workshop” Ethan Zuckerman posts:

Shinsuke Nakajima from NAIST introduces three ways to think about key bloggers: topic-finders, agitators and summarizers. He talks most about the second two types and methods for detecting them. Summarizers, unsurprisingly, link to lots of people. Agitators can be found by looking for a drastic change in entries posted within a thread, or a drastic change in topic.

Its not original, but still important to note that there’s a split between personal life bloggers (the “Livejournal crowd”) and issue bloggers. Many people maintain both.

My Categories Suck

The categories I’ve set for this blog are non-functional. I have 16 categories, of which maybe 4 are ever exclusive.

Do you look at my categorization of posts? Do you look at the category archives?
Should I create a new set of categories? If so, what? (mmm, Choicepoint! Not.) Should I abandon categories and go to tagging? If so, what Movable Type/MarsEdit add-on should I use?

A Few Ideas Connected by the Tag "Folksonomy"

Nude Cybot, in an email in which he promises to emerge soon, presumably to be exceptionally cold, mentions that folksonomies have hit Wired News. The Wired article points out that there are more “cat” (16,297) tagged images than “dog” (14,041) in Flickr. But the conclusion they draw from this, “If the photo-sharing site Flickr is any indication, the world of digital photographers is dominated by cat people” is very dependent on the search. Puppy (2145) beats kitten (1912). As I discuss in Economics of Taxonomies, the cost of easy classification can be difficulty in searching. Deciding which tags are close enough to kitten to be included in the count is subjective. (Flickr suggests “Related: cat, cats, cute” and that you “See also:
kitty, animal, kittens, pet, animals, pets, black, sleeping, sleep, bw, white”

This relates closely to the idea of Keynes’ Beauty Contests, where your goal was not really to decide which was the most beautiful woman out of a set of photos published by Flickr the newspaper, but to select the one picked by the most other people. This might indicate that those skilled at groupthink will do well in a folksonomy-centric world.

A different way to state that, which would get far fewer nods, because the ideas are more rare, would be to say that those with different orientations may well be disadvantaged by their need to spend energy observing the mainstream, unless they use those analysis to guide their decisions and actions to take advantage of the orientation differences. In this way, those Microsofties with Ipods could be doing their company a great service.

Folksonomies, Tested

I’ve just stumbled across this abstract comparing full-test searching to controlled vocabulary searching. The relevance to Clay’s posts on controlled vocabularies is that our intuitive belief that controlled vocabulary helps searching may be wrong. Unfortunately, the full paper is $30–perhaps someone with an academic library can comment.

…In this paper, we focus on an experiment in which different component indexing and retrieval methods were tested. The results are surprising. Earlier work had often shown that controlled vocabulary indexing and retrieval performed better than full-text indexing and retrieval…, but the differences in performance were often so small that some questioned whether those differences were worth the much greater cost of controlled vocabulary indexing and retrieval … In our experiment, we found that full-text indexing and retrieval of software components provided comparable precision but much better recall than controlled vocabulary indexing and retrieval of components. There are a number of explanations for this somewhat counter-intuitive result, including the nature of software artifacts, and the notion of relevance that was used in our experiment. We bring to the fore some fundamental questions related to reuse repositories.

Economics of Taxonomies

In his latest post on folksonomies, Clay argues that we have no choice about moving to folksonomies, because of the economics. I’d like to tackle those economics a bit.

(Some background: There was recently a fascinating exchange between Clay Shirky and Louis Rosenfeld on the subject of taxonomies versus “folksonomies,” lightwieght, uncontrolled terms that users attach to things as classification. Now, as the name of my blog implies, I’m all in favor of such emergent and chaotic phenomenon as folksonomies. At the same time, some of the work I’m doing may involve the creation of a taxonomy. Worse, its a taxonomy where the items being classified are subject to a great many potential classifications, and really, a folksonomy may well be a better choice. So how to decide where to go?)

I don’t think that there is a single economics of taxonomies. We could compare effort of creation to effort of use. Flickr users create a folksonomy because its trivial to create, and the work needed to use it for tagging is also low. In contrast, the Linean taxonomy of life is the subject of a huge amount of work.
Once you’ve learned to use both Flickr and the plethora of modern library systems to search, the effort to search the Flickr site is higher than the effort to search in a library. So Flickr (and perhaps all folksonomies) offload costs from classifiers to searchers.

There’s also an economic question of the cost of failure. Flickr is not there to help you find precisely the photo you’re looking for, nor the paper or book you mean to find. It’s there to make surfing easier. If you want to see specific people’s photos, you can subscribe to their site. So the folksonomy works where there’s a very low cost of not seeing a result. Does it work as well where the costs are higher? If you’re searching for a specific book in a library, and can’t guess the tags attached to it, you can fall back to other, organized search criteria. I’m finding it hard to quantify the search failure costs here, because moving from photos to say, reference specimens of butterflys, that specimen, and its name, act as an index into all sorts of scientific work.

Another tension is speed of change. Fast changing taxa are hard to search, but easy to create. Is it worthwhile to spend the effort to enable effective searching? To whom is it worthwhile?

To relate this back to the work I’m doing, I think that the cost of failed searches may be very high. High enough to dominate? Unclear.

"Metadata for the masses"

In “Metadata for the masses,” Peter Merholz presents an interesting idea, which is build a classification scheme from free-form data that users apply. He points to Flikr’s “Cameraphone” category, which would probably not exist if there was only a pull-down list.

He also points up problems: Many categories for one thing (nyc, NewYork, NewYorkCity), one category that means many things (“Flow, for instance, can either mean optimal creative experience, or the movement of a fluid,”), and categorizations that are wrong.

I think there’s a tie here to memes, or ideas which encourage you to adapt them. If I see a tag which strikes me, is evocative to me, or I see as useful, I’m likely to use it myself. If I create a tag which I find evocative, but no one else does, (say, “Bastiat-ic”) its unlikely to get picked up. I am a big fan of evolutionary, or memetic systems like this, and am sorely tempted to try to include it in my project, but the goal of that project isn’t actually to create a taxonomy, its to create a useful naming scheme. I think a taxonomy is part of that, but others who get a say in the final analysis disagree, and so I’d like to focus on getting a taxonomic name space, rather than a cool evolutionary method for creating it.

The Tree of Life, COI-ly

The September 30th issue of the Economist points to an article in PLoS Biology by Hebert, et al, discussing a new technique for identifying species. The technique, which relies on mitochondirial genes for cytochrome c oxidase I (COI), which is a 648 pair gene. [1]

This technique helps settle the question of “Is Astraptes fulgerator one species or several?”[2]. The butterfly in question looks the same as a butterfly, but there are important variations in the caterpillar forms.

Which, as I strugle to create a taxonomy for a specific set of computer security issues, shows that I am doomed to fail, and that may just be ok.

[1] Who the heck told them they could throw a ‘c’ out in the midst of a protien name like that? Do these people have no respect for the English language?
[2] It was keeping me awake at night, too. (As many as 10 species in Costa Rica alone.)

Taxonomic Software

A small window into a large world, with its own software:
biological software, including DELTA, a DEscription Language for TAxonomy, database software, ecology software, morphometric, paleontologic, and phylogentics software. (Hey, I need a taxonomy just to keep the breakdowns straight!)

Or DMOZ has a page, but it doesn’t seem as comprehensive.

What I want to do is to throw keywords at database and have them organized for me. I suspected that this may be sufficiently specialized as to not have software available for it, but I’m no longer so sure.