« What’s in a Neighborhood Name? | Home | Weekly Wrapup - 15 February 2008 »

ACM Conference on Web Search & Data Mining

By Matthew Berk | February 14, 2008

This week I’m attending the first ACM conference on Web Search and Data Mining (WSDM 2008). It’s amazing to note that this is the first such conference of its kind. I’m also astounded at the turnout: circa 250 attendees, including some of the most influential figures in Web search going back to the beginning of the field.

Yesterday I heard Qiaozhu Mei present incredibly interesting work about the size and entropy of the Web, research that was conducted with Ken Church at Microsoft. The paper, “Entropy of Search Logs: How Hard is Search? With Personalization? With Backoff?” explores a really interesting question: what if the relevant Web sites were really sized in the low hundreds of millions of documents, as opposed to the double-digit billions of documents people generally talk about.

Qiaozhu introduced the notion of a set of rational upper bounds for the total pool of interesting content. I’ll paraphrase roughly here:

total personal sites < total online population < total population total business sites < total business listings < total number of businesses

Granted, there are other kinds of content (news and products, for example), with high rates of volatility, but even these secondary data sets are typically in the low tens or hundreds of millions of documents. Analyzing search logs from Microsoft, he discovered that the total set of relevant documents, as expressed by the entropy of the document set, is actually far smaller than we think.

To put these questions another way: is the long tail shorter than we think? Have we over-engineered our solution to the problem of Web search? Does the entrenched box-and-list-of-10 metaphor undercut the utility of the full index? What if an iPod could contain an index as relevant to a searcher as the tens–perhaps hundreds–of thousands of clustered machines operated by Google, Microsoft, Yahoo! and others? Or, put another way: are the costs of indexing the long tail worth the benefits?

I’ve always thought that one key driver of relevance is the reduction of noise in the channel (hence my interest in vertical search); if that’s the case, Qiaozhu’s findings present some really hard truths for us to swallow.

I’m also struck by this because I’m increasingly interested in the short tail, and whether discovery for that spectrum of the query field is best served by keywords. But that’s another story, for another day….

Topics: Data, Content |

Comments