« My Local Resolutions for 2008 | Home | Maximizing Pay-Per-Phone-Call for The Local Businesses That You Serve »
Anatomy of a Fuzzy Space
By Matthew Berk | January 9, 2008
Ian White of Urban Mapping today announced that portions of his fantastic data set are open to developers for free. In the local field, Ian is a great thinker when it comes to the intersection of hard data and human perception, and his company has focused on trying to define fuzzy boundaries for perceived local spaces: neighborhoods. The work of Urban Mapping, which relies on a distributed pool of local knowledge, is a rare bird in the geospatial data world, since neighborhood boundaries are by definition fuzzy — and this kind of thinking doesn’t necessarily compute well among folks with, how shall I put it, a strict training in the field.
Back in 2004, when Open List was cutting its teeth on our first national collection of local businesses, we ran smack into the problem of trying to define neighborhoods for areas we didn’t know. For the first three markets we entered, we defined sets of very complex mappings between ZIP codes, town names, and neighborhoods. The results were, predictably, rough and inadequate. When we went national, we gave ourselves the simple constraints that we a) couldn’t afford to pay for a labor pool to help us with the work, and b) required greater local sensitivity than ZIP code maps could ever provide.
Without being able to pay folks for local expertise, it was really going to be difficult to scale nationally with the kind of sensitivity to place we wanted, and this kept me up at night. At the time, I was living 24 stories up, in New York City, with nice views south and west from the desk I sat at when I wasn’t in the office. And in one of those small dramatic moments I’ll not forget, I looked south, thinking about the restaurants I liked in my neighborhood (the Upper East Side), and how generally, they all sat within a blob of space I could see and delineate from above. If I cast my eyes farther south, I could easily imagine other places I frequented in my old neighborhood, Midtown East; a slight movement right with the eyes, and there would be the places I liked to meet people for lunch, in Midtown. Looking across Central Park, there was yet another cluster of dinner favorites, all on the Upper West Side.
From my bird’s eye perch, it was as if I were seeing a map of the city, with points designating businesses I knew, in neighborhoods I knew, across a conceptual and visual grid. I reasoned that if I went to a new restaurant that was somewhere between two known places in the same neighborhood, I’d most likely still be in that neighborhood. By having a known sample of businesses in neighborhoods I knew, I could build a two-dimensional map that would help me identify the neighborhoods of unknown, new places.
With this insight–that I could model known points as a way to classify unknown ones–the next day it was back to the data mines with a clear path. We mined all of our crawled content, looking for pairings of the term “neighborhood” with text immediately following colons or in adjacent table cells. In the exercise, we prioritized local sources, since they were in effect local experts, and since at the time, none of the national sites had any significant neighborhood data whatsoever. These we then normalized the labels, associated them with cities, and geocoded the locations. The result was a set of about 40,000 known coordinates that local experts had defined as being in a specific neighborhood. By using the mined data, we had in effect leveraged local knowledge without the cost and complexity of actually working with other people. That’s one of the beauties of data mining: you can learn things in aggregate that otherwise are simply infeasible to understand.
The next step, which is still in use by the system today, is to use the sample points to construct a geospatial model of the complex polygons that represent neighborhoods in various cities. Incoming, unknown points can then be compared against the model to determine if they fall squarely within a known polygon, if they fall within one or more polygons, or if they fall within a certain distance from the edges or center of a polygon. In this way, we basically simulated getting a bird’s eye view of a city, and using that to try and determine where things were located based on mined expert knowledge.
Here are two “views” of what Open List actually “sees” for New York City, based on plotting our sample points: actual, fuzzy.
Now, there are flaws to the system, to be sure. Where there are surprising changes in the contour of land, we often err (think of the bends in the Charles river in Boston). There are also ongoing efforts to refine the sample set and eliminate outliers. Also, not surprisingly, even local experts sometimes just put a place in the wrong neighborhood.
That said, it works pretty well, and we’re constantly on the lookout for ways to improve the sample set and the algorithm. Urban Mapping’s announcement reminds me that on the one hand, neighborhood data is really local gold, but at the same time, providers like Maponics and Urban Mapping are commoditizing the ability to know neighborhoods. But if there’s one thing I hope is clear from my description of how Open List handles the problem, it’s that getting local means getting your hands dirty, and the barrier to entry is increasingly to design systems that have detailed local knowledge, at scale, sufficient to engage consumers on their own ground, literally….
Topics: Data, Local Search, Content |


January 10th, 2008 at 2:32 pm
Ahh, growing pains!
January 11th, 2008 at 1:03 pm
[…] Fascinating story about how Professor Berk originally deduced which restaurants and hotels where in which New York neighborhoods. Manhattan is a city notorious for ever changing neighborhoods but Matthew figured out a simple way to crawl the information. The resulting graphs are very cool. […]
January 16th, 2008 at 12:47 am
[…] Deriving Neighborhood data. An interesting post on OpenList’s approach to understanding neighborhoods. This aligns with my thoughts on the semantic web - most semantic meaning will be derived by ‘intelligent agents’, not defined by content owners. Thanks (Rahul and Niki). […]