I started thinking about an interesting problem several months back, and I thought I’d write up a huge, nerdy blog post, in case anyone is thinking about similar issues. But don’t say I didn’t warn you – it is both huge and nerdy.
Let’s forgo the details and just talk about a simple hypothetical scenario. You’ve got a list of local businesses, maybe cobbled together from various sources, like the yellow pages, yelp, and a web crawl you did yourself. Because your data is from a few different sources, you probably have some interesting relationships inside it.
I see two main relationships:
1) You have a couple entries for the local hardware store. This is duplicate data, but lets say you want to keep it around anyway. You could find cases where two sources agree, but one is missing, for example. And that might be interesting.
2) There could be multiple entries for different locations of the same business, like a Starbucks. This isn’t really duplicated data, but identifying the relationship between the two (or more) locations could be meaningful, depending on your domain and what you’re trying to do.
Lately, I’ve been working on finding these kinds of relationships in a similar dataset in my spare time. I’m abstracting a lot out from the specific problem and dataset I was working on, but the basic premise is the same, so just roll with it.
I think there are really only two ways to approach this. One is to look for similar names. This would work for both situations 1 & 2 above. Keep in mind, though, that your data might not be very clean. The other is to look for specific shared data, like the same phone number, or the same address, or some other bit of information you may have. Depending on what data is shared, it may work only for situation 1, or it may work with both.
Checking for the same phone number is pretty easy. But if you’re looking for multiple locations of the same business, your best hope might just be looking for similar names. Just how, exactly, do you rate strings based on their “similarness,” as opposed to actual “equality?” I’m primarily working with 2 methods:
1) Edit Distance. This is the minimum number of operations (add a letter, change a letter, swap a letter, remove a letter, etc) that are required to change String A into String B. There are a few ways of doing this, all of which are variations on the theme. Levenshtein Distance, Damerau-Levenshtein Distance, and Optimal String Alignment are a few. The upside is that it’s an easy-to-understand number (we all know what an edit distance of 2 means). The downside is that it would rate the strings “Matt Good” and “Good, Matt” as being VERY different, when in fact most humans would say they’re fairly similar. There are ways around this, which complicate the algorithm somewhat. I haven’t implemented any of those changes (yet).
2) Something I came up with (which I probably didn’t really come up with) that I am calling “Shared Tokens,” for lack of a better term. The basic gist is that it tokenizes the names, figures out which are shared between the two strings, and then (wait for it) weights each shared token by some amount, inversely related to how frequently that token appears in the corpus. For example, in the business list scenario, you’ll probably see a lot of “Inc” or “Company” or “First” or “Hardware”, and they aren’t really that meaningful. But you won’t see a lot of “GraphXpress” (to use a made-up name), which is likely to be more meaningful if you see it shared.
To scale this up, you really do actually have to compare every entity to every other entity in your set. To put it lightly, that’s not awesome. I don’t see a way around it, so one thing I did to mitigate the problem was to take advantage of modern multicore processors, and divide up the task into smaller chunks, and spin them off into separate threads. This is, of course, a scaled-down version of the idea behind MapReducethat I accidentally reverse-engineered. It’s always a good feeling getting that confirmation that you’re on the right track, or at least that better minds than yours have walked the path before you. When I compared the single-thread version to the multithreaded version, I got a 2.9x performance improvement. That’s a big deal, friends. When you do this, however, you have to make sure you divide the task properly. You still have to end up comparing every entity to every other entity, and it’s easy to think you’re doing that, but not. I’ll leave that to you to figure out how
The other weird issue with this whole problem is that relational databases are awful at storing potential matches. They really aren’t built for the kind of non-hierarchical N:M relationships that this requires. Think about how you would model that: you could have an entity id column, and a matching entity id column. But the matching algorithms are transitive, so there’s no inherent order. You might end up with a record for 1,2 and another for 2,1, but they really mean the same thing. I need to do a bit more research into this, its possible that a NoSQL-style DB or graph DB would be better for this. For a number of reasons (not the least of which is that this is just a proof of concept), I’m just hacking the crap out of the relational model for now. Another something to put on the list.
At any rate, I thought that this was a good introduction to get myself into some of the kinds of data analysis, “big data,” and “collective intelligence” topics I’ve been trying to break into lately. Any robot can build an app that tracks some data. The good work is doing “stuff with the data.”