I started thinking about an interesting problem several months back, and I thought I’d write up a huge, nerdy blog post, in case anyone is thinking about similar issues. But don’t say I didn’t warn you – it is both huge and nerdy.
Let’s forgo the details and just talk about a simple hypothetical scenario. You’ve got a list of local businesses, maybe cobbled together from various sources, like the yellow pages, yelp, and a web crawl you did yourself. Because your data is from a few different sources, you probably have some interesting relationships inside it.
I see two main relationships:
1) You have a couple entries for the local hardware store. This is duplicate data, but lets say you want to keep it around anyway. You could find cases where two sources agree, but one is missing, for example. And that might be interesting.
2) There could be multiple entries for different locations of the same business, like a Starbucks. This isn’t really duplicated data, but identifying the relationship between the two (or more) locations could be meaningful, depending on your domain and what you’re trying to do.
Lately, I’ve been working on finding these kinds of relationships in a similar dataset in my spare time. I’m abstracting a lot out from the specific problem and dataset I was working on, but the basic premise is the same, so just roll with it.
I think there are really only two ways to approach this. One is to look for similar names. This would work for both situations 1 & 2 above. Keep in mind, though, that your data might not be very clean. The other is to look for specific shared data, like the same phone number, or the same address, or some other bit of information you may have. Depending on what data is shared, it may work only for situation 1, or it may work with both.
Checking for the same phone number is pretty easy. But if you’re looking for multiple locations of the same business, your best hope might just be looking for similar names. Just how, exactly, do you rate strings based on their “similarness,” as opposed to actual “equality?” I’m primarily working with 2 methods:
1) Edit Distance. This is the minimum number of operations (add a letter, change a letter, swap a letter, remove a letter, etc) that are required to change String A into String B. There are a few ways of doing this, all of which are variations on the theme. Levenshtein Distance, Damerau-Levenshtein Distance, and Optimal String Alignment are a few. The upside is that it’s an easy-to-understand number (we all know what an edit distance of 2 means). The downside is that it would rate the strings “Matt Good” and “Good, Matt” as being VERY different, when in fact most humans would say they’re fairly similar. There are ways around this, which complicate the algorithm somewhat. I haven’t implemented any of those changes (yet).
2) Something I came up with (which I probably didn’t really come up with) that I am calling “Shared Tokens,” for lack of a better term. The basic gist is that it tokenizes the names, figures out which are shared between the two strings, and then (wait for it) weights each shared token by some amount, inversely related to how frequently that token appears in the corpus. For example, in the business list scenario, you’ll probably see a lot of “Inc” or “Company” or “First” or “Hardware”, and they aren’t really that meaningful. But you won’t see a lot of “GraphXpress” (to use a made-up name), which is likely to be more meaningful if you see it shared.
To scale this up, you really do actually have to compare every entity to every other entity in your set. To put it lightly, that’s not awesome. I don’t see a way around it, so one thing I did to mitigate the problem was to take advantage of modern multicore processors, and divide up the task into smaller chunks, and spin them off into separate threads. This is, of course, a scaled-down version of the idea behind MapReducethat I accidentally reverse-engineered. It’s always a good feeling getting that confirmation that you’re on the right track, or at least that better minds than yours have walked the path before you. When I compared the single-thread version to the multithreaded version, I got a 2.9x performance improvement. That’s a big deal, friends. When you do this, however, you have to make sure you divide the task properly. You still have to end up comparing every entity to every other entity, and it’s easy to think you’re doing that, but not. I’ll leave that to you to figure out how 🙂
The other weird issue with this whole problem is that relational databases are awful at storing potential matches. They really aren’t built for the kind of non-hierarchical N:M relationships that this requires. Think about how you would model that: you could have an entity id column, and a matching entity id column. But the matching algorithms are transitive, so there’s no inherent order. You might end up with a record for 1,2 and another for 2,1, but they really mean the same thing. I need to do a bit more research into this, its possible that a NoSQL-style DB or graph DB would be better for this. For a number of reasons (not the least of which is that this is just a proof of concept), I’m just hacking the crap out of the relational model for now. Another something to put on the list.
At any rate, I thought that this was a good introduction to get myself into some of the kinds of data analysis, “big data,” and “collective intelligence” topics I’ve been trying to break into lately. Any robot can build an app that tracks some data. The good work is doing “stuff with the data.”
“The Scream” is starting to look finished, but it has two more steps to go. Chase’s “Good Vibrations” tremolo pedal is starting to look like a Jamaican flag. Oops. I think it will still be cool. It needs text, and then it will be done.
Charleston is a pretty cool town. We rented bikes and cruised around on the beach, chasing seagulls, flying kites, and generally having a good time. Beaches in winter, who would have thought?
So I was sitting at a fairly kitchy mall-adjacent caribbean-themed “seafood” loud hookupy barstaurant tonight, trying to amuse myself alone on business in Orlando, and suddenly, some dude started playing steel drums:
I was also drinking wine, reading Barbara Kingsolver, and destroying a Cuban sandwich. It was PREPOSTEROUS.
Thanks to Jeff Potter of Cooking For Geeks, I’m building my own DIY Sous Vide rig with a thermostatic controller and an off the shelf cheap slow cooker.
Mains wiring is scary as heck though.
Work in progress.Â I have to get a couple more parts to make sure it’s safe before I’ll plug it in, but most of the guts are there.Â Needless to say I’ll be doing a lot of voltage checking before I touch this thing.
I think first up will be some sous vide salmon, then hit it hot and fast in a cast iron skillet for color and Maillard reactions on the outside.Â Should be one heck of a good time.Â More to come.
I was scrounging around yesterday, looking for something I could build/make…Â When I realized that with but one quick trip to Radio Shack, I could be the neato PWM guitar effect pedal that Collin Cunningham video demo’d for Make Magazine.Â So I did:
Here’s the guts:
It’s a really cool pedal.Â Way different animal than other guitar effects.Â It makes sounds that most closely resemble a synthesizer.Â Hope to use it for some fake-synth parts on some of my tunes in the future.Â It’s kinda “glitchy” though, which I think is by design.Â But every now and then the pedal does something weird, and I can’t tell if there’s something wrong with the it, or if that’s just the way the weird pedal sounds.Â For instance, it doesn’t always pick up every note, and I don’t know if that’s just how it works (it *is* glitchy) or if I’ve got something loose in there.Â I’ve also got some rhythmic clicking going on when the pedal is engaged but I’m not playing.Â I don’t think this is umm… desired behavior, but I checked my wiring and I think it looks good.Â It’s not awful, and since I mostly do recording, I can edit it out fairly easily, but it would be much more nice if it just didn’t happen in the first place.Â If I figure it out, I’ll update the post.
My tremolo pedal is done.Â This is my version of the Baja Trembulator, which i have christened “Good Vibrations.”Â A little overspray on the text stencil, but that’s okay.Â I’m not very good at this kind of thing, so I’ll take what I can get.
It. Sounds. Awesome.
Homemade “Vactrol” using a red led and a photoresistor in some heat-shrink tubing.Â Very fun.Â Have to build one of these for Chase now.
My newest completed pedal, based on Beavis Audio Research’s Trotsky Drive.Â Real simple circuit.Â I didn’t even have the special Russian transistor Beavis used, but it still sounds cool.Â But I kept with the Soviet theme, and named it after one of my favorite composers, Dmitri Shostakovich.Â Though, if good ol’ Shosty were really distilled into pedal form, it would scream one or two whole hecks of a lot more than this one does.
Oh yeah, you’re not seeing things.Â That is an electrical junction box you see there.
Some of you may remember the Song-A-Week project I had going for a while.Â Writing and demoing a new song (almost) every week was a great experiment and left me with like…Â 60? some odd songs to pick through in various states of completion – mostly really rough, but some a lot more fleshed out.Â No more digging around for another song to fill out an album for me, that’s for sure.
Unfortunately, it also left me overwhelmed trying to polish some of these really rough scraps to perfection.Â I did work here and there on the songs, but there were so many, and I had more good ideas than good plans.Â I moved away from all the drummers I knew.Â In short, the songs sat around for a while.
It’s axiomatic that not making music is less fun that making music, so I eventually decided not to let the perfect be the enemy of the good and to put a song “out there” again.
Here’s a version of the song “Maybe Not” from week 23.Â You may remember the original demo – it’s not as bad as I thought it might be.Â Before I left Nashville, I got Seth Rouch and Ian McDermott to play drums and bass respectively on it, which is why they sound great.Â Take a listen:
Maybe Not by Conrad is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.
(That means you can copy the song and do whatever you want to it for noncommercial use.)
Sure the mix was done in just a couple hours and there’s a few edits I still might like to make.Â And yes, it’s probably quieter than a lot of your other music.Â I just don’t feel like squashing the heck out of my songs anymore.Â If it’s too quiet, turn it up.
It felt good to work on my own music again, but it reminded me that I really enjoy working on someone else’s music much more (as long as I believe in the music/person enough).Â I think that will have to be next on my list…Â Need to find some triangle musicians (and about twenty more hours a week of free time).
I got myself some ribbon mics the other day.Â Here’s a few clips of the first time I used them for any actual recording.
I played my junky old Sorento guitar through my Fender Prosonic on the dirty channel with a heavy dose of amp reverb, and stuck an sm57 about a foot away through one of the Octopre preamps, and used the Fat Head II on the other side of the room, pointing at the amp, and ran that through my Seventh Circle Audio A12 preamp.Â So we have a close/room mix.Â For the record, there is a slight EQ on the fathead tracks, mostly just a rumble (read: Heating Noise) reducer:
I forgot this was on them until I bounced out most of the tracks, so tuff luck- there’s EQ on them.
First up is a section where I’m playing a crappy guitar solo.Â I have no chops.Â I have clips of the 57 alone, FH alone, mixed, and then in the context of the song.Â Remember the placement is VERY different on the mics.
Fat Head Solo Guitar
57 + FH Solo Guitar
In Context 57 + FH Guitar Solo
Then I’ve got the same thing for a crunchy section.
SM57 Crunch Guitar
FHII Crunch Guitar
57 + FH Crunch
In Context 57 + FH Crunch Guitar
I really like the mic.Â I’m gonna try to build some portable cheap acoustic panels out of rigid fiberglass insulation to improve the room sound somewhat…Â I am also planning on swapping out the stock transformers for some luhndals.Â They sell them this way on their website, but I can order them and do the mod myself for less money.Â And I’m going to mess around with my Little Labs IBP plugin for my UA card to see if that makes the mics play any nicer together, but even as is- with minimal fuss, I think the combination of 57 + ribbon adds a nice beefiness to the texture.
I also cut some demo vocals with the thing, and they sound pretty neat too.Â If I determine the clips are suitable (read: minimally embarrassing), I’ll have some clips of that as well.