Matthew C Good : Musician, Software Engineer, Hobbyist.

Posts Tagged computers

Doing “Stuff With Data”

I started thinking about an interesting problem several months back, and I thought I’d write up a huge, nerdy blog post, in case anyone is thinking about similar issues. But don’t say I didn’t warn you – it is both huge and nerdy.

Let’s forgo the details and just talk about a simple hypothetical scenario. You’ve got a list of local businesses, maybe cobbled together from various sources, like the yellow pages, yelp, and a web crawl you did yourself. Because your data is from a few different sources, you probably have some interesting relationships inside it.

I see two main relationships:
1) You have a couple entries for the local hardware store. This is duplicate data, but lets say you want to keep it around anyway. You could find cases where two sources agree, but one is missing, for example. And that might be interesting.
2) There could be multiple entries for different locations of the same business, like a Starbucks. This isn’t really duplicated data, but identifying the relationship between the two (or more) locations could be meaningful, depending on your domain and what you’re trying to do.

Lately, I’ve been working on finding these kinds of relationships in a similar dataset in my spare time. I’m abstracting a lot out from the specific problem and dataset I was working on, but the basic premise is the same, so just roll with it.

I think there are really only two ways to approach this. One is to look for similar names. This would work for both situations 1 & 2 above. Keep in mind, though, that your data might not be very clean. The other is to look for specific shared data, like the same phone number, or the same address, or some other bit of information you may have. Depending on what data is shared, it may work only for situation 1, or it may work with both.

Checking for the same phone number is pretty easy. But if you’re looking for multiple locations of the same business, your best hope might just be looking for similar names. Just how, exactly, do you rate strings based on their “similarness,” as opposed to actual “equality?” I’m primarily working with 2 methods:

1) Edit Distance. This is the minimum number of operations (add a letter, change a letter, swap a letter, remove a letter, etc) that are required to change String A into String B. There are a few ways of doing this, all of which are variations on the theme. Levenshtein Distance, Damerau-Levenshtein Distance, and Optimal String Alignment are a few. The upside is that it’s an easy-to-understand number (we all know what an edit distance of 2 means). The downside is that it would rate the strings “Matt Good” and “Good, Matt” as being VERY different, when in fact most humans would say they’re fairly similar. There are ways around this, which complicate the algorithm somewhat. I haven’t implemented any of those changes (yet).
2) Something I came up with (which I probably didn’t really come up with) that I am calling “Shared Tokens,” for lack of a better term. The basic gist is that it tokenizes the names, figures out which are shared between the two strings, and then (wait for it) weights each shared token by some amount, inversely related to how frequently that token appears in the corpus. For example, in the business list scenario, you’ll probably see a lot of “Inc” or “Company” or “First” or “Hardware”, and they aren’t really that meaningful. But you won’t see a lot of “GraphXpress” (to use a made-up name), which is likely to be more meaningful if you see it shared.

To scale this up, you really do actually have to compare every entity to every other entity in your set. To put it lightly, that’s not awesome. I don’t see a way around it, so one thing I did to mitigate the problem was to take advantage of modern multicore processors, and divide up the task into smaller chunks, and spin them off into separate threads. This is, of course, a scaled-down version of the idea behind MapReducethat I accidentally reverse-engineered. It’s always a good feeling getting that confirmation that you’re on the right track, or at least that better minds than yours have walked the path before you. When I compared the single-thread version to the multithreaded version, I got a 2.9x performance improvement. That’s a big deal, friends. When you do this, however, you have to make sure you divide the task properly. You still have to end up comparing every entity to every other entity, and it’s easy to think you’re doing that, but not. I’ll leave that to you to figure out how 🙂

The other weird issue with this whole problem is that relational databases are awful at storing potential matches. They really aren’t built for the kind of non-hierarchical N:M relationships that this requires. Think about how you would model that: you could have an entity id column, and a matching entity id column. But the matching algorithms are transitive, so there’s no inherent order. You might end up with a record for 1,2 and another for 2,1, but they really mean the same thing. I need to do a bit more research into this, its possible that a NoSQL-style DB or graph DB would be better for this. For a number of reasons (not the least of which is that this is just a proof of concept), I’m just hacking the crap out of the relational model for now. Another something to put on the list.

At any rate, I thought that this was a good introduction to get myself into some of the kinds of data analysis, “big data,” and “collective intelligence” topics I’ve been trying to break into lately. Any robot can build an app that tracks some data. The good work is doing “stuff with the data.”

FishApp – Water Change Detection

The other day, I went to the mall with Kristin.  I usually finish up quicker than her, and this time being in possession of a shiny tiny netbook, I was able to code and tweak the water change detection algorithm for the FishApp while sitting on a bench outside of Macy’s.  I had a couple rather confused onlookers.  I may or may not be on a “do-not-fly” list now.

To refresh your memory, the FishApp keeps track of water changes, and gives you a graph showing weighted-age values for the water in your fish tank.  This requires you to pay attention while you’re doing the water change, and to log in to the website and report how much water you changed when.  Well, I want to have the FishApp sense, measure, and publish water changes for me.  To those ends, I designed a system that can measure the water level in the tank over time, report it to a computer, figure out when and how much water was changed, and report that back to the main fishapp web application.  The measurement is done using a Ping))) ultrasonic rangefinder, and data from that (and other sensors) is fed into a computer.

But just how is the computer supposed to figure out when a water change happened?  It’s input is just a string of numbers, and it’s gotta be smart enough to filter out random noise from tank cleanings, frisky fish, the water filter starting and stopping for one reason or another, or any of a hundred other situations.  What to do?

If you’ve read the last post about the FishApp, you know where to start – smooth out the data.  To recap, I have the sensor set to read the water level every half-second, and report it to the computer.  The raw data is pretty noisy:

but if we make new data points from the median of  every 20 samples, things smooth out pretty quickly:

The system should be able to handle this muuuch easier.

The algorithm works by keeping a queue of recent (smoothed) samples.  By comparing the oldest sample with the newest sample, you can get a “slope” value.   On the graph above, for example, it doesn’t take very many samples until the oldest will be just over 600 but the most recent will be around 800 or so, and you’ll have a large positive slope.  Once the algorithm sees this large positive slope (above a certain trigger value), it knows that a water change is beginning, and notes the water level beforehand.  At some point in the process of a water change, I start filing the tank up again, and we see a large negative slope (around 55-60 on the graph above).  The algorithm notes the capitulation and the minimum water level.  If the absolute value of the slope stays low enough for long enough, the algorithm detects a steady state, and calls it the end of the water change.  The “steps” you see on the graph are because I do my water changes bucket-by-bucket, but because the algorithm is using a slope from a queue maybe 10 samples long, it already does a pretty good job of smoothing these steps out, and not getting too confused.

After running the algorithm against the data above, it worked flawlessly.  Take a look at this graph:

The gray lines are where the important steps in the process were detected.  For example, the first gray line @ x=10 is where the algorithm first noticed we were emptying water out of the tank.  It looks late – and it is, but that’s merely a consequence of using a queue of 10 samples to generate the slope.  The actual “before” water level value it uses is not the one at the gray line, but the minimum one in the queue – which is correct (enough for rock and roll).  Then, at x=54, the algorithm detected a large negative slope and decided that we were filling the tank up again.  It started looking for the start of a steady state, which it found at x=121, and it stayed steady long enough that at x=150, the algorithm wrapped up and decided that we’ve done a water change.  NIFTY!

If you think about what is going on here, it’s really calculus, under the covers.  The slope value is the derivative of the function, and we look for the points that derivative changes sign.  But the function we’re using isn’t perfect, and isn’t continuous, and so we’ve got to build in a little extra wonkitude-resistance.  The algorithm has inputs for the size of the queue, the trigger value for large positive/negative slopes, a tolerance value to ignore noise during steady states, and the number of samples to go during a steady state before deciding we’re really done with the water change – so in theory, this algorithm could be adapted and tuned for a wide range of input sources with little-to-no modification.

Right now, the algorithm is coded in python, but I think it might even be possible to do the crunching on the Arduino.  If I did that and wired the Arduino up to an ethernet shield, I could ALMOST eliminate the computer altogether.  I still need the computer to run the webcam, however, so there’s no point in trying to run this code on the chip or anything like that.  But I think it would be possible in theory, if you don’t want the cam server running.

What A Water Change Looks Like to the FishApp

One of the goals of the FishApp is to have automatic water change detection available in phase 1. In order to do this, I have a Ping))) ultrasonic distance sensor pointed down at my fishtank. This little guy works by producing a sound above human hearing range, and listening for it to bounce back. If you know the speed of sound, you can calculate how far away the object was that caused the reflection. The sensor I am using is mounted above the tank in a piece of 1/4 inch wood to help shield it from the moisture, and samples the water level at predetermined intervals, sending its data over a serial connection to a host computer via an Arduino controller.

The host computer gets this stream of numbers, and has to have some way of determining when I’ve done a water change, and how much water I’ve changed. I got the feeling that random variation (noise) in the data from the sensor could throw off whatever method I use to compute all of this, so I needed to figure out exactly what the data looks like coming in to the computer, preferably saving it so I can test my algorithms against it without having to do a water change after each revision – that would be a heck of a lot worse than just waiting for the code to recompile.

So I fired up Arduino and Python did a water change, saving the raw data from the Ping))) sensor to a file. Without further ado, this is what a water change looks like to a computer:

Pretty cool, huh?  When I do a water change, I siphon water out of the tank into a 3 gallon bucket, and empty it a bucket at a time, until I’ve taken out as much as I like, and then I re-fill the think 3 gallons at a time.  You end up with the very visible “steps” on the graph.  While the data looks mostly consistent, you can see some wonkiness in some of the steps – which almost looks like thick lines.  The sensor isn’t quite reading the distance regularly in this case.  This looks to me like the kind of data which could really throw of my detection algorithm.  So I had the bright idea of taking a median of 5 samples for each data point and using that series for detection.  Here’s what a median-of-five graph looks like:

Median of Five Graph

Median of Five Graph

You should notice two things: 1) there are fewer datapoints by an order of five, because of the median, and 2) the curve is smoother.  I could prove this by computing the standard deviation on some of those trouble spots from above, but I don’t think it’s necessary: it’s plain to see when graphed.  The data could still be better though – look at the jaggies around 100.  We’ve got plenty of data, so we should be able to create a very smooth line and still have enough resolution to see each step, etc.

I’ll spare you all the gory details, but suffice it to say that the larger the median used, the better.  A median of 10 was better, but still not good.  A median of 15 was nearly perfect, but there was still just a little weirdness.  A median of 20 was perfect:

Median of 20 graph

Median of 20 graph

That’s more like it.  We’ve still got enough data there to see the water change in detail, but smoothed out all the ugliness that could throw off the computer.  Cool stuff.

More details on the detection algorithm’s implementation to follow.  Charts generated with my new favorite tool, Python.

Why My Next Laptop Will Be A Netbook

@joshuacaddell asked me why I wanted a netbook so bad on twitter today. The answer is too much for 140 characters, so here goes…

I think a lot of people don’t put nearly enough thought into how they use a computer before they purchase one.  This is especially true of laptops, since the ways people use them are (or at least potentially are) so varried and different.  They just buy something kinda flashy looking that is big and relatively powerful, because they want the kitchen sink.  The Swiss Army Laptop, as it were.  Of course, doing all things equally well sometimes means doing all things equally poorly.

Having been with my current ‘top for nearly 8 years now (and not being a very “typical” computer user), I have a pretty good idea of how I use computers.  Specifically, I like to take them places.  I crave long battery life.  I have a high-powered desktop to handle the really crazy stuff.  I have been doing a lot of web development rather than Win32 stuff.  I also have recently been very interested in microcontroller programming on platforms like the Arduino.  I don’t use MS Office much, and think I could get by pretty well with the Google Gears versions of Google Docs in many cases.

The particular netbook I’m interested in has a ridiculous battery life.  They claim up to 9.5 hours, but I’ve seen people citing real-world usage of around 6-8 hours, which is very impressive.  It’s sufficiently light and you don’t have to carry around a bunch of crap to make it work. (Don’t be fooled into thinking that all netbooks have great battery life.  Many are terrible.  I personally think the lackluster battery life of most netbooks defeats their purposes.)  The web dev work that I’d be doing on it is (for the most part) scripting languages and light database stuff (LAMP), which this unit could handle as a development machine.  Any ASP.NET or C#  development would primarily be off this computer, but I’ll probably do some basic edits to some stuff on this machine anyway in a generic text editor (it is occasionally handy to not have to rely on the Wizzards for everything).

I once heard somebody say that the easiest way to drastically improve the quality of your photographs was to carry a camera around with you everywhere.  The quality doesn’t matter; you’ll get better shots if you’re the only guy around with a camera when something cool happens.  The same principle applies here for computers, too.  I plan on making a rugedized & modularized Arduino to go with the netbook to carry around so that virtually anywhere I am, I can do really cool computing and interface with the “real world” in literally a few minutes.  That’s roughly equivalent to a super-nerdy MacGyver.  And that’s what I want.

Plus, I’m going to be building a powerhouse beast of a recording/programming machine for home soon anyway.

P.S. I don’t have a data plan for my phone, and probably won’t get one for a while at least.  So I will have to rely on WiFi for my internet connectivity in the short term.  Ergo, no service charge.  But that’s okay, I can do what I need to offline in many cases.