Doing “Stuff With Data”
I started thinking about an interesting problem several months back, and I thought I’d write up a huge, nerdy blog post, in case anyone is thinking about similar issues. But don’t say I didn’t warn you – it is both huge and nerdy.
Let’s forgo the details and just talk about a simple hypothetical scenario. You’ve got a list of local businesses, maybe cobbled together from various sources, like the yellow pages, yelp, and a web crawl you did yourself. Because your data is from a few different sources, you probably have some interesting relationships inside it.
I see two main relationships:
1) You have a couple entries for the local hardware store. This is duplicate data, but lets say you want to keep it around anyway. You could find cases where two sources agree, but one is missing, for example. And that might be interesting.
2) There could be multiple entries for different locations of the same business, like a Starbucks. This isn’t really duplicated data, but identifying the relationship between the two (or more) locations could be meaningful, depending on your domain and what you’re trying to do.
Lately, I’ve been working on finding these kinds of relationships in a similar dataset in my spare time. I’m abstracting a lot out from the specific problem and dataset I was working on, but the basic premise is the same, so just roll with it.
I think there are really only two ways to approach this. One is to look for similar names. This would work for both situations 1 & 2 above. Keep in mind, though, that your data might not be very clean. The other is to look for specific shared data, like the same phone number, or the same address, or some other bit of information you may have. Depending on what data is shared, it may work only for situation 1, or it may work with both.
Checking for the same phone number is pretty easy. But if you’re looking for multiple locations of the same business, your best hope might just be looking for similar names. Just how, exactly, do you rate strings based on their “similarness,” as opposed to actual “equality?” I’m primarily working with 2 methods:
1) Edit Distance. This is the minimum number of operations (add a letter, change a letter, swap a letter, remove a letter, etc) that are required to change String A into String B. There are a few ways of doing this, all of which are variations on the theme. Levenshtein Distance, Damerau-Levenshtein Distance, and Optimal String Alignment are a few. The upside is that it’s an easy-to-understand number (we all know what an edit distance of 2 means). The downside is that it would rate the strings “Matt Good” and “Good, Matt” as being VERY different, when in fact most humans would say they’re fairly similar. There are ways around this, which complicate the algorithm somewhat. I haven’t implemented any of those changes (yet).
2) Something I came up with (which I probably didn’t really come up with) that I am calling “Shared Tokens,” for lack of a better term. The basic gist is that it tokenizes the names, figures out which are shared between the two strings, and then (wait for it) weights each shared token by some amount, inversely related to how frequently that token appears in the corpus. For example, in the business list scenario, you’ll probably see a lot of “Inc” or “Company” or “First” or “Hardware”, and they aren’t really that meaningful. But you won’t see a lot of “GraphXpress” (to use a made-up name), which is likely to be more meaningful if you see it shared.
To scale this up, you really do actually have to compare every entity to every other entity in your set. To put it lightly, that’s not awesome. I don’t see a way around it, so one thing I did to mitigate the problem was to take advantage of modern multicore processors, and divide up the task into smaller chunks, and spin them off into separate threads. This is, of course, a scaled-down version of the idea behind MapReducethat I accidentally reverse-engineered. It’s always a good feeling getting that confirmation that you’re on the right track, or at least that better minds than yours have walked the path before you. When I compared the single-thread version to the multithreaded version, I got a 2.9x performance improvement. That’s a big deal, friends. When you do this, however, you have to make sure you divide the task properly. You still have to end up comparing every entity to every other entity, and it’s easy to think you’re doing that, but not. I’ll leave that to you to figure out how
The other weird issue with this whole problem is that relational databases are awful at storing potential matches. They really aren’t built for the kind of non-hierarchical N:M relationships that this requires. Think about how you would model that: you could have an entity id column, and a matching entity id column. But the matching algorithms are transitive, so there’s no inherent order. You might end up with a record for 1,2 and another for 2,1, but they really mean the same thing. I need to do a bit more research into this, its possible that a NoSQL-style DB or graph DB would be better for this. For a number of reasons (not the least of which is that this is just a proof of concept), I’m just hacking the crap out of the relational model for now. Another something to put on the list.
At any rate, I thought that this was a good introduction to get myself into some of the kinds of data analysis, “big data,” and “collective intelligence” topics I’ve been trying to break into lately. Any robot can build an app that tracks some data. The good work is doing “stuff with the data.”
FishApp – Water Change Detection
The other day, I went to the mall with Kristin.Â I usually finish up quicker than her, and this time being in possession of a shiny tiny netbook, I was able to code and tweak the water change detection algorithm for the FishApp while sitting on a bench outside of Macy’s.Â I had a couple rather confused onlookers.Â I may or may not be on a “do-not-fly” list now.
To refresh your memory, the FishApp keeps track of water changes, and gives you a graph showing weighted-age values for the water in your fish tank.Â This requires you to pay attention while you’re doing the water change, and to log in to the website and report how much water you changed when.Â Well, I want to have the FishApp sense, measure, and publish water changes for me.Â To those ends, I designed a system that can measure the water level in the tank over time, report it to a computer, figure out when and how much water was changed, and report that back to the main fishapp web application.Â The measurement is done using a Ping))) ultrasonic rangefinder, and data from that (and other sensors) is fed into a computer.
But just how is the computer supposed to figure out when a water change happened?Â It’s input is just a string of numbers, and it’s gotta be smart enough to filter out random noise from tank cleanings, frisky fish, the water filter starting and stopping for one reason or another, or any of a hundred other situations.Â What to do?
If you’ve read the last post about the FishApp, you know where to start – smooth out the data.Â To recap, I have the sensor set to read the water level every half-second, and report it to the computer.Â The raw data is pretty noisy:
but if we make new data points from the median ofÂ every 20 samples, things smooth out pretty quickly:
The system should be able to handle this muuuch easier.
The algorithm works by keeping a queue of recent (smoothed) samples.Â By comparing the oldest sample with the newest sample, you can get a “slope” value.Â Â On the graph above, for example, it doesn’t take very many samples until the oldest will be just over 600 but the most recent will be around 800 or so, and you’ll have a large positive slope.Â Once the algorithm sees this large positive slope (above a certain trigger value), it knows that a water change is beginning, and notes the water level beforehand.Â At some point in the process of a water change, I start filing the tank up again, and we see a large negative slope (around 55-60 on the graph above).Â The algorithm notes the capitulation and the minimum water level.Â If the absolute value of the slope stays low enough for long enough, the algorithm detects a steady state, and calls it the end of the water change.Â The “steps” you see on the graph are because I do my water changes bucket-by-bucket, but because the algorithm is using a slope from a queue maybe 10 samples long, it already does a pretty good job of smoothing these steps out, and not getting too confused.
After running the algorithm against the data above, it worked flawlessly.Â Take a look at this graph:
The gray lines are where the important steps in the process were detected.Â For example, the first gray line @ x=10 is where the algorithm first noticed we were emptying water out of the tank.Â It looks late – and it is, but that’s merely a consequence of using a queue of 10 samples to generate the slope.Â The actual “before” water level value it uses is not the one at the gray line, but the minimum one in the queue – which is correct (enough for rock and roll).Â Then, at x=54, the algorithm detected a large negative slope and decided that we were filling the tank up again.Â It started looking for the start of a steady state, which it found at x=121, and it stayed steady long enough that at x=150, the algorithm wrapped up and decided that we’ve done a water change.Â NIFTY!
If you think about what is going on here, it’s really calculus, under the covers.Â The slope value is the derivative of the function, and we look for the points that derivative changes sign.Â But the function we’re using isn’t perfect, and isn’t continuous, and so we’ve got to build in a little extra wonkitude-resistance.Â The algorithm has inputs for the size of the queue, the trigger value for large positive/negative slopes, a tolerance value to ignore noise during steady states, and the number of samples to go during a steady state before deciding we’re really done with the water change – so in theory, this algorithm could be adapted and tuned for a wide range of input sources with little-to-no modification.
Right now, the algorithm is coded in python, but I think it might even be possible to do the crunching on the Arduino.Â If I did that and wired the Arduino up to an ethernet shield, I could ALMOST eliminate the computer altogether.Â I still need the computer to run the webcam, however, so there’s no point in trying to run this code on the chip or anything like that.Â But I think it would be possible in theory, if you don’t want the cam server running.
What A Water Change Looks Like to the FishApp
One of the goals of the FishApp is to have automatic water change detection available in phase 1. In order to do this, I have a Ping))) ultrasonic distance sensor pointed down at my fishtank. This little guy works by producing a sound above human hearing range, and listening for it to bounce back. If you know the speed of sound, you can calculate how far away the object was that caused the reflection. The sensor I am using is mounted above the tank in a piece of 1/4 inch wood to help shield it from the moisture, and samples the water level at predetermined intervals, sending its data over a serial connection to a host computer via an Arduino controller.
The host computer gets this stream of numbers, and has to have some way of determining when I’ve done a water change, and how much water I’ve changed. I got the feeling that random variation (noise) in the data from the sensor could throw off whatever method I use to compute all of this, so I needed to figure out exactly what the data looks like coming in to the computer, preferably saving it so I can test my algorithms against it without having to do a water change after each revision – that would be a heck of a lot worse than just waiting for the code to recompile.
So I fired up Arduino and Python did a water change, saving the raw data from the Ping))) sensor to a file. Without further ado, this is what a water change looks like to a computer:
Pretty cool, huh?Â When I do a water change, I siphon water out of the tank into a 3 gallon bucket, and empty it a bucket at a time, until I’ve taken out as much as I like, and then I re-fill the think 3 gallons at a time.Â You end up with the very visible “steps” on the graph.Â While the data looks mostly consistent, you can see some wonkiness in some of the steps – which almost looks like thick lines.Â The sensor isn’t quite reading the distance regularly in this case.Â This looks to me like the kind of data which could really throw of my detection algorithm.Â So I had the bright idea of taking a median of 5 samples for each data point and using that series for detection.Â Here’s what a median-of-five graph looks like:
You should notice two things: 1) there are fewer datapoints by an order of five, because of the median, and 2) the curve is smoother.Â I could prove this by computing the standard deviation on some of those trouble spots from above, but I don’t think it’s necessary: it’s plain to see when graphed.Â The data could still be better though – look at the jaggies around 100.Â We’ve got plenty of data, so we should be able to create a very smooth line and still have enough resolution to see each step, etc.
I’ll spare you all the gory details, but suffice it to say that the larger the median used, the better.Â A median of 10 was better, but still not good.Â A median of 15 was nearly perfect, but there was still just a little weirdness.Â A median of 20 was perfect:
That’s more like it.Â We’ve still got enough data there to see the water change in detail, but smoothed out all the ugliness that could throw off the computer.Â Cool stuff.
More details on the detection algorithm’s implementation to follow.Â Charts generated with my new favorite tool, Python.
FishApp Update – It’s Doing Stuff
It looks like the FishApp is already providing me with some really neat and useful insights into my aquarium.Â I recently added the ability to overlay and toggle multiple series on the graphs it displays, to make it easy to see if one parameter has an effect on another.Â Take a look at this snapshot I took today:
This chart shows the average age (in days) of the water in my tank compared with the levels of Nitrate I measure using an aquarium test kit.Â I compute the water age using information I record about water changes I do to my tank (a more detailed explanation can be found in the original FishApp post), and Nitrate is a mildly-bad chemical the can build up in your tank over time.Â It’s the end-product of the Nitrogen Cycle in most fish tanks, and can only be removed by water changes or chemical absorption (which some plants and special filter media can do).
At least that’s the theory.Â What this chart is showing me is that the theory seems to actually work out in practice, and that my tests are precise enough to actually be useful – always a good thing.Â Even though I only have five data points for the nitrate series (because I don’t always test as much as I should), it is easy to see the nitrate curve following the water age curve.Â They actually track pretty well, I think. You can see an ugly spike in the water age when I wasn’t paying enough attention to the tank, and the resulting high nitrate levels, which backed down after a series of regular water changes.Â When the water age started creeping back up again, so did Nitrate levels, and then both went down again after we moved from Tennessee to North Carolina (and changed out about 2/3 of the tank water for fresh in the process).Â Very cool.
Since nitrate is the last state that fish poo reaches in the nitrogen cycle, it would not surprise me if there were a delay between water age and its effects on nitrate levels.Â I don’t really have the data to support that right now, but I will try to be more diligent in my testing, and perhaps we can figure that out soon.
If there are any aquarium owners out there who are interested in the FishApp, you can sign up free and track your own fish tank in a similar fashion.Â Here’s a link to sign up.
Why My Next Laptop Will Be A Netbook
@joshuacaddell asked me why I wanted a netbook so bad on twitter today. The answer is too much for 140 characters, so here goes…
I think a lot of people don’t put nearly enough thought into how they use a computer before they purchase one.Â This is especially true of laptops, since the ways people use them are (or at least potentially are) so varried and different.Â They just buy something kinda flashy looking that is big and relatively powerful, because they want the kitchen sink.Â The Swiss Army Laptop, as it were.Â Of course, doing all things equally well sometimes means doing all things equally poorly.
Having been with my current ‘top for nearly 8 years now (and not being a very “typical” computer user), I have a pretty good idea of how I use computers.Â Specifically, I like to take them places.Â I crave long battery life.Â I have a high-powered desktop to handle the really crazy stuff.Â I have been doing a lot of web development rather than Win32 stuff.Â I also have recently been very interested in microcontroller programming on platforms like the Arduino.Â I don’t use MS Office much, and think I could get by pretty well with the Google Gears versions of Google Docs in many cases.
The particular netbook I’m interested in has a ridiculous battery life.Â They claim up to 9.5 hours, but I’ve seen people citing real-world usage of around 6-8 hours, which is very impressive.Â It’s sufficiently light and you don’t have to carry around a bunch of crap to make it work. (Don’t be fooled into thinking that all netbooks have great battery life.Â Many are terrible.Â I personally think the lackluster battery life of most netbooks defeats their purposes.)Â The web dev work that I’d be doing on it is (for the most part) scripting languages and light database stuff (LAMP), which this unit could handle as a development machine.Â Any ASP.NET or C#Â development would primarily be off this computer, but I’ll probably do some basic edits to some stuff on this machine anyway in a generic text editor (it is occasionally handy to not have to rely on the Wizzards for everything).
I once heard somebody say that the easiest way to drastically improve the quality of your photographs was to carry a camera around with you everywhere.Â The quality doesn’t matter; you’ll get better shots if you’re the only guy around with a camera when something cool happens.Â The same principle applies here for computers, too.Â I plan on making a rugedized & modularized Arduino to go with the netbook to carry around so that virtually anywhere I am, I can do really cool computing and interface with the “real world” in literally a few minutes.Â That’s roughly equivalent to a super-nerdy MacGyver.Â And that’s what I want.
Plus, I’m going to be building a powerhouse beast of a recording/programming machine for home soon anyway.
P.S. I don’t have a data plan for my phone, and probably won’t get one for a while at least.Â So I will have to rely on WiFi for my internet connectivity in the short term.Â Ergo, no service charge.Â But that’s okay, I can do what I need to offline in many cases.
Introducing the “I Heart Heart Cat Cat Map Map App App”
Hello everyone! I’m rolling out a stupid little app for my greenville friends today. If you’re not in my Greenville circle of friends, you may need a little background here…
Actually, this origin story has become the stuff of legend over time, so the following may be totally inaccurate.
A while back, two of my friends Kate and Katie found this horrendous sweater-shirt thing at the Greenville thrift store. It’s clearly a sweater material, but it has short sleeves? What the heck is that all about? It has the letter “I” on it, two hearts, and two cats. The inexplicable monstrosity has been ever since referred to as the “I Heart Heart Cat Cat” shirt. We’ve all taken pictures wearing it – this has sortof become a hazing ceremony for joining our group of friends. It’s a right of passage thing.
Now that a lot of our friends are traveling the world and doing cool things, we decided to take some advice from the whole “traveling garden gnome” thing, and take pictures of us wearing the shirt in cool places, in front of historic and famous monuments, and the like. Of course, we need a Google maps mashup to keep track of all this, right? RIGHT??
Well, either way, I made one. I’m calling it the I Heart Heart Cat Cat Map Map App App.
Kristin looks awesome in that one.
From a technical standpoint, not too much interesting going on here. I did use MySQL’s spatial datatypes (only POINT, really) for storing the coordinates. This isn’t technically correct since MySQL only deals with planar points and lat/long are curvilinear, but I’ve been sorta geeking out over GIS technology lately, and I really just wanted to get used to the OpenGIS stuff. Any distance calculations I’ll have to do out of the db, but I don’t really think I’ll be doing much of those anyway.
The other kinda cool thing I did was set up an approval system. Instead of having users and junk to deal with (so not just ANYONE can post an image), The system accepts all images, thumbnails them and such, and sends an email with an approval code link. I check out the thumbnail in the email, decide if it’s cool, and click a link to either approve or reject it. Neato.
So there you have it. Check it out now and then, as we should have some new additions happening soon, coming from Turkey and London (hint hint kittermans).
Faulty .NET Certification Materials – Image.Save()
After taking a MCTS certification practice test and getting several questions wrong, I started delving into them to figure out what I needed to study. It was then that I found a question asking how to save an image as a JPEG. The test claims that there are two correct answers:
Bitmap bm = new Bitmap(100,100); bm.Save("picture.jpg", ImageFormat.Jpeg);
– and –
Bitmap bm = new Bitmap(100,100); bm.Save("picture.jpg");
The only difference between the two is the parameters used in the call to Bitmap.Save(). I didn’t check the last option, because I wasn’t sure what in which format the image would be saved. The test explanation says that “If you don’t specify an image format, Bitmap.Save defaults to the JPEG format.” Hmm, really?
I had found some dissenting opinions on the net, and the MSDN documentation clearly states that “If no encoder exists for the file format of the image, the Portable Network Graphics (PNG) encoder is used.”
Of course, there’s nothing like actually trying the thing out and seeing what it REALLY does. So built a console app that simply does the following:
Bitmap bm = new Bitmap(100, 100); bm.Save("test.jpg"); bm.Save("test.png");
Both files loaded fine in Windows Picture & Fax Viewer, but trying to load the “jpeg” file in Photoshop proves the png theory with the message “Could not complete your request because a JPEG marker segment length is too short (the file may be truncated or incomplete).” Or it may be because IT’S NOT A JPEG, guys. I thought I would post a clarification on the web, since I didn’t see anything when I did a Google search. Hopefully it will help out other poor confused souls, as well.
This error is from the official Microsoft Press study guide. Let’s hope the actual test doesn’t contain these sorts of inaccuracies… I’ve got a $125 test fee riding on it.
Design Geek in the Making
I’ll admit it, I’m starting to become a bit of one.
I’ve definitely had conversations where I will refuse to say “red” when a simple “FF0000″ will suffice.
Overheard in the Vicinity of My Cubicle
Overheard in the Vicinity of My Cubicle, this 30th of January, 2007 the year of our Lord:
“Hi, do you work on the Internets?”
The speaker of said atrocity noted that she, in fact, worked on the IntrAnets. And needed a copy machine. And apparently needed one of us Internets gurus to help her with it. Funny.
Breakfast at Work – Edit, read it again.
So at work we usually have a quick dev meeting on Fridays to talk about what we’ve been working on, what’s coming up, etc. We usually have donuts and/or bagels. Its nice.
Today, my sorta-boss dude Tom brought in… an electric griddle. And bacon. And pancake mix. And sausages. And cooked all of us breakfast during the meeting. It was fantastic. And surreal.
There he is. Cooking breakfast.
And here’s one of shawn – with a big check:
Yeah, that’s about the size of a typical Christmas bonus at my company… Yeah right.