Friday, October 26, 2012

Wrapping up our first Kaggle competition!


Intro

This semester at Olin College, we (Rachel and Geoff!) are doing a self-study on machine learning. Over the past 2 months, we worked our way through Andrew Ng’s ML course offered via Coursera. At the same time, we’ve been trying to hone our skills by applying some of the class’ techniques to the digit recognition competition on Kaggle. As humans, we’ve learned a lot about how our mechanical counterparts learn (somewhat meta!) So, keep reading for a summary of what we’ve done, and what we’ve gotten out of it.


The Competition

Kaggle's competitions are normally sponsored by companies looking to find the best solutions for their data science problems. Typically, winners are rewarded with money, jobs, or some combination of the two. The digit recognition competition is unique, because it offers neither of these. It is an entry level competition for people who are learning data science for the first time, or have never entered a Kaggle competition before. Since we fall into both these categories, it seemed like the perfect way to wet our feet!

We approached the competition with the mindset of "we're here to learn, not to win" (we didn't really have a choice, because unlike other Kaggle competitions, this one will never declare a winner). Thus our goal was to learn and apply several techniques, not maximize our success with one technique. The techniques we used will be described in this (and subsequent) blog posts. But first, it probably helps to understand the data we were dealing with.


The Data

We used the data provided by Kaggle, which is best described by them. You can go here to see it, but we've also copied it below, for your convenience.

The data files train.csv and test.csv contain gray-scale images of hand-drawn digits, from zero through nine. 

Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255, inclusive. 

The training data set, (train.csv), has 785 columns. The first column, called "label", is the digit that was drawn by the user. The rest of the columns contain the pixel-values of the associated image. 

Each pixel column in the training set has a name like pixelx, where x is an integer between 0 and 783, inclusive. To locate this pixel on the image, suppose that we have decomposed x as x = i * 28 + j, where i and j are integers between 0 and 27, inclusive. Then pixelx is located on row i and column j of a 28 x 28 matrix, (indexing by zero). 

For example, pixel31 indicates the pixel that is in the fourth column from the left, and the second row from the top, as in the ascii-diagram below.


Approach #1: Linear classifier

We jumped into this approach before we even started the coursera course. It was inspired by the baseline linear classifier described by LeCun et al. in their comparison of techniques for handwritten digit recognition. The algorithm we used is described in a linear algebra presentation from UC Davis, available online. 



It uses the training data (all 30k samples) to calculate what an "average" n looks like, where n = 0, 1, ...9. Below are 10 plots, each representing an average digit. Each digit was plotted as a graph, with point (x, y) being the average value of pixel (x, y) for that digit. Because the digits were plotted as graphs, there is whitespace between points. To compensate for the whitespace, the images were adjusted for contrast in photoshop. Cool, right?



After figuring out what the average digits look like, we iterated over each of the 28k test samples to figure out which average it most closely resembles -this is our prediction. All 28k predictions are submitted to Kaggle, which provides instant feedback on our success. In this case, we were 80.471% accurate.

To be honest, this was a lower success rate than we expected. In the paper, the authors described an error rate of 8.4% (or, a success rate of 91.6%). The difference between our success rates is that the authors used a bias unit in their calculation, and also deslanted the images before running their algorithm. We did not choose to replicate these steps, as we were eager to try other methods. If you're interested in examining all the technicalities of our implementation, check out our Github repository: https://github.com/rachelbobbins/machinelearning/tree/master/linear_classifier



No comments:

Post a Comment