Alex Minnaar
Machine Learning at University College London. Research Engineer at Nitro.



Recent Posts

Implementing the DistBelief Deep Neural Network Training Framework with Akka

Word2Vec Tutorial Part II: The Continuous Bag-of-Words Model

Word2Vec Tutorial Part I: The Skip-Gram Model

Distributed Online Latent Dirichlet Allocation with Apache Spark

Deep Learning Basics: Neural Networks, Backpropagation and Stochastic Gradient Descent

Building a Shoutbox App with Cassandra and Node.js

Building a Distributed Binary Search Tree with Akka

Introduction to the Multithreading Problem and the Akka Actor Solution

ScalaNER: A Scala Wrapper for the Stanford NER Tool with Some Added Features

Online Latent Dirichlet Allocation - The Best Option for Topic Modeling with Large Data Sets

Latent Dirichlet Allocation in Scala Part II - The Code

Facebook Recruiting III Keyword Extraction - Part 3

Is this a Clustering Problem?

At first glance this problem looks like a clustering problem. You can think of the tags as the cluster centroids and based on the words in the post title/body you can determine which clusters (tags words) each post is closest to. In this way you can assign posts to tags in the test set.

The problem with this is that there is no guarantee that the training set holds all of the tag words that appear in the test set. This is confirmed by this post in the Kaggle forum. So if there are new tags in the test set, then perhaps this is more of an inference problem than a clustering problem. However, we should not rely solely on a forum post so let's come up with some evidence to give us an idea of how many unseen tags will be in the test set.

Time for an Experiment!

We can use the training set to give us an idea of how many unseen tags will be present in the test set. The training set has ~6 million examples and the test set has ~2 million examples. That's a ratio of about 3:1. So the experiment will be to divide the training set into the same 3:1 ratio and we shall see how many tags appear in the small portion but not the large portion. So the large portion will have ~4.5 million examples and the small portion will have ~1.5 million examples. The following are some of the results of this experiment.

examples in portion A 4500000
examples in portion B 1534194
unique tags in A 41251
unique tags in B 37179
tags in B but not in A 797
occurrences of these tags in B 972
average number of occurrences of each of these tags in B 1.22

So from the above results, only ~2% of the tags in B are not in A. And these tags also occur very rarely in B, (only about once for each tag). Therefore, we can conclude that the vast majority of tags that appear in B also appear in A. We can also therefore assume that the vast majority of tags that appear in the test set also appear in the training set. So we can confidently assume that we can treat this as a clustering problem!