Alex Minnaar
Machine Learning at University College London. Research Engineer at Nitro.

Email
Github

Facebook Recruiting III Keyword Extraction - Part 3

Is this a Clustering Problem?

At first glance this problem looks like a clustering problem. You can think of the tags as the cluster centroids and based on the words in the post title/body you can determine which clusters (tags words) each post is closest to. In this way you can assign posts to tags in the test set.

The problem with this is that there is no guarantee that the training set holds all of the tag words that appear in the test set. This is confirmed by this post in the Kaggle forum. So if there are new tags in the test set, then perhaps this is more of an inference problem than a clustering problem. However, we should not rely solely on a forum post so let's come up with some evidence to give us an idea of how many unseen tags will be in the test set.

Time for an Experiment!

We can use the training set to give us an idea of how many unseen tags will be present in the test set. The training set has ~6 million examples and the test set has ~2 million examples. That's a ratio of about 3:1. So the experiment will be to divide the training set into the same 3:1 ratio and we shall see how many tags appear in the small portion but not the large portion. So the large portion will have ~4.5 million examples and the small portion will have ~1.5 million examples. The following are some of the results of this experiment.

examples in portion A 4500000
examples in portion B 1534194
unique tags in A 41251
unique tags in B 37179
tags in B but not in A 797
occurrences of these tags in B 972
average number of occurrences of each of these tags in B 1.22

So from the above results, only ~2% of the tags in B are not in A. And these tags also occur very rarely in B, (only about once for each tag). Therefore, we can conclude that the vast majority of tags that appear in B also appear in A. We can also therefore assume that the vast majority of tags that appear in the test set also appear in the training set. So we can confidently assume that we can treat this as a clustering problem!