Alex Minnaar
Machine Learning at University College London. Research Engineer at Nitro.

Email
Github

Facebook Recruiting III Keyword Extraction - Part 7

Parameter Tuning

Now the time has come to perform an actual out-of-sample test of our association rule algorithm. The training data set has about 6 million rows so the association rule algorithm will be trained on 5 million rows and the remaining 1 million rows will be held out for testing. From this we can accomplish two goals

It gives us an idea of how well our algorithm will perform on the actual test set without actually having to submit anything on the Kaggle website (you are only allowed 2 submissions per day). It allows us to tune the parameters of our algorithm to achieve the best performance on our hold-out data set. However, before we start tuning parameters we should first formalize our classification algorithm.

Classification Algorithm

The classification algorithm is fairly simple. The following pseudo-code illustrates how it works

I={}
For r in AR:
if C(r)>α AND supp(r)>β AND r not in I:
I.append(r)


The input to the algorithm is the learned association rules from both the post title and post body $$AR$$. $$I$$ denotes the set of tags that will be predicted for the given post, so initially $$I$$ is empty. The algorithm searches through $$AR$$ to find association rules $$r$$ with confidence above a certain threshold $$\alpha$$ and support above a certain threshold $$\beta$$. The tags in these association rules are then added to the set $$I$$ which become the predicted tags for this post.

So we need to find the optimal parameters $$\alpha$$ and $$\beta$$. One way to do this is to vary these parameters and see how their values affect the test error. The parameter values that produce the best test performance are the ones we would use. Before we can do this, we must decide on a loss function. Kaggle says that the loss function that submissions are scored upon is the mean F1-score loss function. The mean F1-score loss function is defined as

$$F1=2\frac{p \cdot r}{p+r}$$ where $$p=\frac{tp}{tp+fp}$$ and $$r=\frac{tp}{tp+fn}$$

This loss function means that both precision ($$p$$) and recall ($$r$$) are weighted equally so both must be optimized in order to obtain the best score. The F1-score can be written in Python code as the following

def F1_score(tags,predicted):
tags = set(tags)
predicted = set(predicted)
tp = len(tags & predicted)
fp = len(predicted) - tp
fn = len(tags) - tp

if tp>0:
precision=float(tp)/(tp+fp)
recall=float(tp)/(tp+fn)

return 2*((precision*recall)/(precision+recall))
else:
return 0


predicted is the list of tags that the classification algorithm predicts for a given post in the hold-out set and tags is the list of actual tags for that post. These lists are then converted to Python sets so that tp, fp, and fn can be computed efficiently.

Varying the Parameters

Now that we have a cost function, we can run some tests to calibrate our parameters α and β. As stated earlier, we split the data set into a training set and test set - mine association rules from the training set and test their performance on the test set.

Why not use cross-validation?

Usually one should use k-fold cross-validation to learn the optimal parameter values as it would give the most unbiased representation of the algorithm's performance on unseen data. However, 5-fold cross-validation was tried initially but it became clear that validation errors for each of the 5 validation sets were very similar so taking the average would not be much different than the error of one of the validation sets. For this reason (and the fact that each test will take 1/5th of the time), a single test set was used instead.

Results

After considering many combinations of support and confidence values, the best F1-score was found to be 0.74021 with confidence=0.5 and support=5. The number 1 score was 0.81350 so I was not too far off.

There were many different strategies that I wanted to try but did not have time to implement. Here is a list of improvements/alternative strategies that I would have liked to try if I had enough time.

• About 50,000 posts in the test set were predicted to have zero tags because there were no association rules for them that fell within the acceptable threshold. These posts were scored as zero. However, F1-score can never be negative so any random guess of tags could only improve the score. A naive solution would be to simply use the baseline tags (i.e. javascript c# python php java) as predictions for these posts, however I suspect that there would be a more clever way of doing this.
• The opposite problem also occurs in some predictions. Some posts are predicted to have 8, 9, 10+ tags. Given what we know from the training set and this kaggle thread, the maximum number of tags is 5 (and it is also quite rare to have that many). This does not necessarily mean that we should cut off all predictions at 5 tags, but certainly 10+ tags is not a sensible number of tags to predict. So I think that is very likely that our score would increase if we reduced the number of tags in these cases.
• I considered each post as a bag-of-words - meaning that I only cared which words appeared in the post, not their ordering. However, I suspect that if I was able to build bigram-based association rules, I could increase the classifier performance. For example, if a post title contains the bigram "linux server", my current classifier would look at the association rules for "linux" and "server" independently and it might be unlikely that it would predict the tag "linux-server" for either of those terms. However, I suspect that a bigram association rule between "linux server" and "linux-server" would be much more likely. The trade-off is that it would be much more difficult to mine bigram association rules on a single computer in terms of memory capacity.
• Another strategy would be to abandon the association rule classification strategy altogether and consider a more classical method. One such method would be to create a very sparse binary valued matrix from the training set where each feature is a word that could appear in a post (assign it a value of 1 if it does, 0 if is doesn't). Then use well-known classification methods to model the relationship between the input matrix and the output tags. However, I would imagine that this could present problems due to the size and sparsity of this matrix.