Alex Minnaar
Machine Learning at University College London. Research Engineer at Nitro.



Recent Posts

Implementing the DistBelief Deep Neural Network Training Framework with Akka

Word2Vec Tutorial Part II: The Continuous Bag-of-Words Model

Word2Vec Tutorial Part I: The Skip-Gram Model

Distributed Online Latent Dirichlet Allocation with Apache Spark

Deep Learning Basics: Neural Networks, Backpropagation and Stochastic Gradient Descent

Building a Shoutbox App with Cassandra and Node.js

Building a Distributed Binary Search Tree with Akka

Introduction to the Multithreading Problem and the Akka Actor Solution

ScalaNER: A Scala Wrapper for the Stanford NER Tool with Some Added Features

Online Latent Dirichlet Allocation - The Best Option for Topic Modeling with Large Data Sets

Latent Dirichlet Allocation in Scala Part II - The Code

Facebook Recruiting III Keyword Extraction - Part 4

Finding Association Rules

The idea is to develop rules that assign tags to certain words in the post titles and bodies. For example,

\(eclpise\rightarrow java \\derivative \rightarrow calculus \\matplotlib \rightarrow python\)

where the words on the left side of the arrow are words in the title/body and words on the right side of the arrow are tags. Once these rules are found, they can be applied to the post words in the test set and the tag word predictions can be found trivially. But first we must come up with a method to generate these rules. This can be done probabilistically using word counts from the training set.

Using probabilities to generate association rules

The question that needs to be answered is given that word A appears in a post, what is the probability that tag B will also appear in that post? To get this probability, you need to count the number of posts in which word A appears and the also the number of posts in which both word A and tag B appear. The desired probability can then be calculated as


where \(|Co(A,B)|\) is the number of posts where word A and tag B co-occur and \(|A|\) is the number of posts where word A occurs. So, if \(P(B|A)\) is above a certain threshold, we can then generate the association rule \(A \rightarrow B\). However, this probability is only a point estimate and does not give us any information regarding how certain we are about the estimate. For example, it is possible to have one case where \(|Co(A_1,B_1)|=9\) and \(|A_1|=10\) thus \(P(B_1|A_1)=90\%\) and a second case where \(|Co(A_2,B_2)|=89\) and \(|A_2|=100\) thus \(P(B_2|A_2)=89\%\). And since \(P(B_1|A_1)>P(B_2|A_2)\), you might think that \(A_1 \rightarrow B_1\) is a stronger association rule than \(A_2 \rightarrow B_2\). However, it is very possible that this assumption is incorrect because we cannot be certain that \(P(B_1|A_1)\) is the true probability because the sample size is so small. Whereas for \(P(B_2|A_2)\) we can be far more certain. So for this reason we must evaluate the association rule based on two criteria - the probability \(P(B|A)\) and also the support which is \(|Co(A,b)|\). We should only consider association rules that have a support above a certain threshold value.

Potential problems with this method

  • Is it scalable? The training data has ~42 thousand unique tag words and many more unique post words. Can this association rule algorithm be performed on a single machine given the size of the data?
  • What are the threshold values? How do we determine how large the probabilities \(P(B|A)\) and support need to be in order to warrant an association rule? i.e. what is the optimal balance between overfitting and underfitting.
  • What about the likelihood of co-occurring tags? In this method we are only generating association rules between post words and tags, but there is no doubt that certain tags are more likely to occur together than others. We have that information in the training set. Shouldn't we use that information too? These are all valid questions that we must think about before any kind of submission can be made...