Finding Association Rules
The idea is to develop rules that assign tags to certain words in the post titles and bodies. For example,
\(eclpise\rightarrow java \\derivative \rightarrow calculus \\matplotlib \rightarrow python\)
where the words on the left side of the arrow are words in the title/body and words on the right side of the arrow are tags. Once these rules are found, they can be applied to the post words in the test set and the tag word predictions can be found trivially. But first we must come up with a method to generate these rules. This can be done probabilistically using word counts from the training set.
Using probabilities to generate association rules
The question that needs to be answered is given that word A appears in a post, what is the probability that tag B will also appear in that post? To get this probability, you need to count the number of posts in which word A appears and the also the number of posts in which both word A and tag B appear. The desired probability can then be calculated as
where \(|Co(A,B)|\) is the number of posts where word A and tag B co-occur and \(|A|\) is the number of posts where word A occurs. So, if \(P(B|A)\) is above a certain threshold, we can then generate the association rule \(A \rightarrow B\). However, this probability is only a point estimate and does not give us any information regarding how certain we are about the estimate. For example, it is possible to have one case where \(|Co(A_1,B_1)|=9\) and \(|A_1|=10\) thus \(P(B_1|A_1)=90\%\) and a second case where \(|Co(A_2,B_2)|=89\) and \(|A_2|=100\) thus \(P(B_2|A_2)=89\%\). And since \(P(B_1|A_1)>P(B_2|A_2)\), you might think that \(A_1 \rightarrow B_1\) is a stronger association rule than \(A_2 \rightarrow B_2\). However, it is very possible that this assumption is incorrect because we cannot be certain that \(P(B_1|A_1)\) is the true probability because the sample size is so small. Whereas for \(P(B_2|A_2)\) we can be far more certain. So for this reason we must evaluate the association rule based on two criteria - the probability \(P(B|A)\) and also the support which is \(|Co(A,b)|\). We should only consider association rules that have a support above a certain threshold value.
Potential problems with this method
- Is it scalable? The training data has ~42 thousand unique tag words and many more unique post words. Can this association rule algorithm be performed on a single machine given the size of the data?
- What are the threshold values? How do we determine how large the probabilities \(P(B|A)\) and support need to be in order to warrant an association rule? i.e. what is the optimal balance between overfitting and underfitting.
- What about the likelihood of co-occurring tags? In this method we are only generating association rules between post words and tags, but there is no doubt that certain tags are more likely to occur together than others. We have that information in the training set. Shouldn't we use that information too? These are all valid questions that we must think about before any kind of submission can be made...