Alex Minnaar's Blog

Online Latent Dirichlet Allocation - Topic Modeling for Large Data Sets

By now it has become very clear that Latent Dirichlet Allocation (LDA) has a variety of valuable, real-world use cases. However, most real-world use cases involve large volumes of data which can be problematic for LDA. This is because both of the traditional implementations of LDA (variational inference and collapsed Gibbs sampling) require the entire corpus (or some encoding of it) to be loaded into main memory. Obviously, if you are working with a single machine and a data set that is sufficiently large, this can be infeasible. One solution is to parallelize the algorithm and scale out until you have the required resources. However, this presents an entire new set of problems - acquiring a cluster of machines, modifying your LDA code such that it can work in a MapReduce framework, etc. A much better solution would be to segment your large data set into small batches and sequentially read each of these batches into main memory and update your LDA model as you go in an online fashion. This way you are only keeping a small fraction of your large data set in main memory at any given time. Furthermore, consider a scenario where your corpus is constantly growing such as an online discussion forum. As your corpus grows you want to see how the topics are changing. With traditional variational inference you would have to rerun the entire batch algorithm with the old data and the new data but it would be much more efficient to simply update your model with only the new data. In their paper Online Learning for Latent Dirichlet Allocation, Blei et al. present an algorithm for achieving this kind of functionality. This blog post aims to give a summary of this paper and also show some results from my own Scala implementation.

Time Series Classification and Clustering with Python

I recently ran into a problem at work where I had to predict whether an account would churn in the near future given the account’s time series usage in a certain time interval. So this is a binary-valued classification problem (i.e. churn or not churn) with a time series as a predictor. This was not a very straight-forward problem to tackle because it seemed like there two possible strategies to employ.

My Experience with Churn Analysis

A large chunk of my time at my last job was devoted to churn analysis and I wanted to use this blog entry to explain how I approached the various problems that it presented. This is not meant to be a very technical post and the reasoning behind this is two-fold