Machine Learning Posts –

The Third UI for LLM Applications

07 Apr 2025

Ever since GPT-3.5 entered the scene in late 2022, LLM-based applications have exploded in popularity. Even in this short amount of time it is interesting to observe how the user interfaces for these applications have evolved. In my opinion there have been two main types of UI’s that have caught on but there is a third that is starting to emerge. In this blog post I’ll delve into all three.

How to Nicely Format Streaming OpenAI API Output for the Web

07 Apr 2025

When I was building HinterviewGPT, I was pretty surprised to discover that one of the most difficult parts was making the raw OpenAI API output look nice on the screen. Everyone has used ChatGPT at this point and users are accustomed to seeing nicely formatted text, code and even math. In order for a product to look professional, the output formatting must at least be on par with ChatGPT. HinterviewGPT is a React app and I soon learned that this is not as easy as I had first assumed.

That Time I Thought I Needed an Agentic Solution but I Really Didn't

23 Mar 2025

In this blog post I’m going to talk about a time I really thought I needed to implement an agentic solution to solve an LLM related problem but it turned out that I really didn’t. Let me first describe the problem itself. I was building HinterviewGPT where there was one feature where the user could generate a practice interview question for themselves based on a description they provide. Specifically, there was a chat ui on the left half of the page where the user can chat with an LLM about the topics/industries etc. that they are interested in and once the LLM (GPT-4o in this case) has enough information it would generate a relevant interview question on the right side of the page. Initially the page looked something like this.

Semi-Supervised Retrieval for RAG

09 Oct 2023

The typical approach to RAG (retrieval augmented generation) is the following

How RepoGPT Parses Code

28 Sep 2023

If you read the previous post about RepoGPT it explains how parsing code provides context which greatly improves RepoGPT’s ability to answer questions. Specifically, extracting the methods and classes associated with the code chunk. Below is an example code chunk with its associated context (shown on top of the code chunk).

RepoGPT: Improving Question Answering Over Code Repositories by using Contextual Chunking

07 Aug 2023

tl;dr RepoGPT is an LLM-based project for question answering over code repositories that exploits contextual chunking for improved performance.

Google JAX Overview

15 Aug 2020

According to it’s README JAX is “Autograd and XLA, brought together for high-performance machine learning research” from Google. Autograd is a reference to an automatic differentiation library which was originally maintained by the Harvard Intelligent Probabilistic Systems Group (HIPS). XLA is a reference to Tensorflow’s XLA (Accelerated Linear Algebra) compiler. JAX also says “At its core, JAX is an extensible system for transforming numerical functions. Here are four of primary interest: grad, jit, vmap, and pmap”. At this point, these four functions make up the bulk of JAX so this blog post will go through each of them and doing so should provide a good overview of JAX in general.

Deep Learning GPU Performance Analysis: Optimal Parameter Checklist

15 May 2020

Typically the parameters used for deep learning models are chosen to optimize some objective function that measures predictive performance, however it important to understand that the chosen parameters can also have a significant affect on the time it takes to train that model. Ideally you would want to choose parameters that both optimize predictive performance as well as minimize the training time. NVIDIA’s deep learning performance guide reveals some simple and in some cases unintuitive tips and tricks for choosing parameters that can result in significant training speedups without negatively affecting predictive performance.

Deep Learning GPU Performance Analysis: Mixed Precision Training

02 May 2020

There are several meaningful benefits to training deep neural networks using a precision format that is lower than 32-bit floating point. Lower precision requires less memory which enables us to train larger networks and/or training with larger minibatches. Furthermore lower precision requires less memory bandwidth which means training is faster. And thirdly lower precision allows for faster math operations. Mixed precision training exploits these benefits and is possible if you are working with a Volta Nvidia GPU or newer.

Deep Learning GPU Performance Analysis: Memory Bound vs Math Bound Operations

11 Apr 2020

Determining whether a GPU operation is memory bound vs math bound is a crucial step in performance analysis because it informs the strategies to optimize the operation. Generally, a memory bound GPU operation is one where the overall computation time is dominated by memory access rather than the actual computation. Conversely a math bound operation is one in which the computation time is dominated by the actual computation rather than memory access.

Multilevel Linear Models

14 Mar 2020

Multilevel models (also called hierarchical models) are a class of statistical models that are applied to data that have a natural hierarchical or nested structure. They are useful because they can often out-perform models that don’t take this structure into account. This post will cover multilevel models applied to linear regression however it can be easily extended to logistic regression and generalized linear models.

Kalman Filter Sensor Fusion using TensorFlow

02 Nov 2019

The Kalman filter is a popular model that can use measurements from multiple sources to track an object in a process known as sensor fusion. This post will cover two sources of measurement data - radar and lidar. It will also cover an implementation of the Kalman filter using the TensorFlow framework. It might surprise some to see TensorFlow being used outside of a deep learning context, however here we are exploiting TensorFlow’s linear algebra capabilies which will be needed in the Kalman filter implementation. Additionaly, if you have tensorflow-gpu installed, TensorFlow can allow us to put GPUs behind our linear algebra computations which is a nice bonus. The code corresponding to this post can be found here.

Named Entity Recognition with RNNs in TensorFlow

22 Aug 2019

Many tutorials for RNNs applied to NLP using TensorFlow are focused on the language modelling problem. But another interesting NLP problem that can be solved with RNNs is named entity recognition (NER). This blog post will cover how to train a LSTM model in TensorFlow in the context of NER - all code mentioned in this post can be found in an associated Colab notebook.

CUDA Grid-Stride Loops: What if you Have More Data Than Threads?

02 Aug 2019

A problem that pops up from time to time in CUDA is when you want to perform a trivial parallel operation on an input array by assigning one thread per input array element but the number of elements in your input array is larger than the number of threads you have available. Or consider the scenario where you have written some CUDA code which works fine with your GPU however someone else tries to run it with an older model GPU and they run into this problem because their GPU has fewer threads than yours. An elegant way to handle this “more data than threads” problem is to use grid-stride loops within your kernels.

Implementing Convolutions in CUDA

12 Jul 2019

The convolution operation has many applications in both image processing and deep learning (i.e. convolutional neural networks). Since convolutions can be performed on different parts of the input array (or image) independently of each other, it is a great fit for parallelization which is why convolutions are commonly performed on GPU. This blog post will cover some efficient convolution implementations on GPU using CUDA. This blog post will focus on 1D convolutions but can be extended to higher dimensional cases.

Reinforcement Learning Notes Part 3: Temporal Difference Learning

02 Jul 2019

Temporal difference learning shares many of the benefits of both dynamic programming methods and Monte Carlo methods without many their disadvantages. Like dynamic programming methods, policy evaluation can be updated at each time step but unlike dynamic programming you do not need a model of the environment. Like Monte Carlo methods, you do not need a model of the environemt but unlike Monte Carlo methods you do not need to wait til the end of an episode to make a policy evaluation update. All three of these methods use the same policy iteration strategy which iterates between policy evaluation (different for each method) and policy improvement (in a greedy fashion for each method).

Reinforcement Learning Notes Part 2: Monte Carlo Methods

30 Jun 2019

In the last reinforcement learning blog post we covered dynamic programming methods. In this blog post we will cover Monte Carlo (MC) methods. The biggest difference between these two methods is that dynamic programming methods assume a complete knowledge of the environement (via a MDP), but Monte Carlo methods do not. Instead, with Monte Carlo methods, knowledge of the environment is learned through experience. Another significant difference is that Monte Carlo methods can only learn from episodic tasks i.e. ones that start and terminate.

Reinforcement Learning Notes Part 1: Dynamic Programming

29 Jun 2019

This series of blog posts is intended to be a collection of short, concise, cheat-sheet-like notes on different topics relating to reinforcement learning. This first one will cover dynamic programming methods applied to reinforcement learning.

Calling CUDA from Python to Speed Up Linear Algebra

14 May 2019

Numpy is the go-to library for linear algebra computations in python. It is a highly optimized library that uses BLAS as well as SIMD vectorization resulting in very fast computations. Having said that, there are times when it is preferable to perform linear algebra computations on the GPU i.e. using CUDA’s cuBLAS linear algebra library. For example, the linear algebra computations associated with training large deep neural networks are commonly performed on GPU. In cases like these, the vectors and matrices are so large that the parallelization offerred by GPUs allows them to outperform linear algebra libraries like numpy.

A CUDA Implementation of the K-Means Clustering Algorithm

05 Mar 2019

This blog post will cover a CUDA C implementation of the K-means clustering algorithm. K-means clustering is a hard clustering algorithm which means that each datapoint is assigned to one cluster (rather than multiple clusters with different probabilities). The algorithm starts with random cluster assignments and iterates between two steps

Building A Basic Computational Graph Engine

14 Jul 2018

Many deep learning libraries like TensorFlow use graphs to represent the computations involved in neural networks. Not only are these graphs used to compute predictions for a given input to the network but they are also used to backpropagate gradients during the training phase. The main advantage of this graph representation is that each computation can be encapsulated as a node on the graph that only cares about its input and output. This level of abstraction gives you the flexibility to build neural networks of (nearly) arbitrary sizes and shapes (eg. MLPs, CNNs, RNNs, etc.). This blog post will implement a very basic version of a computational graph engine.

The Gaussian Mixture Model and the EM Algorithm

22 May 2017

This post is about the Gaussian mixture model which is a generative probabilistic model with hidden variables and the EM algorithm which is the algorithm used to compute the maximum likelihood estimate of its parameters.

Implementing the DistBelief Deep Neural Network Training Framework with Akka

06 Sep 2015

Presently, most deep neural networks are trained using GPUs due to the enormous number of parallel computations that they can perform. Without the speed-ups provided by GPUs, deep neural networks could take days or even weeks to train on a single machine. However, using GPUs can be prohitive for several reasons

Word2Vec Tutorial Part II: The Continuous Bag-of-Words Model

18 May 2015

In the previous post the concept of word vectors was explained as was the derivation of the skip-gram model. In this post we will explore the other Word2Vec model - the continuous bag-of-words (CBOW) model. If you understand the skip-gram model then the CBOW model should be quite straight-forward because in many ways they are mirror images of each other. For instance, if you look at the model diagram

Word2Vec Tutorial Part I: The Skip-Gram Model

12 Apr 2015

In many natural language processing tasks, words are often represented by their tf-idf scores. While these scores give us some idea of a word’s relative importance in a document, they do not give us any insight into its semantic meaning. Word2Vec is the name given to a class of neural network models that, given an unlabelled training corpus, produce a vector for each word in the corpus that encodes its semantic information. These vectors are usefull for two main reasons.

Distributed Online Latent Dirichlet Allocation with Apache Spark

20 Mar 2015

In the past, I have studied the online LDA algorithm from Hoffman et al. in some depth resulting in this blog post. Before we go further I will provide a general description of how the algorithm works. In online LDA, minibatches of documents are sequentially processed to update a global topic/word matrix which defines the topics that have been learned. The processing consists of two steps:

Deep Learning Basics: Neural Networks, Backpropagation and Stochastic Gradient Descent

14 Feb 2015

In the last couple of years Deep Learning has received a great deal of press. This press is not without warrant - Deep Learning has produced stat-of-the-art results in many computer vision and speech processing tasks. However, I believe that the press has given people the impression that Deep Learning is some kind of imprenetrable, esoteric field that can only be understood by academics. In this blog post I want to try to erase that impression and provide a practical overview of some of Deep Learning’s basic concepts.

Building a Distributed Binary Search Tree with Akka

05 Jan 2015

In this blog post I will descibe an interesting Akka mini-project that I came across which helped me gain a deeper understanding of Akka’s asynchronous actor model. In this project we use Akka to build a distributed binary search tree where each node in the tree is an actor which allows it to be a completely asynchronous, concurrent, and distributed version of the traditional data structure. But before we get into the Akka stuff, it would be helpful to remind ourselves of some of the basic properties of a binary search tree.

Introduction to the Multithreading Problem and the Akka Actor Solution

27 Dec 2014

Nowadays, computers have multiple execution cores meaning that they can execute multiple tasks at the same time rather than sequentially. Obviously this makes things much faster but it also presents some new problems. The term multithreading refers to the process in which multiple threads execute code in the same program simultaneously. The inherent problem with multithreading lies in the fact that although each thread acts independently, their memory is shared. Therefore, it is possible for threads to change shared memory values without other threads knowing which can create problems. Let’s use a bank account as an example. Consider the following code that implements a bank account with deposit and withdraw methods.

ScalaNER: A Scala Wrapper for the Stanford NER Tool with Some Added Features

11 Nov 2014

The Stanford NER (named entity recognizer) tool is a widely-used, general purpose named entity recognition tool that Stanford has made available as part of its CoreNLP Java library. It performs named entity recognition via a CRF-based sequence model which has been known to give near state-of-the-art performance results which makes it a popular choice for open-source NER tools.

Online Latent Dirichlet Allocation - Topic Modeling for Large Data Sets

14 Oct 2014

Latent Dirichlet Allocation (LDA) has a variety of valuable, real-world use cases. However, most real-world use cases involve large volumes of data which can be problematic for LDA. This is because both of the traditional implementations of LDA (variational inference and collapsed Gibbs sampling) require the entire corpus (or some encoding of it) to be loaded into main memory. Obviously, if you are working with a single machine and a data set that is sufficiently large, this can be infeasible.

Time Series Classification and Clustering with Python

16 Apr 2014

I recently ran into a problem at work where I had to predict whether an account would churn in the near future given the account’s time series usage in a certain time interval. So this is a binary-valued classification problem (i.e. churn or not churn) with a time series as a predictor. This was not a very straight-forward problem to tackle because it seemed like there two possible strategies to employ.

My Experience with Churn Analysis

30 Mar 2014

A large chunk of my time at my last job was devoted to churn analysis and I wanted to use this blog entry to explain how I approached the various problems that it presented. This is not meant to be a very technical post and the reasoning behind this is two-fold