Alex Minnaar

Deep Learning GPU Performance Analysis: Optimal Parameter Checklist

Typically the parameters used for deep learning models are chosen to optimize some objective function that measures predictive performance, however it important to understand that the chosen parameters can also have a significant affect on the time it takes to train that model. Ideally you would want to choose parameters that both optimize predictive performance as well as minimize the training time. NVIDIA’s deep learning performance guide reveals some simple and in some cases unintuitive tips and tricks for choosing parameters that can result in significant training speedups without negatively affecting predictive performance.

Deep Learning GPU Performance Analysis: Mixed Precision Training

There are several meaningful benefits to training deep neural networks using a precision format that is lower than 32-bit floating point. Lower precision requires less memory which enables us to train larger networks and/or training with larger minibatches. Furthermore lower precision requires less memory bandwidth which means training is faster. And thirdly lower precision allows for faster math operations. Mixed precision training exploits these benefits and is possible if you are working with a Volta Nvidia GPU or newer.

Deep Learning GPU Performance Analysis: Memory Bound vs Math Bound Operations

Determining whether a GPU operation is memory bound vs math bound is a crucial step in performance analysis because it informs the strategies to optimize the operation. Generally, a memory bound GPU operation is one where the overall computation time is dominated by memory access rather than the actual computation. Conversely a math bound operation is one in which the computation time is dominated by the actual computation rather than memory access.

Multilevel Linear Models

Multilevel models (also called hierarchical models) are a class of statistical models that are applied to data that have a natural hierarchical or nested structure. They are useful because they can often out-perform models that don’t take this structure into account. This post will cover multilevel models applied to linear regression however it can be easily extended to logistic regression and generalized linear models.

Kalman Filter Sensor Fusion using TensorFlow

The Kalman filter is a popular model that can use measurements from multiple sources to track an object in a process known as sensor fusion. This post will cover two sources of measurement data - radar and lidar. It will also cover an implementation of the Kalman filter using the TensorFlow framework. It might surprise some to see TensorFlow being used outside of a deep learning context, however here we are exploiting TensorFlow’s linear algebra capabilies which will be needed in the Kalman filter implementation. Additionaly, if you have tensorflow-gpu installed, TensorFlow can allow us to put GPUs behind our linear algebra computations which is a nice bonus. The code corresponding to this post can be found here.

Named Entity Recognition with RNNs in TensorFlow

Many tutorials for RNNs applied to NLP using TensorFlow are focused on the language modelling problem. But another interesting NLP problem that can be solved with RNNs is named entity recognition (NER). This blog post will cover how to train a LSTM model in TensorFlow in the context of NER - all code mentioned in this post can be found in an associated Colab notebook.

CUDA Grid-Stride Loops: What if you Have More Data Than Threads?

A problem that pops up from time to time in CUDA is when you want to perform a trivial parallel operation on an input array by assigning one thread per input array element but the number of elements in your input array is larger than the number of threads you have available. Or consider the scenario where you have written some CUDA code which works fine with your GPU however someone else tries to run it with an older model GPU and they run into this problem because their GPU has fewer threads than yours. An elegant way to handle this “more data than threads” problem is to use grid-stride loops within your kernels.

Implementing Convolutions in CUDA

The convolution operation has many applications in both image processing and deep learning (i.e. convolutional neural networks). Since convolutions can be performed on different parts of the input array (or image) independently of each other, it is a great fit for parallelization which is why convolutions are commonly performed on GPU. This blog post will cover some efficient convolution implementations on GPU using CUDA. This blog post will focus on 1D convolutions but can be extended to higher dimensional cases.

Reinforcement Learning Notes Part 3: Temporal Difference Learning

Temporal difference learning shares many of the benefits of both dynamic programming methods and Monte Carlo methods without many their disadvantages. Like dynamic programming methods, policy evaluation can be updated at each time step but unlike dynamic programming you do not need a model of the environment. Like Monte Carlo methods, you do not need a model of the environemt but unlike Monte Carlo methods you do not need to wait til the end of an episode to make a policy evaluation update. All three of these methods use the same policy iteration strategy which iterates between policy evaluation (different for each method) and policy improvement (in a greedy fashion for each method).

Reinforcement Learning Notes Part 2: Monte Carlo Methods

In the last reinforcement learning blog post we covered dynamic programming methods. In this blog post we will cover Monte Carlo (MC) methods. The biggest difference between these two methods is that dynamic programming methods assume a complete knowledge of the environement (via a MDP), but Monte Carlo methods do not. Instead, with Monte Carlo methods, knowledge of the environment is learned through experience. Another significant difference is that Monte Carlo methods can only learn from episodic tasks i.e. ones that start and terminate.

Reinforcement Learning Notes Part 1: Dynamic Programming

This series of blog posts is intended to be a collection of short, concise, cheat-sheet-like notes on different topics relating to reinforcement learning. This first one will cover dynamic programming methods applied to reinforcement learning.

Calling CUDA from Python to Speed Up Linear Algebra

Numpy is the go-to library for linear algebra computations in python. It is a highly optimized library that uses BLAS as well as SIMD vectorization resulting in very fast computations. Having said that, there are times when it is preferable to perform linear algebra computations on the GPU i.e. using CUDA’s cuBLAS linear algebra library. For example, the linear algebra computations associated with training large deep neural networks are commonly performed on GPU. In cases like these, the vectors and matrices are so large that the parallelization offerred by GPUs allows them to outperform linear algebra libraries like numpy.

A CUDA Implementation of the K-Means Clustering Algorithm

This blog post will cover a CUDA C implementation of the K-means clustering algorithm. K-means clustering is a hard clustering algorithm which means that each datapoint is assigned to one cluster (rather than multiple clusters with different probabilities). The algorithm starts with random cluster assignments and iterates between two steps

Building A Basic Computational Graph Engine

Many deep learning libraries like TensorFlow use graphs to represent the computations involved in neural networks. Not only are these graphs used to compute predictions for a given input to the network but they are also used to backpropagate gradients during the training phase. The main advantage of this graph representation is that each computation can be encapsulated as a node on the graph that only cares about its input and output. This level of abstraction gives you the flexibility to build neural networks of (nearly) arbitrary sizes and shapes (eg. MLPs, CNNs, RNNs, etc.). This blog post will implement a very basic version of a computational graph engine.

The Gaussian Mixture Model and the EM Algorithm

This post is about the Gaussian mixture model which is a generative probabilistic model with hidden variables and the EM algorithm which is the algorithm used to compute the maximum likelihood estimate of its parameters.

Implementing the DistBelief Deep Neural Network Training Framework with Akka

Presently, most deep neural networks are trained using GPUs due to the enormous number of parallel computations that they can perform. Without the speed-ups provided by GPUs, deep neural networks could take days or even weeks to train on a single machine. However, using GPUs can be prohitive for several reasons

Word2Vec Tutorial Part II: The Continuous Bag-of-Words Model

In the previous post the concept of word vectors was explained as was the derivation of the skip-gram model. In this post we will explore the other Word2Vec model - the continuous bag-of-words (CBOW) model. If you understand the skip-gram model then the CBOW model should be quite straight-forward because in many ways they are mirror images of each other. For instance, if you look at the model diagram

Word2Vec Tutorial Part I: The Skip-Gram Model

In many natural language processing tasks, words are often represented by their tf-idf scores. While these scores give us some idea of a word’s relative importance in a document, they do not give us any insight into its semantic meaning. Word2Vec is the name given to a class of neural network models that, given an unlabelled training corpus, produce a vector for each word in the corpus that encodes its semantic information. These vectors are usefull for two main reasons.

Distributed Online Latent Dirichlet Allocation with Apache Spark

In the past, I have studied the online LDA algorithm from Hoffman et al. in some depth resulting in this blog post. Before we go further I will provide a general description of how the algorithm works. In online LDA, minibatches of documents are sequentially processed to update a global topic/word matrix which defines the topics that have been learned. The processing consists of two steps:

Deep Learning Basics: Neural Networks, Backpropagation and Stochastic Gradient Descent

In the last couple of years Deep Learning has received a great deal of press. This press is not without warrant - Deep Learning has produced stat-of-the-art results in many computer vision and speech processing tasks. However, I believe that the press has given people the impression that Deep Learning is some kind of imprenetrable, esoteric field that can only be understood by academics. In this blog post I want to try to erase that impression and provide a practical overview of some of Deep Learning’s basic concepts.

Building a Distributed Binary Search Tree with Akka

In this blog post I will descibe an interesting Akka mini-project that I came across which helped me gain a deeper understanding of Akka’s asynchronous actor model. In this project we use Akka to build a distributed binary search tree where each node in the tree is an actor which allows it to be a completely asynchronous, concurrent, and distributed version of the traditional data structure. But before we get into the Akka stuff, it would be helpful to remind ourselves of some of the basic properties of a binary search tree.

Introduction to the Multithreading Problem and the Akka Actor Solution

Nowadays, computers have multiple execution cores meaning that they can execute multiple tasks at the same time rather than sequentially. Obviously this makes things much faster but it also presents some new problems. The term multithreading refers to the process in which multiple threads execute code in the same program simultaneously. The inherent problem with multithreading lies in the fact that although each thread acts independently, their memory is shared. Therefore, it is possible for threads to change shared memory values without other threads knowing which can create problems. Let’s use a bank account as an example. Consider the following code that implements a bank account with deposit and withdraw methods.

ScalaNER: A Scala Wrapper for the Stanford NER Tool with Some Added Features

The Stanford NER (named entity recognizer) tool is a widely-used, general purpose named entity recognition tool that Stanford has made available as part of its CoreNLP Java library. It performs named entity recognition via a CRF-based sequence model which has been known to give near state-of-the-art performance results which makes it a popular choice for open-source NER tools.

Online Latent Dirichlet Allocation - Topic Modeling for Large Data Sets

Latent Dirichlet Allocation (LDA) has a variety of valuable, real-world use cases. However, most real-world use cases involve large volumes of data which can be problematic for LDA. This is because both of the traditional implementations of LDA (variational inference and collapsed Gibbs sampling) require the entire corpus (or some encoding of it) to be loaded into main memory. Obviously, if you are working with a single machine and a data set that is sufficiently large, this can be infeasible.

Time Series Classification and Clustering with Python

I recently ran into a problem at work where I had to predict whether an account would churn in the near future given the account’s time series usage in a certain time interval. So this is a binary-valued classification problem (i.e. churn or not churn) with a time series as a predictor. This was not a very straight-forward problem to tackle because it seemed like there two possible strategies to employ.

My Experience with Churn Analysis

A large chunk of my time at my last job was devoted to churn analysis and I wanted to use this blog entry to explain how I approached the various problems that it presented. This is not meant to be a very technical post and the reasoning behind this is two-fold