In the last reinforcement learning blog post we covered dynamic programming methods. In this blog post we will cover Monte Carlo (MC) methods. The biggest difference between these two methods is that dynamic programming methods assume a complete knowledge of the environement (via a MDP), but Monte Carlo methods do not. Instead, with Monte Carlo methods, knowledge of the environment is learned through experience. Another significant difference is that Monte Carlo methods can only learn from *episodic* tasks i.e. ones that start and terminate.

However, dynamic programming methods and Monte Carlo methods handle the control problem in generally the same way. In both methods it is an iterative procedure that alternates between policy evaluation and policy improvement (i.e. policy iteration). In fact the policy improvement step is the same in that it is performed in a greedy fashion. The main difference is in the policy evaluation step.

## Monte Carlo Policy Evaluation

Recall that policy evalution is the problem of computing the value function for a given policy i.e. . And the value function evaluated at a given state is defined as the *expected cumulative future discounted reward* starting from that state. A very simple way of doing this is to just average the rewards that have been seen for the episodes in which that state was visited. This is essentially what Monte Carlo policy evaluation does.

There are two general methods for Monte Carlo policy evaluation - first-visit MC and every-visit MC.

**First-Visit MC:**Uses the average of the returns following all visits to the state in a set of episodes.**Every-Visit MC:**Uses averaged returns only after the first vist to the state in a set of episodes.

Both methods converge to the as the number of visits go to infinity.

## The Role of Exploration

The problem with Monte Carlo policy evaluation is that you can only determine the value of a given state if your policy visits it, otherwise there will never be a reward associated with it. So if you begin the process with a deterministic policy that never visits certain states then the values associated with those states will never be known. However, it is also possible that the optimal policy would visit those states. So we need a way to insure that all states will be visited. There are two strategies to insure this - they are called on-policy MC and off-policy MC.

## On-Policy Monte Carlo

On policy methods are the simpler of the two. With on-policy methods you are still following the policy during policy evaluation however you are adding some random exploration as well. For example, -greedy is a common on-policy method where at every step you choose the action determined by your policy with probability and you choose a random action with probability . With this added randomness you can insure that exploration occurs.

## Off-Policy Monte Carlo

The idea for off-policy methods is to evaluate your policy while following a different policy. This has obvious practical advantages for cases where you can observe other policies but are unable to try your own. Say you want to evaluate policy but can only observe a different policy . As long as every action taken under is also taken under you can still evaluate . Let’s see how this is possible.

Let represent the probability of a complete episode occurring under policy and similarly let represent the probability of a complete episode occurring under policy . Also, let represent the reward observed after the first occurrence of . Then the value at state under policy can be estimated as

Where is the number of episodes. Intuitively, the way to look at this is that even though the reward is associated with the other policy , we are weighting it by the relative probability of it occurring i.e. if the reward is more likely to occur under then it is up-weighted and if it is less likely to occur under then it is down-weighted. This is very similar to importance sampling. Now the question is how to determine the probabilities and . Luckily we don’t need to compute them individually rather we only need their ratio. This can be computed by

where is the time of termination of the episode involving . The advantage here is that to compute the desired ratio all we need are the two policies and not the environment dynamics.

Thank you for reading.