Monte Carlo (MC) Methods

Key points

In case of monte-carlo methods, we estimate the value function based on the average return of multiple episodes. We can use this method to estimate the value function for a given policy, which is known as policy evaluation.
The key idea behind monte-carlo methods is to use the actual returns obtained from episodes to update our estimates of the value function. This is in contrast to other methods, such as temporal-difference learning, which use bootstrapping to update estimates based on other estimates.
Monte-carlo methods can be used for both policy evaluation and policy improvement. In policy evaluation, we estimate the value function for a given policy, while in policy improvement, we use the estimated value function to improve the policy.
One of the main advantages of monte-carlo methods is that they can be used in environments with unknown dynamics, as they do not require a model of the environment. However, they can be computationally expensive, as they require multiple episodes to obtain accurate estimates of the value function.

The system is model-free. That means the conditional probability of state transition and rewards is not know - $P(s’, r

s, a)$ is not known. We can only sample from the environment to get the next state and reward. This is in contrast to model-based methods, where we have a model of the environment that allows us to compute the next state and reward given the current state and action. However the other conditional probability of the policy is known - $\pi(a

s)$ is known. We can use this to sample actions from the policy given a state. This is an important distinction, as it allows us to use monte-carlo methods for policy evaluation and improvement without needing a model of the environment.

Even if the policy says, “in state s, always take action a”, the next state and reward can be random. So, different episodes can have different paths under the same policy, as the enironment’s state transition and reward functions can be stochastic. This is a key aspect of monte-carlo methods, as it allows us to estimate the value function based on the average return of multiple episodes, which can capture the variability in the environment’s dynamics.

The Process of Monte Carlo Methods

flowchart TD A[Start] --> B[Initialize arbitrary policy π] B --> C["Initialize Q(s,a) (e.g., zeros)"] C --> D[Repeat for many iterations] D --> E["Generate an episode
by following current policy π
(with exploration)"] E --> F["For each state-action (s,a)
in the episode,
compute return G from that time step"] F --> G["Update Q(s,a)
using average of returns
observed for (s,a)"] G --> H["Improve policy π:
for each state s,
choose action with highest Q(s,a)
(e.g., ε-greedy)"] H --> D