The Big Picture

flowchart TB A["Bellman Equations and Returns
RL foundation"] A --> B{"Do we know the model?"} B -->|Yes| C["Dynamic Programming
Expectation updates"] B -->|No| D["Model-free learning
From sampled experience"] D --> E{"Do we update only
after full episodes?"} E -->|Yes| F["Monte Carlo
Use complete return G_t"] E -->|No| G["Temporal-Difference
Bootstrap using V(S_t+1) or Q(S_t+1,A_t+1)"] F --> H["Prediction
Estimate V_pi or Q_pi"] F --> I["Control
Improve policy from sampled returns"] I --> J["On-policy MC
epsilon-soft / epsilon-greedy"] I --> K["Off-policy MC
importance sampling"] classDef mc fill:#e8f5e9,stroke:#2e7d32,stroke-width:1px; classDef td fill:#fff3e0,stroke:#ef6c00,stroke-width:1px; classDef dp fill:#e3f2fd,stroke:#1565c0,stroke-width:1px; C:::dp F:::mc G:::td

Source Code Monte Carlo

Open source code for Monte Carlo case study

This script implements first-visit on-policy Monte Carlo control with an
epsilon-soft policy on the same 3x3 stochastic gridworld used across the
other DRL case studies.

The gridboard game

Open the game in a new tab full screen mode

The game view lets you:

Model and Tables

Open the Monte Carlo model and tables view

This view is useful when you want the tabular details directly:

Notebook

Open the Monte Carlo notebook

The notebook is a compact runnable companion for the chapter. It calls the same
Monte Carlo implementation and can regenerate the exported JSON artifact used by
the browser views.

What Monte Carlo Means

Monte Carlo methods learn by running episodes, observing rewards, and then averaging the
actual returns obtained. Instead of asking:

MC asks:

If an episode generates rewards

\[R_{t+1}, R_{t+2}, \dots, R_T\]

then the return from time $t$ is

\[G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots + \gamma^{T-t-1} R_T\]

MC uses sampled values of $G_t$ to estimate:

This makes MC natural when:

Core Idea: Learn From Returns

For a fixed policy $\pi$, Monte Carlo prediction estimates value functions by averaging returns.

For state values:

\[V_\pi(s) \approx \text{average of returns observed after visits to } s\]

For action values:

\[Q_\pi(s,a) \approx \text{average of returns observed after visits to } (s,a)\]

Over time, if enough samples are collected, these averages converge to the true values under mild conditions.

First-Visit vs Every-Visit Monte Carlo

There are two standard ways to average returns.

1. First-Visit MC

2. Every-Visit MC

Both are valid. First-visit MC is usually introduced first because it is simpler to reason about.

Monte Carlo Prediction

Suppose policy $\pi$ is fixed. Then the procedure is:

  1. Generate an episode using $\pi$.
  2. For each visited state $s$, compute the return $G_t$ from that point onward.
  3. Update the estimate of $V_\pi(s)$ by averaging returns.

Incremental form:

\[V(s) \leftarrow V(s) + \alpha \bigl(G_t - V(s)\bigr)\]

This looks like a stochastic approximation update:

Similarly for action values:

\[Q(s,a) \leftarrow Q(s,a) + \alpha \bigl(G_t - Q(s,a)\bigr)\]

Why MC Needs Episodes

MC needs a complete return $G_t$, so it must wait until the episode ends.

That creates two important consequences:

  1. MC is naturally suited for episodic tasks.
  2. MC updates can have high variance, because full returns can vary a lot from one episode to another.

This is the main tradeoff:

Monte Carlo Control

Prediction evaluates a policy. Control improves it.

The MC control loop is:

  1. Start with a policy $\pi$.
  2. Generate episodes following $\pi$.
  3. Estimate $Q_\pi(s,a)$ from returns.
  4. Improve the policy to be greedy or nearly greedy with respect to $Q$.
  5. Repeat.

This mirrors policy iteration:

Because the model is unknown, both steps are done from sampled episodes rather than exact Bellman expectations.

Exploring Starts

One classic MC control assumption is exploring starts:

This guarantees enough exploration in theory, but it is often unrealistic in practice because
we usually cannot reset the environment to any arbitrary state-action pair.

So in practical RL, we usually prefer soft exploration policies instead.

On-Policy Monte Carlo Control

In on-policy MC control:

Why?

Typical rule:

This ensures continued exploration while gradually favoring better actions.

Off-Policy Monte Carlo

Off-policy learning separates:

Because episodes come from $b$ but values are for $\pi$, we must correct the mismatch using
importance sampling.

Importance sampling ratio:

\[\rho_{t:T-1} = \prod_{k=t}^{T-1} \frac{\pi(A_k \mid S_k)}{b(A_k \mid S_k)}\]

Then weighted returns are used to estimate values under $\pi$.

Why this matters:

MC vs DP vs TD

Method Model needed Learns from samples Bootstrapping Update timing
DP Yes No Yes During sweeps over model
MC No Yes No After full episode
TD No Yes Yes Every step

Interpretation:

Strengths of Monte Carlo

  1. Does not require the environment model.
  2. Conceptually simple: estimate values by averaging returns.
  3. Works directly from real or simulated experience.
  4. Useful for episodic tasks.
  5. Does not bootstrap, so targets are actual sampled returns.

Limitations of Monte Carlo

  1. Must wait until the episode terminates before updating.
  2. Not naturally suited to continuing tasks without modification.
  3. Returns can have high variance, making learning unstable or slow.
  4. Pure exploring starts is often impractical.
  5. Off-policy MC with importance sampling can become very high variance.

Intuition To Remember

That single contrast is enough to place MC correctly inside the larger RL picture.

Key Points -

  1. Monte Carlo is a model-free method that learns from complete episodes.
  2. MC estimates values using the sampled return $G_t$, not the Bellman expectation over a known model.
  3. MC does not bootstrap; its targets are actual returns observed after an episode unfolds.
  4. First-visit MC uses only the first occurrence of a state or state-action pair in an episode, while every-visit MC uses all occurrences.
  5. MC prediction evaluates a fixed policy by averaging returns for visited states or state-action pairs.
  6. MC control alternates between estimating $Q_\pi(s,a)$ and improving the policy to be greedy or epsilon-greedy with respect to that estimate.
  7. On-policy MC commonly uses epsilon-soft exploration so that all actions continue to be sampled.
  8. Off-policy MC learns about a target policy from data generated by a different behavior policy using importance sampling.
  9. MC is:
    • model-free
    • sample-based
    • episode-based
    • no bootstrapping
    • higher variance than TD
    • especially natural for episodic problems