Monte Carlo Gridworld

First-visit on-policy Monte Carlo control on the same 3x3 gridworld. The board shows learned action-values and the current epsilon-soft policy after training from complete episodes.
Algorithm
Episodes
Gamma
Epsilon Final

Training Summary

The chart plots the moving average episode return. Monte Carlo updates only after each full episode, so the learning signal arrives later than TD methods but is based on complete sampled returns.

Agent State
Episode Reward

Simulation Log

Sample Episodes