Monte Carlo Gridworld

First-visit on-policy Monte Carlo control on the same 3x3 gridworld. The board shows learned action-values and the current epsilon-soft policy after training from complete episodes.

Q-Learning

Algorithm

Episodes

Gamma

Epsilon Final

Training Summary

The chart plots the moving average episode return. Monte Carlo updates only after each full episode, so the learning signal arrives later than TD methods but is based on complete sampled returns.

Agent State

Episode Reward

Monte Carlo Gridworld

Training Summary

Simulation Log

Sample Episodes