First-visit on-policy Monte Carlo control on the same 3x3 gridworld. The board shows learned action-values and the current epsilon-soft policy after training from complete episodes.
The chart plots the moving average episode return. Monte Carlo updates only after each full episode, so the learning signal arrives later than TD methods but is based on complete sampled returns.