The Big Picture
- Dynamic Programming: definition/analytic level with a known model.
- Monte Carlo + TD (SARSA, Q-learning): sample level, learning from experience without knowing the model.
- MC: sample-based, no bootstrapping.
- TD (SARSA, Q-learning): sample-based with bootstrapping.
flowchart TB
A["Bellman Equations (Theory)
Definition level"] A --> B{"Do we know the full MDP?
(transitions & rewards)"} B -->|Yes| C["Dynamic Programming (DP)
Model-based"] B -->|No| D["Model-free RL
Sample level"] C --> C1[Policy Evaluation] C --> C2[Policy Iteration] C --> C3[Value Iteration] D --> E["Monte Carlo (MC)"] D --> F["Temporal-Difference (TD)"] E:::mc F:::td F --> G["SARSA
(on-policy TD control)"] F --> H["Q-learning
(off-policy TD control)"] classDef mc fill:#e0f7fa,stroke:#00838f,stroke-width:1px; classDef td fill:#fff3e0,stroke:#ef6c00,stroke-width:1px;
Definition level"] A --> B{"Do we know the full MDP?
(transitions & rewards)"} B -->|Yes| C["Dynamic Programming (DP)
Model-based"] B -->|No| D["Model-free RL
Sample level"] C --> C1[Policy Evaluation] C --> C2[Policy Iteration] C --> C3[Value Iteration] D --> E["Monte Carlo (MC)"] D --> F["Temporal-Difference (TD)"] E:::mc F:::td F --> G["SARSA
(on-policy TD control)"] F --> H["Q-learning
(off-policy TD control)"] classDef mc fill:#e0f7fa,stroke:#00838f,stroke-width:1px; classDef td fill:#fff3e0,stroke:#ef6c00,stroke-width:1px;
Source Code Dynamic Programming
Open for source code for Dynamic Programming case study
The gridboard game
Open the game in the a new tab full screen mode
Key Points -
- DP is a planning method that computes exact value functions from a known model.
- DP uses the Bellman equations to iteratively compute values.
- DP can compute both state values and action values, which are useful for different purposes.
- DP is not a learning method in the sense of learning from experience, but it provides
the theoretical foundation for later learning methods that do learn from experience. - In DP, the $P(s’, r \mid s, a)$ is constant, and it does not change. that is known as the model. In Q-learning, we do not have access to this model, and we learn from samples instead.
- However, in DP we do learn and optimize on the $\pi (a \mid s)$, which is the policy. So, environment model is constant. But the action plicies change. However it is started with a static arbitrary policy which is then improved. That process is known as bootstrapping.
- In Q-learning, we also learn and optimize on the policy, but we do it indirectly by learning $Q(s,a)$ values that guide the policy.
- DP is :
- model-based
- expectation-based
- sweep over all states and actions
- policy is udpated between interactions.