Case Study: Dynamic Programming vs Q-Learning
This note presents two small case studies:
- Dynamic Programming on a known grid world
- Q-Learning on the same grid world without using the model
The goal is to show:
- how both methods try to solve the same decision problem
- why their update mechanisms are different
- when one is more appropriate than the other
Common Environment
Consider a 3 x 3 grid world:
+-----+-----+-----+
| S0 | S1 | S2 |
+-----+-----+-----+
| S3 | S4 | S5 |
+-----+-----+-----+
| S6 | S7 | S8 |
+-----+-----+-----+
Assumptions:
- start state:
S0 - goal state:
S8 - actions:
Up,Right,Down,Left - reward
+10for reachingS8 - reward
-1for every non-terminal move - episode ends at
S8
State layout:
S0 S1 S2
S3 S4 S5
S6 S7 S8
Action mapping:
UP = 0
RIGHT = 1
DOWN = 2
LEFT = 3
State Value vs Action Value
Before comparing DP and Q-learning, it helps to separate two closely related ideas:
- state value
- action value
They are not immediate reward.
They are both about expected long-term return.
State value
The state-value function V_pi(s) means:
- if the agent is currently in state
s - and then follows policy
pi - what total future reward is expected from that point onward?
So state value is a state-level summary.
It tells us how promising it is to be in a state under a policy.
It does not directly tell us which action is best unless we also know how actions change the state.
Action value
The action-value function Q_pi(s,a) means:
- if the agent is currently in state
s - takes action
anow - and then follows policy
pi - what total future reward is expected from that point onward?
So action value is more specific than state value.
It scores an individual action inside a state.
What they are not
Two common misunderstandings are:
- state value is not just “how quickly I can reach reward”
- action value is not just the immediate reward from the action
Why?
- an action may give a small reward now but lead to a much better future
- another action may give a large reward now but lead to poor states later
So both V and Q include:
- immediate reward
- plus discounted future rewards
Small intuition example
Suppose in state S4:
Rightgives immediate reward0, but reaches a path to the goalLeftgives immediate reward+2, but leads into a poor region
Then it is possible that:
- immediate reward of
Left> immediate reward ofRight - but
Q(S4, Right) > Q(S4, Left)
because Q measures total future return, not just the reward from the next step.
Relationship between V and Q
State value and action value are related, but they are not added together.
It is not true that:
total value = V(s) + Q(s,a)
Instead:
V_pi(s)is the policy-weighted average of action values in that stateQ_pi(s,a)is the value of one specific action choice in that state
Mathematically:
V_{\pi}(s) = \sum_a \pi(a \mid s) Q_{\pi}(s,a)
For the optimal case:
V^*(s) = \max_a Q^*(s,a)
So:
- if we know all action values, we can recover the state value
- but if we only know the state value, we usually cannot recover all action values unless we also know the model
Why use state value at all?
A natural question is:
- if
Q(s,a)is more detailed, why do we needV(s)?
State value is useful because it is a compact summary of how good a state is.
It is useful when we want to:
- evaluate how good a policy is
- compare states overall
- estimate how promising a state is
- do model-based policy improvement in DP
If the environment model is known, V(s) can be enough for control, because we can look one step ahead and ask:
- if I take action
a, which next state might I reach? - how valuable is that next state?
Then we can improve the policy using the state values.
When is Q more useful?
If the model is unknown, Q(s,a) is often more useful for direct control because it already answers:
- in this state, which action is best?
That is why Q-learning learns action values directly.
Once Q(s,a) is learned, the agent can act by choosing:
\arg\max_a Q(s,a)
So in practice:
V(s)is a state-level evaluation toolQ(s,a)is an action-level decision tool
Final intuition
You can think of them like this:
V(s)asks: how good is this situation overall?Q(s,a)asks: how good is this move from this situation?
Or:
V(s): value of being there
Q(s,a): value of doing that from there
This distinction is important for the rest of the note:
- Dynamic Programming often works comfortably with
Vbecause the model is known - Q-learning focuses on
Qbecause the agent must compare actions directly from sampled experience
Why DP often uses V instead of Q
A common question is:
- if
Q(s,a)is more detailed, why does dynamic programming so often useV(s)?
The answer is that policy evaluation in DP can be written using either one:
- evaluate the policy through
V_pi(s) - or evaluate the policy through
Q_pi(s,a)
Both describe the same policy from two different levels of detail.
However, in a known-model setting, V is often enough for control.
That is because once we know the value of each next state, we can compare actions by using the transition model:
\sum_{s',r} p(s',r \mid s,a)\left[r + \gamma V(s')\right]
This quantity tells us how good action a is by looking one step ahead and then using the state values.
So in DP:
Vis often used as the compact object being updated- the model supplies the action consequences
- policy improvement uses those consequences to choose better actions
This means:
- if the model is known,
Vis often sufficient for policy improvement - if the model is unknown,
Valone is usually not enough to compare actions directly - in that unknown-model case, learning
Qis more practical
So your intuition is largely correct:
Qis more detailed thanVQis often more directly useful for choosing actions- but
Vis not redundant in DP, because the known model lets DP turn state values into action comparisons
That is why:
- DP often uses
V - Q-learning learns
Q
Case Study 1: Dynamic Programming
What DP is solving
Dynamic programming is not itself “Q-learning”.
Dynamic programming is a planning framework for solving an MDP when the full model is known.
It can be used to compute:
V_pi(s): the value of statesunder a specific policypiQ_pi(s,a): the value of taking actionain statesunder a specific policypiV*(s): the optimal state valueQ*(s,a): the optimal action value
So Q and Q* are not the same:
Q_pi(s,a)means the expected return if you take actionain statesand then continue following policypiQ*(s,a)means the best possible expected return if you take actionain statesand act optimally afterward
In short:
Q_pi: action value under a particular policy
Q*: optimal action value over all policies
This is why we should not call dynamic programming “Q* learning”.
- DP is the broader method class
Q*is one particular object that DP can compute- Q-learning is a separate model-free algorithm whose goal is to learn
Q*from sampled experience
DP is relevant here because it gives the exact Bellman target that Q-learning is trying to approximate.
If the model is known, DP can compute Q* directly by repeatedly applying Bellman optimality updates over all state-action pairs.
If the model is unknown, Q-learning uses sampled transitions to move its estimates toward that same Q*.
So the connection is:
DP with known model -> computes Q* by full expected backups
Q-learning without model -> learns Q* by sampled backups
Settings
In dynamic programming, we assume the finite MDP is fully known:
\mathcal{M} = \left(\mathcal{S}, \mathcal{A}, p(s', r \mid s, a), \gamma\right)
In plain English, this means:
- the environment is modeled as an MDP $\mathcal{M}$
- $\mathcal{S}$ is the set of all states
- $\mathcal{A}$ is the set of all actions
- $p(s’, r \mid s, a)$ is the conditional probability of getting next state $s’$ and reward $r$, given current state $s$ and action $a$
- $\gamma \in [0,1)$ is the discount factor used to discount future rewards
So in DP, the full model is known: for any current state $s$ and action $a$, we know the distribution over all possible next outcomes $(s’, r)$, and we also know the fixed discount factor $\gamma$.
So for every state-action pair $(s, a)$, DP assumes the model gives:
- all possible next states $s’$
- all possible rewards $r$
- the full transition-reward distribution $p(s’, r \mid s, a)$
Once the MDP model $\left(\mathcal{S}, \mathcal{A}, p(s’, r \mid s, a), \gamma\right)$ is known, dynamic programming uses it to compute long-term return estimates under a policy $\pi(a \mid s)$. These are the state-value function $v_\pi(s)$ and the action-value function $q_\pi(s,a)$. They are defined in terms of the same transition-reward model $p(s’, r \mid s, a)$ and the same discount factor $\gamma$.
For policy evaluation, the state-value function is:
v_{\pi}(s) = \sum_a \pi(a \mid s)\sum_{s',r} p(s', r \mid s, a)\left[r + \gamma v_{\pi}(s')\right]
For action values:
q_{\pi}(s,a) = \sum_{s',r} p(s', r \mid s, a)\left[r + \gamma \sum_{a'} \pi(a' \mid s') q_{\pi}(s', a')\right]
In many tabular Gymnasium toy-text environments, this model is exposed as:
env.P[s][a]
Conceptually, the probability distribution is:
p[(s, a)] = [
(prob_1, next_state_1, reward_1),
(prob_2, next_state_2, reward_2),
...
]
where the probabilities for a fixed $(s, a)$ sum to 1.
In a tabular Gymnasium toy-text implementation, this is usually stored as:
env.P: dict[int, dict[int, list[tuple[float, int, float, bool]]]]
Meaning:
env.P[s]returns all actions available from statesenv.P[s][a]returns a list of transitions for actionain states- each transition is
(prob, next_state, reward, done)
So the implementation form of the transition-reward probability distribution is:
env.P[s][a] = [
(prob_1, next_state_1, reward_1, done_1),
(prob_2, next_state_2, reward_2, done_2),
...
]
Here:
prob_icorresponds to the probability mass fromp(s', r \mid s, a)next_state_iis one possibles'reward_iis the reward paired with thats'done_iindicates whether that transition ends the episode
Example:
env.P[7][1]
# [(1.0, 8, 10.0, True)]
This means:
- with probability
1.0 - the next state is
8 - the reward is
10.0 - and the transition ends the episode
Note:
- this tabular
Pstructure is common in Gymnasium toy-text style environments, but it is not part of the generic GymnasiumEnvAPI for all environments - some codebases may write
env.p[s][a], but the common attribute name in these environments is uppercaseenv.P
Intuition
Dynamic programming computes values by applying Bellman updates repeatedly.
In this grid world:
- if moving
RightfromS7reachesS8, that action should become highly valued - then states that can reach
S7should also gain value - eventually value information propagates backward through the whole grid
Diagram
flowchart LR
S0["S0"] --> S1["S1"]
S1 --> S2["S2"]
S2 --> S5["S5"]
S5 --> S8["S8 Goal"]
S0 --> S3["S3"]
S3 --> S6["S6"]
S6 --> S7["S7"]
S7 --> S8
Dynamic programming uses the full model to evaluate all possible actions and transitions at every state.
Example Policy Iteration Code
import numpy as np
def policy_evaluation(env, policy, gamma=0.9, theta=1e-8):
num_states = env.observation_space.n
V = np.zeros(num_states)
while True:
delta = 0.0
for s in range(num_states):
old_v = V[s]
new_v = 0.0
for a, action_prob in enumerate(policy[s]):
if action_prob == 0:
continue
for prob, next_state, reward, done in env.P[s][a]:
new_v += action_prob * prob * (
reward + gamma * (1 - done) * V[next_state]
)
V[s] = new_v
delta = max(delta, abs(old_v - new_v))
if delta < theta:
break
return V
def compute_q_from_v(env, V, gamma=0.9):
num_states = env.observation_space.n
num_actions = env.action_space.n
Q = np.zeros((num_states, num_actions))
for s in range(num_states):
for a in range(num_actions):
for prob, next_state, reward, done in env.P[s][a]:
Q[s, a] += prob * (
reward + gamma * (1 - done) * V[next_state]
)
return Q
def policy_improvement(env, V, gamma=0.9):
Q = compute_q_from_v(env, V, gamma)
num_states, num_actions = Q.shape
policy = np.zeros((num_states, num_actions))
for s in range(num_states):
best_action = np.argmax(Q[s])
policy[s, best_action] = 1.0
return policy, Q
def policy_iteration(env, gamma=0.9, theta=1e-8):
num_states = env.observation_space.n
num_actions = env.action_space.n
policy = np.ones((num_states, num_actions)) / num_actions
while True:
V = policy_evaluation(env, policy, gamma=gamma, theta=theta)
new_policy, Q = policy_improvement(env, V, gamma=gamma)
if np.array_equal(new_policy, policy):
break
policy = new_policy
return policy, V, Q
Example Outcome
A likely optimal policy for the grid might look like:
+---------+---------+---------+
| S0 -> | S1 -> | S2 v |
+---------+---------+---------+
| S3 -> | S4 -> | S5 v |
+---------+---------+---------+
| S6 -> | S7 -> | S8 Goal |
+---------+---------+---------+
This means:
- from the top row, move right until the last column
- then move down toward the goal
What DP is doing
Dynamic programming is effectively asking:
- if I know the exact transition probabilities and rewards,
- what is the exact best value of each state,
- and therefore what is the exact best action at each state?
So DP performs full expected backups.
Case Study 2: Q-Learning
Setting
Now assume we do not know the model.
So we do not know:
env.P[s][a]- transition probabilities
- exact reward structure in table form
Instead, the agent only interacts with the environment and observes:
(state, action, reward, next_state)
Intuition
Q-learning starts with a zero Q-table and improves it from experience.
At first:
- the agent explores
- values are mostly guesses
Over many episodes:
- transitions leading toward the goal receive better value
- bad actions get lower value
- the Q-table gradually approximates the optimal action-value function
Q-Learning Update
Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha \left[R_{t+1} + \gamma \max_{a'} Q(S_{t+1},a') - Q(S_t,A_t)\right]
Diagram
flowchart LR
A["Current state s"] --> B["Choose action a"]
B --> C["Observe reward r and next state s'"]
C --> D["Compute target: r + gamma max_a' Q(s',a')"]
D --> E["Update Q(s,a)"]
Q-Learning Code
import numpy as np
import random
def epsilon_greedy_action(Q, state, epsilon):
if random.random() < epsilon:
return random.randrange(Q.shape[1])
return np.argmax(Q[state])
def q_learning(env, num_episodes=5000, alpha=0.1, gamma=0.9, epsilon=1.0,
epsilon_decay=0.995, epsilon_min=0.01):
num_states = env.observation_space.n
num_actions = env.action_space.n
Q = np.zeros((num_states, num_actions))
for episode in range(num_episodes):
state, _ = env.reset()
done = False
while not done:
action = epsilon_greedy_action(Q, state, epsilon)
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
target = reward + gamma * (1 - done) * np.max(Q[next_state])
Q[state, action] += alpha * (target - Q[state, action])
state = next_state
epsilon = max(epsilon_min, epsilon * epsilon_decay)
return Q
def greedy_policy_from_q(Q):
num_states, num_actions = Q.shape
policy = np.zeros((num_states, num_actions))
for s in range(num_states):
best_action = np.argmax(Q[s])
policy[s, best_action] = 1.0
return policy
Example Learning Story
Suppose the agent is in S7 and takes Right.
If that reaches the goal S8 with reward +10, then:
Q(S7, Right) becomes large
Then later, if the agent is in S6 and moving Right often leads to S7, then:
Q(S6, Right) also increases
So just like DP, value information propagates backward, but here it happens from sampled experience instead of from the known model.
Example Learned Policy
After enough episodes, the learned greedy policy may become:
+---------+---------+---------+
| S0 -> | S1 -> | S2 v |
+---------+---------+---------+
| S3 -> | S4 -> | S5 v |
+---------+---------+---------+
| S6 -> | S7 -> | S8 Goal |
+---------+---------+---------+
So the final policy may match the dynamic programming solution, even though the learning process is completely different.
Side-by-Side Comparison
Core Difference
Dynamic Programming:
- knows the model
- computes exact expected updates
Q-Learning:
- does not know the model
- learns from sampled experience
Backup Comparison
Dynamic programming backup:
Q^*(s,a) = \sum_{s',r} p(s',r \mid s,a)\left[r + \gamma \max_{a'} Q^*(s',a')\right]
Q-learning backup:
Q(s,a) \leftarrow Q(s,a) + \alpha \left[r + \gamma \max_{a'} Q(s',a') - Q(s,a)\right]
Interpretation:
- DP averages over all possible next outcomes
- Q-learning uses one observed sample
Diagram
flowchart TD
A["Dynamic Programming"]
B["Known model"]
C["Enumerate all next outcomes"]
D["Expected Bellman backup"]
E["Exact update"]
F["Q-Learning"]
G["Unknown model"]
H["Observe one transition sample"]
I["Sampled Bellman backup"]
J["Incremental update"]
A --> B --> C --> D --> E
F --> G --> H --> I --> J
Comparison Table
| Aspect | Dynamic Programming | Q-Learning |
|---|---|---|
| Model needed? | Yes | No |
Uses env.P[s][a]? |
Yes | No |
| Update type | Expected backup | Sampled backup |
| Learns from | Full model | Experience |
| Requires episodes? | Not necessarily | Usually trained over episodes |
| Exploration needed? | No | Yes |
| Typical setting | Known tabular MDP | Unknown environment |
| Converges to optimal? | Yes, under standard assumptions | Yes, under standard assumptions and sufficient exploration |
Why Both Matter
State value and action value answer two different questions.
The state-value function V(s) asks:
- how good is it to be in state
s?
The action-value function Q(s,a) asks:
- how good is it to take action
ain states?
So their roles are different:
V(s)gives a summary of the long-term usefulness of a state under a policyQ(s,a)gives a decision-level score for each available action
Why do we need both?
- if we only know
V(s), we know whether a state is good or bad, but we do not directly know which action caused that goodness unless we also use the model - if we know
Q(s,a), we can act immediately by choosing the action with the highest value - in model-based methods like DP,
V(s)is often convenient for policy evaluation because it summarizes each state compactly - in control methods like Q-learning,
Q(s,a)is especially useful because the agent must compare actions without knowing the model
Another way to say it:
V(s): state-level evaluation
Q(s,a): action-level evaluation
Their practical purpose is:
V(s)helps us understand how promising each state isQ(s,a)helps us choose what to do next
Their conceptual relationship is:
V(s)is useful for evaluating statesQ(s,a)is useful for improving decisions- when a model is known, we can move between them
- when a model is unknown, learning
Q(s,a)directly is often the most practical route to control
Dynamic programming is important because:
- it gives the clean mathematical foundation
- it shows what the exact Bellman solution looks like
- it explains policy evaluation, policy improvement, policy iteration, and value iteration
Q-learning is important because:
- in real problems the model is often unknown
- we still want to learn optimal behavior
- Q-learning shows how Bellman optimality can be learned from data
So the relationship is:
Dynamic Programming:
exact Bellman updates with known model
Q-Learning:
sample-based Bellman updates without known model
Final Takeaway
Both methods try to answer the same question:
- what is the best action to take in each state?
But they solve it differently:
- Dynamic Programming solves it from the model
- Q-Learning solves it from experience
So if the model is known, dynamic programming is the cleanest exact method.
If the model is unknown, Q-learning is a practical model-free alternative.
Key Insights in Sutton-Style Framing
The following ideas are closely aligned with the way Sutton and Barto motivate these methods.
Key insights behind Dynamic Programming
- Dynamic programming assumes a complete model of the environment
DP requires knowledge of:
- state transitions
- rewards
This is why DP is mainly a planning method rather than a pure learning method.
- DP is built directly on the Bellman equations
The Bellman expectation and Bellman optimality equations provide recursive definitions of value.
DP turns those recursive equations into iterative algorithms.
- Policy evaluation and policy improvement are the central decomposition
A major insight is that control can be broken into two alternating steps:
- evaluate the current policy
- improve the policy using the current values
This leads naturally to policy iteration.
- Value iteration compresses evaluation and improvement together
Instead of fully evaluating a policy before improving it, value iteration performs partial evaluation and greedy improvement in the same update.
- DP establishes the conceptual foundation for later RL methods
Even when a model is not available, the structure of DP remains important because later methods such as TD learning and Q-learning can be understood as approximations to DP-style Bellman backups.
Key insights behind Q-Learning
- Q-learning learns action values directly from experience
Rather than learning a model first, Q-learning directly estimates:
\(Q^*(s,a)\)
This is a major shift from planning with a model to learning from interaction.
- Q-learning is a sample-based version of Bellman optimality updates
Instead of averaging over all possible next outcomes, Q-learning updates from one sampled transition at a time.
So it can be seen as replacing:
- full expected backup
with:
- sampled backup
- Q-learning is off-policy
This is one of the most important conceptual insights.
The agent may behave using an exploratory policy, such as epsilon-greedy, but the update target still uses:
\(\max_{a'} Q(s',a')\)
So it learns about the greedy policy while behaving differently during learning.
- The learned Q-values are enough for control
Once the Q-table is learned, there is no need to separately learn a model or a state-value function.
The policy can be obtained directly by choosing:
\(\arg\max_a Q(s,a)\)
- Q-learning brings optimal control into the model-free setting
This is the core reason it is so influential.
It keeps the optimality idea from dynamic programming, but removes the need for a known transition model.
Shared insight across both methods
Both dynamic programming and Q-learning are driven by the same high-level idea:
- good decisions today depend on reward today plus good decisions tomorrow
In Sutton-style RL, this is the unifying role of Bellman recursion.
So:
- DP uses Bellman recursion with a known model
- Q-learning uses Bellman recursion with sampled experience
Short summary
Dynamic Programming:
planning with a known model
exact Bellman backups
policy evaluation + policy improvement
Q-Learning:
learning without a model
sampled Bellman optimality backups
direct estimation of optimal action values