Back to Home

Samrat Kar

exploring & experimenting

Deep Q Learning Notes

Deep Q learning

What is Q?

Action value function $Q(s,a)$ associating a value (reward) to any combination of state $s_t$ and action $a_t$.

Recursive definition of Q

$Q(s_t,a_t)$ can be written as a recursive formula called the Bellman equation, expressing the Q value in the current state in terms of the Q values of the next states:

The update rule for Q learning -

Q Network

Mapping states to action values

Bellman optimality equation (for $Q^{*}$):

\[Q^{*}(s_t,a_t) = \mathbb{E}\!\left[ r_t + \gamma \max_{a'} Q^{*}(s_{t+1},a') \;\middle|\; s_t=s,\; a_t=a \right]\]

A neural network to implement the Q function

import torch
import torch.nn as nn


class QNetwork(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(QNetwork, self).__init__()
        self.fc1 = nn.Linear(state_dim, 64)
        # two fully connected hidden layers with 64 nodes each
        self.fc2 = nn.Linear(64, 64)
        self.fc3 = nn.Linear(64, action_dim)

    def forward(self, x):
        if not isinstance(x, torch.Tensor):
            x = torch.tensor(x, dtype=torch.float32)
        x = x.float()
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return self.fc3(x)

Training the neural network with lunar lander environment - Part 1

"""
Simple LunarLander DQN-style training loop.
"""

import gymnasium as gym
import torch
import torch.nn as nn

from q_network import QNetwork

GAMMA = 0.99
LR = 1e-4
NUM_EPISODES = 10


model = QNetwork(8, 4)
optimizer = torch.optim.Adam(model.parameters(), lr=LR)
criterion = nn.MSELoss()


def to_tensor(state):
    return torch.tensor(state, dtype=torch.float32)


def select_action(net, state_tensor):
    q_values = net(state_tensor)
    return torch.argmax(q_values).item()


def calculate_loss(net, state, action, next_state, reward, done):
    state_t = to_tensor(state)
    next_state_t = to_tensor(next_state)

    q_values = net(state_t)
    current_q = q_values[action]

    with torch.no_grad():
        next_q = net(next_state_t).max()
        target_q = reward + GAMMA * next_q * (1 - int(done))

    return criterion(current_q, target_q)


env = gym.make("LunarLander-v3", render_mode="human")

for episode in range(NUM_EPISODES):
    state, _ = env.reset()
    done = False
    episode_reward = 0.0

    while not done:
        action = select_action(model, to_tensor(state))
        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated

        loss = calculate_loss(model, state, action, next_state, reward, done)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        state = next_state
        episode_reward += reward

    print(f"Episode {episode + 1}: reward={episode_reward:.2f}")

env.close()

The lunar lander crashes. Because -

  1. The agent starts with an untrained Q-network, so its Q-values are essentially random.
  2. The policy is greedy (argmax) from the start, so it repeatedly picks whatever random action currently looks best.
  3. There is no exploration strategy (no epsilon-greedy), so it does not try enough alternative actions to discover safer behavior.
  4. LunarLander needs coordinated action sequences; random/poor early choices quickly lead to bad trajectories and crashes.
  5. Training updates are noisy because they are done step-by-step on highly correlated samples (no replay buffer)
  6. There is no target network, so the learning target moves every step, making Q-learning unstable.
  7. Very short training (10 episodes) is far from enough for this task; early episodes are expected to be mostly crashes.
  8. No reward/gradient stabilization (e.g., clipping) can further increase unstable updates in early training.

## Improvising the training loop - Part 2