Deep Q learning
What is Q?
Action value function $Q(s,a)$ associating a value (reward) to any combination of state $s_t$ and action $a_t$.

Recursive definition of Q
$Q(s_t,a_t)$ can be written as a recursive formula called the Bellman equation, expressing the Q value in the current state in terms of the Q values of the next states:

The update rule for Q learning -

Q Network
Mapping states to action values

Bellman optimality equation (for $Q^{*}$):
A neural network to implement the Q function
import torch
import torch.nn as nn
class QNetwork(nn.Module):
def __init__(self, state_dim, action_dim):
super(QNetwork, self).__init__()
self.fc1 = nn.Linear(state_dim, 64)
# two fully connected hidden layers with 64 nodes each
self.fc2 = nn.Linear(64, 64)
self.fc3 = nn.Linear(64, action_dim)
def forward(self, x):
if not isinstance(x, torch.Tensor):
x = torch.tensor(x, dtype=torch.float32)
x = x.float()
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
return self.fc3(x)Training the neural network with lunar lander environment - Part 1

"""
Simple LunarLander DQN-style training loop.
"""
import gymnasium as gym
import torch
import torch.nn as nn
from q_network import QNetwork
GAMMA = 0.99
LR = 1e-4
NUM_EPISODES = 10
model = QNetwork(8, 4)
optimizer = torch.optim.Adam(model.parameters(), lr=LR)
criterion = nn.MSELoss()
def to_tensor(state):
return torch.tensor(state, dtype=torch.float32)
def select_action(net, state_tensor):
q_values = net(state_tensor)
return torch.argmax(q_values).item()
def calculate_loss(net, state, action, next_state, reward, done):
state_t = to_tensor(state)
next_state_t = to_tensor(next_state)
q_values = net(state_t)
current_q = q_values[action]
with torch.no_grad():
next_q = net(next_state_t).max()
target_q = reward + GAMMA * next_q * (1 - int(done))
return criterion(current_q, target_q)
env = gym.make("LunarLander-v3", render_mode="human")
for episode in range(NUM_EPISODES):
state, _ = env.reset()
done = False
episode_reward = 0.0
while not done:
action = select_action(model, to_tensor(state))
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
loss = calculate_loss(model, state, action, next_state, reward, done)
optimizer.zero_grad()
loss.backward()
optimizer.step()
state = next_state
episode_reward += reward
print(f"Episode {episode + 1}: reward={episode_reward:.2f}")
env.close()The lunar lander crashes. Because -
- The agent starts with an untrained Q-network, so its Q-values are essentially random.
- The policy is greedy (argmax) from the start, so it repeatedly picks whatever random action currently looks best.
- There is no exploration strategy (no epsilon-greedy), so it does not try enough alternative actions to discover safer behavior.
- LunarLander needs coordinated action sequences; random/poor early choices quickly lead to bad trajectories and crashes.
- Training updates are noisy because they are done step-by-step on highly correlated samples (no replay buffer)
- There is no target network, so the learning target moves every step, making Q-learning unstable.
- Very short training (10 episodes) is far from enough for this task; early episodes are expected to be mostly crashes.
- No reward/gradient stabilization (e.g., clipping) can further increase unstable updates in early training.
## Improvising the training loop - Part 2