Markov Decision Process (MDP) Notes

MDP - Markov Decision Process is used to model environment in reinforcement learning. It is a mathematical framework for modeling decision-making problems where an agent interacts with an environment to achieve a goal.
An MDP consists of the following components -
- states - some states are terminal states where episode ends,
- actions - each action is specified by a particular index in the q-value array, for that state. action always leads to a state transition.
- transition probabilities, - which specify the probability of transitioning from one state to another given a particular action.
- rewards - which specify the immediate reward received by the agent after taking a particular action in a particular state.
At heart of MDP is the markov property, which states that the future state of the environment depends only on the current state and action taken by the agent, and not on the past states or actions. This property allows us to model the environment as a Markov process, which simplifies the analysis and solution of reinforcement learning problems.
Markov property vs Bellman equation (How both are consistent)

It can feel like a contradiction:
- Markov property says: future depends on present.
- Bellman equation writes value of present state using next-state values.
There is no contradiction because these two statements are about different things.
Forward-time causality (environment dynamics):
- Environment transitions are causal in time:
  - (S_t, A_t \rightarrow S_{t+1})
- Markov assumption:
  - (P(S_{t+1} \mid S_t, A_t, S_{t-1}, \dots) = P(S_{t+1} \mid S_t, A_t))
- So the future state distribution depends only on current state-action, not full history.
Bellman equation (value recursion, not physical causality):
- For a policy (\pi):
  - (V^\pi(s) = \sum_a \pi(a \mid s)\sum_{s’} P(s’ \mid s,a)\big[R(s,a,s’) + \gamma V^\pi(s’)\big])
- This is a recursive definition for expected return.
- It does not mean future state causes current state.
- It means: to evaluate how good state (s) is, we account for immediate reward and expected value of next states.
Key reconciliation:
- Causality in the MDP is still forward in time.
- Bellman relation is backward-looking only in computation/indexing.
- So: forward causal process + recursive value equation can coexist without conflict.