RL Tutorial: MDP Process

Reward is all you need?

It is the era of reinforcement learning!

Several Courses

Introduction

The initial course for CS234, introduction to reinforcement learning.

Reinforcement learning generally involves:

  • Optimization methods (General)
  • Delayed consequences
  • Exploration
  • Generalization

Evaluation and explore!

AI Planning SL UL RL IL
Optimization X X X
Learns from experience X X X X
Generalization X X X X X
Delayed Consequences X X X
Exploration X

IL: Imitation learning, like a parrot.
RLHF: Reinforcement learning based on human feedback.

RLHF and DPO are classical offline Rl algorithms. We will not explicitly “explore” during the training process, instead, we will just inspect and learn from the collected historical data.(e.g. policy gradient algorithm.)

We can see that exploration is the uniqueness of reinforcement learning!

Simple Demo

  • Action Space?
  • State Space?
  • Reward?
  • Actions for transformation?
  • Reward Hacking

Interacting with the world

For each time within a discrete timestamp, the agent will do:

  • Action AtA_t to interact with the environment
  • Agent get the reward RtR_t and the observation OtO_t
  • Finally, it will update itself for the next iteration!
  • History: (a1,o1,h1,a2,o2,h2,)(a_1, o_1, h_1, a_2, o_2, h_2, \dots )

Markov Decision Process, MDP

A Markov Decision Process (MDP) is a mathematical framework for modeling sequential decision-making in situations where outcomes are partly random and partly under the control of a decision-maker.

An MDP is formally defined by five key components: (S,A,P,R,γ)(S, A, P, R, \gamma).

  • SS (State Space): This is the set of all possible states of the environment. A state must have the Markov property, meaning the future depends only on the current state and action, not on the sequence of events that led to it.

    For example: after tt steps, the future depends on St1S_{t-1} and AtA_t

  • AA (Action Space): This is the set of all possible actions an agent can take.

  • PP (Transition Probability): This defines the dynamics of the environment. P(ss,a)P(s' | s, a) is the probability of transitioning to state ss' from state ss after taking action aa. This is the stochastic part of the MDP, representing the uncertainty in the environment.

    • It is just a probability.
    • The state sts_t is markov iff:

    p(stat1,st1)=p(stat1,ht1)p(s_t|a_{t-1}, s_{t-1}) = p(s_t|a_{t-1}, h_{t-1})

  • RR (Reward Function): This function determines the immediate reward the agent receives after taking action aa in state ss and landing in state ss'. The reward is a numerical value that tells the agent how good or bad a particular action is in a given state. The ultimate goal of the agent is to maximize its cumulative reward over time.

    • reward function:

    r(st=s,at=a)=E([rtst,at])r(s_t =s, a_t = a) = \mathbb{E}([r_t|s_t, a_t])

    • or we can say the general reward for the state st+1s_{t+1}
  • γ\gamma (Discount Factor): This is a value between 0 and 1 that discounts future rewards. It’s used to give more weight to immediate rewards compared to rewards received in the distant future. A discount factor of 0 makes the agent “myopic” (only considers immediate rewards), while a factor closer to 1 makes the agent “farsighted” (values long-term rewards).

    • Define the state value function:

    V(s)=E[Gtst=s]V(s) = \mathbb{E}[G_t|s_t =s]

    Gt=rt+γrt+1++γHrt+HG_t = r_t + \gamma r_{t+1} + \dots + \gamma^{H} r_{t+H}

The central problem of an MDP is to find an optimal policy (π)(\pi). A policy is a strategy that tells the agent which action to take in each state. The optimal policy is the one that maximizes the agent’s total expected cumulative reward over a long sequence of decisions.

  • Determine: π(s)=a\pi(s) = a
  • Stochastic: π(as)=Pr(at=ast=s)\pi(a|s) = Pr(a_t = a | s_t = s)

Matrix & Linear Transformation

P=(P11P12P1NP21P22P2NPN1PN2PNN)P = \begin{pmatrix} P_{11} & P_{12} & \cdots & P_{1N} \\ P_{21} & P_{22} & \cdots & P_{2N} \\ \vdots & \vdots & \ddots & \vdots \\ P_{N1} & P_{N2} & \cdots & P_{NN} \end{pmatrix}

πt+1=Pπt\pi_{t+1} = P \pi_{t}

Is the world markov?

The world is partially observable.

  • We have Partially Observable Markov Decision Process (POMDP)

  • In these cases, state is less the observations.(In real world senarios, it actually is!)

b(s)=P(so,a,b)=P(os,a)sSP(ss,a)b(s)P(oa,b)b'(s') = P(s' | o, a, b) = \frac{P(o | s', a) \sum_{s \in S} P(s'|s, a) b(s)}{P(o | a, b)}

V(b)=maxaA[R(b,a)+γoΩP(oa,b)V(bo,a)]V(b) = \max_{a \in A} \left[ R(b, a) + \gamma \sum_{o \in \Omega} P(o|a, b) V(b'_{o,a}) \right]

  • V(b)V(b):当前信念 bb 的价值。
  • R(b,a)R(b, a):在信念 bb 下执行动作 aa期望即时奖励。这是对所有可能状态的奖励加权求和:R(b,a)=sSb(s)R(s,a)R(b, a) = \sum_{s \in S} b(s)R(s, a)
  • oΩP(oa,b)V(bo,a)\sum_{o \in \Omega} P(o|a, b) V(b'_{o,a}):这部分是期望未来价值。它对所有可能的观察 oo 进行加权,每个观察的权重是其发生的概率 P(oa,b)P(o|a, b),而其价值是新的信念 bb' 的价值 V(b)V(b').

RL Tutorial: MDP Process
https://xiyuanyang-code.github.io/posts/RL-Tutorial/
Author
Xiyuan Yang
Posted on
September 14, 2025
Updated on
September 16, 2025
Licensed under