RL Tutorial: MDP Process
Reward is all you need?
It is the era of reinforcement learning!
Several Courses
- CS285 Berkeley for Deep Reinforcement Learning
- The resources it provided.
- ACM Hands with Reinforcement Learning
- Deep Reinforcement Learning by Hongyi Li
- CS 234 Reinforcement Learning
Introduction
The initial course for CS234, introduction to reinforcement learning.
Reinforcement learning generally involves:
- Optimization methods (General)
- Delayed consequences
- Exploration
- Generalization
Evaluation and explore!
AI Planning | SL | UL | RL | IL | |
---|---|---|---|---|---|
Optimization | X | X | X | ||
Learns from experience | X | X | X | X | |
Generalization | X | X | X | X | X |
Delayed Consequences | X | X | X | ||
Exploration | X |
IL: Imitation learning, like a parrot.
RLHF: Reinforcement learning based on human feedback.
RLHF and DPO are classical offline Rl algorithms. We will not explicitly “explore” during the training process, instead, we will just inspect and learn from the collected historical data.(e.g. policy gradient algorithm.)
We can see that exploration is the uniqueness of reinforcement learning!
Simple Demo
- Action Space?
- State Space?
- Reward?
- Actions for transformation?
- Reward Hacking
Interacting with the world
For each time within a discrete timestamp, the agent will do:
- Action to interact with the environment
- Agent get the reward and the observation
- Finally, it will update itself for the next iteration!
- History:
Markov Decision Process, MDP
A Markov Decision Process (MDP) is a mathematical framework for modeling sequential decision-making in situations where outcomes are partly random and partly under the control of a decision-maker.
An MDP is formally defined by five key components: .
-
(State Space): This is the set of all possible states of the environment. A state must have the Markov property, meaning the future depends only on the current state and action, not on the sequence of events that led to it.
For example: after steps, the future depends on and
-
(Action Space): This is the set of all possible actions an agent can take.
-
(Transition Probability): This defines the dynamics of the environment. is the probability of transitioning to state from state after taking action . This is the stochastic part of the MDP, representing the uncertainty in the environment.
- It is just a probability.
- The state is markov iff:
-
(Reward Function): This function determines the immediate reward the agent receives after taking action in state and landing in state . The reward is a numerical value that tells the agent how good or bad a particular action is in a given state. The ultimate goal of the agent is to maximize its cumulative reward over time.
- reward function:
- or we can say the general reward for the state
-
(Discount Factor): This is a value between 0 and 1 that discounts future rewards. It’s used to give more weight to immediate rewards compared to rewards received in the distant future. A discount factor of 0 makes the agent “myopic” (only considers immediate rewards), while a factor closer to 1 makes the agent “farsighted” (values long-term rewards).
- Define the state value function:
The central problem of an MDP is to find an optimal policy . A policy is a strategy that tells the agent which action to take in each state. The optimal policy is the one that maximizes the agent’s total expected cumulative reward over a long sequence of decisions.
- Determine:
- Stochastic:
Matrix & Linear Transformation
Is the world markov?
The world is partially observable.
-
We have Partially Observable Markov Decision Process (POMDP)
-
In these cases, state is less the observations.(In real world senarios, it actually is!)
- :当前信念 的价值。
- :在信念 下执行动作 的期望即时奖励。这是对所有可能状态的奖励加权求和:。
- :这部分是期望未来价值。它对所有可能的观察 进行加权,每个观察的权重是其发生的概率 ,而其价值是新的信念 的价值 .