Will RL Be the Future?
Will RL Be the Future
Original Blog scripts: Will RL be the AGI?
Current RL
RL has made giant leaps so far!
- AlphaGo
- DeepSeek RL Fine-tuning
- Agentic RL
The future?
- better simulations
- better reward functions
- the abstraction is the way to go!
- 回归高维数据的最终本质
- 不要施加过度先验和复杂的结构。
强化学习从直觉上来说非常符合人们认为“正确”的学习路径,而这样的学习路径也伴随着泛化性的飞跃提升:
- 从最基本的算法出发,模型不会学习到“任何内容”,只是已经人类预定义的工作流完成特定的输入-输出任务,简单并且泛化性差。
- 进入机器学习时代,经过有监督的标签数据训练,学习数据中的隐含特征。
- 深度学习纪元将这一步推向更高的顶峰,海量的数据和飞速发展的算力资源支持深度学习学习到更本质,更泛化、更加复杂的任务上。
- In the era of Large Language Models:
- 在预训练阶段海量的训练语料极大提升了模型的泛化性能
- What is the next step? Models’ Self Evolvement and Self Learning: Reinforcement Learning.
RLVR (Reinforcement Learning From Verifiable Rewards)
Training Language Models without supervised human-labeled data.
Currently, many problems are not verifiable by the outcome: Core Challenge.
We generate, then verify. In previous senses, the generation process is much more easier than the verification process. (But is it actually correct?):
- RL based on debate
- RL based on self-critic
Reward Functions
The design of the reward functions:
- RL with human feedback
- But we need higher-quality data!
- alignment type of rewards.
Data
- SubTask Decompositions for unsolvable
- Human in the loop:
- Human Judgement and expert annotations are expensive for labelled data.
- LLM judges: need alignment, having bias.
Exploration & Exploitation TradeOff
Exploration
The main exploration is done in the pre-training process.
For the post training: think out of the token-level box. (Higher level view.)
OaK Structure
Agents learning from experience and sub-task planning.
Hierarchical RL 分层强化学习
Most of the signals are token low level, lead to inefficiency.
Think out of the box, in higher levels! (Human actually think higher level of abstractions): Process-Base Award.
- Add check points.
- Designing Process-Based Rewards are difficult.
- The intermediate reward can actually independent about the final task reward.