Will RL Be the Future?

Will RL Be the Future

Original Blog scripts: Will RL be the AGI?

Current RL

RL has made giant leaps so far!

  • AlphaGo
  • DeepSeek RL Fine-tuning
  • Agentic RL

The future?

  • better simulations
  • better reward functions
  • the abstraction is the way to go!
    • 回归高维数据的最终本质
    • 不要施加过度先验和复杂的结构。

强化学习从直觉上来说非常符合人们认为“正确”的学习路径,而这样的学习路径也伴随着泛化性的飞跃提升

  • 从最基本的算法出发,模型不会学习到“任何内容”,只是已经人类预定义的工作流完成特定的输入-输出任务,简单并且泛化性差。
  • 进入机器学习时代,经过有监督的标签数据训练,学习数据中的隐含特征。
  • 深度学习纪元将这一步推向更高的顶峰,海量的数据和飞速发展的算力资源支持深度学习学习到更本质,更泛化、更加复杂的任务上。
  • In the era of Large Language Models:
    • 在预训练阶段海量的训练语料极大提升了模型的泛化性能
    • What is the next step? Models’ Self Evolvement and Self Learning: Reinforcement Learning.

RLVR (Reinforcement Learning From Verifiable Rewards)

Training Language Models without supervised human-labeled data.

Currently, many problems are not verifiable by the outcome: Core Challenge.

We generate, then verify. In previous senses, the generation process is much more easier than the verification process. (But is it actually correct?):

  • RL based on debate
  • RL based on self-critic

Reward Functions

The design of the reward functions:

  • RL with human feedback
    • But we need higher-quality data!
  • alignment type of rewards.

Data

  • SubTask Decompositions for unsolvable
  • Human in the loop:
    • Human Judgement and expert annotations are expensive for labelled data.
    • LLM judges: need alignment, having bias.

Exploration & Exploitation TradeOff

Exploration

The main exploration is done in the pre-training process.

For the post training: think out of the token-level box. (Higher level view.)

OaK Structure

Agents learning from experience and sub-task planning.

Hierarchical RL 分层强化学习

Most of the signals are token low level, lead to inefficiency.

Think out of the box, in higher levels! (Human actually think higher level of abstractions): Process-Base Award.

  • Add check points.
  • Designing Process-Based Rewards are difficult.
  • The intermediate reward can actually independent about the final task reward.

Will RL Be the Future?
https://xiyuanyang-code.github.io/posts/Will-RL-be-the-future/
Author
Xiyuan Yang
Posted on
November 20, 2025
Updated on
November 20, 2025
Licensed under