LLML CS224n Lecture 10 Post-Training
CS224N Post-Training
Introduction
About generalization and emergent abilities of large language models…
What have large language models have been learning during the pre-training process?
- Syntax
- Knowledge
- More: Models of agents, beliefs and actions as well.
That is called world models, where AI can be generalized as general intelligence.
In-Context Learning
With emergent abilities augmented, LLMs can execute tasks without examples (few-shot learning).
Thus, for SOTA large language models, all we need to input is efficiency instructions to get alignment of the task in my mind and the the task comprehension of the large language models.
Large Language Models are few-shot learners: In-context Learning.
Paradigm for Parameters Fine-tuning
- SFT
- RFT
Supervised Fine-Tuning (SFT)
SFT is fundamentally a supervised learning task. Its core mathematical principle is maximum likelihood estimation.
The model’s objective is to maximize the likelihood of generating the “correct” responses provided in a high-quality, curated dataset of instruction-response pairs. Given a dataset where is the instruction and is the desired response, the model’s parameters are updated to maximize the following objective:
The training process uses standard gradient descent (or its variants like Adam) to find the parameters that best fit the supervised data. The gradient is calculated with respect to the negative log-likelihood, which is a form of cross-entropy loss. This loss function measures how well the model’s predicted probability distribution aligns with the one-hot distribution of the correct response.
Reinforcement Fine-Tuning (RFT)
RFT is based on reinforcement learning (RL), which introduces a new element: a reward signal. Its core mathematical principle is reward maximization. RFT is teaching the model to learn how to behave by providing a reward signal, allowing it to explore and discover better ways to align with human values beyond what is explicitly shown in a fixed dataset.
- The model’s objective is to learn a policy (its generation strategy) that maximizes the expected cumulative reward.
- The core objective is to find a policy that maximizes the expected reward. The base objective is:
- Reward Function (): The key difference from SFT is the introduction of a reward function, which acts as a “judge” to score the model’s output. This function is typically a Reward Model (RM) trained on human preference data.
- Optimization: Unlike SFT, RFT uses an RL algorithm (like PPO or DPO) to optimize the model. These algorithms don’t just mimic a target answer; they use gradients to update the model in a direction that leads to higher reward scores.
A crucial aspect of RFT is the regularization term to prevent the model from drifting too far from its pre-trained knowledge. The final objective often includes a KL divergence penalty, like in the RLHF objective:
This term ensures that the model learns human preferences while staying grounded in the knowledge it gained during pre-training.
Instruction Finetuning
Original Large Language Models are focused on text generation, intentionally, it is not designed for user assistant! Language models are not aligned with user intent!
Instruction Finetuning: Collect examples of (instruction, output) pairs across many tasks and finetune an LM, then evaluate on unseen tasks.
High Level Lessons
- Synthetical Data
- You may don’t need to sample many instructions to tune.
- Less is More Alignment
- Data quantity, Data quality and data diversity.
- CrowdSourcing (Like ImageNet)
Problems
- Lack of high-quality data
- For open-ended tasks?
- All token-level mistakes are penalized equally, which makes no sense!
- Human generate sub-optimal answers.
Thus, we will attempt to satisfy human preferences directly!
RLHF (Reinforcement learning from human preferences)
Optimizing for human preferences
Begin with maths and reinforcement learning…
For a specific scene of summary generation, we expect LLMs to generate an output with the given input, and we design a fair grading systems which returns , which represents the human reward in reinforcement.
Now we want to maximize the expected rewards of samples for out large language models.
- is the model policy (how LLM generate, from the view of reinforcement learning).
- : Model parameters
- means the probability distribution given model and given prompts, generated the target output .
Make it more specific:
RL optimization
For optimization based on gradient, this will not work, for our reward function may be non-differentiable.
We can use Reinforcement Learning algorithms! Policy-gradient Algorithms.
- We want to obtain
- Here we’ll use a very handy trick known as the log-derivative trick. Let’s try taking the gradient of
- Plug back in:
Simple Introduction on Policy Gradient
这个公式的目标是求策略参数 的梯度,从而最大化预期回报 。
- 表示对参数 求梯度。
- 是智能体采取策略 时的预期回报。
- 是在状态 下得到的总回报。
- 是策略,它给出了在参数 下,智能体采取行动并最终到达状态 的概率。
- 最终的等式 是将求导和求和运算交换了顺序。
这个公式引入了“对数导数技巧”。直接对策略 求梯度通常很困难,但它的对数 形式往往更简单。
- 带入公式1,因为它把对策略的梯度 转换为了策略概率 乘以对数策略的梯度 。
- 最终的形式 可以被看作是一个期望的采样形式:。这意味着我们不需要遍历所有可能的状态,只需要从当前策略中采样(Sample)出一些轨迹(即智能体经历的状态序列),就可以近似地计算出梯度。这就是策略梯度算法(如REINFORCE算法)的核心思想。
RLHF Pipeline
-
Instruction Finetuning with supervised learning and labeled data.
-
Collect comparison data and train a reward model.
- Human labelers will grade the model output, as the training data for reward model.
-
Optimize a policy against the reward model using reinforcement learning.
Modeling Human preference
-
We can use a classifier (trained given human labeled data), for letting humans to grade all the is quite expensive.
-
Instead of asking for direct ratings, ask for pairwise comparisons.
- This could help vanish bias and noise for human grading process.
Hence, for the Reward Model: , the sampling functions are like:
- means the winner sample, while means the loser sample.
- We require the RM to score higher grade for , thus
- We use sigmoid function to map the difference into a probability between 0 and 1, then do Negative Log-Likelihood Loss.
Optimizing the parameters
Directly using as the reward function may have bias, for this reward model is trained. Gradient-based optimizers are highly sensitive to numerical errors.
Thus, we need to augment the design of optimization target.
We introduce a Penalty term (KL Divergence Penalty): , to prevent the policy model cheat when it finds bugs of reward model. (We need to train a reward model robust enough)
这个 KL 散度惩罚项是为了保证强化微调的模型的输出不会离预训练的模型太过遥远,注意后训练(尤其是 RLHF)的意义:我们不乞求模型在后训练的阶段学习更多深刻的知识(这个部分在庞大的预训练过程中完成),而是为了对齐规范和语义信息。因此,在预训练模型本身就具有一定泛化能力的情况下,后训练微调不能舍本逐末选择牺牲泛化性!
Direct Preference Optimization (DPO)
Limitations of RL in Fine-Tuning
Human preferences are unreliable!
- Reward hacking” is a common problem in RL.
- Chatbots are rewarded to produce responses that seem authoritative and helpful, regardless of truth.
- This can result in making up facts and hallucinations.
Thus the bottleneck for current RLHF is about the design and training of reward model.
Removing the RL from RLHF
- Recall we want to maximize the following objective in RLHF
There is a closed form solution to this:
is a normalization constant.
- Rearrange this via a log transformation
- This holds true for any arbitrary LMs, thus
What does that tells us? The ideal RM model can be computed if we get the optimized policy .
Algorithm
Hence, we can derive the reward model directly:
We just ignore for it is constants.
Defining Direct Preference Optimization:
Previous: Highly relied on trained reward model (bottleneck)
Now: Using DPO into a loss function, learning directly from to
Using the data directly (learn directly!)