Self-Evolving Reward Learning aligns LLMs with less human feedback

November 18, 2024

2 Views

SaveSavedRemoved 0

Self-Evolving Reward Learning aligns LLMs with less human feedback

self-repairing robot — Image generated with Bing Image Creator

This article is part of our coverage of the latest in AI research.

Researchers from Fudan University, Peking University, and Microsoft have introduced Self-Evolved Reward Learning (SER), a technique that reduces the need for human-labeled data during alignment training of large language models (LLMs).

Alignment is a very important stage of creating LLMs. It enables the models to follow complex instructions and human preferences. The classic way to do this is through reinforcement Learning from Human Feedback (RLHF). RLHF involves training a reward model (RM) that learns to distinguish between good and bad responses generated by an LLM. The RM is then used to guide the LLM’s training through reinforcement learning, encouraging it to generate outputs that align with human preferences.

Classic RLHF techniques approaches require large amounts of high-quality human-labeled preference data, which can be a major bottleneck for improving the performance of LLMs.

Reinforcement Learning from AI Feedback (RLAIF) is a set of more recent techniques that reduce dependence on human-labeled data by using AI models to train the reward model. However, these methods assume that the model is capable of delivering high-quality and varied results or may require even stronger LLMs to provide effective feedback.

Recent studies show that, given their sheer training data, LLMs can serve as world models that can reason about actions and events. Self-Evolved Reward Learning uses the LLM’s world model abilities to generate its training data. The LLM can use this capability to evaluate and provide feedback on its own generated responses, effectively acting as its own reward model.

In SER, the process starts by training an initial RM with a small amount of human-annotated data. This “seed RM” provides a basic understanding of what constitutes good and bad responses. The RM is then used to generate feedback on a larger unlabeled dataset. These self-labeled examples are then used to retrain and improve the RM. This iterative “feedback-then-train” loop allows the RM to self-evolve, gradually refining its ability to distinguish between high-quality and low-quality responses.

Self-evolving reward modeling (source: arXiv)

However, the process is not straightforward and can result in diminishing returns or degraded performance as the iterations continue. To overcome this limitation, the researchers introduced data filtering techniques that change over stages of the RM training. In each stage, they analyze the learning status of the RM and identify high-confidence self-labeled examples. This filtered data is then used for more efficient and robust RM training.

Finally, the improved RM is used to guide the LLM’s training through reinforcement learning. “By employing this self-evolved reward learning process, where the RM continually learns from its own feedback, we reduce dependency on large human-labeled data while maintaining, or even improving, the model’s performance,” the researchers write.

The researchers conducted extensive experiments to evaluate SER, using various LLMs, model sizes, and datasets. They found that with SER, they only need 15% of the human-annotated seed data to create a reward model that is comparable to models trained with full human-labeled datasets.

On average, SER improved model performance by 7.88% compared to seed models trained on the small human-labeled dataset. In some cases, it even surpassed the performance of models trained on the full dataset.

While the results are promising, there are still some areas to improve. “An avenue worth exploring is generating more diverse responses through LLMs,” the researchers write. “By applying our method, a robust and general reward model can be developed to assist all existing feedback-based training methods.”

SER offers a viable path toward reducing the dependency on large human-labeled datasets while maintaining or even improving LLM performance. It could prove to be an important technique for building more sophisticated and powerful LLMs with significantly less human intervention.