
CommunityNews
Training Language Models to Self-Correct via Reinforcement Learning
Training Language Models to Self-Correct via Reinforcement Learning.
Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Existing approaches for training self-correction either require multiple models or rely on a more capable model or other forms of supervision. To this end, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM’s self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are insufficient for instilling self-correction behavior. In particular, we observe that training via SFT either suffers from a distribution mismatch between the training data and the model’s own responses or implicitly prefers only a certain mode of correction behavior that is often not effective at test time. SCoRe addresses these challenges by training under the model’s own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction strategy that is effective at test time as opposed to simply fitting high-reward responses for a given prompt. This regularization prescribes running a first phase of RL on a base model to generate a policy initialization that is less susceptible to collapse and then using a reward bonus to amplify self-correction during training. When applied to Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models’ self-correction by 15.6% and 9.1% respectively on the MATH and HumanEval benchmarks.
Read in full here:
This thread was posted by one of our members via one of our news source trackers.
Popular General Dev topics










Other popular topics










Categories:
Sub Categories:
- All
- In The News
- Dev Chat (200)
- Questions (30)
- Resources (118)
- Blogs/Talks (26)
- Jobs (3)
- Events (14)
- Code Editors (58)
- Hardware (57)
- Reviews (4)
- Sales (15)
- Design & UX (4)
- Marketing & SEO (1)
- Industry & Culture (14)
- Ethics & Privacy (19)
- Business (4)
- Learning Methods (4)
- Content Creators (7)
- DevOps & Hosting (9)
Popular Portals
- /elixir
- /rust
- /wasm
- /ruby
- /erlang
- /phoenix
- /keyboards
- /rails
- /js
- /python
- /security
- /go
- /swift
- /vim
- /clojure
- /java
- /haskell
- /emacs
- /svelte
- /onivim
- /typescript
- /crystal
- /c-plus-plus
- /tailwind
- /kotlin
- /gleam
- /react
- /flutter
- /elm
- /ocaml
- /ash
- /vscode
- /opensuse
- /centos
- /php
- /deepseek
- /scala
- /html
- /zig
- /debian
- /nixos
- /lisp
- /agda
- /textmate
- /react-native
- /sublime-text
- /kubuntu
- /arch-linux
- /revery
- /ubuntu
- /spring
- /django
- /manjaro
- /diversity
- /lua
- /nodejs
- /julia
- /c
- /slackware
- /neovim