CommunityNews

CommunityNews

Training Language Models to Self-Correct via Reinforcement Learning

Training Language Models to Self-Correct via Reinforcement Learning.
Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Existing approaches for training self-correction either require multiple models or rely on a more capable model or other forms of supervision. To this end, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM’s self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are insufficient for instilling self-correction behavior. In particular, we observe that training via SFT either suffers from a distribution mismatch between the training data and the model’s own responses or implicitly prefers only a certain mode of correction behavior that is often not effective at test time. SCoRe addresses these challenges by training under the model’s own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction strategy that is effective at test time as opposed to simply fitting high-reward responses for a given prompt. This regularization prescribes running a first phase of RL on a base model to generate a policy initialization that is less susceptible to collapse and then using a reward bonus to amplify self-correction during training. When applied to Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models’ self-correction by 15.6% and 9.1% respectively on the MATH and HumanEval benchmarks.

Read in full here:

This thread was posted by one of our members via one of our news source trackers.

Where Next?

Popular General Dev topics Top

First poster: HenryCost
I wired my tree with 500 LED lights and calculated their 3D coordinates… If you support me on Patreon at any point in December 2020 I wi...
New
First poster: bot
Neovim nightly, v0.5.0 and v0.4.4 has been released. Link: Release Nvim development (prerelease) build · neovim/neovim · GitHub Link:...
New
First poster: bot
It has some interesting features: It’s entirely wireless (the left half speaks Bluetooth to the right half, and the right half speaks B...
New
First poster: mindriot
LG 28-inch 16:18 DualUp Monitor with Ergo Stand and USB Type-C™ (28MQ780-B) | LG USA. Shop LG 28MQ780-B on the official LG.com website ...
New
CommunityNews
Docker on MacOS is slow and how to fix it. Thanks to the DALL·E 2, we finally have a very nice graphic representation of the feelings of...
New
First poster: dyowee
GitHub - TodePond/DreamBerd: perfect programming language. perfect programming language. Contribute to TodePond/DreamBerd development by...
New
First poster: fullstackplus
Why Python is terrible… Nice language, but unsuitable for most professional purposes
New
CommunityNews
Rendering Action Mailer emails with Phlex components and layouts: Clean, Composable, and Completely Ruby - Blog post by Camillo Visini
New
CommunityNews
After six months of hard work, I’m thrilled to announce the general availability of Sidekiq 8.0! :partying_face::tada: Status Sidekiq is...
New
CommunityNews
:person_lifting_weights: Modern open-source fitness coaching platform. Create workout plans, track progress, and access a comprehensive e...
New

Other popular topics Top

Devtalk
Reading something? Working on something? Planning something? Changing jobs even!? If you’re up for sharing, please let us know what you’...
1052 22283 402
New
PragmaticBookshelf
Free and open source software is the default choice for the technologies that run our world, and it’s built and maintained by people like...
New
DevotionGeo
I know that these benchmarks might not be the exact picture of real-world scenario, but still I expect a Rust web framework performing a ...
New
Rainer
My first contact with Erlang was about 2 years ago when I used RabbitMQ, which is written in Erlang, for my job. This made me curious and...
New
Exadra37
I am asking for any distro that only has the bare-bones to be able to get a shell in the server and then just install the packages as we ...
New
AstonJ
In case anyone else is wondering why Ruby 3 doesn’t show when you do asdf list-all ruby :man_facepalming: do this first: asdf plugin-upd...
New
AstonJ
Continuing the discussion from Thinking about learning Crystal, let’s discuss - I was wondering which languages don’t GC - maybe we can c...
New
PragmaticBookshelf
Use WebRTC to build web applications that stream media and data in real time directly from one user to another, all in the browser. ...
New
PragmaticBookshelf
A concise guide to MySQL 9 database administration, covering fundamental concepts, techniques, and best practices. Neil Smyth MySQL...
New
PragmaticBookshelf
Use advanced functional programming principles, practical Domain-Driven Design techniques, and production-ready Elixir code to build scal...
New