CommunityNews

CommunityNews

Offline Reinforcement Learning for LLM Multi-Step Reasoning

Offline Reinforcement Learning for LLM Multi-Step Reasoning.
Improving the multi-step reasoning ability of large language models (LLMs) with offline reinforcement learning (RL) is essential for quickly adapting them to complex tasks. While Direct Preference Optimization (DPO) has shown promise in aligning LLMs with human preferences, it is less suitable for multi-step reasoning tasks because (1) DPO relies on paired preference data, which is not readily available for multi-step reasoning tasks, and (2) it treats all tokens uniformly, making it ineffective for credit assignment in multi-step reasoning tasks, which often come with sparse reward. In this work, we propose OREO (Offline Reasoning Optimization), an offline RL method for enhancing LLM multi-step reasoning. Building on insights from previous works of maximum entropy reinforcement learning, it jointly learns a policy model and value function by optimizing the soft Bellman Equation. We show in principle that it reduces the need to collect pairwise data and enables better credit assignment. Empirically, OREO surpasses existing offline learning methods on multi-step reasoning benchmarks, including mathematical reasoning tasks (GSM8K, MATH) and embodied agent control (ALFWorld). The approach can be extended to a multi-iteration framework when additional resources are available. Furthermore, the learned value function can be leveraged to guide the tree search for free, which can further boost performance during test time.

Read in full here:

This thread was posted by one of our members via one of our news source trackers.

Where Next?

Popular General Dev topics Top

First poster: bot
SPWN is a programming language that compiles to Geometry Dash levels. What that means is that you can create levels by using not only the...
New
New
First poster: dyowee
Everyone seems to be striving for ‘clean’ code at the moment. You can’t read a blog post without the author telling you how clean their a...
New
First poster: bot
How a piece of advice became a lifestyle TABLE OF CONTENTS WHERE TO BEGIN… FIRST CONTACT PICKING EMACS FOR LIFE CHEATING ON EMACS SERE...
New
First poster: bot
API Gateway Trends behind Features: Apache APISIX 3.0 vs. Kong 3.0 - API7.ai. By comparing the open-source API Gateway Apache APISIX and...
New
First poster: joeb
50 Shades of Go: Traps, Gotchas, and Common Mistakes for New Golang Devs. Go is a simple and fun language, but, like any other language,...
/go
New
New
First poster: dyowee
olmOCR is an open-source tool for converting PDFs to text with high accuracy, preserving reading order and supporting tables, equations, ...
New
First poster: alvinkatojr
There are countless articles why developers should not focus on Frameworks too much and instead learn to understand the underlying langua...
New
CommunityNews
GitSyncPad is an innovative micro keypad designed for effortless Git version control. Execute commands like git add, git commit, and git ...
New

Other popular topics Top

AstonJ
What chair do you have while working… and why? Is there a ‘best’ type of chair or working position for developers?
New
ohm
Which, if any, games do you play? On what platform? I just bought (and completed) Minecraft Dungeons for my Nintendo Switch. Other than ...
New
AstonJ
SpaceVim seems to be gaining in features and popularity and I just wondered how it compares with SpaceMacs in 2020 - anyone have any thou...
New
DevotionGeo
I know that -t flag is used along with -i flag for getting an interactive shell. But I cannot digest what the man page for docker run com...
New
Exadra37
On modern versions of macOS, you simply can’t power on your computer, launch a text editor or eBook reader, and write or read, without a ...
New
Rainer
Not sure if following fits exactly this thread, or if we should have a hobby thread… For many years I’m designing and building model air...
New
Exadra37
Oh just spent so much time on this to discover now that RancherOS is in end of life but Rancher is refusing to mark the Github repo as su...
New
PragmaticBookshelf
Build highly interactive applications without ever leaving Elixir, the way the experts do. Let LiveView take care of performance, scalabi...
New
First poster: joeb
The File System Access API with Origin Private File System. WebKit supports new API that makes it possible for web apps to create, open,...
New
CommunityNews
A Brief Review of the Minisforum V3 AMD Tablet. Update: I have created an awesome-minisforum-v3 GitHub repository to list information fo...
New