CommunityNews

CommunityNews

Offline Reinforcement Learning for LLM Multi-Step Reasoning

Offline Reinforcement Learning for LLM Multi-Step Reasoning.
Improving the multi-step reasoning ability of large language models (LLMs) with offline reinforcement learning (RL) is essential for quickly adapting them to complex tasks. While Direct Preference Optimization (DPO) has shown promise in aligning LLMs with human preferences, it is less suitable for multi-step reasoning tasks because (1) DPO relies on paired preference data, which is not readily available for multi-step reasoning tasks, and (2) it treats all tokens uniformly, making it ineffective for credit assignment in multi-step reasoning tasks, which often come with sparse reward. In this work, we propose OREO (Offline Reasoning Optimization), an offline RL method for enhancing LLM multi-step reasoning. Building on insights from previous works of maximum entropy reinforcement learning, it jointly learns a policy model and value function by optimizing the soft Bellman Equation. We show in principle that it reduces the need to collect pairwise data and enables better credit assignment. Empirically, OREO surpasses existing offline learning methods on multi-step reasoning benchmarks, including mathematical reasoning tasks (GSM8K, MATH) and embodied agent control (ALFWorld). The approach can be extended to a multi-iteration framework when additional resources are available. Furthermore, the learned value function can be leveraged to guide the tree search for free, which can further boost performance during test time.

Read in full here:

This thread was posted by one of our members via one of our news source trackers.

Popular General Dev topics Top

First poster: andrea
What’s the real cost of interruptions? I illustrate all the context developers keep in their head and how it starts to decay immediately ...
New
First poster: bot
Kinesis Advantage360 Ergonomic Keyboard. Split-adjustable, contoured design that maximizes comfort and boosts productivity. Mechanical s...
New
First poster: dpritchett
It’s not what programming languages do, it’s what they shepherd you to. How many of you have listened, read or taken part in a discussio...
New
OvermindDL1
Yet another rust-made text editor, though I’m really liking the looks of how this one works!
New
CommunityNews
The First Social-Media Babies Are Growing Up—And They’re Horrified. How would you feel if millions of people watched your childhood tant...
New
First poster: dyowee
Rust: The wrong people are resigning. Rust: The wrong people are resigning. GitHub Gist: instantly share code, notes, and snippets.
New
CommunityNews
9 fintech engineering mistakes. Read this list unless you want to build a money dissappearing system
New
CommunityNews
Once you get good at Rust all of these problems will go away Rust being great at big refactorings solves a largely self-inflicted issues ...
New
First poster: DevotionGeo
To avoid being replaced by LLMs, do what they can’t. What LLM’s can’t do yet
New
CommunityNews
We’re a tiny team @deepseek-ai pushing our limits in AGI exploration. Starting this week , Feb 24, 2025 we’ll open-source 5 repos – one ...
New

Other popular topics Top

Exadra37
I am thinking in building or buy a desktop computer for programing, both professionally and on my free time, and my choice of OS is Linux...
New
AstonJ
You might be thinking we should just ask who’s not using VSCode :joy: however there are some new additions in the space that might give V...
New
PragmaticBookshelf
Rust is an exciting new programming language combining the power of C with memory safety, fearless concurrency, and productivity boosters...
New
AstonJ
Just done a fresh install of macOS Big Sur and on installing Erlang I am getting: asdf install erlang 23.1.2 Configure failed. checking ...
New
dimitarvp
Small essay with thoughts on macOS vs. Linux: I know @Exadra37 is just waiting around the corner to scream at me “I TOLD YOU SO!!!” but I...
New
DevotionGeo
The V Programming Language Simple language for building maintainable programs V is already mentioned couple of times in the forum, but I...
New
mafinar
This is going to be a long an frequently posted thread. While talking to a friend of mine who has taken data structure and algorithm cou...
New
PragmaticBookshelf
Author Spotlight James Stanier @jstanier James Stanier, author of Effective Remote Work , discusses how to rethink the office as we e...
New
New
First poster: bot
Large Language Models like ChatGPT say The Darnedest Things. The Errors They MakeWhy We Need to Document Them, and What We Have Decided ...
New