CommunityNews

CommunityNews

Offline Reinforcement Learning for LLM Multi-Step Reasoning

Offline Reinforcement Learning for LLM Multi-Step Reasoning.
Improving the multi-step reasoning ability of large language models (LLMs) with offline reinforcement learning (RL) is essential for quickly adapting them to complex tasks. While Direct Preference Optimization (DPO) has shown promise in aligning LLMs with human preferences, it is less suitable for multi-step reasoning tasks because (1) DPO relies on paired preference data, which is not readily available for multi-step reasoning tasks, and (2) it treats all tokens uniformly, making it ineffective for credit assignment in multi-step reasoning tasks, which often come with sparse reward. In this work, we propose OREO (Offline Reasoning Optimization), an offline RL method for enhancing LLM multi-step reasoning. Building on insights from previous works of maximum entropy reinforcement learning, it jointly learns a policy model and value function by optimizing the soft Bellman Equation. We show in principle that it reduces the need to collect pairwise data and enables better credit assignment. Empirically, OREO surpasses existing offline learning methods on multi-step reasoning benchmarks, including mathematical reasoning tasks (GSM8K, MATH) and embodied agent control (ALFWorld). The approach can be extended to a multi-iteration framework when additional resources are available. Furthermore, the learned value function can be leveraged to guide the tree search for free, which can further boost performance during test time.

Read in full here:

This thread was posted by one of our members via one of our news source trackers.

Where Next?

Popular General Dev topics Top

First poster: bot
FUZIX FUZIX is a fusion of various elements from the assorted UZI forks and branches beaten together into some kind of semi-coherent pla...
New
First poster: dimitarvp
On Wednesday last week, Google’s Fiona Cicconi wrote to company employees. She announced that Google was bringing forward its timetable ...
New
First poster: bot
Last night I re-read this Steve Yegge article about learning to type as a programmer. I can touch type, but I don’t usually manage to bre...
New
First poster: dimitarvp
A career ending mistake — Bitfield Consulting. As software engineers, we’re constantly making detailed, elaborate plans for computers to...
New
First poster: cpgo
8 reasons to ditch Chrome and switch to Firefox. Chrome may dominate, but Firefox is a known name among browsers for a reason. Whether y...
New
CommunityNews
ABSTRACT In lieu of a traditional , I’ve tried to distill the essence of the talk into a collection of maxims: All programmers are API ...
New
First poster: bot
When Zig is safer and faster than Rust. There are endless debates online about Rust vs. Zig, this post explores a side of the argument I...
New
First poster: bot
Declarative GNOME configuration with NixOS. I adore tinkering with my machine, trying new tools, extensions, themes, and ideas. When I w...
New
First poster: jkdiaz
Dark mode isn’t as good for your eyes as you believe. The shadowy display mode has leagues of fans claiming it helps reduce eye strain, ...
New
First poster: adamaiken89
Why Ruby on Rails still matters. An old tool endures in a Next.js world
New

Other popular topics Top

AstonJ
If it’s a mechanical keyboard, which switches do you have? Would you recommend it? Why? What will your next keyboard be? Pics always w...
New
PragmaticBookshelf
Brace yourself for a fun challenge: build a photorealistic 3D renderer from scratch! In just a couple of weeks, build a ray tracer that r...
New
wolf4earth
@AstonJ prompted me to open this topic after I mentioned in the lockdown thread how I started to do a lot more for my fitness. https://f...
New
DevotionGeo
I know that -t flag is used along with -i flag for getting an interactive shell. But I cannot digest what the man page for docker run com...
New
AstonJ
Just done a fresh install of macOS Big Sur and on installing Erlang I am getting: asdf install erlang 23.1.2 Configure failed. checking ...
New
mafinar
Crystal recently reached version 1. I had been following it for awhile but never got to really learn it. Most languages I picked up out o...
New
foxtrottwist
A few weeks ago I started using Warp a terminal written in rust. Though in it’s current state of development there are a few caveats (tab...
New
PragmaticBookshelf
Author Spotlight Mike Riley @mriley This month, we turn the spotlight on Mike Riley, author of Portable Python Projects. Mike’s book ...
New
New
mindriot
Ok, well here are some thoughts and opinions on some of the ergonomic keyboards I have, I guess like mini review of each that I use enoug...
New