CommunityNews

Offline Reinforcement Learning for LLM Multi-Step Reasoning

Offline Reinforcement Learning for LLM Multi-Step Reasoning.
Improving the multi-step reasoning ability of large language models (LLMs) with offline reinforcement learning (RL) is essential for quickly adapting them to complex tasks. While Direct Preference Optimization (DPO) has shown promise in aligning LLMs with human preferences, it is less suitable for multi-step reasoning tasks because (1) DPO relies on paired preference data, which is not readily available for multi-step reasoning tasks, and (2) it treats all tokens uniformly, making it ineffective for credit assignment in multi-step reasoning tasks, which often come with sparse reward. In this work, we propose OREO (Offline Reasoning Optimization), an offline RL method for enhancing LLM multi-step reasoning. Building on insights from previous works of maximum entropy reinforcement learning, it jointly learns a policy model and value function by optimizing the soft Bellman Equation. We show in principle that it reduces the need to collect pairwise data and enables better credit assignment. Empirically, OREO surpasses existing offline learning methods on multi-step reasoning benchmarks, including mathematical reasoning tasks (GSM8K, MATH) and embodied agent control (ALFWorld). The approach can be extended to a multi-iteration framework when additional resources are available. Furthermore, the learned value function can be leveraged to guide the tree search for free, which can further boost performance during test time.

Read in full here:

This thread was posted by one of our members via one of our news source trackers.

View thread on forum

#learning #llm

0 54 0

2024-12-23 17:08:43 UTC

Where Next?

View thread on forum

learning

llm

Home General Dev>In The News

#learning #llm

0 54 0

Last post

Popular General Dev topics

General Dev>In The News

SkiftOS: Simple, handmade operating system for the x86 platform

skiftOS is a simple, handmade operating system for the x86 platform, aiming for clean and pretty APIs while keeping the spirit of UNIX. s...

github.com

#skiftos

2 1426 3

2021-01-28 14:47:06 UTC

New

General Dev>In The News

A career ending mistake

A career ending mistake — Bitfield Consulting. As software engineers, we’re constantly making detailed, elaborate plans for computers to...

bitfieldconsulting.com

#career

22 1056 8

2022-03-12 13:42:09 UTC

New

General Dev>In The News

Quick Start Guide for Flipper Zero

Flipper Zero is a portable multi-tool for pentesters and geeks in a toy-like body. It loves hacking digital stuff, such as radio protocol...

blog.flipperzero.one

#guide

0 1129 0

2022-05-15 13:56:21 UTC

New

General Dev>In The News

Developing Godot Projects with Neovim

Developing Godot Projects with Neovim. When I started using Godot Engine, what surprised me the most is the built-in Language Server Pro...

devpoga.org

/neovim

0 1598 0

2022-07-27 13:30:06 UTC

New

General Dev>In The News

Whatever happened to Elm, anyway?

Whatever happened to Elm, anyway?. I see this question pop up quite frequently in lots of different arenas - folks are curious as to wha...

derw.substack.com

/elm

17 1013 12

2025-04-21 03:57:49 UTC

New

General Dev>In The News

When Zig is safer and faster than Rust

When Zig is safer and faster than Rust. There are endless debates online about Rust vs. Zig, this post explores a side of the argument I...

zackoverflow.dev

/rust /zig

0 1072 0

2023-03-08 15:55:05 UTC

New

General Dev>In The News

Declarative GNOME configuration with NixOS

Declarative GNOME configuration with NixOS. I adore tinkering with my machine, trying new tools, extensions, themes, and ideas. When I w...

hoverbear.org

/nixos

0 1087 0

2023-05-01 14:45:52 UTC

New

General Dev>In The News

Why Python is terrible

Why Python is terrible… Nice language, but unsuitable for most professional purposes

josvisser.substack.com

/python

8 824 6

2024-04-06 04:17:41 UTC

New

General Dev>In The News

Lessons learned after 3 years of fulltime Rust game development, and why we're leaving Rust behind

Once you get good at Rust all of these problems will go away Rust being great at big refactorings solves a largely self-inflicted issues ...

loglog.games

/rust #development

0 924 0

2024-04-29 13:12:55 UTC

New

General Dev>In The News

Why Ruby on Rails still matters

Why Ruby on Rails still matters. An old tool endures in a Next.js world

contraption.co

/ruby /rails

12 394 7

2025-03-04 08:44:37 UTC

New

Other popular topics

Science/Tech>Tech Chat

What are you watching?

Or looking forward to? :nerd_face:

#community

480 9438 251

2024-11-13 14:03:50 UTC

New

Game Dev>Learning Resources

Apple Game Frameworks and Technologies

Design and develop sophisticated 2D games that are as much fun to make as they are to play. From particle effects and pathfinding to soci...

pragprog.com

#pragprog #ios #game-dev #macos /swift #published-book #apple /book-apple-game-frameworks-and-technologies

30 3963 10

2021-04-22 16:51:02 UTC

New

General Dev>Hardware

Poll: Which keyboard layout do you use?

poll poll Be sure to check out @Dusty’s article posted here: An Introduction to Alternative Keyboard Layouts It’s one of the best write-...

colemakmods.github.io

#polls /keyboards

10 5348 11

2020-10-31 23:12:33 UTC

New

Backend>Questions

Erlang's not installing on macOS Big Sur "You are natively building Erlang/OTP for a later version of MacOSX than current version"

Just done a fresh install of macOS Big Sur and on installing Erlang I am getting: asdf install erlang 23.1.2 Configure failed. checking ...

#macos /erlang #big-sur #asdf

10 5616 8

2021-01-16 12:33:23 UTC

New

General Dev>Dev Chat

Do you have any (non-dev) hobbies?

Not sure if following fits exactly this thread, or if we should have a hobby thread… For many years I’m designing and building model air...

#community

200 3586 78

2025-01-24 20:03:51 UTC

New

Backend>Learning Resources

Concurrent Data Processing in Elixir

Learn different ways of writing concurrent code in Elixir and increase your application's performance, without sacrificing scalability or...

pragprog.com

#pragprog /elixir #published-book /book-concurrent-data-processing-in-elixir

78 4119 24

2021-09-04 12:35:42 UTC

New

Backend>Chat

Data Structures and Algorithms with Elixir

This is going to be a long an frequently posted thread. While talking to a friend of mine who has taken data structure and algorithm cou...

/elixir #algorithms #data-structures

108 9152 31

2024-11-14 02:14:00 UTC

New

Community>In The Spotlight

Spotlight: Rebecca Skinner (Author) Interview and AMA!

Author Spotlight Rebecca Skinner @RebeccaSkinner Welcome to our latest author spotlight, where we sit down with Rebecca Skinner, auth...

#author-spotlight /haskell /book-effective-haskell

106 10605 28

2022-11-16 10:29:37 UTC

New

Community>In The Spotlight

Spotlight: David Bryant Copeland (Author) Interview and AMA!

Author Spotlight: David Bryant Copeland @davetron5000 We’re so happy to bring you another Author Spotlight, a series where we sit dow...

#author-spotlight /book-agile-web-development-with-rails-6 /book-rails-angular-postgres-and-bootstrap-second-edition /book-sustainable-web-development-with-ruby-on-rails

19 3459 15

2023-03-28 18:14:47 UTC

New

Community>In The Spotlight

Spotlight: Karl Stolley (Author) Interview and AMA!

Author Spotlight: Karl Stolley @karlstolley Logic! Rhetoric! Prag! Wow, what a combination. In this spotlight, we sit down with Karl ...

#author-spotlight /book-programming-webrtc

31 3800 16

2023-03-28 18:47:13 UTC

New

General Dev>In The News

Replacing cron jobs with a centralized task scheduler

General Dev>In The News

Age Verification Doesn’t Need to Be a Privacy Footgun - Dhole Moments

General Dev>In The News

Every Satellite Orbiting Earth and Who Owns Them

General Dev>In The News

Freestyle Documentation - How we revamped our Docs for AI

General Dev>In The News

Dark patterns: tricks to make you spend more online

General Dev>In The News

2000 words about arrays and tables

General Dev>In The News

Optician Sans – Free font based on historical optotypes

General Dev>In The News

The Hype is the Product

General Dev>In The News

Writing memory efficient C structs

General Dev>In The News

Opsqueue: lightweight batch processing queue for heavy loads

General Dev>In The News

General Dev In The News ❯

Latest on Devtalk

Replacing cron jobs with a centralized task scheduler

General Dev>In The News

Age Verification Doesn’t Need to Be a Privacy Footgun - Dhole Moments

General Dev>In The News

LLM Leaderboard - Comparison of over 100 AI models from OpenAI, Google, DeepSeek & others | Artificial Analysis

AI>In The News

Every Satellite Orbiting Earth and Who Owns Them

General Dev>In The News

Developer survey shows trust in AI coding tools is falling as usage rises

AI>In The News

Scala 3.7.2 is now available!

Backend>Official News

openSUSE: Tumbleweed Monthly Update - July 2025

Linux>Official News

China claims Nvidia built backdoor into H20 chip designed for Chinese market

AI>In The News

Freestyle Documentation - How we revamped our Docs for AI

General Dev>In The News

Dark patterns: tricks to make you spend more online

General Dev>In The News

QUIC for the kernel

Linux>In The News

Node.js v24.5.0 and v22.18.0 released!

Backend>Official News

TypeScript v5.9.2 released!

Frontend>Official News

Tutorial Deploy Phoenix 1.8 with Coolify on Hetzner

Backend>Blogs/Talks

Djangonaut Space is looking for contributors to be mentors

Backend>Official News

Devtalk ❯

We ❤️ helpful members!

We reward our most helpful members via our MOTM scheme - by giving away a whopping 25 books per year!

Sub Categories:

We're in Beta

About us Mission Statement See our Roadmap

Offline Reinforcement Learning for LLM Multi-Step Reasoning

CommunityNews

Offline Reinforcement Learning for LLM Multi-Step Reasoning

Where Next?

Popular General Dev topics

SkiftOS: Simple, handmade operating system for the x86 platform

A career ending mistake

Quick Start Guide for Flipper Zero

Developing Godot Projects with Neovim

Whatever happened to Elm, anyway?

When Zig is safer and faster than Rust

Declarative GNOME configuration with NixOS

Why Python is terrible

Lessons learned after 3 years of fulltime Rust game development, and why we're leaving Rust behind

Why Ruby on Rails still matters

Other popular topics

What are you watching?

Apple Game Frameworks and Technologies

Poll: Which keyboard layout do you use?

Erlang's not installing on macOS Big Sur "You are natively building Erlang/OTP for a later version of MacOSX than current version"

Do you have any (non-dev) hobbies?

Concurrent Data Processing in Elixir

Data Structures and Algorithms with Elixir

Spotlight: Rebecca Skinner (Author) Interview and AMA!

Spotlight: David Bryant Copeland (Author) Interview and AMA!

Spotlight: Karl Stolley (Author) Interview and AMA!

Sponsor Spotlight

General Dev>In The News

Latest on Devtalk

We ❤️ helpful members!

Devtalk Sponsors

Categories:

Sub Categories:

Popular Portals

Devtalk Sponsors

We're in Beta