CommunityNews

SWE-Bench+: Enhanced Coding Benchmark for LLMs

SWE-Bench+: Enhanced Coding Benchmark for LLMs.
Large Language Models (LLMs) in Software Engineering (SE) can offer assistance for coding. To facilitate a rigorous evaluation of LLMs in practical coding contexts, Carlos et al. introduced the SWE-bench dataset, which comprises 2,294 real-world GitHub issues and their corresponding pull requests, collected from 12 widely used Python repositories. Several impressive LLM-based toolkits recently are developed and evaluated on this dataset. However, a systematic evaluation of the quality of SWE-bench remains missing. In this paper, we addressed this gap by presenting an empirical analysis of the SWE-bench dataset. We conducted a manual screening of instances where SWEAgent + GPT-4 successfully resolved issues by comparing the model-generated patches with the actual pull requests. SWE-Agent+GPT-4 was at the top of SWE-bench leaderboard during the time of our study. Our analysis reveals some critical issues with the SWE-bench dataset: 1) 32.67% of the successful patches involve cheating as the solutions were directly provided in the issue report or the comments. We refer to as solution leakage problem. 2) 31.08% of the passed patches are suspicious patches due to weak test cases, i.e., the tests were not adequate to verify the correctness of a patch. When we filtered out these problematic issues, the resolution rate of SWE-Agent+GPT-4 dropped from 12.47% to 3.97%. We also observed that the same data quality issues also exist in the two variants of SWE-bench, i.e., SWE-bench Lite and SWE-Bench Verified. In addition, over 94% of the issues were created before LLM’s knowledge cutoff dates, posing potential data leakage issues.

Read in full here:

This thread was posted by one of our members via one of our news source trackers.

View thread on forum

#coding

0 198 0

2025-02-28 16:40:41 UTC

Where Next?

View thread on forum

coding

Home General Dev>In The News

#coding

0 198 0

Last post

Popular General Dev topics

General Dev>In The News

I am lonely will anyone speak to me

en.wikipedia.org

/diversity #mental-health

0 1415 1

2020-12-26 08:45:20 UTC

New

General Dev>In The News

Remote working: Is Big Tech going off work from home?

On Wednesday last week, Google’s Fiona Cicconi wrote to company employees. She announced that Google was bringing forward its timetable ...

bbc.co.uk

#remote

31 1342 10

2021-06-02 01:13:39 UTC

New

General Dev>In The News

Neovim nightly, v0.5.0 and v0.4.4 released!

Neovim nightly, v0.5.0 and v0.4.4 has been released. Link: Release Nvim development (prerelease) build · neovim/neovim · GitHub Link:...

github.com

#official-news /neovim

0 1482 0

2021-07-11 23:08:05 UTC

New

General Dev>In The News

Launching Fig

:tada: Launching Fig I am excited to announce that, as of today, Fig is generally available to the public for download. With our public ...

fig.io

29 1760 17

2021-11-22 15:41:56 UTC

New

General Dev>In The News

Helix, a Kakoune inspired Vim-model text editor (written in Rust)

Yet another rust-made text editor, though I’m really liking the looks of how this one works!

/rust

5 2417 1

2022-03-30 14:44:03 UTC

New

General Dev>In The News

Quick Start Guide for Flipper Zero

Flipper Zero is a portable multi-tool for pentesters and geeks in a toy-like body. It loves hacking digital stuff, such as radio protocol...

blog.flipperzero.one

#guide

0 1310 0

2022-05-15 13:56:21 UTC

New

General Dev>In The News

ChatML: ChatGPT API expects a structured format, called Chat Markup Language

openai-python/chatml.md at main · openai/openai-python. The OpenAI Python library provides convenient access to the OpenAI API from appl...

github.com

#chat #api #chatgpt

0 1247 0

2023-03-02 14:46:31 UTC

New

General Dev>In The News

Self-Hosting a Firefox Sync Server

After switching from Firefox to LibreWolf, I became interested in the idea of self-hosting my own Firefox Sync server. Although I had see...

blog.diego.dev

#hosting #firefox

0 615 0

2025-03-09 03:43:04 UTC

New

General Dev>In The News

The A.I. Monarchy

About accelerationism, NRx, and the intersection of technology, religion, and philosophy: an analysis of the essential ideas in the new A...

substack.com

2 472 1

2025-03-11 21:27:01 UTC

New

General Dev>In The News

The Meter, Golden Ratio, Pyramids, and Cubits, Oh My

The French originated the meter in the 1790s as one/ten-millionth of the distance from the equator to the north pole along a meridian thr...

iforgeiron.com

0 207 0

2025-03-12 16:36:27 UTC

New

Other popular topics

General Dev>Dev Chat

HELLO WORLD (Introductions thread!)

Hello Devtalk World! Please let us know a little about who you are and where you’re from :nerd_face:

#community

481 6447 116

2025-11-06 03:57:03 UTC

New

General Dev>Code Editors

Poll: Which code editor do you use?

You might be thinking we should just ask who’s not using VSCode :joy: however there are some new additions in the space that might give V...

#community #polls /vim /emacs #code-editors /vscode #notepad /sublime-text #atom /textmate #codespaces #brackets /onivim #geany

121 5409 61

2025-09-05 00:52:19 UTC

New

General Dev>Hardware

BIIP MT3 Extended 2048 Custom Keycap Set (Drop)

This looks like a stunning keycap set :orange_heart: A LEGENDARY KEYBOARD LIVES ON When you bought an Apple Macintosh computer in the e...

/keyboards #apple #keycaps #mechanical-keyboards

14 6365 7

2020-12-12 19:58:26 UTC

New

Backend>Chat

Using Regular Expressions in Erlang

Intensively researching Erlang books and additional resources on it, I have found that the topic of using Regular Expressions is either c...

/erlang #regular-expressions

91 5416 43

2021-09-06 19:12:48 UTC

New

Community>In The Spotlight

Spotlight: Dmitry Zinoviev (Author) Interview and AMA!

Author Spotlight Dmitry Zinoviev @aqsaqal Today we’re putting our spotlight on Dmitry Zinoviev, author of Data Science Essentials in ...

#author-spotlight /python /book-complex-network-analysis-in-python /book-data-science-essentials-in-python /book-resourceful-code-reuse /book-pythonic-programming

33 5041 14

2022-10-11 20:07:10 UTC

New

macOS>Chat

How to block any website on Mac using Little Snitch

If you want a quick and easy way to block any website on your Mac using Little Snitch simply… File > New Rule: And select Deny, O...

#macos #how-to #littlesnitch

5 9782 3

2022-07-05 00:59:40 UTC

New

Backend>Learning Resources

Engineering Elixir Applications

Develop, deploy, and debug BEAM applications using BEAMOps: a new paradigm that focuses on scalability, fault tolerance, and owning each ...

pragprog.com

#pragprog /elixir #published-book /book-engineering-elixir-applications

40 4988 21

2024-11-08 15:13:02 UTC

New

Game Dev>Questions

I want to learn how make a game, but where should I start?

I’m able to do the “artistic” part of game-development; character designing/modeling, music, environment modeling, etc. However, I don’t...

#game-dev

14 4122 9

2025-10-18 13:12:58 UTC

New

AI>Questions

What are the best text-to-speech ai generation tools that you can run locally?

Background Lately I am in a quest to find a good quality TTS ai generation tool to run locally in order to create audio for some videos I...

#ai #text-to-speech

6 5092 3

2025-03-24 16:48:39 UTC

New

General Dev>Reviews

Keyboard Review: UHK60V2 vs Defy vs Voyager vs Glove80 vs Svalboard

Ok, well here are some thoughts and opinions on some of the ergonomic keyboards I have, I guess like mini review of each that I use enoug...

/keyboards #uhk60v2 #defy #voyager #glove80 #svalboard

5 3327 7

2025-04-21 21:44:45 UTC

New

General Dev>In The News

Building a Copying GC for the Plush Programming Language

General Dev>In The News

Judge Signals Win for Software Freedom Conservancy in Vizio GPL Case - FOSS Force

General Dev>In The News

Tunnl.gg - Expose localhost to the internet

General Dev>In The News

Who Hooked Up a Laptop to a 1930s Dance Hall Machine?

General Dev>In The News

EU’s Top Court Just Made It Literally Impossible To Run A User-Generated Content Platform Legally

General Dev>In The News

The Power of Proximity to Coworkers: Training for Tomorrow or Productivity Today?

General Dev>In The News

Porn site fined £1m over age checks has never responded to Ofcom

General Dev>In The News

LOAD ZX Spectrum - World's first Museum dedicated to our first personal computer

General Dev>In The News

Kohler Can Access Data and Pictures from Toilet Camera It Describes as “End-to-End Encrypted”

General Dev>In The News

Quad9 | A public and free DNS service for a better security and privacy

General Dev>In The News

General Dev In The News ❯

Latest on Devtalk

Gemini 3 Pro: the frontier of vision AI

AI>In The News

Building a Copying GC for the Plush Programming Language

General Dev>In The News

Why We Built Lightpanda in Zig

Backend>In The News

Judge Signals Win for Software Freedom Conservancy in Vizio GPL Case - FOSS Force

General Dev>In The News

AdonisJs v7.0.0-next.13 released!

Backend>Official News

Rust: crates.io: Malicious crates finch-rust and sha-rust

Backend>Official News

Tunnl.gg - Expose localhost to the internet

General Dev>In The News

Who Hooked Up a Laptop to a 1930s Dance Hall Machine?

General Dev>In The News

EU’s Top Court Just Made It Literally Impossible To Run A User-Generated Content Platform Legally

General Dev>In The News

Thoughts on Go vs. Rust vs. Zig

Backend>In The News

Learning Music with Strudel

AI>In The News

The Power of Proximity to Coworkers: Training for Tomorrow or Productivity Today?

General Dev>In The News

Zig's new plan for asynchronous programs

Backend>In The News

Introducing Mistral 3

AI>In The News

Porn site fined £1m over age checks has never responded to Ofcom

General Dev>In The News

Devtalk ❯

We ❤️ helpful members!

We reward our most helpful members via our MOTM scheme - by giving away a whopping 25 books per year!

Sub Categories:

We're in Beta

About us Mission Statement See our Roadmap

SWE-Bench+: Enhanced Coding Benchmark for LLMs

CommunityNews

SWE-Bench+: Enhanced Coding Benchmark for LLMs

Where Next?

Popular General Dev topics

I am lonely will anyone speak to me

Remote working: Is Big Tech going off work from home?

Neovim nightly, v0.5.0 and v0.4.4 released!

Launching Fig

Helix, a Kakoune inspired Vim-model text editor (written in Rust)

Quick Start Guide for Flipper Zero

ChatML: ChatGPT API expects a structured format, called Chat Markup Language

Self-Hosting a Firefox Sync Server

The A.I. Monarchy

The Meter, Golden Ratio, Pyramids, and Cubits, Oh My

Other popular topics

HELLO WORLD (Introductions thread!)

Poll: Which code editor do you use?

BIIP MT3 Extended 2048 Custom Keycap Set (Drop)

Using Regular Expressions in Erlang

Spotlight: Dmitry Zinoviev (Author) Interview and AMA!

How to block any website on Mac using Little Snitch

Engineering Elixir Applications

I want to learn how make a game, but where should I start?

What are the best text-to-speech ai generation tools that you can run locally?

Keyboard Review: UHK60V2 vs Defy vs Voyager vs Glove80 vs Svalboard

Sponsor Spotlight

General Dev>In The News

Latest on Devtalk

We ❤️ helpful members!

Devtalk Sponsors

Categories:

Sub Categories:

Popular Portals

Devtalk Sponsors

We're in Beta