CommunityNews

CommunityNews

SWE-Bench+: Enhanced Coding Benchmark for LLMs

SWE-Bench+: Enhanced Coding Benchmark for LLMs.
Large Language Models (LLMs) in Software Engineering (SE) can offer assistance for coding. To facilitate a rigorous evaluation of LLMs in practical coding contexts, Carlos et al. introduced the SWE-bench dataset, which comprises 2,294 real-world GitHub issues and their corresponding pull requests, collected from 12 widely used Python repositories. Several impressive LLM-based toolkits recently are developed and evaluated on this dataset. However, a systematic evaluation of the quality of SWE-bench remains missing. In this paper, we addressed this gap by presenting an empirical analysis of the SWE-bench dataset. We conducted a manual screening of instances where SWEAgent + GPT-4 successfully resolved issues by comparing the model-generated patches with the actual pull requests. SWE-Agent+GPT-4 was at the top of SWE-bench leaderboard during the time of our study. Our analysis reveals some critical issues with the SWE-bench dataset: 1) 32.67% of the successful patches involve cheating as the solutions were directly provided in the issue report or the comments. We refer to as solution leakage problem. 2) 31.08% of the passed patches are suspicious patches due to weak test cases, i.e., the tests were not adequate to verify the correctness of a patch. When we filtered out these problematic issues, the resolution rate of SWE-Agent+GPT-4 dropped from 12.47% to 3.97%. We also observed that the same data quality issues also exist in the two variants of SWE-bench, i.e., SWE-bench Lite and SWE-Bench Verified. In addition, over 94% of the issues were created before LLM’s knowledge cutoff dates, posing potential data leakage issues.

Read in full here:

This thread was posted by one of our members via one of our news source trackers.

Where Next?

Popular General Dev topics Top

First poster: bot
SPWN is a programming language that compiles to Geometry Dash levels. What that means is that you can create levels by using not only the...
New
First poster: AstonJ
:tada: Launching Fig I am excited to announce that, as of today, Fig is generally available to the public for download. With our public ...
New
First poster: bot
It has some interesting features: It’s entirely wireless (the left half speaks Bluetooth to the right half, and the right half speaks B...
New
First poster: bot
Developing Godot Projects with Neovim. When I started using Godot Engine, what surprised me the most is the built-in Language Server Pro...
New
First poster: bot
GitHub - lucidrains/PaLM-rlhf-pytorch: Implementation of RLHF (Reinforcement Learning with Human Feedback) on top of the PaLM architectur...
New
CommunityNews
SLUM: The Shadow Library Uptime Monitor. This dashboard tracks the availability of popular shadow libraries in real time from a US-based...
New
First poster: dyowee
Software engineering job openings hit five-year low?. There are 35% fewer software developer job listings on Indeed today, than five yea...
New
New
First poster: braycarla
In beginning the NVIDIA Blackwell Linux testing with the GeForce RTX 5090 compute performance, besides all the CUDA/OpenCL/OptiX benchmar...
New
First poster: andrea
Most of what modern software engineers do involves APIs: public interfaces for communicating with a program, like this one from Twilio. I...
New

Other popular topics Top

PragmaticBookshelf
Write Elixir tests that you can be proud of. Dive into Elixir’s test philosophy and gain mastery over the terminology and concepts that u...
New
ohm
Which, if any, games do you play? On what platform? I just bought (and completed) Minecraft Dungeons for my Nintendo Switch. Other than ...
New
AstonJ
Thanks to @foxtrottwist’s and @Tomas’s posts in this thread: Poll: Which code editor do you use? I bought Onivim! :nerd_face: https://on...
New
AstonJ
If you are experiencing Rails console using 100% CPU on your dev machine, then updating your development and test gems might fix the issu...
New
PragmaticBookshelf
Build efficient applications that exploit the unique benefits of a pure functional language, learning from an engineer who uses Haskell t...
New
PragmaticBookshelf
Author Spotlight Jamis Buck @jamis This month, we have the pleasure of spotlighting author Jamis Buck, who has written Mazes for Prog...
New
PragmaticBookshelf
Author Spotlight Rebecca Skinner @RebeccaSkinner Welcome to our latest author spotlight, where we sit down with Rebecca Skinner, auth...
New
New
sir.laksmana_wenk
I’m able to do the “artistic” part of game-development; character designing/modeling, music, environment modeling, etc. However, I don’t...
New
NewsBot
Node.js v22.14.0 has been released. Link: Release 2025-02-11, Version 22.14.0 'Jod' (LTS), @aduh95 · nodejs/node · GitHub
New