CommunityNews

CommunityNews

SWE-Bench+: Enhanced Coding Benchmark for LLMs

SWE-Bench+: Enhanced Coding Benchmark for LLMs.
Large Language Models (LLMs) in Software Engineering (SE) can offer assistance for coding. To facilitate a rigorous evaluation of LLMs in practical coding contexts, Carlos et al. introduced the SWE-bench dataset, which comprises 2,294 real-world GitHub issues and their corresponding pull requests, collected from 12 widely used Python repositories. Several impressive LLM-based toolkits recently are developed and evaluated on this dataset. However, a systematic evaluation of the quality of SWE-bench remains missing. In this paper, we addressed this gap by presenting an empirical analysis of the SWE-bench dataset. We conducted a manual screening of instances where SWEAgent + GPT-4 successfully resolved issues by comparing the model-generated patches with the actual pull requests. SWE-Agent+GPT-4 was at the top of SWE-bench leaderboard during the time of our study. Our analysis reveals some critical issues with the SWE-bench dataset: 1) 32.67% of the successful patches involve cheating as the solutions were directly provided in the issue report or the comments. We refer to as solution leakage problem. 2) 31.08% of the passed patches are suspicious patches due to weak test cases, i.e., the tests were not adequate to verify the correctness of a patch. When we filtered out these problematic issues, the resolution rate of SWE-Agent+GPT-4 dropped from 12.47% to 3.97%. We also observed that the same data quality issues also exist in the two variants of SWE-bench, i.e., SWE-bench Lite and SWE-Bench Verified. In addition, over 94% of the issues were created before LLM’s knowledge cutoff dates, posing potential data leakage issues.

Read in full here:

This thread was posted by one of our members via one of our news source trackers.

Where Next?

Popular General Dev topics Top

First poster: bot
SPWN is a programming language that compiles to Geometry Dash levels. What that means is that you can create levels by using not only the...
New
OvermindDL1
Yet another rust-made text editor, though I’m really liking the looks of how this one works!
New
First poster: cpgo
8 reasons to ditch Chrome and switch to Firefox. Chrome may dominate, but Firefox is a known name among browsers for a reason. Whether y...
New
First poster: bot
Large Language Models like ChatGPT say The Darnedest Things. The Errors They MakeWhy We Need to Document Them, and What We Have Decided ...
New
First poster: Korbin73
Whatever happened to Elm, anyway?. I see this question pop up quite frequently in lots of different arenas - folks are curious as to wha...
New
CommunityNews
Building GitHub with Ruby and Rails | The GitHub Blog. Since the beginning, GitHub.com has been a Ruby on Rails monolith. Today, the app...
New
First poster: joeb
GitHub - crablang/crab: A community fork of a language named after a plant fungus. All of the memory-safe features you love, now with 100...
New
First poster: joeb
The new frameworks will continue until morale improves.
/js
New
First poster: alvinkatojr
There are countless articles why developers should not focus on Frameworks too much and instead learn to understand the underlying langua...
New
New

Other popular topics Top

PragmaticBookshelf
Brace yourself for a fun challenge: build a photorealistic 3D renderer from scratch! In just a couple of weeks, build a ray tracer that r...
New
AstonJ
We have a thread about the keyboards we have, but what about nice keyboards we come across that we want? If you have seen any that look n...
New
AstonJ
Thanks to @foxtrottwist’s and @Tomas’s posts in this thread: Poll: Which code editor do you use? I bought Onivim! :nerd_face: https://on...
New
PragmaticBookshelf
Use WebRTC to build web applications that stream media and data in real time directly from one user to another, all in the browser. ...
New
mafinar
This is going to be a long an frequently posted thread. While talking to a friend of mine who has taken data structure and algorithm cou...
New
PragmaticBookshelf
Rails 7 completely redefines what it means to produce fantastic user experiences and provides a way to achieve all the benefits of single...
New
PragmaticBookshelf
Author Spotlight Rebecca Skinner @RebeccaSkinner Welcome to our latest author spotlight, where we sit down with Rebecca Skinner, auth...
New
New
CommunityNews
A Brief Review of the Minisforum V3 AMD Tablet. Update: I have created an awesome-minisforum-v3 GitHub repository to list information fo...
New
PragmaticBookshelf
A concise guide to MySQL 9 database administration, covering fundamental concepts, techniques, and best practices. Neil Smyth MySQL...
New