
CommunityNews
SWE-Bench+: Enhanced Coding Benchmark for LLMs
SWE-Bench+: Enhanced Coding Benchmark for LLMs.
Large Language Models (LLMs) in Software Engineering (SE) can offer assistance for coding. To facilitate a rigorous evaluation of LLMs in practical coding contexts, Carlos et al. introduced the SWE-bench dataset, which comprises 2,294 real-world GitHub issues and their corresponding pull requests, collected from 12 widely used Python repositories. Several impressive LLM-based toolkits recently are developed and evaluated on this dataset. However, a systematic evaluation of the quality of SWE-bench remains missing. In this paper, we addressed this gap by presenting an empirical analysis of the SWE-bench dataset. We conducted a manual screening of instances where SWEAgent + GPT-4 successfully resolved issues by comparing the model-generated patches with the actual pull requests. SWE-Agent+GPT-4 was at the top of SWE-bench leaderboard during the time of our study. Our analysis reveals some critical issues with the SWE-bench dataset: 1) 32.67% of the successful patches involve cheating as the solutions were directly provided in the issue report or the comments. We refer to as solution leakage problem. 2) 31.08% of the passed patches are suspicious patches due to weak test cases, i.e., the tests were not adequate to verify the correctness of a patch. When we filtered out these problematic issues, the resolution rate of SWE-Agent+GPT-4 dropped from 12.47% to 3.97%. We also observed that the same data quality issues also exist in the two variants of SWE-bench, i.e., SWE-bench Lite and SWE-Bench Verified. In addition, over 94% of the issues were created before LLM’s knowledge cutoff dates, posing potential data leakage issues.
Read in full here:
This thread was posted by one of our members via one of our news source trackers.
Popular General Dev topics










Other popular topics










Categories:
Sub Categories:
- All
- In The News
- Dev Chat (200)
- Questions (30)
- Resources (118)
- Blogs/Talks (26)
- Jobs (3)
- Events (14)
- Code Editors (58)
- Hardware (57)
- Reviews (4)
- Sales (15)
- Design & UX (4)
- Marketing & SEO (1)
- Industry & Culture (14)
- Ethics & Privacy (19)
- Business (4)
- Learning Methods (4)
- Content Creators (7)
- DevOps & Hosting (9)
Popular Portals
- /elixir
- /rust
- /wasm
- /ruby
- /erlang
- /phoenix
- /keyboards
- /rails
- /js
- /python
- /security
- /go
- /swift
- /vim
- /clojure
- /java
- /haskell
- /emacs
- /svelte
- /onivim
- /typescript
- /crystal
- /c-plus-plus
- /tailwind
- /kotlin
- /gleam
- /react
- /flutter
- /elm
- /ocaml
- /ash
- /vscode
- /opensuse
- /centos
- /php
- /deepseek
- /html
- /scala
- /zig
- /debian
- /nixos
- /lisp
- /agda
- /sublime-text
- /textmate
- /react-native
- /kubuntu
- /arch-linux
- /ubuntu
- /revery
- /spring
- /django
- /manjaro
- /diversity
- /nodejs
- /lua
- /c
- /julia
- /slackware
- /neovim