CommunityNews

CommunityNews

ProgramBench: Can Language Models Rebuild Programs From Scratch?

Turning ideas into full software projects from scratch has become a popular use case for language models. Agents are being deployed to seed, maintain, and grow codebases over extended periods with minimal human oversight. Such settings require models to make high-level software architecture decisions. However, existing benchmarks measure focused, limited tasks such as fixing a single bug or developing a single, specified feature. We therefore introduce ProgramBench to measure the ability of software engineering agents to develop software holisitically. In ProgramBench, given only a program and its documentation, agents must architect and implement a codebase that matches the reference executable’s behavior. End-to-end behavioral tests are generated via agent-driven fuzzing, enabling evaluation without prescribing implementation structure. Our 200 tasks range from compact CLI tools to widely used software such as FFmpeg, SQLite, and the PHP interpreter. We evaluate 9 LMs and find that none fully resolve any task, with the best model passing 95% of tests on only 3% of tasks. Models favor monolithic, single-file implementations that diverge sharply from human-written code.

Read in full here:

Where Next?

Popular Ai topics Top

New
First poster: CommunityNews
Imagine you’re sitting at a casino’s poker table. Someone has explained the basic rules to you, but you’ve never played before and don’t ...
New
First poster: bot
AI Is Discovering Patterns in Pure Mathematics That Have Never Been Seen Before. We can add suggesting and proving mathematical theorems...
New
First poster: CommunityNews
A simple algorithm that revolutionizes how neural networks approach language is now taking on image classification as well. It may not st...
New
First poster: CommunityNews
OpenJourney is a Text-to-Image AI model which has the goal of bringing an open source equivalent to Midjourney to the people. It is curre...
New
First poster: alvinkatojr
Giving AI systems the ability to focus on particular brain regions can make them much better at reconstructing images of what a monkey is...
New
First poster: brennan
It’s Not a Hypothetical, I’ve Already Lost My Job to AI For The Last Year
New
First poster: mercyf
Google’s Veo 3 delivers AI videos of realistic people with sound and music. We put it to the test.
New
CommunityNews
Study shows how patterns in LLM training data can lead to “parahuman” responses.
New
First poster: jkdiaz
TechCrunch spoke to experienced coders about their time using AI-generated code about what they see as the future of vibe coding.
New

Other popular topics Top

PragmaticBookshelf
Learn from the award-winning programming series that inspired the Elixir language, and go on a step-by-step journey through the most impo...
New
ohm
Which, if any, games do you play? On what platform? I just bought (and completed) Minecraft Dungeons for my Nintendo Switch. Other than ...
New
dasdom
No chair. I have a standing desk. This post was split into a dedicated thread from our thread about chairs :slight_smile:
New
PragmaticBookshelf
From finance to artificial intelligence, genetic algorithms are a powerful tool with a wide array of applications. But you don't need an ...
New
Exadra37
I am asking for any distro that only has the bare-bones to be able to get a shell in the server and then just install the packages as we ...
New
Margaret
Hello everyone! This thread is to tell you about what authors from The Pragmatic Bookshelf are writing on Medium.
1147 29994 760
New
mafinar
This is going to be a long an frequently posted thread. While talking to a friend of mine who has taken data structure and algorithm cou...
New
PragmaticBookshelf
Programming Ruby is the most complete book on Ruby, covering both the language itself and the standard library as well as commonly used t...
New
RobertRichards
Hair Salon Games for Girls Fun Girls Hair Saloon game is mainly developed for kids. This game allows users to select virtual avatars to ...
New
PragmaticBookshelf
Use advanced functional programming principles, practical Domain-Driven Design techniques, and production-ready Elixir code to build scal...
New