Navigation

Home About me Contact me

Topics

PHP Qualité Open-source Test Performance IA Architecture Go

2026-03-10

ai quality llm benchmark

LLMs can code. But can they maintain?

Share on Twitter Share on LinkedIn

You will find the 🇫🇷 French version of this article here

TL;DR (AI)

Current benchmarks evaluate LLMs on isolated tasks (snapshot), not on their ability to maintain code over time.
The SWE-CI benchmark measures maintainability across dozens of successive iterations: most models introduce regressions in more than 75% of cases.
Maintainability metrics and human architectural vision become all the more essential as we delegate code production to AI.

Summary generated by AI to help you skim.

I’ve been coding for nearly thirty years. Twenty of them professionally. And I’m going to say something that would have seemed absurd just four years ago: artificial intelligences vastly outperform me in terms of code production. In speed, in volume, often in edge case coverage.

This isn’t a surrender. It’s an honest observation, and I’m at peace with it. These tools have made me more effective than I’ve ever been. Copilot, Claude, GPT — depending on the context, they regularly impress me. For implementing a known algorithm, wiring up an API, writing unit tests, or refactoring a function, their power is real and now undeniable.

But for a while, something had been nagging at me. An intuition I couldn’t quite articulate. This paper articulated it for me.

It’s titled SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration, published in early March 2026 on arXiv by researchers from Sun Yat-sen University and Alibaba Group. It asks a simple and unsettling question: we know LLMs write code — but do they write code that holds up over time?

The problem nobody measures

To understand what this work contributes, you need to understand how LLMs are evaluated on code today. Classic benchmarks (HumanEval, SWE-bench, LiveCodeBench) all ask the same fundamental question: the agent receives a problem, produces a solution — does it pass the tests?

This is what researchers call “snapshot” evaluation: a photo at a single point in time. The model fixes a bug, generates a function, proposes a patch. We check. It works or it doesn’t.

Classic evaluation (snapshot)

One problem → one solution → tests pass. The agent is evaluated on a single act of production. What came before and what comes after does not exist.

→

What SWE-CI measures

Start from a real codebase, evolve the project across dozens of successive iterations, and measure whether the code remains maintainable over time.

The problem? In real life, software isn’t born in a single night and doesn’t die after its first deployment. It lives, mutates, ages. Features get added, interfaces change, colleagues (or agents) pick up what we’ve written. What matters then isn’t just that a working patch was produced — it’s that this patch didn’t mortgage the next fifty.

An agent that hard-codes a fragile workaround and an agent that writes clean, extensible code can both pass the same tests. Their difference only becomes visible at the third or fourth change.

This is precisely what Lehman’s laws of software evolution theorized back in the 1970s: software quality degrades naturally as it evolves. And classic literature estimates that maintenance accounts for 60 to 80% of the total lifecycle cost of software. Maintenance, not initial development.

How SWE-CI works

The benchmark is carefully constructed. The researchers combed GitHub for serious Python projects: at least three years of active maintenance, at least 500 stars, a real test suite, a permissive license. From 4,923 filtered projects, they ultimately retained 100 cases from 68 distinct repositories.

For each case, they select two commits on the main branch: a starting commit (the “base”) and a target commit (the “oracle”), separated on average by 233 days and 71 commits of real development history. Between the two, at least 500 lines of source code have changed.

The agent must evolve the base toward the oracle, but not all at once. It proceeds through successive iterations, as a team would in continuous integration. At each turn:

An “architect” agent analyzes the failing tests, identifies root causes in the code, and produces a requirements document in natural language — no more than five priority requirements, framed in terms of expected behavior, without prescribing the implementation.

A “developer” agent reads this document, understands the behavioral contracts, plans its modifications, and writes the code. Without running the tests itself — the external system does that.

This dual protocol reproduces what happens in a real team. The architect doesn’t code. The developer doesn’t over-engineer. And it’s the cumulative result across the entire sequence that is measured.

How to measure maintainability

The researchers introduce two original metrics. The first, the normalized change, measures at each iteration how many additional tests pass relative to the base — with a symmetric penalty if tests that were passing get broken (what we call a regression).

The second, the EvoScore, aggregates these measurements across the entire sequence with increasing weight toward the later iterations. The idea is simple and sound: truly maintainable code is code that remains easy to modify as evolution progresses. An agent that succeeds in the early iterations by accumulating technical debt, then collapses afterward, will be penalized. An agent that progresses steadily, even slowly, will be rewarded.

What the results show

The researchers evaluated 18 models from 8 different providers, spending over 10 billion tokens in total. Three major observations emerge.

1. LLMs are improving — fast

Across all model families, recent versions systematically outperform their predecessors. And models released after early 2026 show particularly marked gains. This isn’t linear progression: it’s acceleration. What was difficult a year ago is beginning to be solved.

Over the entire observation period, the Claude Opus series stands out clearly at the top, with GLM-5 as another remarkable performer.

EvoScore by model family — general trend

Schematic representation of EvoScore (γ=1) progression by model release date. Post-2026 models show markedly stronger gains. Source: SWE-CI, Figure 4.

2. Providers have different priorities

The γ parameter of the EvoScore allows varying the weight given to early versus late iterations. When you raise γ, you favor models that maintain quality over the long term. When you lower it, you reward immediate gains.

What the researchers observe is revealing: rankings change depending on γ. MiniMax, DeepSeek, and GPT favor long-term gains. Kimi and GLM prioritize quick returns. Qwen, Doubao, and Claude remain relatively stable regardless of weighting. The authors interpret this as a reflection of training choices — each provider orients its models differently, and it shows.

3. Regression remains the great unsolved problem

This is the most striking observation, and the most directly useful for anyone using AI in their projects.

A regression, in development, is when a change breaks something that was working before. It’s every experienced developer’s nightmare. And this is precisely where current LLMs struggle the most.

"Zero regression" rate — proportion of trials with no regression introduced

Claude Opus 4.6

0.76

Claude Opus 4.5

0.51

Kimi-K2.5

0.37

GLM-5

0.36

GPT-5.2

0.23

Qwen3.5-plus

0.20

DeepSeek-V3.2

0.20

MiniMax-M2.5

0.20

MiniMax-M2.1

0.15

Kimi-K2-Thinking

0.15

GLM-4.7 / GLM-4.6

0.14

Kimi-K2-instruct

0.12

Qwen3-coder-plus

0.10

Doubao / Qwen3-Max

0.08–0.09

Proportion of trials in which no regression was introduced throughout maintenance. Most models stay below 0.25. Only two models exceed 0.5. Source: SWE-CI, Figure 6.

In concrete terms: if you ask most current LLMs to maintain a project over time, in more than 75% of cases, they will break something that was working. Not intentionally. Not through negligence. Through lack of a view of the whole — exactly like a junior developer who fixes a bug without reading the rest of the code.

Reading note. These figures evaluate agents in autonomous mode, without human review between iterations. In practice, an experienced developer supervising AI suggestions will catch these regressions before they accumulate. The paper measures the intrinsic capability of models — not their usefulness in pair programming, which remains very real.

What this clarifies for me

When I started building phpmetrics, the central question was: how do you know, objectively, whether a PHP project is healthy? Not whether it compiles. Not whether it passes tests. But whether the internal structure of the code will allow working with it six months from now without suffering.

Cyclomatic complexity. Coupling between modules. Class cohesion. Component instability. These metrics aren’t glamorous. They don’t answer the question “does it work?” — they answer the question “will it hold?”

ast-metrics extends this logic by going deeper into the syntactic structure of code, independent of language. The idea remains the same: give a picture of maintainability, not just functionality.

What SWE-CI has just formalized for AI agents is exactly this distinction. And it struck me reading the paper: the researchers built, to evaluate LLMs, the same type of reasoning that has guided these tools from the beginning.

Making it work is necessary. Making it last is different. The two are not measured the same way.

LLMs excel today at making things work. They are progressing, fast, on the question of making things last. But they’re not there yet — with one exception. And this exception is not trivial: Claude Opus 4.6 reaches a zero-regression rate of 0.76. That’s remarkable. It’s also proof that it’s possible, and that the rest of the market will follow.

What this means in practice

For me, the practical lesson is twofold.

First, maintainability metrics are not a luxury. They may have been when code was entirely human and teams naturally had a memory of the project. They become essential when code is generated at industrial speed, with tools that have no memory between sessions and no vision of the global architecture. Without external measurement, you’re flying blind.

Second, AI doesn’t replace architecture — it needs it all the more. An LLM generating a function does so in a local context, without seeing adjacent modules, without understanding the constraints that guided past decisions. The more we delegate code production to these tools, the more important it becomes for someone (a human) to maintain the overall vision, set the invariants, define the contracts.

This isn’t a criticism of AI. It’s a description of what it is today: an extraordinarily powerful production tool that needs a framework so its power doesn’t turn against itself.

Thirty years of code have taught me that the truly costly problems are almost never bugs. They’re architectural errors discovered too late, poorly thought-out dependencies, abstractions that don’t hold up over time. LLMs haven’t solved that yet. And that’s precisely why tools like phpmetrics or ast-metrics remain useful — not as a bulwark against AI, but as a necessary complement.

The SWE-CI paper is available on arXiv: arxiv.org/abs/2603.03823. It’s accessible, well-written, and its data is public on Hugging Face. If you work with AI agents on real projects, it’s worth a read.

💡 Ces sujets vous parlent ?

Vous voulez améliorer la qualité de vos projets sans ralentir la livraison.

J'accompagne les équipes PHP avec des audits ciblés et des formations concrètes. Architecture, tests, industrialisation et pratiques d'équipe, chaque intervention est adaptée à votre contexte.

Voir les prestations sur lepine.pro

Stay up to date

Follow new articles via RSS feed — no spam, just quality content.

Subscribe via RSS