You will find the 🇫🇷 French version of this article here
I had a simple need: get the best possible score on a language assessment task. A human-annotated dataset, 700 texts, and a way to test multiple prompts under the same conditions.
I did what everyone does. I started with an elaborate, structured prompt, following best practices. Then I tested variations. And the result surprised me: an 8-word prompt beat a prompt designed by 21 researchers.
At first I thought it was an anomaly. Then I looked into the literature, and I discovered this result had a name: instruction dilution. A documented phenomenon that affects all language models. The idea is simple and somewhat counterintuitive: when you add information to a prompt, even correct, even relevant information, performance can degrade. Not because the information is wrong, but because it dilutes the useful signal into noise.
What follows is what recent research teaches us about this phenomenon, illustrated by my experiment on those 700 texts.
The context: automated assessment of learners’ English language proficiency, according to the CEFR framework (the six levels A1 to C2). The model tested: GPT-4.1. The corpus: 700 written compositions, each evaluated by three certified human examiners.
We built four prompts. From the most elaborate to the most minimalist.
Instructions up front, ### and """ separators, few-shot examples, format constraints. Follows official OpenAI guide recommendations.
From Universal CEFR: Enabling Open Multilingual Research on Language Proficiency Assessment (arXiv:2506.01419). Written by researchers specializing in automated assessment.
No role. No criteria. No list of levels. The model infers everything (and does it better than when you explain it).
We add context and the list of levels. Two seemingly useful pieces of information. Result: -12 points compared to P3.
Accuracy (Exact Match) (GPT-4.1, n=700)
Exact match on the A1-C2 scale. GPT-4.1 via API. Academic corpus CC BY-NC-SA 4.0, evaluated by three certified human examiners.
P3 wins, and by a wide margin. It beats the academic prompt by 8 points, the few-shot prompt by 15 points, and its own enriched version by 12 points.
That last gap is the one that intrigued me most. Between P3 and P4, we only added two pieces of information that GPT-4.1 already knows: that the learner is studying English, and the list of CEFR levels. By repeating them, we didn’t help the model. We got in its way.
This 12-point gap between P3 and P4 is explained by what the literature calls information dilution: when you rephrase what the model already knows, you dilute the useful signal into noise.
The mechanism isn’t mysterious. Transformer attention allocates finite capacity on each pass. When part of that capacity is consumed by redundant information, even correct information, less remains for the task itself. Attention doesn’t disappear: it scatters across non-informative tokens. Jiang et al. (2024) describe “reduced perceptual ability due to the limited context window”: the context window is a finite resource, and every token consumed by noise is one less token for signal.
GPT-4.1 knows the CEFR. It was trained on massive amounts of language assessment data. When you ask it in one sentence to evaluate according to the CEFR, it activates exactly what it needs. When you provide detailed examples and instructions, however well-crafted, you’re offering a reformulation of what it already knows. And that reformulation creates friction. That’s why P1, the most elaborate prompt, finishes last.
A longer prompt is not a more precise prompt. On a domain the model has mastered, it's often a noisier prompt.
The question that follows: how systematic is this phenomenon? And what happens when you push models beyond just a few instructions?
The IFScale benchmark (Jaroslawicz et al., 2025) tested 20 models on tasks with 10 to 500 simultaneous instructions. The results reveal three distinct degradation profiles.
Three degradation profiles by instruction density
Source: IFScale (Jaroslawicz et al., arXiv:2507.11538), 2025.
Reasoning models like o3 or gemini-2.5-pro follow a threshold profile: performance stays near-perfect up to 150 or 250 instructions, then drops sharply. Gemini-2.5-pro goes from 98.4% at 100 instructions to 68.9% at 500. The model absorbs and absorbs, then gives way all at once.
Large general-purpose models like gpt-4.1 or claude-3.7-sonnet degrade steadily and predictably. GPT-4.1 goes from 95.4% to 48.9%, claude-3.7-sonnet from 94.8% to 52.7%. Each added instruction costs a bit of performance. This is the most actionable profile, because you can estimate the loss before incurring it.
Smaller models like claude-3.5-haiku or llama-4-scout collapse quickly then stabilize at a low plateau, between 7 and 15%. Beyond about a hundred instructions, adding or removing anything barely changes the result: the model has already disengaged.
It's not "shorter = better." It's that there's a threshold beyond which performance collapses; and that threshold depends on the model.
The nuance matters. Reasoning models resist remarkably well up to a critical point. General-purpose models degrade gradually but manageably. Smaller models simply aren’t designed for dense prompts. Knowing the degradation profile of the model you’re using means knowing how many instructions you can afford.
IFScale also reveals something more subtle: models don’t fail the same way depending on instruction density.
At low density, when a model gets it wrong, it gets it wrong by a little. It approximates, it misinterprets a constraint. The researchers call this a modification error: the model tried, and fell short.
At high density, the failure mode changes radically. The model doesn’t get it wrong, it forgets. Instructions aren’t misinterpreted, they’re simply ignored. This is an omission error. And the numbers are striking: at 500 instructions, llama-4-scout shows an omission/modification ratio of 34.88x. For every approximation error, 35 instructions are simply forgotten. The model isn’t doing its best with an imperfect result. It’s giving up.
This shift has a direct consequence for how we design prompts. At moderate density, rephrasing or clarifying an instruction can help. At high density, it’s pointless, because the problem isn’t that the model misunderstands. It’s that it’s no longer reading.
There’s also a cognitive competition effect worth mentioning. The attention spent on instruction-following degrades the quality of the main task itself. IFScale shows that o3, at 500 instructions, produces about 1,500 output tokens where every third word must be an imposed keyword. The model devotes so much capacity to respecting formal constraints that not enough remains for the task.
The connection to my experiment is direct. P1 and P2 add instructions that compete with the main evaluation task. Even though those instructions are correct, they consume attention. P3 leaves all the attention to the model for what actually matters.
Instruction dilution is amplified by another well-documented problem: positional bias.
Liu et al. (Stanford, 2023) showed that LLMs exhibit a U-shaped attention bias. They process information at the beginning and end of the prompt better, and significantly degrade information placed in the middle. This is the Lost in the Middle phenomenon, and it affects even models explicitly trained for long contexts.
IFScale adds an unexpected nuance about primacy bias, the model’s tendency to favor earlier instructions. This bias follows a non-linear curve. At low density, below 100 instructions, it’s weak: the model processes instructions relatively uniformly. At moderate density, between 150 and 200 instructions, it peaks. The model becomes selective and favors the first instructions at the expense of later ones. Beyond 300 instructions, the bias converges to a low level, not because the model has become fair again, but because it’s switched to “uniform failure” mode: it ignores instructions evenly, regardless of position.
In practice, on moderate-length prompts (the most common case in production), position matters enormously. What’s at the beginning frames the task. What’s at the end acts as the last instruction before generation. What’s in the middle is the most vulnerable to being forgotten. And that’s one more reason to keep prompts short: the longer the prompt, the larger the “middle,” and the more pronounced the U-shaped bias.
The message of this article is not “make shorter prompts.” It’s that beyond a certain threshold, adding information becomes counterproductive, and that threshold is lower than you think.
This threshold depends on several things. First, the model: reasoning models like o3 or gemini-2.5-pro resist instruction density far better than smaller models. A prompt that works on gpt-4.1 may collapse on claude-3.5-haiku.
Then, the task. My P3 wins because GPT-4.1 already knows the CEFR. It was trained on massive volumes of language assessment data. On a domain unknown to the model, a proprietary business framework, unpublished specialized jargon, a detailed prompt remains necessary. It’s precisely because the model doesn’t know that you need to tell it.
And finally, the nature of the added information. The entire difference lies between redundant and new information. Rephrasing what the model knows (the CEFR descriptors, the list of levels) creates noise. Providing what the model doesn’t know (a proprietary grading scale, specific business criteria) creates signal.
The rule isn't "be short." The rule is: don't rephrase what the model knows. Only add what it doesn't know.
In practice, this requires knowing the model’s boundaries, what’s part of its training and what isn’t. That boundary is rarely documented. It’s discovered by testing.
In many production use cases (RAG, document analysis, long contexts), you simply can’t shorten the prompt. The context is long because the problem demands it.
Research on prompt compression offers an interesting alternative: rather than removing information, you compress it. You eliminate noise while preserving signal.
The first family of approaches, called text-to-text, transforms text into shorter text by pruning non-informative tokens or by summarization. LLMLingua-2 from Microsoft achieves compression ratios of 3 to 6x with performance comparable to uncompressed text. On NaturalQuestions, F1 stays at 71.90 with 3.9x compression. On GSM8K, exact match reaches 79.08 at 5x. You divide context size by 4 or 5, and performance stays nearly identical.
The second family, text-to-vector, encodes text into compact vector representations that the literature calls gist tokens. The gisting technique achieves up to 26x compression with minimal loss. The idea: instead of having the model read 10,000 tokens, you encode them into a few hundred vectors that capture the essential information.
LLM-DCP (Jiang et al., 2024) goes further by modeling compression as a Markov Decision Process. Each token is evaluated sequentially, keep or remove, based on its contribution to the task. The result: 12.9x compression ratio.
Compression is an alternative to truncation. It preserves signal while eliminating noise (exactly what instruction dilution prevents us from doing manually).
When context is long, the question isn’t “how to shorten?” but “how to compress without losing signal?” It’s an engineering problem, not a writing problem.
From all of this, I take away a few principles that I now apply daily.
The first is atomicity. One instruction per prompt, stated in a single sentence if possible. When a task is complex, I break it into steps rather than making the prompt longer.
The second is position. Main instruction at the beginning, critical constraints at the end, context and examples in the middle, knowing they’ll be less well retained.
The third is testing at scale. A prompt that works on 10 examples may collapse on 700. IFScale results show that degradation profiles are predictable: it’s worth testing at different densities to find your model’s threshold.
And finally, when a prompt contains multiple constraints, a reminder at the end (“make sure you respect all constraints above”) can reduce the omission rate. It’s a palliative, not a solution, but it works on linear degradation profiles.
I drew concrete rules for production prompting from these observations, what changes when you move from conversation to API, how to separate the what from the how, and how to stabilize and evaluate prompts at scale.
Instruction dilution isn’t an intuition. It’s a measurable, reproducible phenomenon, and research is starting to understand it well. Beyond a certain threshold, adding information to a prompt degrades the model’s performance. That threshold depends on the model, the task, and the nature of the added information.
Prompting guides provide techniques. Research explains why some work and others don’t. Not because the guides are wrong, but because they’re written for a context, conversation, that doesn’t transfer directly to another, production at scale.
To write prompts that work at scale, you need to understand what’s happening in the model’s attention. Know what it already knows, what it doesn’t, and where the tipping point lies. And that can only be learned through experimentation. By getting it wrong on 700 examples and understanding why.
💡 Ces sujets vous parlent ?
Vous voulez améliorer la qualité de vos projets sans ralentir la livraison.
J'accompagne les équipes PHP avec des audits ciblés et des formations concrètes. Architecture, tests, industrialisation et pratiques d'équipe, chaque intervention est adaptée à votre contexte.
© Jean-François Lépine, 2010 - 2026