Navigation

Home About me Contact me

Topics

PHP Qualité Open-source Test Performance IA Architecture Go

2026-03-24

ai prompting llm

LLMs in Production: Why an 8-Word Prompt Beat All the Others

Share on Twitter Share on LinkedIn

You will find the 🇫🇷 French version of this article here

TL;DR (AI)

An 8-word prompt beats complex prompts crafted by 21 researchers, on 700 real CEFR evaluation examples.
This phenomenon is called instruction dilution: adding information the model already knows degrades its performance.
The IFScale benchmark (2025) reveals three degradation profiles across models: threshold, linear, and exponential.
Beyond a certain threshold, models don’t get it wrong; they simply ignore instructions altogether.
The solution isn’t to blindly shorten: it’s to stop rephrasing what the model knows, and compress when context is long.

Summary generated by AI to help you skim.

I had a simple need: get the best possible score on a language assessment task. A human-annotated dataset, 700 texts, and a way to test multiple prompts under the same conditions.

I did what everyone does. I started with an elaborate, structured prompt, following best practices. Then I tested variations. And the result surprised me: an 8-word prompt beat a prompt designed by 21 researchers.

At first I thought it was an anomaly. Then I looked into the literature, and I discovered this result had a name: instruction dilution. A documented phenomenon that affects all language models. The idea is simple and somewhat counterintuitive: when you add information to a prompt, even correct, even relevant information, performance can degrade. Not because the information is wrong, but because it dilutes the useful signal into noise.

What follows is what recent research teaches us about this phenomenon, illustrated by my experiment on those 700 texts.

The experiment: 700 texts, 4 prompts

The context: automated assessment of learners’ English language proficiency, according to the CEFR framework (the six levels A1 to C2). The model tested: GPT-4.1. The corpus: 700 written compositions, each evaluated by three certified human examiners.

Setup

Corpus 700 examples under CC BY-NC-SA 4.0 license. Each text comes with a reference level and evaluations from three certified examiners.

Model GPT-4.1, via API, same parameters for all prompts tested.

Metric Exact match: did the model assign exactly the right level?

We built four prompts. From the most elaborate to the most minimalist.

P1: Few-shot, following the OpenAI guide 44%

Show full prompt

Classify the CEFR level of the following written text. ### Rules: - Output only the CEFR level: A1, A2, B1, B2, C1, or C2 - No explanation, no justification - If uncertain, output the closest level Examples: Text: "I have a dog. His name is Max. He is big and black." Level: A1 Text: "Last summer I visited my grandparents. We went to the market every morning." Level: A2 (...) ### Text: """""" Level:

Instructions up front, ### and """ separators, few-shot examples, format constraints. Follows official OpenAI guide recommendations.

P2: Prompt from an academic publication 51%

Show full prompt

You are an expert in language proficiency classification based on the Common European Framework of Reference for Languages (CEFR). Your task is to analyze the given text or narrative and determine the best CEFR level [A1, A2, B1, B2, C1, or C2] based on the CEFR descriptors of reading comprehension of learners below: A1 - Learners of this level can give information about matters of personal relevance (e.g. likes and dislikes, family, pets) using simple words/signs and basic expressions. Learners can also produce simple isolated phrases and sentences. A2 - Learners of this level can produce a series of simple phrases and sentences linked with simple connectors like "and", "but" and "because". Learners have sufficient vocabulary for the expression of basic communicative needs and for coping with simple survival needs. B1 - Learners of this level can produce straightforward connected texts on a range of familiar subjects within their field of interest, by linking a series of shorter discrete elements into a linear sequence. Learners have a good range of vocabulary related to familiar topics and everyday situations. B2 - Learners of this level can produce clear, detailed texts on a variety of subjects related to their field of interest, synthesising and evaluating information and arguments from a number of sources. Learners have a good range of vocabulary for matters connected to their field and most general topics. C1 - Learners of this level can produce clear, well-structured texts of complex subjects, underlining the relevant salient issues, expanding and supporting points of view at some length with subsidiary points, reasons and relevant examples, and rounding off with an appropriate conclusion. Learners can also employ the structure and conventions of a variety of genres, varying the tone, style and register according to addressee, text type and theme. C2 - Learners of this level can produce clear, smoothly flowing, complex texts in an appropriate and effective style and a logical structure which helps the reader identify significant points. Learners have a good command of a very broad lexical repertoire including idiomatic expressions and colloquialisms; shows awareness of connotative levels of meaning. Provide only the CEFR level as output directly, without explanation or justification. Text: «TEXT» Answer:

From Universal CEFR: Enabling Open Multilingual Research on Language Proficiency Assessment (arXiv:2506.01419). Written by researchers specializing in automated assessment.

P3: One sentence (best result 🏆) 59%

Assess the CEFR level of this written production.

No role. No criteria. No list of levels. The model infers everything (and does it better than when you explain it).

P4: P3 slightly enriched 47%

Assess the CEFR level of this written production by an English learner. Give a score between A1, A2, B1, B2, C1, C2.

We add context and the list of levels. Two seemingly useful pieces of information. Result: -12 points compared to P3.

The results

Accuracy (Exact Match) (GPT-4.1, n=700)

44%

P1
Few-shot

51%

P2
Academic

59%

P3
One sentence

47%

P4
Enriched

P1Few-shot following the OpenAI guide: 44%

P2Academic research prompt (Universal CEFR): 51%

P3"Assess the CEFR level of this written production": 59%

P4P3 with context and level list: 47%

Exact match on the A1-C2 scale. GPT-4.1 via API. Academic corpus CC BY-NC-SA 4.0, evaluated by three certified human examiners.

P3 wins, and by a wide margin. It beats the academic prompt by 8 points, the few-shot prompt by 15 points, and its own enriched version by 12 points.

That last gap is the one that intrigued me most. Between P3 and P4, we only added two pieces of information that GPT-4.1 already knows: that the learner is studying English, and the list of CEFR levels. By repeating them, we didn’t help the model. We got in its way.

Why adding information degrades results

This 12-point gap between P3 and P4 is explained by what the literature calls information dilution: when you rephrase what the model already knows, you dilute the useful signal into noise.

The mechanism isn’t mysterious. Transformer attention allocates finite capacity on each pass. When part of that capacity is consumed by redundant information, even correct information, less remains for the task itself. Attention doesn’t disappear: it scatters across non-informative tokens. Jiang et al. (2024) describe “reduced perceptual ability due to the limited context window”: the context window is a finite resource, and every token consumed by noise is one less token for signal.

GPT-4.1 knows the CEFR. It was trained on massive amounts of language assessment data. When you ask it in one sentence to evaluate according to the CEFR, it activates exactly what it needs. When you provide detailed examples and instructions, however well-crafted, you’re offering a reformulation of what it already knows. And that reformulation creates friction. That’s why P1, the most elaborate prompt, finishes last.

A longer prompt is not a more precise prompt. On a domain the model has mastered, it's often a noisier prompt.

The question that follows: how systematic is this phenomenon? And what happens when you push models beyond just a few instructions?

Three degradation profiles

The IFScale benchmark (Jaroslawicz et al., 2025) tested 20 models on tasks with 10 to 500 simultaneous instructions. The results reveal three distinct degradation profiles.

Three degradation profiles by instruction density

Threshold: near-perfect, then sharp drop (o3, gemini-2.5-pro)

Linear: steady, predictable degradation (gpt-4.1, claude-3.7-sonnet)

Exponential: rapid collapse, low plateau (claude-3.5-haiku, llama-4-scout)

Source: IFScale (Jaroslawicz et al., arXiv:2507.11538), 2025.

Reasoning models like o3 or gemini-2.5-pro follow a threshold profile: performance stays near-perfect up to 150 or 250 instructions, then drops sharply. Gemini-2.5-pro goes from 98.4% at 100 instructions to 68.9% at 500. The model absorbs and absorbs, then gives way all at once.

Large general-purpose models like gpt-4.1 or claude-3.7-sonnet degrade steadily and predictably. GPT-4.1 goes from 95.4% to 48.9%, claude-3.7-sonnet from 94.8% to 52.7%. Each added instruction costs a bit of performance. This is the most actionable profile, because you can estimate the loss before incurring it.

Smaller models like claude-3.5-haiku or llama-4-scout collapse quickly then stabilize at a low plateau, between 7 and 15%. Beyond about a hundred instructions, adding or removing anything barely changes the result: the model has already disengaged.

It's not "shorter = better." It's that there's a threshold beyond which performance collapses; and that threshold depends on the model.

The nuance matters. Reasoning models resist remarkably well up to a critical point. General-purpose models degrade gradually but manageably. Smaller models simply aren’t designed for dense prompts. Knowing the degradation profile of the model you’re using means knowing how many instructions you can afford.

Omission vs modification: how models fail

IFScale also reveals something more subtle: models don’t fail the same way depending on instruction density.

At low density, when a model gets it wrong, it gets it wrong by a little. It approximates, it misinterprets a constraint. The researchers call this a modification error: the model tried, and fell short.

At high density, the failure mode changes radically. The model doesn’t get it wrong, it forgets. Instructions aren’t misinterpreted, they’re simply ignored. This is an omission error. And the numbers are striking: at 500 instructions, llama-4-scout shows an omission/modification ratio of 34.88x. For every approximation error, 35 instructions are simply forgotten. The model isn’t doing its best with an imperfect result. It’s giving up.

This shift has a direct consequence for how we design prompts. At moderate density, rephrasing or clarifying an instruction can help. At high density, it’s pointless, because the problem isn’t that the model misunderstands. It’s that it’s no longer reading.

There’s also a cognitive competition effect worth mentioning. The attention spent on instruction-following degrades the quality of the main task itself. IFScale shows that o3, at 500 instructions, produces about 1,500 output tokens where every third word must be an imposed keyword. The model devotes so much capacity to respecting formal constraints that not enough remains for the task.

The connection to my experiment is direct. P1 and P2 add instructions that compete with the main evaluation task. Even though those instructions are correct, they consume attention. P3 leaves all the attention to the model for what actually matters.

Positional bias: Lost in the Middle and beyond

Instruction dilution is amplified by another well-documented problem: positional bias.

Liu et al. (Stanford, 2023) showed that LLMs exhibit a U-shaped attention bias. They process information at the beginning and end of the prompt better, and significantly degrade information placed in the middle. This is the Lost in the Middle phenomenon, and it affects even models explicitly trained for long contexts.

IFScale adds an unexpected nuance about primacy bias, the model’s tendency to favor earlier instructions. This bias follows a non-linear curve. At low density, below 100 instructions, it’s weak: the model processes instructions relatively uniformly. At moderate density, between 150 and 200 instructions, it peaks. The model becomes selective and favors the first instructions at the expense of later ones. Beyond 300 instructions, the bias converges to a low level, not because the model has become fair again, but because it’s switched to “uniform failure” mode: it ignores instructions evenly, regardless of position.

In practice, on moderate-length prompts (the most common case in production), position matters enormously. What’s at the beginning frames the task. What’s at the end acts as the last instruction before generation. What’s in the middle is the most vulnerable to being forgotten. And that’s one more reason to keep prompts short: the longer the prompt, the larger the “middle,” and the more pronounced the U-shaped bias.

It’s not always the shortest that wins

The message of this article is not “make shorter prompts.” It’s that beyond a certain threshold, adding information becomes counterproductive, and that threshold is lower than you think.

This threshold depends on several things. First, the model: reasoning models like o3 or gemini-2.5-pro resist instruction density far better than smaller models. A prompt that works on gpt-4.1 may collapse on claude-3.5-haiku.

Then, the task. My P3 wins because GPT-4.1 already knows the CEFR. It was trained on massive volumes of language assessment data. On a domain unknown to the model, a proprietary business framework, unpublished specialized jargon, a detailed prompt remains necessary. It’s precisely because the model doesn’t know that you need to tell it.

And finally, the nature of the added information. The entire difference lies between redundant and new information. Rephrasing what the model knows (the CEFR descriptors, the list of levels) creates noise. Providing what the model doesn’t know (a proprietary grading scale, specific business criteria) creates signal.

The rule isn't "be short." The rule is: don't rephrase what the model knows. Only add what it doesn't know.

In practice, this requires knowing the model’s boundaries, what’s part of its training and what isn’t. That boundary is rarely documented. It’s discovered by testing.

Prompt compression: compress rather than remove

In many production use cases (RAG, document analysis, long contexts), you simply can’t shorten the prompt. The context is long because the problem demands it.

Research on prompt compression offers an interesting alternative: rather than removing information, you compress it. You eliminate noise while preserving signal.

The first family of approaches, called text-to-text, transforms text into shorter text by pruning non-informative tokens or by summarization. LLMLingua-2 from Microsoft achieves compression ratios of 3 to 6x with performance comparable to uncompressed text. On NaturalQuestions, F1 stays at 71.90 with 3.9x compression. On GSM8K, exact match reaches 79.08 at 5x. You divide context size by 4 or 5, and performance stays nearly identical.

The second family, text-to-vector, encodes text into compact vector representations that the literature calls gist tokens. The gisting technique achieves up to 26x compression with minimal loss. The idea: instead of having the model read 10,000 tokens, you encode them into a few hundred vectors that capture the essential information.

LLM-DCP (Jiang et al., 2024) goes further by modeling compression as a Markov Decision Process. Each token is evaluated sequentially, keep or remove, based on its contribution to the task. The result: 12.9x compression ratio.

Compression is an alternative to truncation. It preserves signal while eliminating noise (exactly what instruction dilution prevents us from doing manually).

When context is long, the question isn’t “how to shorten?” but “how to compress without losing signal?” It’s an engineering problem, not a writing problem.

Practical implications

From all of this, I take away a few principles that I now apply daily.

The first is atomicity. One instruction per prompt, stated in a single sentence if possible. When a task is complex, I break it into steps rather than making the prompt longer.

The second is position. Main instruction at the beginning, critical constraints at the end, context and examples in the middle, knowing they’ll be less well retained.

The third is testing at scale. A prompt that works on 10 examples may collapse on 700. IFScale results show that degradation profiles are predictable: it’s worth testing at different densities to find your model’s threshold.

And finally, when a prompt contains multiple constraints, a reminder at the end (“make sure you respect all constraints above”) can reduce the omission rate. It’s a palliative, not a solution, but it works on linear degradation profiles.

I drew concrete rules for production prompting from these observations, what changes when you move from conversation to API, how to separate the what from the how, and how to stabilize and evaluate prompts at scale.

Conclusion

Instruction dilution isn’t an intuition. It’s a measurable, reproducible phenomenon, and research is starting to understand it well. Beyond a certain threshold, adding information to a prompt degrades the model’s performance. That threshold depends on the model, the task, and the nature of the added information.

Prompting guides provide techniques. Research explains why some work and others don’t. Not because the guides are wrong, but because they’re written for a context, conversation, that doesn’t transfer directly to another, production at scale.

To write prompts that work at scale, you need to understand what’s happening in the model’s attention. Know what it already knows, what it doesn’t, and where the tipping point lies. And that can only be learned through experimentation. By getting it wrong on 700 examples and understanding why.

Sources and further reading

IFScale: Benchmarking Instruction Following Across Scales
Jaroslawicz et al., 2025. Benchmark of 20 models on 10 to 500 simultaneous instructions.

Lost in the Middle: How Language Models Use Long Contexts
Liu et al. (Stanford), 2023. U-shaped positional bias in LLMs.

LLM-DCP: Prompt Compression as a Markov Decision Process
Jiang et al., 2024. Prompt compression via sequential decision process.

LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression
Microsoft, 2024. Text-to-text compression 3-6x with preserved performance.

A Survey on Prompt Compression
Taxonomy of text-to-text and text-to-vector approaches.

💡 Ces sujets vous parlent ?

Vous voulez améliorer la qualité de vos projets sans ralentir la livraison.

J'accompagne les équipes PHP avec des audits ciblés et des formations concrètes. Architecture, tests, industrialisation et pratiques d'équipe, chaque intervention est adaptée à votre contexte.

Voir les prestations sur lepine.pro

Stay up to date

Follow new articles via RSS feed — no spam, just quality content.

Subscribe via RSS