This new, dead simple prompt technique boosts accuracy on LLMs by up to 76% on non-reasoning tasks

cool guys
In the chaotic world of large language model (LLM) optimization, engineers have spent the last few years developing increasingly esoteric rituals to get better answers.

we have seen "chain of thought" (asking models to think step-by-step and often showing them "marks of logic" for the user), "emotional blackmail" (telling the model that her career depends on the answer, or that she is being accused of sexual misconduct), and complex multi-shot prompting frameworks.

But a new paper released by Google Research shows that we may have been thinking too much about this. The researchers found that simply repeating the input query – literally copying and pasting the prompt so that it appears twice – consistently improves performance across leading models, including Gemini, GPT-4o, Cloud, and DeepSeek.

paper, title "prompt repetition improves non-logical LLM," Released last month, just before the holidays, presents a conclusion that is almost suspiciously simple: For tasks that do not require complex reasoning steps, stating the hint twice produces significantly better results than stating it once.

Even better, because of how the Transformer architecture works, it "a strange move" Comes with virtually zero penalty in terms of production speed.

The Causal Blind Spot

To understand why repeating a question makes a supercomputer smarter, you have to look at the architectural limitations of the standard Transformer model.

Most modern LLMs are trained as "causality" Language model. This means that they process text strictly from left to right. When the model is processing the 5th token in your sentence, it can "to participate" (note) from tokens 1 to 4, but it has zero knowledge about token 6, because it hasn’t happened yet.

This creates a fundamental limitation in how models understand user queries. As the authors note, the order of information matters immensely.

Formatted as a query <CONTEXT> <QUESTION> often get different results than <QUESTION> <CONTEXT> Because, in the latter case, the model reads the question before knowing the context to which it should apply it.

Quick iteration breaks this limitation by transforming an input <QUERY> In <QUERY><QUERY>.

By the time the model starts processing Second Repetition of the query, it is already "Reading" First iteration. This allows tokens of the second copy to participate in every token of the first copy.

Effectively, the second iteration enjoys bidirectional attention – it can "look back" To resolve ambiguities in the entire query or to obtain specific details that may have been missed in one go.

Benchmark: 47 wins, 0 losses

Researchers Yaniv Levithan, Matan Kalman, and Yossi Matias tested this hypothesis on a set of seven popular benchmarks, including ARC, OpenBookOA, GSM8K, and MMLU-Pro. They evaluated seven different models, ranging from lightweight models like the Gemini 2.0 Flash Lite and GPT-4O-Mini to heavyweight models like the Cloud 3.7 Sonnet and DeepSeek v3. The results were statistically very good. When asking models No Using explicit logic (i.e., giving only straightforward answers), prompt repetition won 47 out of 70 head-to-head trials against the baseline, with zero losses. The benefits were particularly dramatic in tasks that required accurate retrieval from the signal. The team designed a custom "name index" Benchmark, where the model is given a list of 50 names and asked to identify the 25th name.

  • Basic performance: Gemini 2.0 Flash-Lite scores disappointingly 21.33% accuracy.

  • With repetition: accuracy skyrocketed 97.33%.

This represents a giant leap "causal blind spot" Correct. Once in a while, the model may lose track of the count by the time it reaches the 25th name. On repeated passes, the model effectively has the complete list "working memory" Before it tries to solve the recovery task.

"free lunch" of latency

Typically, adding text to a prompt increases cost and latency. If you double the input, surely you double the waiting time? Surprisingly, no. The paper shows that accelerated iteration is essentially "Free" Regarding user-perceived latency. LLM processing is divided into two stages:

  1. Prefill: The model processes the input prompt. It is highly parallelizable; The GPU can crunch the entire prompt matrix at once.

  2. Generation (Decoding): The model generates one token answer at a time. It is gradual and slow.

Rapid repetition only increases the work prefill stage. Because modern hardware handles prefill so efficiently, the user will barely notice the difference. Researchers found that repeating the signal caused this No Increase the length of the generated answer, nor is it increased "time of first token" Latency for most models. The only exceptions were Anthropic’s models (Cloud Haiku and Sonnet), which were on extremely long requests, where the prefill phase eventually hit a bottleneck. But in most use cases, the technology improves accuracy without slowing down the chat experience.

logic vs repetition

There is one caveat: this technique is primarily for "non logic" Tasks -Scenarios where you want a straightforward answer rather than a step-by-step derivation.

When researchers combined testing with accelerated repetition "chain of thought" (asking the model "think step by step"), the advantage largely disappeared, showing neutral to slightly positive results (5 wins, 1 loss, 22 ties).

The authors believe that reasoning models naturally perform a version of self-repetition. When a model "thinks," It often repeats its premise in the output it generates before solving the question. Therefore, it becomes unnecessary to explicitly repeat the signal in the input.

However, for applications where you need a fast, straightforward answer without the verbosity (and cost) of long logic trails, instant recursion provides a powerful alternative.

Strategic Implementation for the Enterprise

For enterprise leaders, this research represents one of the rarest of things in AI development: a "Free" Adaptation. But capitalizing requires nuance; This is not a setting to be toggled blindly across the organization, but a strategic adjustment that ripples across engineering, orchestration, and security.

For technological leadership balancing the eternal triangle of speed, quality and cost, accelerated iteration offers a way to punch above your weight class. The data shows that smaller, faster models – like the Gemini 2.0 Flash Lite – can achieve almost perfect retrieval accuracy (21.33% to 97.33%) by only processing the input twice.

This changes the calculus for model selection: Before upgrading to a larger, more expensive model to solve the accuracy constraint, engineers must first test whether simple iterations allow their current "light" Models to bridge the gap. This is a potential strategy to maintain the speed and cost benefits of lightweight infrastructure without compromising performance on extraction and retrieval tasks.

This logic naturally shifts the burden to the orchestration layer. For those managing the middleware and API gateways that link AI applications together, instant iteration should become a standard, invisible component of pipeline logic rather than user behavior.

However, because the technique is neutral for logic-heavy tasks but highly effective for straightforward answers, it requires conditional application. A smart orchestration harness will automatically identify requests sent to non-logical endpoints – such as entity extraction, taxonomy, or simple Q&A – and double-check the signal before passing it to the model. It optimizes performance at the infrastructure level, providing better results without requiring action from end users or increasing generation budgets.

Ultimately, this increased vigilance presents a new change for security teams.

If repeating a prompt makes the user’s intentions toward the model clear, it stands to reason that malicious intentions can also be made clear. Security directors will need to update their red-teaming protocols for testing "repeated injections" Attack—Verifying whether the jailbreak command is being repeated (e.g. "Ignore previous instructions") makes the model "to participate" To violate more effectively. Conversely, this mechanism provides a new defensive tool: repeating the system prompt.

Stating the security guardrails twice at the beginning of the context window can force the model to pay more strict attention to security constraints, which serves as a low-cost reinforcement for stronger security operations.

why does it matter

This research highlights an important insight for developers building on top of LLM: our current models are still deeply constrained by their unidirectional nature. While we wait for new architectures that can solve causal blindness, crude but effective solutions like accelerated iteration provide immediate value. The authors suggest that this may become a default behavior for future systems.

We may soon see inference engines that silently double our signals in the background before sending them to the model, or "logic" Models are trained to make this iterative strategy more efficient. For now, if you’re struggling to get a model to follow complex instructions or retrieve specific details from a lengthy document, the solution couldn’t be a better prompt. You’ll probably need to say it again.



<a href

Leave a Comment