LLMs degrade after 15 turns; new industry tactics emerge
Serge Bulaev
Studies suggest that language models often lose reliability after about 15-20 back-and-forths in a conversation. This may happen because the models must split their attention as the chat gets longer, making it harder to remember or follow earlier instructions. Common problems include forgetting rules, repeating answers, or making up new ones. Researchers and industry teams now use tactics like summarizing conversation history early, breaking tasks into smaller parts, and storing important facts outside the chat to help fight these issues. There is still debate about whether bigger context windows can fix the problem, but most agree that better prompt handling and context management work better than just making context windows larger.

The observation that LLMs degrade significantly after as few as 2-4 turns (1-2 back-and-forths) in underspecified multi-turn conversations, with an average 39% accuracy drop is a well-documented failure mode for modern AI agents. This performance drop has been attributed in recent research to context bottlenecks, attention scarcity, and reduced precision as context grows. As a conversation grows, the model's attention is spread too thin, causing even top models like GPT-4 to falter on reliability. A study in Nature confirmed that under complex prompts, larger models performed no better than smaller ones Larger and more instructable language models become less reliable.
This isn't just a lab finding. Industry reports confirm that even models with massive 1M-token windows exhibit "quality degradation as that context fills up". The symptoms are immediate and disruptive: the LLM may forget initial instructions, repeat itself, or hallucinate answers.
Recent research clarifies this is not a bug with one specific model but a fundamental attention-allocation problem. As a chat's context log grows, the model's attention favors the most recent text, causing it to effectively 'forget' or underweight earlier, critical instructions.
Why Performance Degrades: The Root Causes of Context Decay
LLM performance drops in long chats because of 'context window saturation.' As the conversation history grows, the model must divide its attention across more information. This dilutes its focus on early instructions, causing it to forget rules, repeat answers, or lose track of the conversation's goal.
Research from MIT on reasoning stability highlights that models struggle when long conversations stray into unfamiliar territory, showing poor generalization beyond their training data Reasoning skills of large language models are often overestimated. This issue, combined with the known "lost in the middle" retrieval problem, creates three overlapping pressures:
- Context dilution: every extra token lowers the relative weight of earlier ones.
- Instruction drift: partial recall of prior rules invites reinterpretation.
- Hallucination pressure: sparse attention on grounding facts increases confabulations.
Industry reports suggest that techniques like selective retention and Retrieval-Augmented Generation (RAG) maintain accuracy while reducing prompt size. This proves that a smaller, more relevant context often outperforms a complete but bloated conversation history.
Proven Industry Fixes and Proactive Tactics
To combat this degradation, production teams don't wait for automated context limits. They proactively implement one or more of these context management strategies:
- Proactive summarization at roughly 70 percent capacity, preserving constraints verbatim.
- Task decomposition into smaller sub-sessions, each with its own fresh agent.
- External state commits: checklists or JSON blobs that store facts outside the prompt.
- Periodic re-anchoring: restating system rules every 8-10 turns to offset drift.
Industry reports suggest that context management techniques can significantly improve both efficiency and quality by pruning context early.
How the Industry is Adapting: Adoption Patterns
Adoption trends show a clear shift toward retrieval-first architectures. Benchmarks like LOCA-bench confirm that agents using active context curation outperform those relying on raw history for long-term tasks. This has led to the informal "15-turn rule" becoming standard practice in operations manuals, advising engineers to restart or summarize sessions that exceed this limit.
To maintain reliability, teams are implementing continuous evaluation pipelines, similar to software regression testing. These systems use "canary prompts" with known answers to detect performance drift and trigger alerts. Another key practice is requiring models to use abstention tokens, allowing them to state they are "unsure" rather than hallucinate an answer when context is unclear.
The Ongoing Debate: Bigger Windows vs. Better Architecture
A key debate is whether simply scaling context windows - from 128K to 1M tokens - is a real solution. Many argue it only delays the performance drop, suggesting fundamental architectural changes are needed to solve attention dilution. Others are exploring fine-tuning techniques like position interpolation to improve memory. Industry reports suggest that reliability metrics are not keeping pace with benchmark scores, indicating a need for better evaluation methods to guide future development.
Current research is exploring advanced techniques like critic agents and tool-verified retrieval to extend reliable conversation length without increasing cost or latency. However, the current consensus is clear: smart, proactive context management and good prompt hygiene are far more effective and economical than simply brute-forcing larger context windows.
Why do LLMs start to break after ~2-4 conversational turns?
The drop is a known systems-level limit tied to finite context windows. Once the accumulated prompt history nears the window size, the model begins under-weighting earlier tokens, leading to repetition, instruction drift, hallucinations, and ignored constraints. Recent research shows that performance degrades well before the hard token limit, a phenomenon often called "context rot" (Chroma Context Rot report).
What are some real failure symptoms users notice?
Teams report:
- Circular responses the same explanation repeated verbatim.
- Instruction drift the agent starts misinterpreting or dropping previously agreed rules.
- Hallucinated facts confidently asserting data never supplied in the session.
- Silent refusal ignoring requests without explanation.
Industry reports suggest that even the largest models show volatility on reliability indicators once context is long.
How can I stop degradation before it starts?
Combine layered tactics:
1. Reset the session at ~2-4 turns or whenever quality slips.
2. Decompose large tasks into shorter, chained subtasks (each under 15 turns).
3. Externalize state with explicit checklists or commit-style summaries stored outside the prompt.
4. Re-anchor critical instructions every 3-5 turns by repeating them at the start and end of the prompt.
5. Proactively compact context with early warning around 64% and session switching around 80%, rather than waiting for the 95% auto-trigger.
Does a bigger context window solve the problem?
Not by itself. Industry reports suggest quality still decays as the filled length grows, merely shifting the cliff further out. The key is context engineering - keeping only what the model needs right now via retrieval, summarization, and masking - rather than stuffing more into the prompt.
Which specific metrics show the biggest gains from proactive compaction?
- Cost reduction: JetBrains reported that hybrid techniques reduced costs by 7% compared with pure observation masking and by 11% compared with LLM summarization, with up to USD 35 saved across the benchmark.
- Accuracy improvements: Industry reports suggest significant solve rate improvements on coding tasks under managed context vs raw history.
- Session length extension: Production teams report achieving significant extension in conversation length through intelligent compaction without visible quality loss.
The pattern is clear: curated context beats longer context.