What Your AI Forgets Mid-Sentence — And What to Do About It

March 29, 2026

Syntheia published a useful piece this week on what they call "context rot" — the family of failures that occur when a large language model processes more text than it can reliably attend to. Their diagnosis is sharp: LLMs degrade silently on long documents, and the law firm's traditional quality-assurance architecture is not calibrated to catch the resulting errors. I agree with most of their analysis, but I want to take it further and offer solutions.

In this post, I explain the mechanics of context windows in terms aimed at the practicing lawyer, and then I propose concrete strategies to work within those constraints.

The context window, explained without jargon

Every LLM has a context window — the total amount of text it can hold in working memory for a single exchange. That window includes everything: the system instructions that tell the model how to behave, whatever documents you have uploaded or pasted in, the full history of your conversation, and the model's own response. All of it competes for the same finite space.

Context windows are measured in tokens, roughly three-quarters of a word in English. A "200,000-token context window" means roughly 150,000 words across all inputs combined, in a single conversation turn. That sounds enormous until you consider that a single commercial loan agreement can run 80,000 words and a due diligence data room can contain millions. For reference, the Claude system instruction alone — which is necessarily part of every conversation with Claude — can easily run to tens of thousands of tokens.

The critical point, and the one that most marketing materials omit, is that the advertised context window and the effective context window are not the same thing. NVIDIA's RULER benchmark tested models on the kind of complex reasoning tasks that legal work demands, and found that effective performance sits at roughly 50 to 65 percent of the advertised token limit. A model with a 200,000-token window performs reliably on about 100,000 to 130,000 tokens of actual input. The number on the box is not the number that governs your work.

How the degradation works

The research literature identifies several distinct failure modes. They are worth understanding individually, because each one suggests a different mitigation strategy.

Positional bias. The Stanford "Lost in the Middle" research (Liu et al., TACL 2024) demonstrated that LLMs attend most strongly to text at the beginning and end of their input. In multi-document question answering, accuracy dropped by roughly 30 percentage points — from approximately 75% to approximately 45% — when relevant information moved from the first position to the middle of the context. In a 200-page agreement, the provisions that matter most are rarely on page one or page 200.

Volume-dependent reasoning decay. Du et al. (2025) isolated an even more troubling finding: reasoning accuracy degrades as context length increases even when the model has perfect access to all relevant information. They tested this by padding relevant text with whitespace (minimally distracting filler that should not confuse the model) and observed performance drops of up to 85 percent. The sheer volume of input makes the model a worse reasoner, independent of whether the right answer is present.

Conversation history displacement. When a conversation exceeds the context window, something has to go. In most current implementations, including Anthropic's Claude and OpenAI's ChatGPT, the system preserves the system prompt and truncates the oldest conversation turns first. Some platforms summarize rather than drop the earlier exchanges, though that introduces its own fidelity problems. The practical result is the same: the model loses track of what you discussed earlier in the session. The analytical framework you established, the specific issues you flagged, the constraints you set three exchanges ago, all of it becomes inaccessible. In custom or middleware implementations, the system prompt itself may also be at risk, though the major providers now treat it as pinned content.

Compression artifacts. Summarizing a document before feeding it to the model, a common workaround for length limitations, introduces its own errors. Compression algorithms often strip language that appears formulaic or repetitive, but legal documents are dense with formulaic language that carries substantive weight. "Subject to," "notwithstanding the foregoing," "except as provided in Section K": these phrases distinguish an absolute obligation from a qualified one. Pagnoni et al. (NAACL 2021) found that over 80 percent of summaries produced by the neural models evaluated contained factual errors, concentrated precisely in conditional and qualifying language. Current models perform better on standard summarization benchmarks, but the specific vulnerability to legal qualifying language persists because it is structural. Compression algorithms are designed to remove redundancy, and legal qualifiers are designed to look redundant while doing essential work.

These failure modes share a symptom: the output looks complete. It is well-formatted, internally coherent, and confident. Nothing about it signals that a substantial portion of the source material was functionally ignored. That is what distinguishes context rot from the more familiar hallucination problem, and what makes it harder to catch in review.

What to do about it

What follows are concrete approaches, ordered from simplest to most involved, that any lawyer can implement today.

1. One task, one conversation

This is probably the single highest-value habit change available to a non-technical user. Every AI conversation accumulates context: your prior messages, the model's prior responses, uploaded documents, session instructions. As the conversation grows, the model's effective reasoning capacity shrinks. Old instructions interfere with current tasks. Prior assumptions bleed into new analysis. The context fills with material that was useful ten exchanges ago and is now dead weight, what researchers call context pollution.

The fix is simple: start a new conversation for each discrete task. Do not use the same session to summarize a lease, then draft a demand letter, then review an indemnification clause. Each of those deserves a clean context window, and starting a new conversation is free, while the accuracy cost of a polluted one is invisible until something goes wrong.

I call this the OTOC rule — one task, one conversation. That's not to discourage iterative prompting. Iterative refinement of a single work product is still a single task and is an effective use of an LLM. Revising a draft and then pivoting to an unrelated analysis in the same session is two tasks crammed into one window — increasing the risk of context rot.

2. Write a durable task specification

The OTOC rule creates a practical problem: if every task gets a fresh conversation, you lose the background context the model needs to do good work. The overarching objectives, the governing law, the deal structure, the specific issues you care about — all of that vanishes when you close the session.

The solution is to write a reusable task specification: a short document (a few hundred words is usually sufficient) that captures the stable context for a project. Think of it as a briefing memo for the model. It should include the matter description, the governing jurisdiction, the relevant parties, the specific analytical framework you want applied, and any constraints or preferences that should carry across sessions.

You paste this specification at the top of each new conversation, or, even better, preserve it as its own file to attach as input. The model reads it fresh every time, without the accumulated noise of prior exchanges. This is the complement to the OTOC rule: it lets you start clean without starting ignorant. Some tools (Anthropic's Claude Projects feature, for instance) let you attach persistent instructions to a project workspace that automatically prepopulate every conversation. If your platform supports it, use it.

3. Chunk your documents before the model reads them

If positional bias causes the model to lose track of middle-document content, and if volume alone degrades reasoning quality, then the logical response is to feed the model smaller, task-relevant segments rather than entire documents.

For a 200-page credit agreement, do not upload the entire file and ask the model to "review it." Instead, consider breaking the document into its component sections (representations and warranties, covenants, events of default, definitions, schedules) and submit each section in a separate conversation (applying the OTOC rule) with a targeted question. "Identify all financial covenants in the following section and flag any that use a trailing-twelve-month measurement period" will produce dramatically better results than "review this agreement and summarize the key terms."

One important caveat: legal documents are dense with internal cross-references (defined terms, conditions qualified by other sections, carve-outs incorporated by reference). When you chunk, you sever those links. The model analyzing the covenants will not know that a defined term in Article I changes the meaning of a financial ratio test, or that a carve-out in Schedule 3 qualifies an obligation in Section 12. The practical mitigation is to always include the definitions section (or at minimum the relevant defined terms) alongside whatever substantive section you are analyzing.

Manual chunking is labor-intensive, but the labor is front-loaded and predictable. It converts one unreliable pass over an entire document into multiple reliable passes over bounded sections. The lawyer stitches the analysis back together, which is the level at which human judgment should operate regardless of whether AI is involved. For high-stakes tasks, the benefit of minimizing AI errors through manual chunking far outweighs the burden.

4. Use chain-of-thought prompting to structure the model's reasoning

Chain-of-thought prompting means explicitly instructing the model to reason through intermediate steps before reaching a conclusion. Instead of asking "Does Section 7.2 conflict with Schedule B?", you ask: "First, extract the operative language of Section 7.2 and state its requirements. Then extract the relevant provisions of Schedule B. Then identify any inconsistencies between them. Then state your conclusion."

This matters for context management because it forces the model to surface the textual evidence it is relying on before it reasons over that evidence. If the model skips a provision, you will see the gap in the intermediate step, before it gets papered over by a confident-sounding conclusion. Du et al. (2025) found that a simple version of this approach, prompting the model to recite the retrieved evidence before solving the problem, mitigated much of the performance loss caused by long contexts. The technique works because it forces the model to move relevant information into a high-attention position (the most recent output) before it reasons about it.

For legal work, chain-of-thought prompting also functions as a transparency mechanism. A model that shows its intermediate reasoning produces work product that a supervising lawyer can actually verify, because the intermediate steps expose the gaps that a polished final conclusion would conceal.

5. Place critical information strategically

The "Lost in the Middle" research has a direct practical corollary: put the most important content where the model pays the most attention. That means the beginning and end of your input, not the middle.

If you are asking the model to analyze a specific clause in the context of a larger document section, place the target clause at the top of your prompt, followed by the surrounding context, and then restate the analytical question at the end. If you are using a task specification (Strategy 2), put it at the top. If you have specific instructions about format or analytical framework, repeat them at the bottom. The worst arrangement, and the one most people default to, is pasting a large document and then typing the question at the bottom, burying the analytical instructions in a low-attention position.

6. Verify in a separate conversation, not the one that produced the work

This follows directly from the OTOC rule. Generation and verification are different tasks, and they belong in different conversations.

When you ask the model to check its own work in the same session, the entire prior exchange sits in the context window: the assumptions, the omissions, the analytical choices the model made on its first pass. All of it exerts influence on the verification. A model reviewing its own conclusions is structurally biased toward confirming them, the equivalent of asking the same reviewer to read the same draft a second time and expecting fresh insight.

A de novo review in a fresh conversation eliminates that problem. Paste or upload the relevant source text and the model's output into a clean session. Ask: "Does this analysis accurately and completely reflect the source material? Identify every section of the source you relied on and quote the language supporting each conclusion." The new session has no prior commitments pulling it toward agreement. It is structurally analogous to the mid-level reviewing the junior's draft — fresh eyes on the same source.

A necessary warning: the model can fabricate quotations even in a clean session. It may generate text that looks like a verbatim extract but is actually a paraphrase, a conflation of multiple provisions, or an outright invention. The verification step itself requires verification — you must check the model's quoted language against the source document. That is additional work, but it is targeted work: instead of re-reading 200 pages looking for problems you do not know to expect, you are checking specific passages the model claims to have relied on. The de novo framing does not eliminate the need for human verification, but it gives you a structurally honest starting point for it.

The underlying principle

Every strategy above is a variation on a single idea: give the model less to think about, and tell it more precisely what to think about it. That runs against the grain of how most people use these tools. The natural instinct is to dump everything into the conversation and let the AI sort it out, and the marketing encourages exactly that — "upload your entire contract," "ask anything about your documents." The context window numbers are designed to suggest the model can handle it all.

It can, in the sense that it will produce output. What it cannot do — reliably, on long documents, under token pressure — is produce output accurate enough to stake a client's interests on. The strategies in this post are all ways of closing that gap: structuring the input so the model's actual capabilities match the demands of the task. The work is unglamorous — writing briefing documents for a machine, manually splitting PDFs, running the same analysis twice in separate sessions. But it maps directly onto skills lawyers already have. Scoping a task, preparing materials for review, verifying work product against source documents — these are not new professional obligations. They are existing ones, applied to a new tool.

This post draws on Liu et al., Lost in the Middle: How Language Models Use Long Contexts (TACL 2024); Du et al., Context Length Alone Hurts LLM Performance Despite Perfect Retrieval (EMNLP 2025); NVIDIA's RULER benchmark (2024); and Pagnoni et al., Understanding Factuality in Abstractive Summarization with FRANK (NAACL 2021). Anthropic's context window documentation and context management guidance informed the discussion of conversation history displacement. For context on the data-handling and compliance dimensions of AI tool selection, see prior entries in this series on consumer-versus-commercial data handling, API compliance architecture, and the duty to counsel clients about AI privilege risks.