Many of us learned how to prompt large language models between late 2022 and the middle of 2024. The techniques that worked during that period—such as writing in detail, using lots of bullet points, telling the model to think step by step, setting the temperature to zero for factual work, and leaning on emphatic CAPITAL LETTERS for the rules that really count—are no longer best practices. Anthropic, OpenAI, and Google have all rewritten their prompt engineering guides over the past several months, and all three now converge against the older prompting techniques. The frontier models follow instructions with what OpenAI calls “surgical precision,” which means contradictions and overstuffed rule lists strongly and negatively affect outputs, and prescriptions that worked on GPT-4 or Claude 3 can now degrade newer models.
Below I explain some of the key techniques and practices that changed, focusing on the ones that are most likely to affect lawyers’ use of LLMs, i.e., grounded research, citation discipline, and verification before delivery. I tried to synthesize them in a way that is model-agnostic, meaning that they should improve the output quality of any AI tool you are using.
State the outcome, not the procedure
All three vendors recommend moving away from “first, do X; then do Y; then do Z” prompts and toward prompts that describe the desired outcome and the constraints, allowing the model to decide how to get there. Anthropic’s current guidance for Claude Opus 4.7 puts it directly: “shorter, outcome-first prompts usually work better than process-heavy prompt stacks.” Google’s recent recommendations say the same: the model responds best to prompts that are “direct, well-structured, and clearly define the task and any constraints.”
The elaborate step-by-step research scripts many practitioners built up in 2024 (“first identify the issue; then list the controlling authority; then summarize each case in two sentences; then apply the law to the facts; then write a conclusion”) may now be counterproductive. Replace them with a goal, a set of success criteria, and any hard constraints. The model will decide the best path, and if it does poorly, the right fix is to clarify the outcome rather than to script more steps. The same applies to “think step by step,” which was added to coax chain-of-thought reasoning out of older models. Current reasoning models do this internally; the phrase now reads as noise.
Use ALL CAPS and absolutes very sparingly
Anthropic now warns against “CRITICAL: You MUST...” style language, and OpenAI says the same in different words. Both companies report that emphatic CAPS and absolutist imperatives, which were sometimes useful on older models that would otherwise ignore guardrails, now cause newer models to overtrigger: the model leans into the imperative when a balanced response would have been more accurate. Reserve “must,” “never,” and ALL CAPS for true invariants, such as “Never fabricate a citation,” and use ordinary instruction language for everything else. According to the new guidance, “Use this tool when X” now outperforms “CRITICAL: You MUST use this tool when X.”
Tell the model what to do, not what not to do
A negative instruction (“do not write in bullet points,” “do not include disclaimers”) gives the model no positive target. Both Anthropic and OpenAI now recommend replacing prohibitions with affirmative descriptions of the desired behavior. Instead of “do not write in bullet points,” write “compose your response as flowing prose in complete paragraphs.” Instead of “do not hedge,” write “state your conclusions directly; tie any uncertainty to a specific missing fact or conflicting source.” The exception covers the narrow set of true invariants from the previous section. Prohibitions like “never fabricate a citation” or “never infer beyond the provided documents” work as written, because the prohibition itself is the rule and no affirmative restatement replaces it. Reserve negation for that short list.
The pattern is especially useful for lawyers drafting memos with AI assistance, where the older negation-heavy style (“do not use the passive voice,” “do not start sentences with conjunctions,” “do not use Latin phrases”) tends to produce stilted prose that has technically obeyed every rule and yet reads badly.
Give the rationale for a rule
Instead of stating a bare rule, provide the rationale for the rule. The LLM can generalize from a rule that comes with a reason. For example, Anthropic suggests that instead of writing “NEVER use ellipses,” you write “your response will be read aloud by a text-to-speech engine, so never use ellipses the engine cannot pronounce them.” The reason gives the model something to reason from when it encounters an edge case the rule did not anticipate.
For lawyers, this rewards prompts that explain the audience and the use of the output. “Write for a partner who has not seen the file before and wants to know whether to take the case” steers more reliably than “write a case-evaluation memo.” “Cite only cases decided after June 2, 2024, because that is when the U.S. Supreme Court overruled the Chevron framework ” steers more reliably than “only cite cases after June 2, 2024.” This is also consistent with the recent trend toward viewing interactions with LLMs as “context engineering” rather than “prompt engineering.”
Use structure, and pick one structural style
All three best practices guides recommend using XML-style tags or Markdown headings to separate the regions of a prompt. The tags do not need to be standardized across applications; the model parses them by structure, not by name. What does help is using the tags for prompts that combine instructions with context and with examples, so the model can tell which is which.
For example,
<instructions>Draft a bulleted list of the key deadlines described in this document.
</instructions>
<context>I will use this list to prepare for a client meeting.
</context>
Other tags described in the guidance include <documents>, <examples>, <task>, and <output_format>, but you can use any descriptive label you want.
Google adds a small but practical detail: use either XML tags or Markdown headings, but not both within a given prompt. Mixing them within the same prompt degrades adherence. Many of the prompts I have seen lawyers saving and reusing use Markdown; there’s no need to change from that practice. If you don’t currently use either, start using one or the other to structure your prompts.
Use fewer, more diverse examples
The 2023-era advice (“give the model many examples of the format you want”) has been refined. Anthropic now recommends three to five examples in a few-shot prompt and warns explicitly against “a laundry list of edge cases.” Google goes further and warns that too many examples cause the model to overfit on surface features of the examples rather than the underlying pattern.
Choose examples that cover meaningfully different cases; three well-chosen examples beat ten redundant ones. And because the model picks up formatting cues regardless of whether they were intended, you should try to ensure that all of the examples use a common structural format.
Force grounding in quotes
For long-document tasks where accuracy is required, Anthropic recommends asking the model to quote the relevant portions of the document first, then answer the question. The pattern looks like this:
Before answering, identify and quote the passages of the attached opinion that bear on the question. Number each quotation. Then answer the question, citing the quotations by number.
The output reads as a worked argument: relevant text, then analysis grounded in that text. The model is much less likely to confabulate when it has just quoted the underlying source, and the reviewer can verify each conclusion against the quotations the model produced.
Use a strict grounding template for closed-corpus work
Google publishes a system prompt for cases where the cost of confabulation is high enough to warrant suppressing the model’s parametric knowledge entirely. It is the most aggressive grounding template across the three vendors, and it is well suited to legal research over a defined corpus (a single statute, an administrative record, the documents produced in discovery). The template, slightly trimmed:
You are a strictly grounded assistant limited to the information provided in the User Context. Rely only on facts directly mentioned in that context. You must not access or utilize your own knowledge or common sense to answer. Do not assume or infer beyond the provided facts; report them exactly as they appear. Treat the provided context as the absolute limit of truth; any facts or details not directly mentioned in the context must be considered completely unsupported. If the exact answer is not explicitly written in the context, state that the information is not available.
This does not eliminate hallucination, but it shifts the model’s default from “produce a useful answer” to “produce only what the text supports.”
Cite only retrieved sources, and label inferences
OpenAI’s current citation guidance reads as if it were written for lawyers, and the pattern works on the other vendors too:
- Only cite sources retrieved during the current workflow. Never fabricate citations, URLs, or quoted spans.
- Attach citations to the specific claims they support, not at the end of a paragraph.
- If sources conflict, state the conflict and attribute each side.
- If the context is insufficient, narrow the answer or say the claim cannot be supported.
- If a statement is an inference rather than a directly supported fact, label it as an inference.
- When citable support is missing for a specific fact (a name, date, or figure), use a labeled placeholder rather than invent the specific.
The last rule is particularly relevant to legal work because it forces the model to do something LLMs are otherwise poor at: distinguishing what the source says from what the model thinks the source implies. A prompt that ends with “label every inference as an inference” produces higher-quality output than a prompt that does not.
Make the model verify before finalizing
OpenAI’s current guidance includes a “verification loop” pattern that lawyers should adopt as a default closing block in any prompt with correctness criteria:
Before finalizing your answer, check that every requirement has been met, that every factual claim is backed by the provided context or by a citation, that the output matches the requested format, and that you have not introduced any unsupported specifics. If any check fails, fix the issue before producing the final answer.
While not foolproof, appending this language at the end of a prompt should catch the residual errors and small fabrications that escape a single pass. Pair it with the inference-labeling rule above and you have a workable defense against the most common failure modes in legal output.
Date the prompt
Google now recommends an explicit date clause for time-sensitive tasks. It looks trivial but addresses a common failure mode:
The current date is May 30, 2026. Use this date when reasoning about deadlines, statutes of limitations, recent regulatory developments, or any question whose answer depends on what has happened recently.
Lawyers run into this problem constantly. Models trained before a recent rule amendment will, by default, reason from the older version unless told otherwise. A one-line date clause at the top of the prompt prevents most of these errors. For research over a corpus that has its own internal dates (depositions taken on specific days, contracts with effective dates), name those dates too.
An audit pass on existing prompts
Pull up some of the prompts you reuse most often and conduct a fresh read of each one. Most of the patterns described in this post are easy to spot when you read the prompt cold. Contradictions are harder to spot. A long contract-review prompt that starts with “summarize the contract in plain English” may later say “use formal legal tone,” “write for a non-lawyer client,” “be concise,” and “err on the side of thoroughness.” Try pasting the prompt back into the model and ask it to flag what it cannot reconcile. OpenAI reports that newer reasoning models “[expend] reasoning tokens searching for a way to reconcile the contradictions rather than picking one instruction at random,” which means a prompt full of latent contradictions costs accuracy on every run. Your audit will probably result in shorter prompts that yield higher-quality outputs.
This post draws on Anthropic’s Claude 4 prompt engineering best practices and Effective context engineering for AI agents; OpenAI’s Prompt guidance and GPT-5 prompting guide; and Google’s Gemini API prompt design strategies and Gemini 3 prompting guide.