OpenAI’s recent paper on why language models hallucinate highlights how current evaluation methods create the wrong incentives, rewarding models for guessing rather than admitting uncertainty.
But this framing still focuses mainly on intrinsic hallucinations. These are cases where the model generates unsupported information from its training data alone.
In practice, most hallucinations we encounter today are extrinsic. They happen when a model misuses or misinterprets context provided at runtime: retrieved documents, repository code, tool calls, and other external inputs. That is where the real work of building reliable AI systems lies.
Hidden in plain sight: reasoning over context
Virtually none of today’s AI agents or workflows use models in isolation. Almost every real application embeds models inside workflows that rely on external data coming from retrieval, tool use, or other systems.
That means the real test is not whether the model “knows” a fact, but whether it can reason correctly over the context it is given.
And that context is often:
- Missing when the retrieval system never surfaced the right document
- Conflicting when different documents say different things
- Noisy when too much irrelevant text overwhelms the model
Training regimes and benchmarks still have a bias towards general knowledge and intrinsic correctness. Yet the skill that matters most in production is almost always: can the model reliably handle unreliable context.
This is actually a significantly more advanced skill, because it requires robustness in response to real-world stochasticity: context can be partial, ambiguous, or even contradictory, and the model still has to make sense of it without overstepping what the evidence supports.
When context must override training
Models are caught in a tug-of-war between their internal training and the external data in front of them. How they resolve that tension (when to defer to context, when to fall back on priors, and when to admit insufficiency) is the art of applied AI.
For example, imagine a financial analysis system that retrieves a company’s latest quarterly earnings report. If the model leans on its static training data about older financials instead, it will produce outdated or misleading results.
The model needs to decide when to treat the retrieved documents as authoritative, even if they contradict what it “remembers” from training. More often than not, grounding in context is essential.
It’s all about context engineering
Some recent work gives evidence for the kinds of problems we see in production when models must reason over context, especially when that context is long, messy, or conflicting.
- Sufficient Context: A New Lens on Retrieval Augmented Generation Systems (Joren et al, ICLR 2025) shows why so many hallucinations aren’t really about generation at all. Context can be relevant yet still incomplete, which sets the model up to fail. Worse, adding insufficient context often makes models more confident, so hallucinations go up, not down. What looks like “the model making things up” is often just a context gap.
- Context Rot: How Increasing Input Tokens Impacts LLM Performance (Hong et al, 2025) hows that models often break down as context length grows. Even simple tasks fail once the relevant detail is buried in long input, which means retrieval doesn’t guarantee usability. The problem isn’t access to information, it’s reasoning over noisy, stretched, or poorly structured context.
- A Survey of Context Engineering for Large Language Models (Mei et al, 2025) underscores many of the issues behind production‐hallucinations. It shows that problems like being overly sensitive to noisy or long context, failing to structure or filter retrieved information properly, and lacking memory or compression mechanisms all contribute to failure. The survey argues that effective systems optimize not just what context is retrieved, but how it’s processed, compressed, ordered, and managed.
This research makes the same point: hallucinations in production are less about models fabricating facts, and more about how they reason (or fail to reason) over imperfect context.
An open problem
Ultimately, most production hallucinations come from the context provided to the model, how that context is engineered, and how the model reasons over it. What we need are benchmarks that don’t just stress-test long inputs or “needle-in-a-haystack” retrievals, but that capture the true messiness of real-world usage: missing documents, conflicting snippets, noisy retrievals, and imperfect prompting.
And just as important, this needs to become the consistent measuring stick for language models: not just accuracy on static benchmarks, but robustness in the face of messy, imperfect context.
Hallucinations in language models is still an open problem.