Skip to content
All posts

The Retrieval Problem That Makes RAG Hallucinate at Scale

May 18, 2026·Read on Medium·

It’s not the model. It’s what you fed it.

The system worked fine in staging. Internal testers loved it. Then real traffic arrived, the knowledge base tripled in size, and the answers started sounding right while being wrong.

Not obviously wrong. Not the kind of wrong that triggers an alert. The kind where a user asks a question, gets a confident three-paragraph response with the right tone and plausible structure, and only discovers the error two days later when something breaks downstream. That kind of wrong is expensive.

Most teams debug in the wrong direction. They audit the prompt, then swap the model, then add instructions like “only answer from provided context” to a system prompt that was already doing exactly that. The hallucinations persist. They persist because the problem is not in the generation layer. It is three steps earlier, in the retrieval pipeline that nobody is looking at.

The Part of RAG That Most Engineers Treat as Plumbing

Retrieval-Augmented Generation has a deceptively clean architecture on a whiteboard: a user submits a query, the system retrieves relevant documents, the model generates an answer grounded in those documents. The model’s job is to synthesize. The retrieval layer’s job is to pick the right raw material.

In practice, the retrieval layer is where most things go wrong.

Engineers consistently report that the majority of unexplained hallucinations in production RAG trace back not to the language model but to what was retrieved. Specifically, to what was retrieved that looked relevant but was not, or to relevant content that was never surfaced at all.

The LangChain State of AI Agents report put quality as the top barrier to production deployment, cited by 32% of respondents. Unreliable performance was the biggest obstacle overall at 41%. Those numbers describe what teams experience, not why it happens.

Why it happens is a retrieval problem. And retrieval fails in specific, diagnosable ways.

What Vector Search Actually Does (and What It Cannot)

Dense vector search works by representing every document chunk as a point in high-dimensional space, then finding the nearest neighbors to a query’s embedding. The intuition holds nicely at small scale. Documents that mean similar things cluster together. Queries find neighbors that share semantic weight with the question.

The intuition breaks at larger scale. This is not speculation.

An ICLR 2025 paper on retrieval-augmented generation found that increasing the number of retrieved passages initially improves model performance, then leads to a decline. More context, worse answers. The degradation is more pronounced with higher-quality retrievers than with weaker ones. In other words, a better embedding model finds the problem faster.

The geometric reason matters here. When a vector space is sparsely populated, distance is meaningful. Similar documents sit close; irrelevant ones sit far. As the corpus grows and more vectors fill the space, distances begin to converge. Embeddings that should be far apart creep closer. The retriever cannot reliably distinguish what matters from what merely sounds like it might.

This is what makes scale a structural problem, not a tuning problem. You cannot fix it by adjusting a similarity threshold. The space itself has changed.

The Ingestion Layer You Probably Skipped

Before any of this retrieval geometry applies, documents pass through an ingestion pipeline. Raw text comes in from PDFs, Confluence pages, Slack exports, API responses, Markdown files. Something parses them, chunks them, embeds them, and writes them to a vector store.

That pipeline is where most teams spend the least engineering time.

Parsing is messier than it looks. Tables in PDFs lose their structure during extraction. Nested headers in Markdown collapse into flat text. OCR artifacts introduce invisible noise that survives into the embedding layer, polluting vector representations with characters that should not be there.

Chunking is the decision that shapes everything downstream. Small chunks, around 128 to 256 tokens, give precise retrieval but strip context. A chunk that contains the answer to a question might not contain enough surrounding text for the model to use that answer correctly. Larger chunks, 512 to 1024 tokens, preserve context but introduce noise. A chunk retrieved because it mentions the right topic might contain three other topics that dilute the model’s attention.

The COLING 2025 study on RAG best practices found that chunk size is one of the primary determinants of response quality, alongside knowledge base size and retrieval stride. Teams often set a chunk size once during prototyping, then forget it exists.

The most common production failure mode is not a dramatic crash. It is a slow drift where retrieval quality degrades and nobody notices until users stop trusting the system.

What the Stanford Legal AI Study Actually Shows

A Stanford study published in the Journal of Empirical Legal Studies in 2025 evaluated three commercial RAG-based legal research tools: Lexis+ AI, Westlaw AI-Assisted Research, and Ask Practical Law AI.

The researchers found that Lexis+ AI, the best-performing system, still hallucinated in 17% of cases. Westlaw hallucinated in 33% of cases. For context, GPT-4 without RAG hallucinated on legal queries 43% of the time.

This is often cited as evidence that RAG works. It does reduce hallucinations compared to ungrounded generation. But the conclusion that matters is different.

Lexis+ AI answered accurately in 65% of cases. These are purpose-built systems, well-funded, with domain-specific corpora, trained legal teams maintaining data quality, and enterprise contracts that depend on reliability. The hallucination rate at the best implementation is still one in six.

One category they identified was particularly dangerous: the model cited a real source, but the source did not actually support the claim being made. The citation exists, the case is real, and the AI’s characterization of what the case said was simply wrong. That failure is downstream of retrieval. The retriever surfaced a document that scored well on cosine similarity, which is its only job, and the model interpreted a high similarity score as confirmation that the source supported the claim. Relevance and accuracy are not the same thing. They just usually travel together closely enough that the distinction does not matter until it does.

This is what misgrounded retrieval looks like in production. The model gets context. The context is real. The context is just not the right context.

Two Retrieval Failures Worth Understanding

The first is retrieval noise. The retriever returns documents that match the query’s surface pattern without matching its intent. Dense embeddings capture semantic similarity, but similarity is not the same as relevance. A query about how to handle rate limiting in Redis might surface chunks that discuss Redis in other contexts, queue backpressure theory, and a section from a rate limiting article that mentions Redis only once. All of those chunks are semantically close to the query. None of them contain the answer.

This is why teams that add more documents often see answer quality drop before it improves. More documents means more noise candidates. The retriever’s job gets harder, not easier.

The second is retrieval absence. The answer exists in the knowledge base but the retriever misses it. This happens when the user’s query uses different terminology than the document. Dense embeddings handle synonyms reasonably well, but struggle with domain-specific abbreviations, product-specific naming conventions, and cross-lingual overlap. Keyword search would have found the match. Vector search missed it.

The gap between these two failure modes is where hybrid retrieval lives.

Fixing the Architecture, Not the Prompt

On chunking:

The naive approach is fixed-size chunking. Set a token limit, split every document at that boundary, embed each piece. It is fast to implement and wrong at the boundaries, where concepts get cut mid-sentence.

A better approach is hierarchical chunking. Store two representations of every document: a large parent chunk of 512 to 1024 tokens for context, and smaller child chunks of 128 to 256 tokens for precision. Run retrieval on the child chunks, where signal is strong. Inject the parent chunk into the model’s context, where the full surrounding information lives. The model gets precision in finding the right section and context for generating an accurate answer.

This pattern requires more storage and slightly more complex retrieval logic, but the quality difference in production is measurable. Splitting a concept across chunk boundaries is one of the most common causes of misgrounded retrieval.

On retrieval strategy:

Pure dense vector search fails at exact matches. Pure keyword search fails at semantic variation. Hybrid retrieval combines both.

Dense search (embedding-based):

  • Catches semantic similarity and paraphrasing well
  • Misses exact terminology, product codes and abbreviations where the words matter more than the meaning
  • Gets progressively worse as corpus size grows

Keyword search (BM25):

  • Reliable on exact matches and domain-specific terminology
  • Blind to synonyms and paraphrasing
  • Scale does not hurt it the same way dense search is hurt

Hybrid (both combined):

  • Outperforms either approach alone in practice, consistently, across domains
  • More complex to build and maintain
  • Worth doing if answer quality is a product requirement, not just a nice-to-have

A re-ranking layer on top of hybrid retrieval adds another quality gate. The retriever returns a broader candidate set of 20 or 40 documents. A cross-encoder model, much smaller and cheaper than the generation model, scores each candidate against the query and reorders them. Only the top 4 or 5 go to the language model. This step removes a significant share of noise before generation begins.

On corpus architecture:

If your vector space is growing past tens of thousands of documents, the question is not whether to partition it. The question is how.

Partitioning by domain, time range, product line, or content type keeps each sub-space bounded. Within a bounded partition, semantic distance remains meaningful. The retriever runs within the appropriate partition rather than searching everything. Query routing, deciding which partition to search, can be handled by a lightweight classifier or by explicit metadata tagging during ingestion.

This is more operational complexity than a single vector store. The alternative is a single vector store where retrieval quality degrades as the corpus grows, which is not a stable foundation.

What to Audit Before Tuning the Model

If your RAG system is hallucinating, work through this list before touching the model or the prompt:

  1. Chunk boundaries: Pull 20 examples of retrieved chunks. Do they contain complete thoughts? Do answer-critical sentences appear at the start or end of a chunk, where they risk getting cut off?
  2. Retrieval noise: Log the top-5 retrieved chunks for 50 real user queries. How often does the relevant document appear in position 1 versus position 4 or 5? High noise in top results means the retriever is confused, not the model.
  3. Retrieval absence: Pick 10 questions whose answers exist in your knowledge base. Does the retriever surface the right document? If not, is it a keyword match the dense retriever missed?
  4. Ingestion artifacts: Sample 50 raw chunks from your vector store. What does the text actually look like? Encoding issues, broken headers, and OCR artifacts are invisible until you look directly at them.
  5. Corpus distribution: How many documents are in each semantic neighborhood? High-density regions cause retrieval to behave unpredictably. Sparse regions mean good documents might never surface.

If any of these audits reveals a problem, that problem will survive model upgrades, prompt rewrites, and temperature changes. Fix the data layer first.

The Trust Gap That Retrieval Is Responsible For

Pragmatic Engineer’s 2026 AI tooling survey found that 95% of respondents use AI tools at least weekly. Only 29% say they trust the output.

That trust gap is partly cultural inertia. It is also partly earned, by systems that confidently produce wrong answers in ways that are hard to catch before they matter.

RAG is the architecture most teams use to solve the hallucination problem. It works, and the Stanford legal AI numbers confirm it does. But “works better than raw GPT-4” is not the same as “works well enough to trust,” and getting there requires treating the retrieval layer with the same engineering discipline as the model itself.

The ingestion pipeline needs data quality checks, not just a parser call. The chunking strategy needs domain-specific calibration, not a default from a framework tutorial. The retrieval layer needs observability: logging which documents were retrieved, how often the right document appeared, and what the distribution of similarity scores looks like.

Most teams skip this work. They build a demo that impresses in a meeting, ship it, and start getting reports that the AI is “making things up.” The AI is not making things up. It is faithfully synthesizing the wrong documents, the ones the retriever surfaced because they were close enough in vector space, which is not at all the same thing as correct. That is a fixable problem. It just requires fixing in the right place.

At the end? The model gets blamed. The retrieval layer caused it.

This story is published on Generative AI. Connect with us on LinkedIn and follow Zeniteq to stay in the loop with the latest AI stories.

Subscribe to our newsletter and YouTube channel to stay updated with the latest news and updates on generative AI. Let’s shape the future of AI together!

Found this helpful?

If this article saved you time or solved a problem, consider supporting — it helps keep the writing going.

Originally published on Medium.

View on Medium
The Retrieval Problem That Makes RAG Hallucinate at Scale — Hafiq Iqmal — Hafiq Iqmal