Why Your RAG Has a 70% Gap Between Best and Worst Answers (And How to Fix It)

Ask your LLM-based RAG system "What is the refund policy?" right now. Write down the answer. Wait ten minutes. Ask the exact same question again.

If you get a different answer, you have a production reliability problem. And statistically, you will get a different answer. Research from the ACL 2025 Eval4NLP workshop measured accuracy variations of up to 15% across identical runs with deterministic settings enabled - and a gap between best-case and worst-case performance of up to 70%.

This is not a theoretical concern. This is the state of LLM-based retrieval systems in production today. The architectural fix is pure retrieval with no generation in the hot path - the approach Ragionex takes with POST https://api.ragionex.com/v1/knowledge/search.

The Consistency Problem Nobody Talks About

Most conversations about AI search focus on relevance: "Did the system find the right document?" That is important. But there is a more fundamental question that gets far less attention: "Will the system find the right document every single time?"

In traditional software engineering, a function that returns different outputs for the same input is called a bug. In AI, it is called a feature - "creative variability," "response diversity," or whatever euphemism makes non-determinism sound intentional.

For creative writing or brainstorming, variability is genuinely useful. For factual retrieval - looking up a refund policy, a medical dosage, a compliance procedure, a configuration parameter - variability is a defect. A dangerous one.

Why LLM-Based Systems Cannot Be Deterministic

The non-determinism in LLM-based RAG systems is not a configuration problem you can fix. It is architectural. Here is why.

Temperature Is Not a Determinism Switch

The most common advice is "set temperature to zero." This does not work.

OpenAI's own documentation states that their API can only be "mostly deterministic" regardless of temperature. They introduced a seed parameter to improve reproducibility, but explicitly warn that determinism is still not guaranteed. Even with temperature=0 and a fixed seed, responses can vary because the seed only affects the sampling layer - it does not control the numerical computation that happens before sampling.

Anthropic's Claude API produces slightly different outputs across calls even with identical inputs and temperature set to zero. Google's Vertex AI and Gemini behave the same way.

Floating-Point Arithmetic Is Non-Associative

GPUs use floating-point math where (a + b) + c can produce a different result than a + (b + c) due to rounding at the level of units in the last place (ULPs). Modern LLMs run billions of these operations. The order of operations depends on batch size, padding patterns, memory layout, and GPU scheduling - none of which are deterministic across requests.

A 2025 study (Yuan et al.) demonstrated that even running inference on your own hardware with open-source libraries, inference still is not deterministic. The non-determinism runs deeper than configuration - it emerges from how GPU hardware executes batched operations.

Infrastructure Drift

Cloud LLM providers run models across clusters of machines. Different hardware generations, different CUDA versions, different batch compositions - all introduce micro-variations. OpenAI's system_fingerprint parameter exists specifically to track when the backend infrastructure changes, which they acknowledge happens "a few times a year." When it changes, even seeded requests produce different outputs.

Model Updates Break Reproducibility

Providers update models continuously. A prompt that produced Answer A in March may produce Answer B in April because the model weights changed. You have no control over this. You often do not even get notified.

Real-World Consequences

The Air Canada Precedent

In February 2024, a British Columbia tribunal found Air Canada liable after its AI chatbot told a customer he could apply for a bereavement fare refund within 90 days of the ticket being issued. The airline's actual policy did not allow retroactive bereavement rates. Air Canada argued that the chatbot was essentially a separate entity responsible for its own accuracy. The tribunal rejected this argument and ordered damages.

The chatbot did not hallucinate randomly. It synthesized information from multiple policy pages and generated a plausible but incorrect answer. A deterministic retrieval system would have returned the actual policy text - the same text, every time, verbatim.

Compliance and Legal Exposure

California's AB 489, effective January 2026, prohibits AI systems from implying they hold healthcare licenses. Similar legislation is advancing in other states, all responding to the same underlying problem: AI systems that produce inconsistent outputs cannot be audited, tested, or reliably certified for compliance purposes.

If your AI search system gives different compliance guidance depending on when the question is asked, you do not have a compliance system. You have a liability generator.

QA and Testing Become Impossible

In traditional software, you write a test: given input X, expect output Y. If the test passes today and fails tomorrow with no code changes, something is broken.

With LLM-based retrieval, this is the default behavior. You cannot write reliable regression tests. You cannot verify that a bug fix actually fixed the bug, because the next run might produce a different output for an unrelated reason. You cannot prove to an auditor that your system will give the same answer it gave during certification.

Your support team cannot tell a customer "the system will tell you X" because the system might tell them Y instead.

The Case for Deterministic AI Search

Here is a straightforward principle: in factual retrieval, the same question should return the same answer. Every time. No exceptions.

This is not a radical idea. It is how every reliable information system in history has worked. Query a database with SELECT * FROM policies WHERE id = 42 and you get the same row every time. Look up a word in a dictionary and you get the same definition. Search a documentation site for "installation guide" and you get the same page.

The moment you insert an LLM into the retrieval path - to rewrite queries, summarize results, or generate answers - you lose this guarantee. The system becomes a function with hidden random state, and no amount of prompt engineering, temperature tuning, or seed parameters can fully eliminate the randomness.

Deterministic search means: same input, same output. Testable. Auditable. Predictable.

When Determinism Is Non-Negotiable

Some domains cannot tolerate answer variability:

Healthcare: A system that returns different drug interaction information depending on server load is a patient safety risk.
Legal and compliance: Regulatory guidance must be consistent. If two employees ask the same compliance question and get different answers, you have a policy enforcement gap.
Financial services: Product terms, fee structures, and regulatory disclosures must be identical every time they are retrieved.
Customer support: Customers who get conflicting answers from the same system lose trust. Support agents who cannot predict what the system will say cannot effectively assist customers.
Technical documentation: Developers need the same API reference, the same configuration steps, the same troubleshooting procedures regardless of when they ask.

When Non-Determinism Is Acceptable

To be clear, not all AI applications need determinism:

Creative writing and brainstorming: Variability is the point.
Conversational AI: Natural dialogue benefits from varied phrasing.
Exploratory analysis: Different perspectives on the same data can surface insights.

The distinction is simple: are you retrieving facts or generating content? Facts demand determinism. Content can tolerate creativity.

How to Achieve Deterministic AI Search

The solution is architectural, not configurational. You do not make an LLM-based system deterministic by tuning parameters. You make a search system deterministic by removing the non-deterministic component from the retrieval path.

This means: no LLM at query time.

Pre-process your knowledge base. Structure it. Index it. When a question comes in, perform a fast lookup against the pre-computed index. Return the matched content directly. No generation. No summarization. No rewriting.

The result is a system where:

The same question always returns the same answer
Response time is measured in milliseconds, not seconds
There is zero hallucination risk because no text is generated
Results are fully testable with standard software QA practices
Every response can be traced back to a specific source document
Behavior does not change when a model provider pushes an update

This is the approach behind Ragionex - a Context Engine that provides deterministic AI search with sub-200ms response times. No LLM runs at query time. The same question returns the same answer, every time. The system retrieves pre-processed, verified content rather than generating new text on the fly.

You can test this yourself with a free API key at ragionex.com. Send the same question a hundred times - you will get the same answer a hundred times.

The Reliability Engineering Perspective

In systems where "mostly works" is not acceptable - payment processing, health monitoring, security infrastructure - determinism is not optional. It is a requirement.

The AI industry has normalized a level of unpredictability that would be unacceptable in any other domain of software engineering. We would never ship a database that returns different rows for the same query. We would never deploy an authentication system that sometimes accepts invalid credentials. Yet we routinely deploy AI search systems that give different answers to the same question and call it state of the art.

Deterministic AI search is not about rejecting AI. It is about applying the same engineering standards to AI systems that we apply to every other production system. Same input, same output. Every time.

The standard is not new. We have always required this from systems that matter.

Related reading: Why We Don't Call an LLM at Query Time walks through the architectural choice in more detail. For the agent-side analogue - deterministic semantic recall instead of an LLM summarizing past conversations - see Persistent Memory Without the Vector Database.