KnowledgeMemoryBlogPricingDocs Log in

How to Cut Your AI Agent's Token Cost by 90% Without Touching Your Prompts

The 90% comes from Mem0's published research. The mechanism is the same one any persistent-memory layer uses. Here is the math, the architecture, and the working integration.

The 90% number is not ours. It is conservatively rounded down from the figures Mem0 published in their Token-Efficient Memory Algorithm writeup and the accompanying paper at arXiv:2504.19413. Across the LongMemEval and LoCoMo benchmarks they report a 91% latency reduction and a token cost reduction of 3-4x compared to full-history baselines. The "90%" in the title of this post is the headline you can defend in a stand-up. The "3-4x cost cut" is the version you can defend in a procurement meeting.

Both numbers describe the same architectural change: stop sending the entire prior conversation back to the model on every turn. Replace it with a small, targeted retrieval call against a persistent memory store. The model still has to reason. Your prompts stay exactly where they are. What changes is the size of the messages array you pay for.

This post is the math, the architecture, the before-and-after code, and the trade-off. It applies to any persistent-memory layer that exposes a search endpoint. We use Ragionex Memory in the examples - POST /v1/memory/write and POST /v1/memory/search - because that is what we ship; the mechanism is the same with Mem0, Letta, or Zep.

Where Token Cost Hides in Agent Workflows

Most teams treat their token bill as a function of the model and the prompt. It is mostly neither. In a long-running agent session the system prompt is fixed, the tool descriptions are fixed, and the per-turn user message is small. The line item that grows without bound is the conversation history that gets re-sent on every turn.

Consider a coding agent that has been working with a developer for three hours. The user types a one-line follow-up: "Now apply the same pattern to the auth module." The model needs context to act on that. The naive way to give it context is to include every previous turn in the request. After three hours of back-and-forth that can be 150-200 turns. At 500 input tokens per turn on average, that is 75,000 to 100,000 tokens of history sent on every single subsequent turn. Anthropic's own field guide on Effective Context Engineering for AI Agents walks through the same arithmetic and lands on the same conclusion: history-padding dominates the bill.

The same dynamic shows up in customer-facing chatbots that span weeks or months, in support agents whose conversations span tickets, and in any agent that needs to act on something a user said yesterday. The pattern is universal: the longer the agent's life, the larger the share of the bill that is just resending the past.

Multiple independent surveys of production agent traffic put the share consumed by history padding alone in the 60-70% range, with the rest split between system prompt, tool descriptions, and the actual user turn. That is the slice the 90% number eats.

The 90% Comes From Stopping the History Stuff

The architectural change is simple to state. Instead of sending the model "everything we have ever talked about", you send the model "the small slice of what we have talked about that is relevant to this turn". You get that slice from a persistent store via one API call.

The math at a glance, using the three-hour coding session above as the running example:

PatternPer-turn input tokens200 turns of history
History stuffing~100,00020,000,000 tokens billed
Targeted recall~2,000400,000 tokens billed
Reduction50x per turn~98% lifetime

The "98% lifetime" cell is what makes the 90% headline conservative. For agents whose users come back across days or weeks, the reduction approaches the full ratio of the conversation length to the recall window. Mem0's measured 3-4x cost reduction on benchmarks is the average across mixed workloads; long-session agents land higher, single-turn agents land lower.

The mechanism is not new. It is the same retrieval-then-reason loop that documents have been using since the original RAG paper, and the same architectural move that drives the cost story on the documentation side - written up in eliminate RAG hallucination AND runtime cost. What is new in 2026 is that the retrieval target is the agent's own past, not a static document corpus. For more on why the surface complaint of "amnesia" is the wrong reframe for what is mechanically a retrieval problem, see Your Agent Doesn't Have a Memory Problem.

The Code: Replace History-Stuffing With Two API Calls

The before pattern is the one most teams ship first. It works, it is simple, and it gets expensive on a curve.

# BEFORE: history stuffing
messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    *prior_turns,            # everything since session start
    {"role": "user", "content": user_message},
]
response = client.messages.create(model="claude-sonnet", messages=messages)

The after pattern replaces prior_turns with a targeted recall. The recall call returns the slices of the agent's past that are semantically related to the current user message, and you put only those slices in the context.

# AFTER: targeted recall
import requests

recall = requests.post(
    "https://api.ragionex.com/v1/memory/search",
    headers={"X-API-Key": API_KEY},
    json={
        "query": user_message,
        "scope": "segment",
        "results": 5,
        "project": "coding-agent",
    },
).json()

context = "\n\n".join(r["content"] for r in recall["results"])

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "system", "content": f"Relevant prior context:\n{context}"},
    {"role": "user", "content": user_message},
]
response = client.messages.create(model="claude-sonnet", messages=messages)

You also write to memory at points worth remembering. That is one more endpoint, called less often than search.

requests.post(
    "https://api.ragionex.com/v1/memory/write",
    headers={"X-API-Key": API_KEY},
    json={
        "content": "Decided to use Postgres advisory locks for the queue worker. Redis was rejected because the team wanted one operational dependency.",
        "project": "coding-agent",
    },
)

The write happens after a meaningful turn. The search happens at the start of every turn that needs context. In a coding agent that means roughly one search per user message and one write per architectural decision or convention discovered.

Why Prompts Don't Need to Change

The title's promise. Your system prompt is the same. Your tool descriptions are the same. Your model is the same. The line that changes is the one that builds the messages array. The recall result lands in exactly the slot where conversation history used to land - either as an additional system message or prepended to the user turn - and the model sees it the same way.

This matters because prompt engineering is brittle work. Teams that have spent weeks tuning a system prompt do not want to be told the cost optimization requires re-tuning it. They do not. The optimization is upstream of the prompt.

It also means the optimization is reversible without code changes. If you ever want to A/B test history-stuffing versus recall on the same prompt, you flip the branch that builds messages. The rest of the agent is unchanged.

The cheapest token is the one you never sent. Persistent memory turns "send everything just in case" into "send only what is relevant".

When 90% Is Conservative

The 90% headline assumes a representative long-session agent. There are workloads where the actual reduction is higher and workloads where it is lower. Knowing which one you have changes the ROI calculation before you integrate.

Higher reduction (95%+): customer-facing chatbots whose conversations span months, support agents that read across past tickets, coding agents that work with the same developer across weeks of project work, sales agents that revisit accounts across quarters. In all of these the lifetime conversation is long, the relevant slice per turn is small, and the recall ratio is extreme.

Roughly the headline 90%: long single-day sessions like a multi-hour coding pairing, a research agent walking through a long document set, or a multi-tool agent chaining many calls inside one task.

Lower reduction (or none): single-turn agents that have no history to compress, ultra-short sessions where stuffing the last three turns is already cheap, and agents where every prior turn is genuinely relevant to the current one (rare but possible in tightly-scoped task automation).

If your traffic looks like the first or second category, the math is comfortably in your favor. If it looks like the third, persistent memory is still useful for cross-session continuity but the cost story is weaker. Be honest about which one you are.

The Honest Trade-Off

Targeted recall is not free. You add one network round-trip per turn. The recall call has to run, the result has to come back, and only then does the model start generating. On Ragionex Memory the search latency runs in the tens-to-low-hundreds of milliseconds depending on result count and store size. On the providers measured in the Mem0 paper the numbers cluster in the same range.

For most agent workloads that latency is invisible because the model itself is the long pole - first-token latency on Claude Sonnet sits in the hundreds of milliseconds and the full response runs into seconds. A 50ms recall ahead of a 3-second generation is rounding error. For an interactive UX that streams output token-by-token, the recall lands before the user can read the first sentence.

Where the round-trip does matter is in latency-sensitive paths where every millisecond is on a stopwatch - voice agents with sub-second target latency, real-time autocomplete, anything with a hard SLO under 200ms total. For those, you have two options. One: keep recall but cache aggressively, since the same query repeats often in narrow workflows. Two: accept that the cost cut does not apply and stuff history for the latency-critical paths only.

The other trade-off is correctness. Recall can miss relevant context that history-stuffing would have included by brute force. In practice, well-scoped queries with project filters land the right slice. When they do not, the failure mode is the agent answering with less context, not with wrong context. That is the right failure mode for most workflows.

How to Pilot This in a Day

The smallest test that produces real numbers is to wrap one route of one agent. Pick the route with the longest conversation history, integrate the recall call, run a sample of production traffic through it for a day, and compare token billing on the two paths. The setup is one MCP config or one HTTP client. For a developer working in Cursor or Claude Code, the integration is documented in Add Persistent Memory in 3 Lines. For a backend, it is two HTTP calls.

The numbers you measure on your own traffic will not be exactly 90%. They will be a function of how long your sessions are and how much overlap there is between turns. The right hypothesis to test is "the reduction matches the published 3-4x range on my workload mix". When it does, scale it. When it does not, the pilot itself is cheap enough to be worth doing for the data alone.

Ready to try it?

Free API key. No credit card. Start using in seconds.

Get Started