Your Agent Doesn't Have a Memory Problem. It Has a Retrieval Problem.

In late April 2026, an essay called “Agentic Memory Is Still an Unsolved Problem” hit the front page of Hacker News and pulled six hundred comments before it dropped off. The headline is correct in the sense that something about agents is genuinely unsolved. The headline is also misleading in a way that has been sending teams down the wrong path for two years, because the problem the essay describes - and the problem every framework on the market is actually trying to solve - is not memory in the cognitive sense. It is retrieval, dressed in costume.

This is worth being precise about, because the reframe changes what you measure, what you optimize, and which products you take seriously. If your agent is “forgetting” the user's preferences halfway through a session, it is not because storage failed. The bytes are sitting in a database somewhere. The agent is failing to find them at the right moment. Those two failure modes share a symptom and almost nothing else. The honest shape of the fix is a small, scoped, semantically-searchable store - the shape Ragionex exposes through POST /v1/memory/search.

What the “memory” benchmarks are actually measuring

If you read the marketing for any of the agent-memory frameworks shipping in 2026, you will see the same three or four numbers cited over and over. They are real numbers, produced by real benchmarks, and they tell a consistent story - just not the story the marketing implies they tell. Walk through the canonical examples honestly:

LoCoMo. The benchmark Mem0 popularised in their research paper measures performance on long-conversation question-answering. The model is given a long conversation history, then asked questions whose answers depend on remembering specific facts from earlier turns. The metric is whether the model produces the right answer. That is a retrieval benchmark. It tests whether the system can find the relevant earlier turn and surface it back to the reasoning model. It does not test memory in any sense distinct from retrieval - the “memory” framing is shorthand.

Mem0's headline number: +26 points on LongMemEval. An impressive result, real engineering behind it. What is being measured is recall accuracy on long-horizon questions versus a baseline that stuffs the full history into the prompt. The improvement comes from not stuffing the history - from retrieving a relevant subset and prompting on that. The number quantifies how much better the retrieval-based approach is than the stuff-everything approach. It is a retrieval-quality number, plain and simple.

Zep's headline number: 200ms. The number you see in Zep's marketing is the latency of getting context back from their store. That is retrieval latency. It is not the latency of remembering - the entire concept of “remembering latency” would not parse - and Zep's engineering team would tell you the same. The 200ms is a useful number for builders. It is not a number that has anything to do with memory in the cognitive sense.

Three frameworks, three benchmarks, all measuring the same underlying capability under three different rebrandings. The capability is find the right context for this query, fast. We have been calling it memory because that is what it feels like from the user's perspective. From the engineering side, it is retrieval, and pretending otherwise leads to architectural mistakes.

Why this reframe matters for your code

If memory is just retrieval, then the things you tune are the things you tune in any retrieval system: the granularity of what you store, the quality of the query you send, the relevance ranking, the result set size, the recency bias. None of those are mysterious. The retrieval literature has thirty years of useful work on every one of them. What is new in 2026 is that the queries are written in natural language by an LLM acting on the user's behalf, but the rest of the stack is recognisable from any production search system.

The mistake that comes from treating it as “memory” instead is that you start reaching for cognitive metaphors when the right tools are mechanical. You build “forgetting curves” when the right answer is a recency boost in the ranker. You build “associative recall” when the right answer is a better query. You build “working memory” abstractions when the right answer is to send fewer, more relevant results to the prompt. Every cognitive metaphor that gets baked into the architecture is a layer of indirection between you and the actual lever.

Storing tokens has been a solved problem since the 1970s. Finding the right one when you need it is what we still get wrong.

The aphorism is harsher than it needs to be, but it gets the priority right. If your agent is failing, the failure is almost never in the storage layer. The failure is in the query, the ranking, the granularity of the stored unit, or the way the retrieved results are stitched into the prompt. Those are the levers. Those are the things to instrument.

What the engineering culture used to look like - and what it looks like now

For a long time the search-engine community had a culture around this. You measured precision at k. You measured recall at k. You ran A/B tests on ranker changes and tracked the ranking metrics that correlated with downstream user satisfaction. The work was unsexy and rigorous and it produced systems that were extraordinarily good at finding the right thing.

The agent-memory community in 2026 is, charitably, where search was in 1998. There are dozens of frameworks competing on different abstractions; there is not yet a shared evaluation culture; and the marketing is several steps ahead of the rigour. The teams that are doing the most credible work - Mem0, Zep, Letta, the Anthropic engineering team writing about context engineering - are converging on the same realisation: this is a retrieval problem with a chat-shaped interface. The sooner the broader industry accepts that, the sooner the work gets serious.

What an agent-memory product should compete on, honestly

If we accept the reframe, the axes a memory product should compete on become much less mysterious. They are the same axes a retrieval product competes on, with a few twists from the agent context.

Recall quality. Given a query that the agent generates from the current task, how often is the right past memory in the top three results? This is the central metric, and every framework that does not publish it should be looked at sceptically. The same precision-at-k discipline applies on the documentation side - we walked through the failure modes in why your RAG has a 70% gap between best and worst answers.

Latency at the relevant percentile. Not p50 - p95 or p99, because the long tail is what shows up in user-perceptible bad sessions. Median latency is a marketing number; tail latency is an engineering number.

Granularity of the stored unit. If you store a full session transcript as a single record, recall is bad because the matching unit is too coarse. If you store every sentence as a record, recall is bad because the matching unit is too fine. The right granularity is opinionated and it is part of what a memory product is choosing for you. The architectural choice you make about where memory lives determines who owns this decision.

Update and delete, properly. Memory systems that pretend they are append-only databases produce stale recall the first time a user changes their mind. The ability to update and delete is not optional. It is part of correctness.

Scoping, isolation, and shape. Per-user isolation is non-negotiable for security reasons (this deserves its own treatment, and it has one). Project-level scoping inside a user's pool is what makes the recall match how engineers actually organise work.

Notice what is not on this list: anything cognitive, anything “forgetting,” anything that requires the system to understand what it is storing. The store does not need to understand. The reasoning happens at the LLM that calls the store. The store needs to be fast, accurate, scoped, and editable. Those are search problems with thirty years of literature behind them.

The honest positioning

Ragionex Memory is, deliberately, a retrieval API. It does not pretend to be a cognitive architecture. It does not claim to have a model of what it is storing. It is a small, fast, scoped store - search by meaning, not exact match - with a write endpoint and a search endpoint, and the agent on top does the reasoning. We think the honesty is a feature. The closer the engineering matches the actual workload, the fewer surprises the integration produces, and the easier it is to debug when something goes wrong.

The shape is mechanical. Write a fact, scoped to a project. Query by meaning, optionally filtered to a project, with a result count and a scope (segment-level for a slice of a memory, full-level for the whole stored record). That is the entire surface, and it is the entire surface because the entire surface of the problem is retrieval:

curl -X POST https://api.ragionex.com/v1/memory/search \
  -H "X-API-Key: rgx_memory_..." \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What did the user decide about caching the user-profile endpoint?",
    "scope": "segment",
    "results": 5,
    "project": "acme-api"
  }'

The query is natural language because the agent is writing it. The scope is segment-level because that is usually what fits in the prompt. The result count is small because stuffing more context into the prompt makes the model worse, not better. None of the parameters are about cognition. All of them are about retrieval - which is the problem, properly named.

What changes when you accept this

The teams that have internalised the reframe ship faster. They stop arguing about whether their agent has “real memory.” They start arguing about whether the recall@3 number is high enough on the queries they actually see in production. They instrument the right things. They write evals that measure what users care about. They graduate from the “does it remember?” era of vibes-based evaluation to the “does it find the right past fact when the agent needs it?” era of measurable engineering.

The marketing on memory products will probably keep using cognitive language for another year or two, because cognitive language sells. The engineering reality is mechanical, and once you see it that way it stays seen. Your agent does not have a memory problem. It has a retrieval problem - and that is good news, because retrieval is a problem we know how to make progress on.

Your agent doesn’t have a memory problem. It has a retrieval problem.