KnowledgeMemoryBlogPricingDocs Log in

Why every AI agent has amnesia.

Stateless models do not have a context size problem. They have a state problem. The two are different, and conflating them is how teams burn six months on the wrong fix.

On the Cursor community forum, a thread titled “Cursor with Claude has memory of a goldfish” sat near the top of the “Bug Reports” tab for most of last quarter. The reply count crossed three digits before anyone from the company answered. Pull a random week of Hacker News front pages from 2026 and you will find at least one post making the front page about an agent that forgot the user halfway through a session. The complaint has a thousand variants and one shape: the assistant was helpful, the user trusted it, the conversation got long, and then the assistant quietly stopped knowing things it had been told two screens earlier.

This is not a Cursor bug. It is not a Claude bug. It is not even a bug. It is the default behaviour of a stateless function being asked to behave like a stateful one, and it is the most-complained-about agent UX failure of 2026 by a margin that is not close. The interesting question is not why does it happen - the architecture makes it inevitable. The interesting question is why the most popular proposed fix, “just use a bigger context window,” is wrong in a way that is going to cost teams another year of wasted engineering time. The right fix is a small, queryable store outside the model - the shape Ragionex exposes through POST /v1/memory/write and POST /v1/memory/search.

The architectural cause: a function call, not a conversation

Every LLM in production today is a pure function. You give it a prompt, it returns a completion, and it has no idea any other call ever happened. The illusion of conversation is constructed entirely on the client side: the application keeps the message history in a Python list or a Postgres row, and on every turn it pastes the entire history back into the prompt. There is no “the model” that remembers you. There is a request that re-explains who you are, every single time.

This works fine for short chats. The wheels start coming off the moment the conversation outgrows the budget the application is willing to send. That budget is always smaller than the model's nominal context window, because every token in history is a token not spent on reasoning, and because longer prompts cost more, take longer, and degrade quality. So the application throws old messages away. Sometimes it summarises them, sometimes it just truncates. Either way, the model wakes up on turn forty knowing only what fit through the window, and the user perceives this as the assistant losing its mind.

The problem is not that the model is bad at remembering. The model is not in the remembering business at all. The application that wraps it is supposed to do that, and the default implementation - shove everything back into the prompt every turn - is the architectural primitive that breaks first.

Bigger context windows do not solve this. They make it worse.

The intuitive fix is to make the prompt fit. If the model now accepts a million tokens, surely you can stop dropping history. Anthropic's own engineering team published a piece in late 2025 called Effective context engineering for AI agents that argues the opposite, and it should be required reading for anyone building on top of frontier models. The thesis is plain: context is a finite resource with diminishing marginal returns, and treating it as a free buffer is one of the most common architecture mistakes in agent code. Performance does not scale with the number of tokens you feed in. It often inverts.

The empirical case for this is now overwhelming. The Mem0 team, in their LongMemEval and LoCoMo work, ran the experiment that everyone shipping a chat product implicitly assumes does not exist: take a long-running conversation, evaluate the model with full history stuffed into the prompt, then evaluate the same model with a small retrieved subset of relevant turns. The retrieval-based setup did not just match the stuffed-context setup. It beat it, by double-digit points on long-horizon recall benchmarks, while costing a fraction of the tokens. The phrase the community has converged on is context rot: the longer the prompt, the worse the model attends to any single piece of it. Million-token windows do not abolish this. They give you more rope to hang yourself with.

The fix is not a longer prompt. It is a shorter one with a queryable store behind it.

If you read only one paragraph of this post, read that one. The point is not that long contexts are useless - they are extraordinary for one-shot tasks like analysing a codebase or a contract. The point is that conversation is not a one-shot task. It is a long-running process where 99% of the previous turns are irrelevant to the current one, and feeding the model 99% noise is a tax on quality, latency, and cost simultaneously.

The two failed fixes you have probably already tried

Every team that hits this wall reaches for one of two workarounds before they accept that the architecture is the problem. They are worth walking through, because they fail in instructive ways.

The markdown file. CLAUDE.md, .cursorrules, AGENTS.md, the rules file in your editor of choice. The agent reads this every session, so anything you put in it survives the reset. This works beautifully for the first three things you write down. By the tenth, the file is long enough that the agent stops attending to it carefully. By the thirtieth, it is a wall of text that no one wants to maintain, and the team has started arguing about which CLAUDE.md is canonical for a repo with three contributors. Andrej Karpathy's llm-wiki gist made the pattern famous in April 2026 and is genuinely the right answer for solo developers with a personal knowledge vault. It is not the right answer for an agent that needs to remember thirty different decisions across thirty different projects, because the file becomes the thing it was supposed to fix. We will come back to this in another post - the failure mode is interesting enough to deserve its own treatment.

Bigger context, with care. Some teams accept the context-rot data and try to engineer around it: aggressive summarisation, sliding-window heuristics, retrieval-augmented prompt construction with the agent itself as the retriever. This is closer to the right answer, but it is still treating the prompt as the source of truth. The moment your summariser drops the wrong fact, the agent has forgotten it for the rest of the session. The moment your sliding window slides past the architectural decision the user made on Tuesday, that decision is gone until the user re-types it. You have built a memory system. You have built a bad one, with no externally inspectable state, no way to update a single fact without a re-summarisation pass, and no way to share what one agent learned with another.

The right primitive: a queryable store outside the model

The way out of this is not to make the prompt smarter. It is to stop treating the prompt as the place where memory lives. The prompt is a working set. Memory is a store. They are different objects with different lifecycles, and good agent architectures keep them separate.

The shape that emerges, once you accept the split, is mechanical. On every turn, the agent decides what is worth remembering and writes it to a persistent store. On every turn, the agent decides what it needs to recall and queries the same store by meaning, not by keyword. The prompt carries only the working set: the current task, the current turn, and whatever recall the agent pulled in for this specific question. Everything else lives outside the model, in a place where it can be read, written, updated, audited, and shared between agents.

This is the architectural pattern that Zep, Mem0, and Letta have all converged on, with different opinions about the storage layer. It is what Anthropic's own context engineering writeup recommends. And it is the only architecture that has actually shown improvements on the long-horizon benchmarks - because it is the only one that addresses the actual problem, which is that the model never had memory in the first place; the application was supposed to provide it.

What the API call should look like

If memory is just a store with a query interface, the integration becomes embarrassingly small. Here is the shape we ship at Ragionex - two HTTP calls, one for write, one for recall:

curl -X POST https://api.ragionex.com/v1/memory/write \
  -H "X-API-Key: rgx_memory_..." \
  -H "Content-Type: application/json" \
  -d '{
    "content": "User decided to use Postgres over MongoDB for the billing service. Reason: needed transactional integrity across the ledger and invoice tables.",
    "project": "acme-billing"
  }'

curl -X POST https://api.ragionex.com/v1/memory/search \
  -H "X-API-Key: rgx_memory_..." \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Which database did we choose for billing and why?",
    "scope": "segment",
    "results": 5,
    "project": "acme-billing"
  }'

That is the entire surface. The agent writes facts as it learns them, queries the store when it needs to recall, and never has to stuff thirty turns of history into a prompt to remember a Tuesday decision. The store handles indexing in the background. The agent's prompt stays small. The model's attention stays focused on what matters this turn.

What this changes for your architecture

Once you internalise that the prompt is a working set and not a persistent store, several things stop being mysterious. Why your agent gets confused on long sessions: it is being asked to read a wall of mostly-irrelevant text. Why summarisation feels lossy: because it is, irreversibly. Why every agent framework has converged on an external memory abstraction: because the alternative is engineering forever around context rot. Why the bigger-window era did not end this debate: because the debate was never about token count.

The interesting reframe, which is worth its own post, is that even “memory” is the wrong word for what the agent actually needs. What it needs is retrieval - the ability to find the right context at the moment it is needed, scoped to the current task, ranked by relevance, and small enough to fit in the working set without crowding out reasoning. Storage has been a solved problem since the 1970s. The hard part is finding the right thing again, and the design space for doing it well is the same one we explored for documentation in Context Engine vs RAG.

If your agent forgets the user halfway through a session, the fix is not to give it a bigger pocket. The fix is to give it a filing cabinet, and to teach it when to open the drawer.

Ready to try it?

Free API key. No credit card. Start using in seconds.

Get Started