Beyond CLAUDE.md: When Your Agent's Memory Outgrows a Markdown File

In April 2026, Andrej Karpathy posted a gist titled llm-wiki describing how he uses Claude Code as a personal knowledge agent operating over a markdown vault, with slash commands wired up to read, write, and search the vault. The gist crossed sixteen million views and five thousand stars within weeks, and rightly so: it is a clean, debuggable, file-based pattern that anyone can implement in an afternoon and that solves a real problem for the workload it targets. If you are a solo developer with a personal knowledge base, the right answer to “how should I give my agent memory” is probably to read that gist and stop reading blog posts like this one.

This post is for the next workload up. It is for the developer who built the Karpathy pattern, loved it, and then watched it strain - sometimes after a few weeks, sometimes after a few months, but reliably - the moment the vault grew past a few hundred files, or a second person started writing into it, or the agent started forgetting to read it before answering. The failure modes are not flaws in the pattern. They are properties of the pattern that make it the right answer at one scale and the wrong answer at the next, and they are worth being explicit about so you know which side of the line you are on. The smallest possible upgrade keeps the same /save and /recall verbs and lifts them onto a managed HTTP layer - in our case Ragionex's POST /v1/memory/write and POST /v1/memory/search - so the verbs survive and the librarianship goes away.

What the markdown-vault pattern actually does well

Before listing the failure modes, give the pattern its full credit. A vault of plain markdown files is durable in a way that few storage formats are. You can grep it, version it in git, edit it in any tool, share it with another agent that knows nothing about your conventions, read it ten years from now without a migration project. The slash-command approach (/save, /recall, /links) gives the agent a tiny, opinionated verb surface that is easy to reason about and easy to debug. When something goes wrong, you can cat the file. There is no opaque database, no schema to migrate, no service to keep alive. For a single user with a single agent and a vault that fits in human working memory, this is unimprovable.

The Hacker News thread on “memory for AI coding agents” ran for hundreds of comments in the weeks after Karpathy's gist, and most of the agreement was on this point. The pattern is simple, transparent, and it works. The disagreement was about where it stops working, and that is the part worth walking through carefully.

Failure mode 1: manual upkeep

The first thing that breaks is the discipline. The pattern works because the user remembers to save important things and remembers to ask the agent to recall before acting. For the first week this feels great. By week three you are catching yourself two paragraphs into an explanation that you already saved last Tuesday, because you forgot to /recall. By week six the vault has notes from one week ago that contradict notes from three weeks ago, because nobody pruned. The pattern assumes a level of librarianship that humans, including the author of this post, do not sustain.

The agent cannot help with this in any deep way, because the agent does not know which entries are stale. It can dedupe lexically; it cannot make the editorial call that the new note about Postgres replaces the old note about MongoDB. The librarian has to be a human, the human has to keep up, and the moment the human is busy, the vault decays.

Failure mode 2: the agent forgetting to read the file

This one is more insidious. The whole pattern depends on the agent reading the relevant vault file before answering. In practice, agents skip this step constantly. Sometimes the file is too long. Sometimes the question feels “simple” and the agent answers from priors. Sometimes the agent reads the file but the relevant section is buried six pages in and the agent's attention has moved on. The user perceives this as the agent “not using its memory,” but the actual failure is that the file-based recall is opt-in for the agent and the agent is opting out under load.

The fix the community has converged on is to make recall not optional - to make the recall step a tool call that the agent invokes structurally, with the result returned as a small, focused piece of context rather than a wall of markdown. That is a different shape than “read this file before answering,” and it is the shape that scales.

Failure mode 3: the file grows past the effective attention budget

The third failure is the same context-rot problem that drives the broader amnesia issue. A vault file that fits in five hundred tokens is read carefully. A vault file that fits in fifty thousand tokens is skimmed. A vault file that exceeds the model's effective attention budget is, statistically, read in a way that misses things. Anthropic's engineering team is explicit about this: context is a finite resource and the marginal value of stuffing more into the prompt drops fast. The math does not change just because the file is markdown and lives on disk.

The Karpathy pattern has an answer for this - split the vault into many files and have the agent grep before reading - but the grep is keyword-based, and the moment the user's query is phrased differently from the way the note was written, the grep misses. Markdown grep does not handle “the user wrote down something about caching last month and I need to find it even though they used the word memoization” gracefully. The retrieval problem the pattern was supposed to solve is back, just hiding in the filesystem.

Failure mode 4: team coordination

The fourth failure is what happens when a second person joins. There are now two CLAUDE.md files, or one shared file with merge conflicts, or a convention nobody agreed on about whose preferences win. The agent does not know which file is canonical. The team has discovered the central difficulty of any shared knowledge system - what is true, who decides, and how do conflicts resolve - and is solving it with markdown, which is the wrong tool.

For a team this is not a marginal annoyance. It is the difference between memory that helps and memory that produces silent disagreements in the agent's behaviour across people. The Karpathy pattern is fundamentally individual. Engineering work is fundamentally collaborative. The two do not compose without an explicit shared layer.

Failure mode 5: no semantic recall

The deepest failure is the one developers notice last. The user wrote down on April 1st: “Decided not to cache the user-profile endpoint - changes too often, stale data is worse than the latency win.” On April 28th the user asks the agent: “Should we add memoization to the profile route?” Grep on the vault for memoization returns nothing, because the original note used cache. Grep on cache returns six other notes about other endpoints. The relevant decision is in the vault. The agent cannot find it.

A markdown file is a write-only journal. The agent can edit, but it can’t ask.

This is the structural limit. Filesystem-and-grep is a key-based lookup with a thin metadata layer. Real engineering recall is meaning-based: the question and the answer are phrased differently, and the system has to bridge the gap. The grep cannot. Once you accept that this is a retrieval problem in disguise, the upgrade path is obvious.

The smallest possible upgrade: keep the verbs, lift them off the filesystem

The good news is that you do not have to give up the pattern to fix the failure modes. The Karpathy verbs - /save a fact, /recall by query - are exactly the right shape. What changes is the implementation. Instead of the verbs reading and writing markdown files, they call a small HTTP API that handles storage, retrieval, and ranking on your behalf. The agent integration is the same; the storage and recall layer is upgraded from grep-on-files to semantic search with project scoping.

If your agent supports MCP - Claude Code, Cursor, Windsurf, and an increasing list of others all do - the integration is one config block. Here is the literal MCP server registration:

{
  "mcpServers": {
    "ragionex-memory": {
      "command": "npx",
      "args": ["-y", "@ragionex/mcp-memory"],
      "env": {
        "RAGIONEX_API_KEY": "rgx_memory_..."
      }
    }
  }
}

That registers seven tools - memory_write, memory_search, memory_list, memory_view, memory_update, memory_delete, memory_status - which are the verb surface the Karpathy pattern wanted in the first place, with the failure modes designed out. The agent calls memory_search with a natural-language query and gets back semantically-ranked results. The recall is no longer keyword-dependent. The store is no longer a single growing file. The team can share an API key and operate against one pool. The agent calls the recall tool structurally on every turn that needs context, not when the human remembers to type a slash command.

The same flow, in plain HTTP

If you are not on an MCP-compatible agent, the API is small enough that wiring it directly into your tool layer takes an afternoon. The two calls that replace the bulk of the markdown ritual:

curl -X POST https://api.ragionex.com/v1/memory/write \
  -H "X-API-Key: rgx_memory_..." \
  -H "Content-Type: application/json" \
  -d '{
    "content": "Decided not to cache the user-profile endpoint. Changes too often; stale data is worse than the latency win. Reviewed on 2026-04-01 and confirmed.",
    "project": "acme-api"
  }'

curl -X POST https://api.ragionex.com/v1/memory/search \
  -H "X-API-Key: rgx_memory_..." \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Should we add memoization to the profile route?",
    "scope": "full",
    "results": 3,
    "project": "acme-api"
  }'

The query says memoization. The stored note says cache. The search returns the right note anyway, because the matching is by meaning, not by token. That is the upgrade in one line. The same architecture - meaning-based retrieval over a build-time-prepared corpus - is what we run on the documentation side, written up in Context Engine vs RAG.

What you keep from the Karpathy pattern

None of this requires throwing the markdown vault away. The pattern that works at solo scale - the verbs, the discipline of writing things down explicitly, the agent operating on a small number of focused tools - is the same pattern that works at the next scale up. The change is purely in the storage layer: filesystem becomes a small managed API, grep becomes semantic search, single-user becomes multi-user-via-shared-key, and the agent's recall step becomes structural rather than ritualistic.

Karpathy built the right primitive at the right time. The community absorbed it, ran with it, and is now hitting the limits any file-based system hits when it is asked to do retrieval. The honest next step is not to abandon the verbs - they are correct - but to give them an implementation that does not depend on the human being a perfect librarian or the agent being a perfect grepper. Both of those are reliably false at scale. A managed semantic store is reliably true.

The honest summary

If your vault is small, your team is one person, and you are happy, do not change anything. The Karpathy pattern is the right answer for that workload and it will be the right answer for that workload for years. If your vault has crossed a few hundred files, or a second person has started writing into it, or your agent is reliably failing to find decisions you know are in there, the upgrade path is small and the failure modes you are seeing are predictable. Keep the verbs. Lift them off the filesystem. The afternoon you spend on the integration buys back the days you have been spending on the librarianship the pattern silently asks for.

Beyond CLAUDE.md: when your agent’s memory outgrows a markdown file.