Retrospectives shouldn't be projects

A few weeks ago, Anthropic shipped a research-preview feature in their managed-agents API and called it Dreams. A Dream is an asynchronous job: it reads an AI agent’s accumulated memory store alongside a batch of past session transcripts and produces a new, cleaned-up memory store — duplicates merged, stale or contradicted entries replaced with the latest value, new insights surfaced. The input store is never modified. The output is a separate artifact the operator can review and either adopt or discard. The name is the giveaway. Anthropic borrowed it from the cognitive scientist Erik Hoel, who has been arguing for years that this is precisely what brains do every night.

Hoel’s theory is called the overfitted brain hypothesis. Dreams, he argues, exist so today’s experience does not crowd out everything you knew before today. Sleep runs a cleanup pass on working memory and writes the durable bits to a place you can reach for them later. You don’t book a meeting for it. You can’t really opt out. The reflection is built into the loop that keeps the system functional. Hoel extended the argument earlier this year to current large language models: nothing about today’s session changes the model tomorrow. “There is nothing it is like to be a Large Language Model,” he wrote — because the cleanup pass that sleep performs simply doesn’t happen.

That gap is closing. A few weeks before Anthropic shipped Dreams, Andrej Karpathy published a markdown document on GitHub he called LLM Wiki. It is not software. It is a file that tells a coding agent how to keep its own notes — where to write things down, how to organize them, when to consolidate, what to throw out. The community of people using Claude Code picked it up almost immediately. I implemented it on my own setup and can confirm: the agent forgets less, ignores its own rules less, relitigates fewer decisions. The work didn’t improve because the model got smarter. The work improved because the agent now remembers.

Both designs share two structural moves worth pulling out. The first is that the reflection isn’t a tool you invoke. There is no button. At the end of every session, a small wrap-up runs. A separate operation runs periodically over the whole memory store to consolidate notes, remove stale claims, and merge contradictions. In my setup, those have been packaged as small reusable skills — /wrap-up at the end of each session, /lint-memory on a schedule. In Anthropic’s API, the equivalent is a Dream, dispatched asynchronously when the moment is right. Either way, the agent doesn’t have to remember to remember. The remembering happens.

The second is the file format. Neither version stores memory as a chat transcript. The memory is an indexed wiki — roughly one fact per file, cross-linked, browsable. When the agent needs context, it doesn’t replay the conversation. It looks up the entry. The distinction between a transcript and an index is the distinction between a record and a memory — a distinction Othman Gbadamassi develops at length in his recent essay on the Governance Memory System. A transcript tells you what was said on Tuesday. Memory tells you why a proposal was made, who shaped it before the vote, what happened six months later, and whether the same tension has surfaced three times under different names. Most organizations have records. Almost none have memory.

This is where the story gets uncomfortable. The cleanup pass is going from speculation to neuroscience to product feature inside a single eighteen-month window — for AI agents. The version that organizations need is the one that still doesn’t exist.

Consider what happens when an organization does finally invest in self-reflection. A respected team I know recently spent several months on a governance retrospective for one of the larger DAOs. The cost ran well into six figures. Roughly thirty interviews. Polished data visualizations. I read the final report and I am not sure how much of it will translate into actual decisions. A friend who advises me opened it, glanced through, and told me she couldn’t process it. The team did not do the work badly. The format is the problem.

Retrospectives, the way organizations run them now, are projects. They have a budget, a proposal, a stakeholder list, a deliverable, and a political question about whose work will look worse after the findings come out. That is too much weight for an activity that ought to happen all the time. As a result, retrospectives are also the first ritual any team cancels when the calendar tightens. Agile told us to do one after every sprint two decades ago. Almost nobody does. And when organizations finally do invest in self-reflection, they make the investment so large that the loop between decision and consequence is broken before the data even arrives. By the time the report lands, the cohort that lived the story has half-rotated out.

The pattern Hoel, Karpathy, and Anthropic each describe is the structural alternative. Reflection ought to be a small thing that happens at the end of every cycle, indexed in a place the next decision can reach for, and periodically consolidated by an operation that runs whether or not anyone asked. The DAO retrospective I started with would be unrecognizable inside that model. There would be one at the end of every project, not one big one every few years. Each would produce a few atomic notes, not a hundred-page report. The aggregate would form a queryable history. The story of what the organization actually learned would no longer depend on whether the right person remembered to read this quarter’s deliverable.

This is not an obscure complaint. The gov/acc research cohort at Metagov surveyed 52 governance practitioners during their first phase. Eight raised institutional amnesia as a top problem, but those eight discussed it at the second-highest depth in the entire dataset, averaging 5.8 messages per conversation. The pattern they described was consistent: long-tenured contributors notice the same debates returning, the same mistakes recurring, the same wisdom evaporating each time a cohort rotates out. Newer participants did not flag the problem, because they have not yet been around long enough to see it loop. The people who feel institutional amnesia most acutely are the people who have stuck around. The report flags it as an expert concern. It is also a structural one — the kind that gets worse the larger and longer-running the organization is.

The advisor I quoted earlier named the tension I think this fix has to resolve. Retrospectives need to be safe. Postmortems in large organizations almost always degrade into a hunt for someone blameable, which is why teams stop running them — the agile version handles this by treating the retro as a closed room. But the learning then dies with the team. Agile retrospectives have no mechanism for synthesizing what was decided, testing whether the change worked, and feeding the result back into the organization as a whole. The lint operation, applied to organizational memory, is roughly the resolution. The raw notes stay in the team. The abstractions travel. A specific team can decide privately that their standup needs a hard time limit; the institution can pick up the more general lesson about which kinds of meetings degrade without one. The team keeps its safe room. The organization gets to learn anyway.

This is the part of the picture Harmonica is trying to fill in.

Harmonica already runs async, structured sessions — you describe what you want to learn, your team responds on their own time, the system synthesizes a summary. The piece I want to add next is the part that turns the output of each session into something an organization can accumulate. If a team runs a short retro at the end of every project cycle, the results should fall, by default, into a context layer the rest of the organization can browse. I want the first version of that layer to land in an Obsidian vault, because that is where most of the people thinking carefully about personal memory infrastructure already work, and because the file-per-fact, links-between-files model is the same one Karpathy’s LLM Wiki and Anthropic’s Dreams output both use. Each retrospective becomes a small set of atomic notes, linked to the project they came from, indexed by theme. A lint pass runs across the vault periodically and surfaces themes that keep recurring, decisions that contradict earlier ones, and obligations that have aged out without being closed. The retros themselves stay cheap enough to run every cycle. The compounding does the work that the six-figure project was supposed to do, and does it without anyone having to commission, defend, or vote it through.

What gets unlocked downstream depends on the organization. For a software team it shows up as less repeated debate and faster onboarding when someone new joins. For a DAO it shows up as the ability to look at a new proposal and ask, before voting, whether the same tension has surfaced before. For a city government or a parliament it starts to look like what Othman’s Governance Memory System is reaching for — feedback loops between decisions and consequences, recurring themes surfaced rather than re-litigated, informal power made legible alongside the formal org chart. The output of a retro is no longer only for the team that produced it. It also becomes the upstream material for policy, for proposals, for the next round of design.

One thing I want to be careful about. The aspiration here is not a magical context layer that no human ever has to touch. Othman’s framing is the right one: the AI is the engine, humans are the rudder. The agents extract structure. People decide what stays, what gets archived, what shapes the next decision. The lint operation in my own setup is a useful model — it doesn’t delete entries on its own. It flags them, surfaces conflicts, and asks. Anthropic’s Dreams API makes the same design call: the input memory store is never modified by the dream, and the output is a separate artifact the operator can review and either adopt or discard. The pattern is consistent across every place this is being built. The cleanup pass surfaces structure. Human judgment decides what counts. The machinery just makes that judgment cheap enough to apply often, which is the difference between an organization that learns and one that runs the same retrospective every five years for six figures.

If you’re already using something like LLM Wiki to build memory into your own agent setup, you’ve probably noticed it works better at the individual level than at the team level. That is because the team-level version of this doesn’t fully exist yet. The thing I am building toward at Harmonica is the version that takes a group through the same kind of reflection an agent does at the end of its session, and writes the lessons somewhere everyone can find them later. If that resonates and you’d like to be early on it, get in touch — the longer arc is interesting to several audiences at once. Civic-tech and governance people who care about institutional memory at the polity scale. Change management consultants who want their client work to compound. Software teams who want their sprint retrospectives to actually feed anything downstream. Communities trying to learn faster than their contributors rotate out. The shape of the problem is the same in each case. Only the deployment context changes.

Organizations that don’t build this rhythm will keep paying six figures every few years to remember what they already learned.