Curing the doom loop

Your local agents keep relearning what they already figured out, and you pay for it every session. Here's how darkmux makes the lessons stick.

Jun 25, 2026

A quick recap, and what’s changed

If you have followed this series, you know darkmux as the tool I have been building to run local AI as real infrastructure rather than a toy. The short version, for anyone arriving fresh: darkmux is a local-AI orchestrator. It manages the models you have loaded, dispatches coding work to them inside bounded containers, drives each one through an agent loop, and records the whole run so you can see what happened and rerun it. The series so far has been about making that loop trustworthy on local hardware, where the models are smaller and the failure modes are louder than anything you hit against a frontier API.

The last few releases pushed hard on that. There is now a loop bench that runs a single dispatch under a chosen configuration and tells you how the loop behaved. There are detectors that watch a live run and flag the pathologies, the cycles, the looping reasoning, the failure cascades, as they happen. The dispatch-to-PR path can take a task, hand it to a local model, review the result, and gate on it. The tool grew from a model-swapper into something that actually operates a local crew.

Two habits run underneath all of that, and they are the reason this post can show its work instead of just asserting it. darkmux records every dispatch as a rerunnable lab run, so a claim about behavior is a command you can repeat, not a story you have to trust. And the design itself is kept in a lab notebook: a chronological record of what each session actually figured out, with an index of the moments where an earlier belief turned out to be wrong. Everything below, the decisions and the one honest non-result at the end, is drawn from that notebook and backed by those runs.

Which raised the obvious question: of everything left to build, what comes next? This post is the answer, and the why behind it.

Why the doom loop jumped the queue

The thing that moved it to the front was a piece of writing, not a backlog vote. Brandon Waselnuk, writing at Unblocked, published “The AI Agent Doom Loop”, and it named something I had been watching darkmux’s own dogfooding hit over and over without having a clean word for it.

The doom loop is this. An agent opens your codebase, makes a mistake, gets corrected, and ships. The next agent opens the same codebase and makes the same mistake. So does the one after that. Each session starts from zero, because the correction never survived the context reset. The article’s diagnosis is bracing in how little it is about the model: the fix is not a smarter agent or a bigger rules file. It is a structural change. Give the agent institutional context before it writes a line, and make that context automatic, carrying the why, current, and scoped to the work in front of it. The agents are not getting dumber. They are amnesiac, and amnesia repeats.

I want to give that article its due, because it did the hard part: it turned a vague irritation into a named problem with a shape. darkmux’s dogfooding had been showing me the symptom. The article gave me the diagnosis. I am not claiming I got there first, and I am not claiming the article and I disagree. The honest and more interesting story is convergence: an outside team writing about the problem in general, and a local-AI tool hitting it in particular, landed on the same prescription. This post builds that prescription, decision by decision, for a small local model, where the constraint is sharpest and the answer looks least like the cloud playbook.

The failure has a shape: the runaway search

Here is what the doom loop looks like from inside a dispatch, and why it bites a small local model harder than a frontier one. You hand a small model a task in a repo it has never seen. It can read files, but it has no map: it does not know where the relevant code lives or why the project is shaped the way it is. So it goes hunting. It searches, reads a file, searches again, re-reads the first because the context already rolled past it. It is not making progress. It is looking for where things are, and on a tight window the hunt is expensive. Every search and re-read spends tokens the model needed for the work. darkmux’s cycle detector is built to catch exactly this: the same call landing again and again, the runaway search that is really an agent compensating for what it was never told.

I do not have to reach for a hypothetical, because darkmux hit this while building this feature, and the whole run is on the record in its own flow stream. The task had two parts: copy a small path-normalizing helper, and make the brief-assembly code filter injected context down to the files in play. The model knew where the helper was, because the task said so. What it did not know was where the filtering logic lived, or how “files in play” is even determined. So it went looking. The flow stream caught what happened: thirty-three search calls, twenty file reads, zero edits. The cycle detector fired twice, “search called 3× in the last 10 tool calls,” and after roughly ten minutes the dispatch errored without writing a single line of the actual fix. That is the doom loop, captured live. Not a dumb model, a blind one, spending its entire budget looking for where to begin.

So the finding did not stay in my head. It became the first lesson in darkmux’s own darkmux lessons store, written straight off that failed run:

The code that filters injected context belongs in src/mission_run.rs. The normalize_path_lexical helper to reuse already exists at runtime/src/loop_runner.rs:1432. Copy that small pure function; do not go searching loop_runner.rs for the filtering logic. It is the in-container agent loop, not brief assembly.

Then darkmux re-dispatched the same task to the same model, with that lesson now in the brief. The brief grew from about three hundred characters to roughly two thousand: the task, plus the cautions the first run’s own detector firings had generated, plus the distilled lesson. This time the model went to the right files, made its edits, and the dispatch completed and passed its gate. It still searched some. One sentence did not turn it into a different model. But it broke the loop that had stalled the first run cold. The difference between an error after ten minutes and a finished change was a single line about where the code lived.

There is a postscript that is the whole thesis in miniature. That first lesson got the model onto the right files, and the work it then did exposed a second gap. The new filtering code tried to read “files in play” from the git diff, which is empty at brief-assembly time because the worktree is freshly checked out from the base. A real regression, and again the root cause was missing institutional knowledge, not a slow model. That became the next lesson. One supplied piece of context bought enough reach to surface the next missing piece. The cure does not just stop a loop. It compounds. Each thing the agent is handed lets it get far enough to teach you the one you did not yet know to write.

A frontier model with a huge window can absorb that first flailing and still finish. A small local model cannot. It anchors on whatever is in front of it and burns its budget rediscovering, every session, the things the last session already learned. The missing piece is not intelligence. It is institutional context: where things are, and why they are that way. That is the thing to deliver, and the rest of this post is how darkmux delivers it.

The loop is short enough to say in four words: detect, distill, inject, don’t repeat.

Most of what follows is one engineering decision pulled out of taking that loop seriously, and several of them I built wrong on the first cut and had to tear out once the failure mode showed itself. That is the honest version.

Detections are evidence; knowledge is the conclusion

darkmux already watches its own dispatches. The detectors catch the loop pathologies as they happen, the runaway search among them, and each firing lands as a record in darkmux’s flow stream. Those records are evidence. They are raw, high-frequency, and perishable.

The durable thing is not the firing. It is the lesson you distill out of the firings: what the runs kept getting wrong, what to do instead, and why. The detector points at where to look. A human, or the frontier orchestrator standing in for one, supplies what the lesson is. darkmux keeps the two as separate concepts on purpose: the raw firings are cautions, the distilled conclusions are lessons. Evidence becomes a conclusion, and the conclusion is the part worth keeping.

That distinction sounds academic until you try to store both in one place.

Two stores, because two kinds of writing

The two kinds of writing have opposite shapes. Cautions are machine-generated, arrive dozens to a run, and are write-once: a detector fires, a record gets appended, and nothing ever goes back to mutate it. Lessons are human-curated, rare, and edited in place over time. Those are not two flavors of the same store. They are an append-only log and a mutable database, and the worst thing you can do is pretend they are one file that a churning run and a careful human both write to. That file is a race condition with a corruption problem stapled to it.

So darkmux keeps them in two stores joined by a distillation step, each matched to its write shape. The cautions ride the flow stream, which is append-only JSONL: concurrency is a non-problem there because nobody ever rewrites an existing line, they only append new ones, and an append is the one file operation that composes safely under contention. The lessons get their own store with a real write path, which is the next section, because that is where the engineering actually is. There is a small pleasure in the vocabulary too: a spacecraft has a Caution and Warning System, and the word landed on theme before anyone reached for it.

A memory you can lose to one bad write isn’t memory

The lessons store wanted to be a hand-editable JSON file. That is how the rest of darkmux stores config, it diffs cleanly in git, and it keeps the operator sovereign over the file. I started there, looked at the actual write pattern, and it killed the idea.

Lessons are detection-driven, so they land while runs are churning: a long mission, parallel dispatches, the store being written in the middle of all of it. A JSON document has no concurrency story to offer. Adding one entry means read the whole file, parse it, mutate the tree, serialize it back, and two of those happening at once is a lost write, last-writer-wins over the top of the other. Worse is the partial write: a process killed mid-serialize does not corrupt the entry it was adding, it corrupts the entire document, because JSON is one balanced tree and half a tree will not parse. And there is no migration path. The day the shape changes, every file already on disk is a hand-edit or a one-off script.

So the store is SQLite, for the boring-on-purpose substrate reasons. It opens in WAL mode, so a read can proceed while a writer holds the log instead of the two serializing against each other. It sets a busy_timeout before it takes any other lock, so a dispatch that arrives to read the store mid-write blocks for up to two seconds and then gets its read, rather than failing or coming back empty, which is the silent failure that would quietly defeat the entire point. Writes are transactional, so a process killed mid-write rolls back to the last consistent state rather than shredding the store. And PRAGMA user_version carries a real schema number on the file itself, so a shape change becomes a numbered migration instead of a prayer. The cost is that you edit through verbs, darkmux lessons add/edit/remove/export/import, instead of opening the file in an editor. For a store whose job is to survive the concurrency that produces its own contents, that is the trade you take without flinching. Durability is not a nice-to-have for institutional memory. It is the definition. A memory you can lose to one bad write is not memory.

Knowledge that doesn’t bleed

The next decision was scope. If every lesson is global, then an agent working in your weekend side project gets briefed with the conventions of your day-job repo. That is noise at best and actively wrong at worst, and “remember to tag each lesson with where it applies” is exactly the kind of discipline that fails the first busy week.

So the boundary is structural, on the model git already taught everyone. darkmux lessons is two tiers of SQLite: a per-repo database at <repo>/.darkmux/lessons.db for the engagement’s own rules, and a user-global one at ~/.darkmux/lessons.db for the things true everywhere, like house style. The per-repo tier resolves against the project root relative to where the dispatch runs, not against an ambient “current” pointer that could drift, so a coder dispatched into repo X opens X’s database and only X’s. The brief inject reads both tiers and unions them; nothing global is a default, it is opt-in with lessons add --global. Repo Y’s lessons cannot reach a dispatch in repo X, because the file holding them is never on the path that dispatch walks. Bleed is impossible by construction, not by remembering to be careful.

Knowledge is cultivated in a ceremony

A feature that says “run the distiller whenever you remember” will never run, because it depends on memory and memory is the thing we are trying to replace. The cultivation point has to be reliable, and in darkmux the reliable point is the close of a mission. darkmux mission debrief is the ceremony: the work is done, you look back, you bank what was learned.

The debrief does something clean. A mission accumulates perishable things while it runs: the reviewer’s corrections on individual steps, the detectors’ raw cautions. When the mission closes, those evaporate. The debrief is where you sift them and decide which one is a durable lesson worth carrying into every future mission. The corrections fix this mission. The debrief banks the lessons for all of them.

This is not a new idea. NASA runs the Lessons Learned Information System, an agency-wide database of the official, reviewed lessons from its programs and projects. Each entry, in NASA’s words, includes “a summary of the original driving event and recommendations,” indexed so the next project can retrieve it. That is the carry-the-why shape exactly, the driving event paired with what to do about it, plus the reuse step. It is the authentic version of the thing every software team gestures at with the word “retrospective.”

Naming, because half a metaphor is just inconsistency

Which brings me to a darkmux decision that looks cosmetic and is not. The tool already had a Mission and a Crew. The new ceremony needed a name, and the easy choice was “retrospective,” because it is familiar from agile.

The argument against it is the article in miniature. Metaphors endure when they are coherent and relatable, not when they are literal. Xerox PARC and early Apple gave us the Desktop, the Trash, and Files. None of those things are inside the computer. They lasted for forty years because the metaphor was complete and people already understood it from the world. “Retrospective” drags Scrum baggage in and fractures the metaphor for anyone who does not live in a sprint. Committing to the whole thing, Mission to Crew to Debrief to Lessons, makes darkmux own its metaphor on purpose.

We are not sending rockets to the moon. But software ships with a rocket emoji, because the metaphor lands. Half a metaphor is just inconsistency. A whole one is something a person can hold.

There was a smaller catch inside this one, worth confessing. I had argued to reuse an existing darkmux lifecycle stage name for “consistency.” The trouble was that the existing name was an unconsidered placeholder nobody had ever defended. Consistency with a placeholder is no consistency at all. “It is already there” is not an argument.

What fits, not what you know

Now the small-model constraint comes back to collect. darkmux has a relevant store of cautions and lessons, scoped to the repo, distilled with their why. How much of it does it put in front of a model with a tight window?

For a frontier model you would not really have to ask. For the local model darkmux is dispatching to, the window is the binding constraint, so the question is not “what do we know” but “what fits, and which thing must never be the one that gets dropped.” The naive answer is a fixed cap, top twelve lessons, top ten cautions. That breaks the moment the window changes: the same cap that fits a 32K model wastes most of a 128K one and blows out an 8K one.

So the budget is proportional, not fixed. darkmux takes a fraction of this dispatch model’s context window, default fifteen percent and operator-tunable, and converts it to a character budget at roughly four characters per token, which deliberately avoids a per-model tokenizer dependency, an approximation that is cheap and good enough for a budget. Then it allocates that budget across the three blocks, corrections, cautions, and lessons, with a floor under each so the loudest source cannot starve the others. The highest-authority signal, an explicit operator correction, holds its floor before the cheap, high-volume, auto-detected cautions can flood it out, and each block is taken newest-first so what survives the cut is the freshest. With a small model the constraint is not what you know. It is what fits, and which one you refuse to let fall out.

The rule that replaces the stale rules file

The article is blunt about this: “Rules files can’t encode reasoning. They can’t explain why your team chose a particular database migration strategy after a production incident.” A rules file rots, and a rotten rule lies with confidence. If darkmux’s cure for the doom loop were just a different file of rules, it would inherit the same disease.

So darkmux mechanizes staleness instead of promising it, and the mechanism is a content hash, not a timestamp, because a timestamp lies the moment a file is touched without being meaningfully changed. When a detector fires on a file, the runtime hashes that file’s contents at the moment of firing, inside the container where the work is happening, and stamps the hash onto the caution record. That is the part that has to happen at fire time: once the dispatch ends the original bytes are gone, so a hash captured later would be hashing the wrong thing. Later, when a caution or a lesson keyed to that file is up for injection, darkmux re-hashes the file as it is now and compares. Match, and the entry is presented as current. Mismatch, and it is ranked down, because the code it was about has moved underneath it. The rule that replaces the stale rules file is built so it cannot quietly become one.

The discipline that makes it trustworthy

Every slice of this went through a fresh-context, higher-tier review before it merged, the same recheck-versus-rethink discipline darkmux uses on its own dispatch-to-PR loop. It earned its keep twice in this one build. The review caught a test that was quietly reading my real flow stream during unrelated runs, which made those tests machine-dependent and slow. And it caught a missing wait-for-the-writer timeout that would have let a dispatch reading the lessons store mid-write come back empty, silently defeating the exact concurrency the database was chosen to handle.

Neither of those was “looks fine, ship it.” My own tests passed on my own work in both cases. The catch came from a different reader, in a fresh context, which is the only kind of review that reliably catches the thing the author already convinced themselves about.

The honest measurement, on two levels

Here is the part I am most careful about, because it is where it would be easiest to overclaim. darkmux is a tool that benchmarks its own loops, so the cure has to be measurable, not asserted, and there are two measurements, on two different levels of rigor.

The strongest one already happened, and I walked you through it: the same task, the same model, errored after ten minutes of runaway search with no lesson, finished cleanly with one. That is a behavior receipt. Memory that bore on the task moved the outcome from failure to done. The honest caveat is that it was a field before-and-after, not a sterile experiment. It lives in the flow stream as two real dispatches, the second brief grew by the cautions and the lesson together rather than the lesson alone, and I am reading the cause partly from having watched it happen. An n of one, in the wild. Compelling, not clinical.

So I also reached for the clinical version. darkmux lab loop --ab runs the same task twice, once with the injected lessons and cautions and once without, and reports whether the verdict moved. I ran it. The verdict shift was: no change.

That is not a contradiction with the field result. It is the precise point, seen from the other side. The only lessons on hand for the controlled run were about darkmux’s own internals, and the test task was unrelated to them. Off-topic memory should not move behavior, and it did not. Put the two together and they say the same thing from both directions: institutional memory moves behavior exactly when it bears on the task, and not otherwise. The field run had on-topic memory and the outcome moved hard. The controlled run had off-topic memory and the outcome held still.

It helps to name the rungs. There is a premise receipt, that the failure mode is real: the first dispatch is the doom loop, captured live. There is a harness receipt, that the mechanism runs end to end: context built, injected, measured, on the record. And there is a behavior receipt, that with the right memory the verdict actually moves, earned in the field on darkmux’s own code. The one rung still worth nailing down is the sterile version of that last one: a clean A/B whose injected lessons genuinely bear on the task. The credibility is in saying which rung each piece of evidence stands on, instead of letting a field win wear a lab coat.

The cure isn’t asserted; it’s benchmarked

The doom loop is amnesia, and darkmux’s cure is institutional memory: detected as the agents work, distilled with its why through mission debrief, scoped so it does not bleed across repos, stored durably enough to survive the concurrency that produces it, budgeted to fit a small window, and stale-checked so it cannot lie. None of that is exotic. It is mostly the discipline of taking one short loop seriously and refusing, at each step, to ship the weaker version that was easier.

And because it lives in a tool that measures its own loops, the cure is not a claim at the end of an essay. It is a delta you can rerun with darkmux lab loop --ab. That is the move worth stealing, whatever you are building agents on top of: do not assert that your loop learned something. Build the bench that tells you whether it did, and be willing to read “no change” out loud.

Darkly Energized Blog

Discussion about this post

Ready for more?