Replay-deterministic LLM agents: why your agent runs need a ledger
LLM agent frameworks obsess over prompts and graphs while ignoring the unglamorous part: what happens when a 40-minute run crashes at minute 35. The answer is older than AI — make the run a transaction.
The failure mode nobody demos
Every agent framework demo looks the same: a task goes in, agents collaborate, a result comes out. What the demos never show is minute 35 of a 40-minute run, when the process dies — laptop sleeps, rate limit cascades, OOM, a deploy restarts the box. Now answer honestly:
- Which steps actually completed?
- If you restart, which steps will silently run twice — and double-spend?
- Can you prove the review step ran before the merge step?
For a chat toy, nobody cares. For agents that edit production code, touch customer data, or spend real money per step, these are the only questions that matter. And prompt engineering can’t answer them — they’re transaction-processing questions, and databases solved them decades ago. The fix isn’t a smarter agent. It’s making the run itself a transaction.
Mechanism 1: one timestamp, threaded everywhere
The root cause of non-replayable runs is ambient state: every
Date.now(), every Math.random(), every read of
“whatever the world looks like right now” makes the run impossible to
reproduce. loom’s kernel captures one timestamp token per state-machine
tick and threads it through every call. It’s persisted with the
run and replayed verbatim. Banning ambient clock reads isn’t a
convention — it’s machine-checked lint in the kernel.
What this buys you is subtle but powerful: the same (state, timestamp, ledger) always produces the same trajectory. A recorded run becomes an artifact you can re-execute against changed rules: “would the new safety invariant have caught last week’s incident?” is a query, not a thought experiment.
Mechanism 2: the idempotency ledger
Crash recovery in most agent frameworks is some flavor of “we checkpoint sometimes, you re-run and hope.” The reliable version is boring: every effect gets a ledger row, and the row is committed in the same database transaction as the state change it represents. Either both exist or neither does — there is no window where the work happened but the record didn’t.
Recovery then stops being a feature and becomes a non-event: restart the process, replay the inputs, and the ledger absorbs every step that already ran. No double work, no double spend, no “did the implementer already commit?” archaeology. This is how payment systems have handled retries forever; agent runs that cost dollars per step deserve the same treatment.
Mechanism 3: invariants at commit time, not in the prompt
The third mechanism is the one prompt engineering pretends to provide.
Telling an agent “don’t approve your own work” is a suggestion. Encoding
it as an invariant that runs inside the database transaction —
and rolls the transaction back on violation — is a guarantee. loom’s
code bundle ships rules like:
- acceptance cannot pass while a blocking finding is open;
- if an agent modified the tests it’s judged by, the final gate must be human-approved.
The agent can argue with the prompt all it wants; it cannot argue with a rolled-back transaction. That’s the difference between alignment by persuasion and safety by construction.
“Isn’t this just Temporal?”
Close — and the comparison is a compliment. Temporal made workflow code
deterministic and replayable for microservices, and it’s the right tool
for that job. But agent work has different physics: the expensive,
nondeterministic unit is an LLM call, the critical control point is a
human decision, and the deployment target is a developer’s
machine, not a cluster. loom takes Temporal’s discipline — determinism,
event-sourced state, idempotent effects — and rebuilds it agent-shaped:
human gates as a first-class primitive with policies (human /
on-blockers / auto), safety invariants in the
commit path, and the whole thing in one SQLite file in your
repo. No cluster, no cloud, no telemetry.
What determinism doesn’t buy you
Honesty matters here. Replay-determinism guarantees the process: the declared steps ran, in order, nothing was bypassed, irreversible actions got a human. It does not make the model’s output correct — a deterministic pipeline can deterministically produce a wrong answer. What you gain is the ability to prove which process ran and to inspect every decision behind a result. When an agent run goes wrong, “show me exactly what happened” has an answer that’s a database query, not a log-grepping session.
Try it
npm i -g @loomfsm/pipeline
loom up # web dashboard, first-run wizard
# or: loom run "add rate limiting to the login endpoint" loom is open source (Apache-2.0), early-stage, and built in the open — the design rationale and source are public. If your agents do work that’s expensive to get wrong, the ledger is waiting.