Why loom

LLM agents are powerful and unreliable. The industry’s answer has been better prompts, bigger graphs — and lately, durable execution everywhere. loom’s answer goes one layer further: make the run a transaction, and make its safety structural.

The problem

Multi-step agent work fails in ways single prompts don’t: a crash halfway leaves half-applied steps; a retry double-spends; an agent quietly skips the review you asked for; and when something goes wrong, there’s no record of which decision produced the damage. For throwaway tasks that’s fine. For work where being wrong is expensive — touching production code, migrations, anything review-gated — it isn’t.

loom’s answer: three mechanisms

1. Replay-determinism

One timestamp token is captured per state-machine tick and threaded through every kernel call, persisted, and replayed verbatim. Combined with atomic SQLite transactions, the same (state, timestamp, ledger) produces the same trajectory. You can replay a recorded run against a changed invariant and ask “would this rule have caught it?”

2. Commit-time invariants

Safety rules run inside the database transaction and roll it back on violation. They’re not prompt suggestions — they’re structural. The code bundle ships rules like “acceptance can’t pass while a blocking finding is open” and “if an agent touched the tests, the final gate must be human-approved”.

3. The idempotency ledger

Every effect is recorded in a ledger row committed in the same transaction as the state change it dedupes. Crash recovery is therefore trivial and exact: restart, and the ledger silently absorbs every step that already happened. No double work, no double spend.

Compared to the alternatives

	Agent frameworks (LangGraph, CrewAI…)	Workflow engines (Temporal, Inngest…)	loom
Built for	authoring agent graphs	durable service workflows	review-gated agent work
Replay-deterministic runs	no	yes (workflow code)	yes (whole run, incl. agent steps)
Safety enforced at commit time	prompt-level	n/a	invariants inside the transaction
Human gates as a primitive	callbacks you wire up	signals you wire up	first-class, policy-driven dial
Infrastructure	your process	a cluster / a cloud service	one SQLite file in your repo
Vendor coupling	varies	none	zero-dependency kernel, no vendor names

The honest comparison: if you’re building a custom agent product, a framework gives you more authoring surface. If you’re orchestrating microservices, Temporal is the right tool. loom sits in between — “Temporal for LLM agents”, local-first, with human-in-the-loop and provable process as the primitives.

A platform: pluggable on three orthogonal axes

loom was designed as a platform from day one, not as a code tool that grew plugins. The kernel is generic — it knows nothing about code review or any domain, contains zero runtime dependencies, and no vendor, model, or transport names (enforced by CI greps). Three axes plug into it:

Bundles — the domain. A bundle declares the phases and steps of the work, the gates and who decides them, the safety invariants enforced at commit time, and typed prompt templates. The kernel supplies atomic state, the idempotency ledger, replay, and gate machinery. code (review-gated implementation) ships today; a new domain — incident-response runbooks, research pipelines, content workflows with a legal gate — is a new bundle, and the kernel never changes.
Providers — the LLM backend. Claude Code login (default, no key), OpenRouter, local Ollama, Anthropic API — with per-agent fallback chains.
Transports — the wire. Web dashboard, Telegram bot, MCP (inside Claude Code), CLI, HTTP — all driving the identical state machine.

What loom doesn’t claim

It does not make the model’s output correct — it makes the process provable: the declared review ran, nothing was bypassed, irreversible steps got a human.
Not a prompt-template framework — templates live in bundles, typed.
Not an agent IDE — it runs underneath your IDE / shell / MCP host.
Not a distributed runtime — single in-flight task per project, by design.

Early-stage, in the open. loom is v0.3: used daily by its author, stable at the core, still moving at the edges. Reading the whitepaper is the fastest way to decide if the design philosophy fits how you work.