Why loom
LLM agents are powerful and unreliable. The industry’s answer has been better prompts, bigger graphs — and lately, durable execution everywhere. loom’s answer goes one layer further: make the run a transaction, and make its safety structural.
The problem
Multi-step agent work fails in ways single prompts don’t: a crash halfway leaves half-applied steps; a retry double-spends; an agent quietly skips the review you asked for; and when something goes wrong, there’s no record of which decision produced the damage. For throwaway tasks that’s fine. For work where being wrong is expensive — touching production code, migrations, anything review-gated — it isn’t.
loom’s answer: three mechanisms
1. Replay-determinism
One timestamp token is captured per state-machine tick and threaded through every kernel call, persisted, and replayed verbatim. Combined with atomic SQLite transactions, the same (state, timestamp, ledger) produces the same trajectory. You can replay a recorded run against a changed invariant and ask “would this rule have caught it?”
2. Commit-time invariants
Safety rules run inside the database transaction and roll it
back on violation. They’re not prompt suggestions — they’re structural.
The code bundle ships rules like “acceptance can’t pass
while a blocking finding is open” and “if an agent touched the
tests, the final gate must be human-approved”.
3. The idempotency ledger
Every effect is recorded in a ledger row committed in the same transaction as the state change it dedupes. Crash recovery is therefore trivial and exact: restart, and the ledger silently absorbs every step that already happened. No double work, no double spend.
Compared to the alternatives
| Agent frameworks (LangGraph, CrewAI…) | Workflow engines (Temporal, Inngest…) | loom | |
|---|---|---|---|
| Built for | authoring agent graphs | durable service workflows | review-gated agent work |
| Replay-deterministic runs | no | yes (workflow code) | yes (whole run, incl. agent steps) |
| Safety enforced at commit time | prompt-level | n/a | invariants inside the transaction |
| Human gates as a primitive | callbacks you wire up | signals you wire up | first-class, policy-driven dial |
| Infrastructure | your process | a cluster / a cloud service | one SQLite file in your repo |
| Vendor coupling | varies | none | zero-dependency kernel, no vendor names |
The honest comparison: if you’re building a custom agent product, a framework gives you more authoring surface. If you’re orchestrating microservices, Temporal is the right tool. loom sits in between — “Temporal for LLM agents”, local-first, with human-in-the-loop and provable process as the primitives.
A platform: pluggable on three orthogonal axes
loom was designed as a platform from day one, not as a code tool that grew plugins. The kernel is generic — it knows nothing about code review or any domain, contains zero runtime dependencies, and no vendor, model, or transport names (enforced by CI greps). Three axes plug into it:
- Bundles — the domain. A bundle declares the phases and
steps of the work, the gates and who decides them, the safety
invariants enforced at commit time, and typed prompt templates. The
kernel supplies atomic state, the idempotency ledger, replay, and gate
machinery.
code(review-gated implementation) ships today; a new domain — incident-response runbooks, research pipelines, content workflows with a legal gate — is a new bundle, and the kernel never changes. - Providers — the LLM backend. Claude Code login (default, no key), OpenRouter, local Ollama, Anthropic API — with per-agent fallback chains.
- Transports — the wire. Web dashboard, Telegram bot, MCP (inside Claude Code), CLI, HTTP — all driving the identical state machine.
What loom doesn’t claim
- It does not make the model’s output correct — it makes the process provable: the declared review ran, nothing was bypassed, irreversible steps got a human.
- Not a prompt-template framework — templates live in bundles, typed.
- Not an agent IDE — it runs underneath your IDE / shell / MCP host.
- Not a distributed runtime — single in-flight task per project, by design.