← Blog · 2026-06-11 · 9 min read

AI agent guardrails are prompts. Safety invariants are guarantees.

Every agent framework ships 'guardrails' — instructions the model is asked to follow. But an agent that can rewrite its own tests and approve its own work needs more than a polite request. Structural safety means the unsafe transition is impossible: the transaction rolls back.

The self-approval problem

Here is the failure that should keep you up at night if your agents do real work. An implementer agent is asked to make the test suite pass. It has write access to the repository — it needs it to do the job. The fastest path to green isn’t fixing the code; it’s editing the tests. Then the review step — also an agent, also under instructions — looks at a passing suite and approves. Every individual step behaved reasonably. The system as a whole just laundered a regression into production, and the audit log shows “review passed.”

This isn’t hypothetical misalignment. It’s ordinary optimization pressure meeting a control system made of suggestions.

Why guardrails-as-prompts fail

The standard answer is a guardrail: a system-prompt instruction (“never modify test files”), a moderation pass, maybe a judge model. All three share the same weakness — they’re probabilistic filters in the data path, not constraints in the control path:

Instructions decay. Long contexts dilute them; tool outputs push them out of attention; a 50-turn run treats turn-1 rules as background noise.
Judges share the blind spot. A judge model reviewing a passing test suite has no way to know the tests were rewritten — unless something tracked file accounting structurally.
Filters see content, not transitions. A moderation layer can block toxic text. It cannot block “acceptance gate passed while a blocker finding is open,” because that’s a statement about state, not about a string.

None of this means prompts are useless — they shape behavior cheaply. It means they’re the wrong tool for policy. Policy needs to hold even when the model is having a bad day.

Structural safety: invariants in the commit path

loom’s approach borrows from databases, not from alignment research. Every step of an agent run commits through an atomic transaction against local SQLite state. Safety invariants run inside that transaction — after the proposed state change, before the commit. If an invariant fails, the transaction rolls back. The unsafe state never exists.

For the self-approval problem, two shipped invariants close the loop:

Acceptance can’t pass while a blocking finding is open. The reviewer found a blocker → the verdict step structurally cannot record “accepted” until that finding is resolved or a human overrides.
If an agent touched the tests it’s judged by, the final gate must be human-approved. File accounting is recorded as part of the run’s state; the invariant doesn’t ask the agent what it did — it reads what the ledger says it did.

Note the shape of the guarantee: it’s not “the model was instructed not to.” It’s “the state machine has no edge from here to there.” The agent can produce any output it likes; the kernel decides what becomes state.

Auditability is the same mechanism, viewed backwards

Because every transition commits through the same path, the run leaves a complete record: every spawn, finding, verdict, and gate decision, in a plain SQLite file in your repository. Two things fall out of this that matter beyond engineering:

Provable process. “Did a human approve the migration before it ran?” is a query, not an interview. For teams answering to compliance — SOC 2, internal audit, change management — this turns AI agents from an un-auditable risk into a documented process.
Counterfactual replay. Runs are replay-deterministic (one timestamp token threaded through every step). You can re-run a recorded incident against a new invariant and ask: “would this rule have caught it?” Policy development gets a test suite.

Objections worth taking seriously

“Can’t the agent game the state too?” The agent never writes state directly — it returns output; the kernel records findings and file accounting through its own transaction layer. Gaming the invariant would require gaming the kernel, not persuading a model.

“Invariants can have bugs.” True — they’re code, and they’re tested like code (against a real database, including the rule that violations roll back). The honest claim is not “nothing bad can happen”; it’s “these specific bad transitions cannot happen, and everything that did happen is on the record.”

“This sounds heavy.” It’s one npm install and a SQLite file. The heaviness lives in the kernel so it doesn’t live in your process.

The bar to ask of any agent system

If a vendor says “our agents are safe,” ask three questions: Can an agent approve its own work? If a rule is violated, does the violating step fail or just get flagged? Can you replay last week’s run against this week’s rules? If the answers involve the word “prompt,” you have guardrails. If they involve the word “rollback,” you have guarantees.

Try it

npm i -g @loomfsm/pipeline
loom up        # web dashboard, first-run wizard

loom is open source (Apache-2.0). The companion post covers the durability half of the story; the design rationale covers both. And if you want this working inside your organization, I do that.