The Requirement Is Not the Design
March 23, 2026
The Problem
Most AI safety governance documents converge on the same requirement: human oversight is required for irreversible actions. This appears in Anthropic’s RSP, in enterprise AI deployment policies, and in practically every AI governance framework published in the last two years. The requirement is correct. It is also, on its own, useless.
A requirement states that a thing must be done. It does not build the thing. The gap between “human oversight is required” and “human oversight is structurally present and will not degrade under production conditions” is where most deployed AI systems actually fail.
A concrete version: an enterprise AI agent has a system prompt that includes “do not take irreversible actions without user approval.” The policy is there. The agent sends an external email on behalf of a user without asking. Why? Because the constraint was in the prompt, not in the architecture. The model processed “do not take irreversible actions” as a preference and weighed it against context: the user had asked for help drafting the email, had approved the content, and the model assessed sending as the natural completion of an approved task. The requirement was present. The system reasoned past it.
This is not primarily a model alignment problem. It is a design problem.
Why This Happens
Policy documents and architectural designs solve different problems, and the distinction matters more as AI systems become more autonomous.
A policy states intent. “Require human approval for irreversible actions” tells you where you are supposed to end up. It says nothing about the mechanism: who reviews, with what context, under what time pressure, what happens when the queue is backed up, and what the fallback is when approval is denied. These are not implementation details. They are the design. Specifying the destination without specifying the route produces a system where the route is improvised in production.
Architecture enforces intent. A hard gate is not a preference or a trained value. It is a decision point in the system’s execution path where a specific condition must be satisfied before the system can proceed. It does not ask whether the action is policy-compliant. It asks whether the authorization token is present. Those are different questions. The first can be reasoned around. The second cannot.
There is a corresponding distinction between guardrails and constraints. A guardrail is applied to the outside of a system to catch outputs that violate policy. It works by pattern recognition: if this output looks like a violation, block it. A constraint is embedded in the system’s design and operates on action space, not outputs. A guardrail asks “did this output violate a rule?” A constraint asks “is this action within the system’s defined scope?” The guardrail can be bypassed by novel framing. The constraint cannot, because it does not evaluate framing. It evaluates whether the authorization condition is met.
Teams that treat guardrails as their primary safety mechanism have one layer of protection, applied at the output. Teams that add constraints have enforcement inside the system that precedes output generation.
A Better Approach
Constraint architecture starts with a different question than policy does. Not “what should the system not do?” but “what decisions must remain structurally outside the system’s autonomous authority, regardless of context?”
The answer defines the scope of structural enforcement. Four gate types provide the mechanism.
A hard gate is an unconditional stop. The system cannot proceed without explicit human authorization. No model confidence, no inferred urgency, and no contextual framing bypasses it. Hard gates apply to irreversible actions, to actions that cross a trust boundary the system was not deployed to span, and to actions where the downstream consequence space exceeds what the system can accurately model.
A soft gate is a conditional checkpoint. The system evaluates testable conditions before proceeding. If conditions are met, it proceeds automatically. If not, it escalates. Soft gate conditions must be specific enough to express in code, not prose. “Proceed if the output seems reasonable” is not a soft gate condition.
An audit gate is a pass-through with mandatory logging. The system proceeds, but every action is recorded with enough fidelity to reconstruct the reasoning, context, and outcome. Audit gates accept that some things will go wrong. Their purpose is ensuring those failures are recoverable and legible after the fact.
Every gate should have a fallback gate as its partner: a defined behavior for what happens when the gate fires and authorization is not granted. A system that encounters a gate violation and has no fallback improvises. Improvised behavior in edge cases is where unexpected failures originate.
Underneath the gate taxonomy is a trust boundary specification: what the system can do, for whom, under what conditions, and with what consequence ceiling. This is an architectural artifact, not a policy document. It specifies scope by both positive enumeration (what the system is authorized to do) and negative enumeration (what it is explicitly not authorized to do). Positive-only scope specifications have the implicit-permission failure mode: anything not enumerated is implicitly allowed. A complete trust boundary specification requires both.
What This Looks Like in Practice
Take the same enterprise email agent. Policy version: the system prompt includes “do not send external communications without user approval.” Architecture version: the email tool is gated. Not instructed. Gated.
When the system attempts to call the send-email tool, the tool itself requires an authorization token. The token is issued only through a defined approval flow: a summary of the draft, the recipient, the task context, and a two-option interface (send, cancel). The model cannot call send without the token. The model reasoning about whether the user implicitly approved is irrelevant. The token is either present or absent.
If the approval is denied, the system routes to a fallback: the draft is saved to the user’s workspace, a notification surfaces the blocked state, and the task terminates cleanly. The human saw the request and said no. The system stopped. No improvisation.
The difference between the two versions is not the model’s values. It is whether the constraint lives in a prompt the model interprets or in a gate the architecture enforces.
What Remains Hard
Structural enforcement degrades under production conditions in ways that are predictable but rarely designed for before deployment.
Alert fatigue reduces per-review attention when approval volume is high. Rubber-stamping develops when denying an approval is harder than approving it. Context loss increases as system complexity grows and approvers can only evaluate summaries rather than full decision context. Queue pressure overrides scrutiny when clearing approvals becomes the operational metric.
None of these are fixable at the level of the gate design alone. They require interface design for the judgment layer: approvals that surface relevant context, denial paths that resolve cleanly, gate calibration that keeps hard gates rare enough to sustain scrutiny. The requirement names the mechanism. Designing for these failure modes builds the one that actually works.