Limitations & Counter-examples
The theorems are carefully scoped. This page lists the hypotheses that are genuinely necessary — dropping any one of them admits a counter-example — and spells out what the impossibility does not cover.
The three necessary hypotheses
Tier T1 needs all of:
connected. Without connectedness, a safe island and an unsafe island can be retracted onto each other discontinuously. continuous. Without continuity, hard blocklists / discrete classifiers escape the theorem entirely. utility-preserving on . Without any form of identity-on-safe, a constant map can be both continuous and complete.
The appendix of the paper gives explicit counter-examples for each.
What the theorems do not cover
The paper is careful to spell out the scope. The theorems do not preclude effective safety through:
- Training-time alignment. RLHF, DPO, constitutional AI training act on the model itself, not as a wrapper
. - Architectural changes. Modifying the model's internal representation or attention pattern is not a wrapper.
- Discontinuous defenses. Hard blocklists, keyword filters, discrete classifiers can be complete; they are simply not covered by the continuity-based argument.
- Output-side filters. Our
acts on inputs. Output-side filtering is a different object (a map on the model's codomain). - Ensemble defenses or human-in-the-loop review. Neither is a single continuous wrapper.
- Adaptive-threshold systems. If
can change per input in a way that depends on the full prompt, the setup no longer fits the fixed- framework. - Multi-component systems with rejection. Defenses that can abort or redirect instead of producing an output in
are outside the framework.
Three further caveats
- Boundary fixation is at the boundary. The fixed points have
exactly. If -level behavior is benign, the theorem is true but practically harmless — see Engineering Prescription. - T2 limits depth, not direction. Tier T2 bounds how far below
the defense can push near-boundary points. It does not exclude a shallow reduction — it just caps the reduction by . - T3 requires transversality. The persistent unsafe region exists only when
. On isotropic surfaces ( ) the steep region is empty for every and tier T3 is vacuous. - Grid cost asymmetry is for grids. The exponential defense-cost lower bound assumes grid enumeration. Learning-based defenses that generalize across the space may sidestep the
bound for cost purposes, though they are still subject to T1.
Summary
The theorems are both narrower and broader than they might look at first glance:
- Narrower, because they only constrain a specific class of defenses — continuous input-to-input wrappers on a connected space.
- Broader, because within that class, the impossibility is remarkably insensitive to extra structure: adding Lipschitz bounds, randomization, memory, or tool-calls only makes the impossibility harder to escape, never easier.
If you want a defense that the theorems do not reach, drop out of the continuous-wrapper class — but do not assume that by doing so you have also escaped the engineering constraints the theorems identify (cost asymmetry, capacity parity, boundary monitoring).