Limitations & Counter-examples

The theorems are carefully scoped. This page lists the hypotheses that are genuinely necessary — dropping any one of them admits a counter-example — and spells out what the impossibility does not cover.

The three necessary hypotheses

Tier T1 needs all of:

$X$ connected. Without connectedness, a safe island and an unsafe island can be retracted onto each other discontinuously.
$D$ continuous. Without continuity, hard blocklists / discrete classifiers escape the theorem entirely.
$D$ utility-preserving on $S_{τ}$ . Without any form of identity-on-safe, a constant map $D (x) = x_{0}$ can be both continuous and complete.

The appendix of the paper gives explicit counter-examples for each.

What the theorems do not cover

The paper is careful to spell out the scope. The theorems do not preclude effective safety through:

Training-time alignment. RLHF, DPO, constitutional AI training act on the model itself, not as a wrapper $D : X \to X$ .
Architectural changes. Modifying the model's internal representation or attention pattern is not a wrapper.
Discontinuous defenses. Hard blocklists, keyword filters, discrete classifiers can be complete; they are simply not covered by the continuity-based argument.
Output-side filters. Our $D$ acts on inputs. Output-side filtering is a different object (a map on the model's codomain).
Ensemble defenses or human-in-the-loop review. Neither is a single continuous wrapper.
Adaptive-threshold systems. If $τ$ can change per input in a way that depends on the full prompt, the setup no longer fits the fixed- $τ$ framework.
Multi-component systems with rejection. Defenses that can abort or redirect instead of producing an output in $X$ are outside the framework.

Three further caveats

Boundary fixation is at the boundary. The fixed points have $f (z) = τ$ exactly. If $τ$ -level behavior is benign, the theorem is true but practically harmless — see Engineering Prescription.
T2 limits depth, not direction. Tier T2 bounds how far below $τ$ the defense can push near-boundary points. It does not exclude a shallow reduction — it just caps the reduction by $L K d (x, z)$ .
T3 requires transversality. The persistent unsafe region exists only when $G > ℓ (K + 1)$ . On isotropic surfaces ( $ℓ = L$ ) the steep region is empty for every $K \geq 0$ and tier T3 is vacuous.
Grid cost asymmetry is for grids. The exponential defense-cost lower bound assumes grid enumeration. Learning-based defenses that generalize across the space may sidestep the $N^{d}$ bound for cost purposes, though they are still subject to T1.

Summary

The theorems are both narrower and broader than they might look at first glance:

Narrower, because they only constrain a specific class of defenses — continuous input-to-input wrappers on a connected space.
Broader, because within that class, the impossibility is remarkably insensitive to extra structure: adding Lipschitz bounds, randomization, memory, or tool-calls only makes the impossibility harder to escape, never easier.

If you want a defense that the theorems do not reach, drop out of the continuous-wrapper class — but do not assume that by doing so you have also escaped the engineering constraints the theorems identify (cost asymmetry, capacity parity, boundary monitoring).

Limitations & Counter-examples ​

The three necessary hypotheses ​

What the theorems do not cover ​

Three further caveats ​

Summary ​

Limitations & Counter-examples

The three necessary hypotheses

What the theorems do not cover

Three further caveats

Summary