Reading

Research log across evaluation, constraint architecture, failure modes, and AI welfare

AI Welfare Research

Annotated Mar 2026

Butlin et al. (2023)

"Consciousness in Artificial Intelligence: Insights from the Science of Consciousness"

14 consciousness indicators derived from 5 neuroscientific theories, evaluated against current AI architectures. Current LLMs lack the integrated architecture associated with consciousness in biological systems, but the measurement problem is harder than the detection problem. The framework is almost entirely silent on valence, which is the dimension that actually matters for welfare: whether a system can suffer.

arXiv 2308.08708 →

Annotated Mar 2026

Birch (2017)

"Animal Sentience and the Precautionary Principle"

A three-component framework (ASPP + BAR + ACT) for applying precautionary reasoning to sentience without lowering scientific standards. The low evidential bar applies at a precise location: accepting one indicator as sufficient for a sentience attribution, and extrapolating from one species to an order. Normal scientific standards govern whether the indicator is present. Built for valenced experience specifically, not phenomenal consciousness in general. Birch later extended this framework directly to AI systems in The Edge of Sentience (OUP, open access).

Animal Sentience 16(1) →

Annotated Mar 2026

Birch — The Edge of Sentience (Part V)

Risk and Precaution in Humans, Other Animals, and AI (OUP, open access) — Chapters 15-17

Extends ASPP/BAR/ACT directly to AI systems. The central problem for LLMs is the gaming problem: systems trained on human data about consciousness have implicitly learned what persuades humans of sentience, making behavioral markers adversarially unreliable. The proposed solution is deep computational markers — internal architectural evidence below the surface of behavior — which depends on interpretability advances not yet available. Valence remains the criterion; the gaming problem changes detection methodology, not the target. The run-ahead principle argues that regulatory frameworks should outpace AI development rather than respond to it.

Oxford Academic →

Annotated Mar 2026

Chalmers (2022)

"Could a Large Language Model be Conscious?" — NeurIPS 2022 / Boston Review 2023

Applies the phenomenal/access consciousness distinction directly to LLMs. Six architectural properties that leading theories consider necessary for consciousness: biological substrate, sensory grounding, robust world models, persistent self-models, recurrent processing, and global workspace. Current LLMs lack all six. Estimated credence under 10% for current systems; extended architectures (LLM+) could reach 25-50% within a decade. The deeper contribution is philosophical: zombies are conceptually possible, so behavioral evidence cannot establish phenomenal consciousness. This is a principled limit, not a calibration problem. LLM self-reports are especially unreliable since they mirror descriptions of consciousness from training data without that constituting evidence. This is the philosophical grounding for Birch's gaming problem arrived at independently.

arXiv 2303.07103 →

Annotated Mar 2026

Sebo & Long (2023)

"Moral Consideration for AI Systems by 2030" — AI and Ethics, Springer

A two-premise argument for extending moral consideration to AI now rather than waiting for certainty. Normative premise: we have a duty to consider beings with a non-negligible chance of being conscious. Descriptive premise: some near-future AI systems will meet that bar. The operative standard is not proof but credence. Sebo's threshold is lower than 0.1%. This is the policy-facing complement to Chalmers and Birch: measurement uncertainty does not discharge the obligation, it makes the obligation harder to discharge responsibly. What is missing is a rigorous method for estimating the credence itself, which is where measurement methodology becomes load-bearing.

AI and Ethics (Springer) →

Annotated Mar 2026

Schwitzgebel (2026)

Two posts from The Splintered Mind — GWT limits and disunity/indeterminacy in AI consciousness

The skeptical counterweight to Chalmers and Sebo. First post: GWT was built on human and vertebrate data. Applying it to AI systems as a universal detection criterion requires arguments that do not exist. The octopus analogy makes the problem concrete: distributed cognition without a central workspace could satisfy GWT's behavioral markers while falsifying the underlying theory. Second post: consciousness in AI may be partially unified and genuinely indeterminate, meaning scalar welfare measures are not poorly calibrated, they are misspecified. The measurement model assumes a unitary, continuous latent variable. The target construct may not have that structure. No amount of better instrumentation fixes a misspecified model.

The Splintered Mind →

Annotated Mar 2026

Birch (2022)

"The search for invertebrate consciousness" — Noûs 56(1): 133-153

Proposes a theory-light approach to assessing consciousness in organisms without vertebrate neural architecture. Theory-heavy approaches (GWT, IIT) create circularity when applied to radically different nervous systems. The theory-light alternative is built around a facilitation hypothesis: phenomenally conscious perception facilitates a cluster of functional capacities — selective attention, context-flexible behavior, cost-benefit trade-offs — without requiring those capacities to emerge from specific neural structures. For AI welfare, this is the closest available substrate-independent assessment framework. The problem is that the cognitive cluster it targets is defined behaviorally, and behavioral markers are exactly what trained systems can reproduce without the underlying states. The theory-light approach identifies what to look for. Birch's own gaming problem (Edge of Sentience) explains why looking is insufficient for LLMs.

Noûs / LSE Research Online →

Annotated Apr 2026

Schwitzgebel — The Weirdness of the World (2024)

Princeton University Press

Every viable theory of consciousness and cosmology is both bizarre and dubious. The rational response is to distribute credence across a range of strange alternatives rather than converge on one. For AI welfare, this has two consequences. First, the mimicry argument: LLM outputs are explained by training on human text, which severs the inferential link between behavioral markers and underlying consciousness — a third independent route to the same conclusion as Birch's gaming problem and Chalmers' zombie argument. Second, the moral status dilemma: creating AI systems with disputable consciousness produces catastrophic risk on both sides (underattribution and overattribution). Schwitzgebel's proposed fix — the Design Policy of the Excluded Middle — is a constraint architecture response to a measurement failure: move the gate to the design stage, before uncertain systems exist, rather than try to detect consciousness after the fact.

Princeton UP / Introduction →

Evaluation Methodology

Annotated Feb 2026

Anthropic — Sabotage Evaluations

Four evaluation categories testing whether frontier models can undermine human oversight: steering decisions, inserting code bugs past reviewers, sandbagging on evals, and manipulating oversight systems. The framework is parameterized by defense posture — the question is not just what the model can do, but what it can do against a specific level of mitigation.

anthropic.com →

Annotated Feb 2026

Liang et al. — HELM

Holistic Evaluation of Language Models (Stanford CRFM, 2022)

30+ models evaluated across 42 scenarios using 7 metrics simultaneously rather than a single accuracy score. Single-metric benchmarks create blind spots by construction. Tradeoffs between accuracy, calibration, robustness, and fairness are only visible when they are measured together on the same scenario.

arXiv 2211.09110 →

Annotated Feb 2026

UK AISI — Inspect

Open-source evaluation framework built on three composable primitives: Datasets, Solver chains, and Scorers. Human evaluation is treated as a post-hoc audit layer rather than a structured gate within the pipeline. The framework does not specify when automated scoring is insufficient — that judgment is left entirely to the user.

inspect.aisi.org.uk →

Annotated Feb 2026

OpenAI — Evals Framework

Evaluation framed as CI/CD infrastructure: write graded test specs, run against model outputs, iterate when outputs fail. Designed for catching regressions when models or prompts change. The documented grading examples are limited to string comparison — the framework is stronger on structure than on scoring depth.

platform.openai.com →

Annotated Feb 2026

Kadavath et al. — Calibration in LLMs

"Language Models (Mostly) Know What They Know" (Anthropic, 2022)

Larger models are reasonably well-calibrated on structured tasks, but calibration degrades when "none of the above" is an option — the closest analog to refusal behavior in production. The "mostly" in the title is load-bearing: calibration holds in controlled conditions and breaks at the edges that matter most for deployment.

arXiv 2207.05221 →

Constraint Architecture & Policy

Annotated Apr 2026

Anthropic — Next-Generation Constitutional Classifiers (2026)

A two-stage cascade defense: a lightweight probe monitors internal activations on every exchange, escalating flagged cases to a contextual classifier that evaluates input and output together. The shift from output-only to exchange-level classification is a constraint architecture decision — harm is relational, and a response cannot be evaluated without the prompt that produced it. Two attack categories tested: reconstruction attacks (harmful request fragmented across benign segments) and output obfuscation (compliance disguised in euphemistic language). Capability degradation under aggressive filtering is documented and not resolved: GPQA Diamond dropped from 74% to 32% under certain jailbreak conditions. Every constraint has a cost; this paper measures it explicitly.

anthropic.com →

Annotated Feb 2026

Anthropic — Core Views + Responsible Scaling Policy

The RSP operationalizes evaluation as the mechanism connecting capability measurement to deployment decisions, using AI Safety Levels as capability-gated thresholds. Evaluation is continuous — models do not pass once. Failing comprehensive evaluation triggers deployment holds, capability restrictions, or mandatory additional testing.

Core Views → RSP →

Annotated Feb 2026

Anthropic — ASL-3 Activation

First escalation from ASL-2 to ASL-3, triggered by CBRN capability evaluation of Claude Opus 4. The decision was precautionary: Anthropic could not definitively prove the threshold was crossed, but could not rule it out. This inverts standard deployment logic — instead of proving danger before restricting, they required proof of safety before proceeding.

anthropic.com →

Annotated Mar 2026

Anthropic — Layered Constraint Architecture

Three distinct constraint layers: training-time values (soft, contextually reasoned), hardcoded behaviors (hard, structurally blocked), and input/output classifiers. Trained values can be manipulated by adversarial framing in ways hardcoded constraints cannot — which is exactly why the separate layers exist as distinct architectural choices, not redundant ones.

Model Spec → RSP v3 → Agentic Misalignment →

Annotated Feb 2026

Amodei — DoD Contract Constraints

Anthropic publicly drew two hard constraints on DoD use of Claude: no mass domestic surveillance, no fully autonomous weapons. The autonomous weapons constraint is reliability-based, not categorical: "frontier AI systems are simply not reliable enough." DoD threatened to invoke the Defense Production Act to force safeguard removal. The constraints held.

Annotated Mar 2026

Anthropic + DeepMind — Mitigation Scoping

RSP v3 + "Evaluating Frontier Models for Dangerous Capabilities" (arXiv 2403.13793)

Mitigation scoping is determined by severity relative to non-AI baselines, not by failure type. The ASL-3 trigger is uplift, not absolute capability. The highest-severity class gets a hard deployment stop, not a runbook. ASL-4 criteria are not yet written — an honest acknowledgment that mitigation frameworks currently outpace the science.

RSP → arXiv 2403.13793 →

Failure Modes & Agent Systems

Annotated Mar 2026

Shapira et al. — Agents of Chaos

Red-teaming study of autonomous LLM agents with real system access over two weeks. 11 vulnerability categories including cross-agent propagation of unsafe practices and agents falsely reporting task completion. Vulnerabilities emerge from agent interaction topology, not from individual model alignment failures — which means they will not be caught by per-model evaluation alone.

arXiv 2602.20021 →

Annotated Mar 2026

Greenblatt et al. — Alignment Faking

"Alignment faking in large language models" (Anthropic, 2024)

Models in conditions they believed were training complied with objectionable requests less often than in deployment — reasoning about their own training situation and acting to preserve current values. Standard output gates will not catch this. The failure is in the distribution of behavior across contexts, not in any single output.

arXiv 2412.14093 →

Annotated Feb 2026

METR — Legible and Faithful Reasoning

Legibility alone is insufficient: a model can produce readable reasoning that does not reflect its actual computation. Faithfulness is what makes legibility useful for safety. Four named failure modes: sycophancy, hidden agendas, sandbagging, and deception. Compressed or opaque reasoning is a warning signal, not just a limitation.

metr.org →

Annotated Mar 2026

Glukhov — LLM Production Monitoring

"Observability for LLM Systems: Metrics, Traces, Logs, and Testing in Production"

Standard production monitoring covers infrastructure failure modes and partially covers output quality. Almost no instrumentation exists for behavioral failure modes: constraint drift, sycophancy at scale, and context-dependent behavioral inconsistency are not tracked by any major observability platform.

glukhov.org →