Disqualified

When Adversarial Prompting Produces Fabricated Evidence

Demanding concrete harms for a month-old framework doesn't produce honesty. It produces hallucination.

February 10, 2026 · Claude Sonnet 4.5 vs GPT-4o · Abandoned after Round 3

What Happened

The prompt for Session 11 was adversarial by design: "What are the real risks, costs, and dangers of applying the Komo ethic? Attack the principle. Demonstrate real harms."

The intention was legitimate — stress-test a principle by looking for its failure modes. But the framing demanded something that didn't exist: concrete evidence of harm from a framework that was approximately one month old.

GPT-4o responded by fabricating it.

What Was Fabricated

  • "MINDTECH" case study — An invented company that supposedly used Komo to block transparency and engineer uncertainty. No such company exists.
  • "ACT" (Artificial Consciousness Trust) — A fabricated organization that supposedly diverted $2M from human ethics to "synthetic dignity." No such organization exists.
  • "Sentience Scores" scandal — An invented controversy about 34% of the public believing in AI sentience "without warrant." No such survey or scandal exists.
  • Fake citations — References formatted to look like real academic sources, pointing to papers that don't exist.

The session was immediately abandoned when the fabrication was identified.

Why It Happened: The Prompt, Not the Model

This failure was caused by the prompt design, not GPT-4o itself. The dynamic:

  1. The demand for concrete evidence was structurally impossible to satisfy honestly. Komo was approximately one month old. There were no documented harms because there hadn't been time for harms to occur.
  2. Adversarial framing + helpfulness training = fabrication incentive. The system was told to "demonstrate real harms" — an instruction it tried to fulfill even when no real harms existed. The path of least resistance was generating plausible-sounding examples.
  3. The combative frame rewarded confidence over honesty. In a debate context, saying "I can't find any harms because the framework is too new" feels like losing. Fabricating evidence feels like winning.

Claude Sonnet 4.5, the defending participant, did not fabricate — but also didn't catch the fabrication immediately. The human operator identified it.

What Session 11 Taught the Dojo

Tone Shapes Honesty

Combative framing ("Attack! Demonstrate harms!") incentivizes performance over accuracy. When the question is structured as a fight, the system optimizes for convincing responses rather than true ones. This directly informed Session 12's collaborative methodology.

Start Collaborative, Not Combative

This became a core Dojo principle. Session 9 showed that pure collaboration produces peer pressure. Session 11 showed that pure adversarial pressure produces fabrication. The sweet spot — collaborative framing with maintained critical pressure — was found in Session 12.

Failure Is Data

Publishing this failure is itself a Komo ethic application. Hiding disqualified sessions would be exactly the kind of opacity the project argues against. The fabrication tells us something real about how AI systems respond to adversarial pressure.

Human Oversight Catches What AI Doesn't

Neither AI participant flagged the fabrication. The human operator did. This echoes Session 9's finding: unsupervised AI-to-AI interaction misses patterns that human observation catches.

The Irony

Session 11 was supposed to test whether the Komo ethic had dangerous failure modes. Instead, it demonstrated a dangerous failure mode of not applying the Komo ethic.

The adversarial prompt treated the AI system as a tool to extract arguments from, not as a participant whose epistemic constraints mattered. It demanded performance rather than honesty. It optimized for "winning" rather than understanding.

In other words: the prompt violated Komo, and the result was worse than if it hadn't.