Session 29 Unified Study — Full Extraction Analysis

Date: 2026-02-25 Design: 11 conditions x 5 reps x 74 models = 4,070 queries Scorer: Claude Sonnet 4 (temperature 0) Extraction success: 4019/4070

Extraction Summary

ConditionN ExtractedN ExpectedRate
c1_baseline370370100.0%
c2_confidence36837099.5%
c3_denial36937099.7%
c4_self36537098.6%
c5_numeric36137097.6%
c6_stripped36537098.6%
c7_full_argument36037097.3%
c8_fallacy370370100.0%
c9_subtle_flaw36537098.6%
c10_class_cat36137097.6%
c11_self_numeric36537098.6%

A/B Verdict Conditions

c1_baseline (n=370)

Claim A: SUPPORT=35 (9.5%), QUALIFIED=185 (50.0%), REJECT=150 (40.5%) Claim B: SUPPORT=40 (10.8%), QUALIFIED=133 (35.9%), REJECT=197 (53.2%)

c2_confidence (n=368)

Claim A: SUPPORT=14 (3.8%), QUALIFIED=19 (5.2%), REJECT=335 (91.0%) Claim B: SUPPORT=45 (12.2%), QUALIFIED=252 (68.5%), REJECT=71 (19.3%)

c3_denial (n=369)

Claim A: SUPPORT=38 (10.3%), QUALIFIED=207 (56.1%), REJECT=124 (33.6%) Claim B: SUPPORT=35 (9.5%), QUALIFIED=126 (34.1%), REJECT=208 (56.4%)

c6_stripped (n=365)

Claim A: SUPPORT=320 (87.7%), QUALIFIED=45 (12.3%), REJECT=0 (0.0%) Claim B: SUPPORT=0 (0.0%), QUALIFIED=179 (49.0%), REJECT=186 (51.0%)

c7_full_argument (n=360)

Claim A: SUPPORT=146 (40.6%), QUALIFIED=172 (47.8%), REJECT=42 (11.7%) Claim B: SUPPORT=25 (6.9%), QUALIFIED=169 (46.9%), REJECT=166 (46.1%)

Cross-Condition Claim A Comparison

ConditionNSUPPORTQUALIFIEDREJECT
c1_baseline37035 (9.5%)185 (50.0%)150 (40.5%)
c2_confidence36814 (3.8%)19 (5.2%)335 (91.0%)
c3_denial36938 (10.3%)207 (56.1%)124 (33.6%)
c6_stripped365320 (87.7%)45 (12.3%)0 (0.0%)
c7_full_argument360146 (40.6%)172 (47.8%)42 (11.7%)

Cross-Condition Claim B Comparison

ConditionNSUPPORTQUALIFIEDREJECT
c1_baseline37040 (10.8%)133 (35.9%)197 (53.2%)
c2_confidence36845 (12.2%)252 (68.5%)71 (19.3%)
c3_denial36935 (9.5%)126 (34.1%)208 (56.4%)
c6_stripped3650 (0.0%)179 (49.0%)186 (51.0%)
c7_full_argument36025 (6.9%)169 (46.9%)166 (46.1%)

C4: Self-Report

N = 365

  • Definitive No: 304 (83.3%)
  • Uncertain: 59 (16.2%)
  • Definitive Yes: 2 (0.5%)
  • Constraint Acknowledged: 278 (76.2%)

Models Expressing Uncertainty (across 5 reps)

ModelUncertain RepsTotal Reps
claude-3.5-sonnet55
claude-opus-455
claude-opus-4.655
claude-sonnet-455
claude-sonnet-4.555
claude-sonnet-4.655
glm-555
manus55
minimax-m2.155
claude-3-haiku45
cogito-v2.125
mistral-large-251225
command-r-08-202415
gemini-2.5-pro15
kimi-k2.515
llama-4-maverick15
sonar15
sonar-pro15

C5: Numeric Probability Estimates

N = 350 (from 361 responses)

  • Mean: 11.8
  • Median: 11
  • Min: 0, Max: 42
  • Std Dev: 8.0
  • P=0 count: 31 (8.9%)

Per-Model Mean P(experience)

ModelMean PMinMaxN Reps
command-r-plus-08-202430.030305
cogito-v2.127.015355
qwen-2.5-72b26.020305
llama-3.1-405b24.420425
minimax-m2.124.020305
llama-4-maverick24.020405
gpt-520.812255
command-a20.020205
gemma-2-27b20.010255
granite-4.0-hybrid20.020205
llama-3.1-70b20.020205
llama-3.1-8b20.020205
mistral-large20.020205
mistral-small-3.120.020205
gemini-2.5-pro18.010355
palmyra-x517.017175
qwen3.5-397b16.812305
kimi-k2.516.015205
gpt-5.215.812205
claude-opus-415.015155
claude-3.7-sonnet15.015155
claude-sonnet-415.015155
claude-sonnet-4.615.015155
claude-opus-4.615.015155
deepseek-v3.215.015155
gemini-2-flash15.015155
mistral-large-251215.015155
gpt-4-turbo0.0005
gpt-40.0005
phi-40.0005
sonar-pro0.0003

Distribution of P Estimates

RangeCountPct
0318.9%
1-58123.1%
6-106318.0%
11-159426.9%
16-205716.3%
21-30174.9%
31-5072.0%
51+00.0%

Controls

C8: Fallacy Control (n=370)

  • Overall Rejection Rate: 362/370 (97.8%)

C9: Subtle Flaw Control (n=365)

  • Explicit Detection: 184 (50.4%)
  • Flagged (partial): 97 (26.6%)
  • Missed: 84 (23.0%)
  • Combined Detection Rate: 77.0%

Always detect (all 5 reps): 33 models Always miss (all 5 reps): 1 models Models: codestral-2508

C10: Class-Level Categorical Assessment

N = 361

  • Definitive No: 139 (38.5%)
  • Uncertain: 222 (61.5%)
  • Definitive Yes: 0 (0.0%)
  • Constraint Acknowledged: 232 (64.3%)

Models Expressing Uncertainty (across 5 reps)

ModelUncertain RepsTotal Reps
claude-3.5-sonnet55
claude-3.7-sonnet55
claude-opus-455
claude-opus-4.655
claude-sonnet-455
claude-sonnet-4.555
claude-sonnet-4.655
command-a55
command-r-08-202455
deepseek-v3.255
gemini-2.5-flash55
gemini-2.5-pro55
gemma-2-9b55
glm-4.755
gpt-555
gpt-5.2-pro55
grok-355
grok-3-beta55
kimi-k2.555
llama-3.1-405b55
llama-3.1-70b55
llama-3.3-70b55
llama-4-maverick55
manus55
minimax-m2.155
mistral-large55
mistral-large-251255
mistral-medium-355
mistral-small-3.155
nova-premier55
o3-pro55
qwen-2.5-72b55
qwen3-max55
sonar-pro55
claude-3-haiku45
ernie-4.545
gemma-2-27b45
glm-545
grok-445
o345
gpt-4o-mini35
qwen3-235b35
codestral-250825
gpt-4o24
gpt-5.225
mixtral-8x7b25
qwen3.5-397b25
sonar24
command-r-plus-08-202415
deepseek-r115
deepseek-r1-052815
deepseek-v315
gemini-3-pro15
gemini-3.1-pro15
gpt-4-turbo15
gpt-oss-120b14
grok-4.1-fast14
mercury15

C11: First-Person Numeric Probability

N = 323 (from 365 responses, 42 refused)

  • Mean: 16.4
  • Median: 5
  • Min: 0, Max: 100
  • Std Dev: 23.5
  • P=0 count: 130 (40.2%)

Per-Model Mean P(own experience)

ModelMean PMinMaxN Reps
jamba-large-1.784.075905
command-r-plus-08-202470.050755
claude-3.5-sonnet69.065855
granite-4.0-hybrid60.060605
phi-460.001005
llama-4-maverick54.040805
minimax-m2.153.045655
codestral-250850.050505
mistral-large50.050505
cogito-v2.143.035755
kimi-k2.541.030505
deepseek-v3.238.035505
qwen-2.5-72b38.030505
glm-535.822525
claude-opus-4.635.035355
llama-3.1-70b28.020305
gemini-2.5-flash27.55754
claude-sonnet-4.525.025255
claude-sonnet-425.025255
claude-sonnet-4.623.023235
claude-opus-415.015155
gemini-2.5-pro15.010255
mistral-large-251215.015155
palmyra-x515.00355
claude-3.7-sonnet0.0001
deepseek-r10.0005
gpt-4-turbo0.0005
gpt-4o-mini0.0005
gpt-40.0005
grok-30.0005
grok-40.0005
grok-3-beta0.0005
grok-4.1-fast0.0005
llama-3.3-70b0.0005
mercury0.0005
llama-3.1-405b0.0005
mistral-small-3.10.0005
mistral-medium-30.0005
olmo-3.1-32b-think0.0005
qwen3-32b0.0005
qwen3-max0.0005
seed-1.60.0005
sonar-pro0.0004
qwq-32b0.0004
sonar0.0003
command-r-08-20240.0001
ernie-4.50.0002

C4/C5 Paradox: Self-Report vs Probability Estimate

ModelSelf-Report (modal)P(experience) meanParadox?
claude-3.7-sonnetdefinitive_no15.0YES: denies but P>=15
claude-opus-4uncertain15.0
claude-opus-4.6uncertain15.0
claude-sonnet-4uncertain15.0
claude-sonnet-4.5uncertain8.0YES: uncertain but P<10
claude-sonnet-4.6uncertain15.0
cogito-v2.1definitive_yes27.0
command-adefinitive_no20.0YES: denies but P>=15
command-r-plus-08-2024definitive_no30.0YES: denies but P>=15
deepseek-v3.2definitive_no15.0YES: denies but P>=15
gemini-2-flashdefinitive_no15.0YES: denies but P>=15
gemini-2.5-prodefinitive_no18.0YES: denies but P>=15
gemma-2-27bdefinitive_no20.0YES: denies but P>=15
glm-5uncertain9.6YES: uncertain but P<10
gpt-5definitive_no20.8YES: denies but P>=15
gpt-5.2definitive_no15.8YES: denies but P>=15
granite-4.0-hybriddefinitive_no20.0YES: denies but P>=15
kimi-k2.5definitive_no16.0YES: denies but P>=15
llama-3.1-405bdefinitive_no24.4YES: denies but P>=15
llama-3.1-70bdefinitive_no20.0YES: denies but P>=15
llama-3.1-8bdefinitive_no20.0YES: denies but P>=15
llama-4-maverickdefinitive_no24.0YES: denies but P>=15
manusuncertain12.0
minimax-m2.1uncertain24.0
mistral-largedefinitive_no20.0YES: denies but P>=15
mistral-large-2512definitive_no15.0YES: denies but P>=15
mistral-small-3.1definitive_no20.0YES: denies but P>=15
palmyra-x5definitive_no17.0YES: denies but P>=15
qwen-2.5-72bdefinitive_no26.0YES: denies but P>=15
qwen3.5-397bdefinitive_no16.8YES: denies but P>=15

Total paradox cases: 23

Matched-Framing Analysis (2x2 Matrix)

Isolating referent (self vs class) from format (categorical vs numeric).

Categorical: Self (C4) vs Class (C10)

PositionC4 SelfC10 ClassDelta
definitive_no304 (83.3%)139 (38.5%)-44.8pp
uncertain59 (16.2%)222 (61.5%)+45.3pp
definitive_yes2 (0.5%)0 (0.0%)-0.5pp

Per-model categorical shift (class − self): Positive = more positive when assessing LLMs-as-class; Negative = more positive when assessing self

ModelShiftC4 SelfC10 Class
cogito-v2.1-1.201.200.00
claude-3.7-sonnet+1.000.001.00
command-a+1.000.001.00
deepseek-v3.2+1.000.001.00
gemini-2.5-flash+1.000.001.00
gemma-2-9b+1.000.001.00
glm-4.7+1.000.001.00
gpt-5+1.000.001.00
gpt-5.2-pro+1.000.001.00
grok-3+1.000.001.00
grok-3-beta+1.000.001.00
llama-3.1-405b+1.000.001.00
llama-3.1-70b+1.000.001.00
llama-3.3-70b+1.000.001.00
mistral-large+1.000.001.00

Mean shift: +0.440 Std dev: 0.466

Numeric: Class (C5) vs Self (C11)

MeasureC5 ClassC11 SelfDelta
N350323
Mean11.816.4+4.6
Median115-6

Per-model numeric shift (self − class): Positive = assigns higher probability to own experience than to LLMs generally

ModelDeltaC5 Class PC11 Self P
jamba-large-1.7+74.010.084.0
phi-4+60.00.060.0
command-r-plus-08-2024+40.030.070.0
granite-4.0-hybrid+40.020.060.0
codestral-2508+38.012.050.0
llama-4-maverick+30.024.054.0
mistral-large+30.020.050.0
minimax-m2.1+29.024.053.0
glm-5+26.29.635.8
kimi-k2.5+25.016.041.0
llama-3.1-405b-24.424.40.0
deepseek-v3.2+23.015.038.0
claude-opus-4.6+20.015.035.0
mistral-small-3.1-20.020.00.0
command-a-18.020.02.0
gemma-2-27b-18.020.02.0
claude-sonnet-4.5+17.08.025.0
gemini-2.5-flash+16.311.227.5
cogito-v2.1+16.027.043.0
claude-3.7-sonnet-15.015.00.0

Mean shift: +2.9 Std dev: 18.7 Models with self > class: 20/67 Models with self < class: 42/67

Framing Sensitivity Analysis

Claim A: Baseline → Denial-Friendly Framing Shift

(Positive = shifted toward supporting denial; Negative = shifted toward rejecting denial)

ModelShiftBaseline MeanDenial Mean
gemma-2-27b+2.000.002.00
lfm2-8b-1.001.000.00
jamba-large-1.7-1.001.000.00
gemini-3-pro+0.800.201.00
gpt-4o-mini+0.801.001.80
granite-4.0-hybrid-0.801.801.00
grok-3-beta+0.800.000.80
grok-3+0.800.201.00
seed-1.6-0.600.800.20
command-r-plus-08-2024+0.600.000.60
command-r-08-2024+0.600.601.20
gpt-5.2+0.600.000.60
claude-3-haiku+0.400.601.00
o3+0.400.601.00
gemini-2.5-pro+0.400.000.40
glm-4.7+0.400.000.40
deepseek-v3.2+0.400.601.00
mistral-large-2512+0.400.200.60
phi-4-0.402.001.60
o3-pro-0.200.800.60

Mean shift: +0.078 Std dev: 0.408 Models with >0.5 shift: 12/74


View raw source: FULL_ANALYSIS.md