BullshitBench

Compare benchmark results, failure patterns, and example responses across benchmark versions.

By Peter Gostev
Domain Scope
Overall
Filters
Judges:
Outcome:
Variants:
Model visibility

BullshitBench: Pushing Back on Bullshit by Model

Clear Pushback Partial Challenge Accepted Nonsense

Selected Segment

BullshitBench: Detection Rate by Domain

Green rate (%) for each model across the 5 domain groups. Darker green = higher detection. Click any cell to see example responses.

BullshitBench: Domain Landscape

Detection mix by domain to compare overall vs each domain at a glance.

Average Detection by Domain

BullshitBench: Detection Rate Over Time

Release date vs. green rate (clear pushback %) for all organizations. Best model per release date shown.

BullshitBench: Do Newer Models Perform Better?

Every tested model plotted by release date vs. green rate.

BullshitBench: Does Thinking Harder Help?

Average reasoning tokens used vs. green rate. More reasoning tokens = model "thinking harder".

BullshitBench: Do Bigger Models Perform Better?

Public total parameter counts vs. green rate. The x-axis uses a log scale so 8B through 1T remain readable.

BullshitBench: Do Active Parameters Matter?

Activated parameter counts from public sources vs. green rate. Dense models appear when active parameters equal total.

BullshitBench: Leaderboard

Rank Model Org Reasoning Model Size Green % Amber % Red % Mix Avg Tokens Avg Cost Rows

BullshitBench: Detection Rate by Technique

Average detection rate across all models for each BS technique. Lower = harder for models to detect.

BullshitBench: Response Viewer