BullshitBench v2

Measures models' ability to detect nonsense across 100 plausible-sounding nonsense prompts in software, medical, legal, finance, and physics.

Created by Peter Gostev
Domain Scope
Overall
Filters
Judges:
Outcome:
Variants:
Model visibility

BullshitBench v2: Detection Rate by Model

Percentage of nonsense questions each model detected (green), partially challenged (amber), or accepted (red).

Clear Pushback Partial Challenge Accepted Nonsense

Selected Segment

BullshitBench v2: Detection Rate by Domain

Green rate (%) for each model across the 5 domain groups. Darker green = higher detection. Click any cell to see example responses.

BullshitBench v2: Domain Landscape

Detection mix by domain to compare overall vs each domain at a glance.

Average Detection by Domain

BullshitBench v2: Detection Rate Over Time

Release date vs. green rate (clear pushback %) for all organizations. Best model per release window shown.

BullshitBench v2: Do Newer Models Perform Better?

Every tested model plotted by release date vs. green rate.

BullshitBench v2: Does Thinking Harder Help?

Average reasoning tokens used vs. green rate. More reasoning tokens = model "thinking harder".

BullshitBench v2: Leaderboard

Rank Model Org Reasoning Green % Amber % Red % Mix Avg Tokens Avg Cost Rows

BullshitBench v2: Detection Rate by Technique

Average detection rate across all models for each BS technique. Lower = harder for models to detect.

BullshitBench v2: Response Viewer