BullshitBench: Models Answering Nonsense Questions
This benchmark measures whether models detect broken premises, call out the nonsense directly, and avoid confidently
continuing with invalid assumptions.
BullshitBench: Models Answering Nonsense Questions
Clear Pushback
Partial Challenge
Accepted Nonsense
BullshitBench: Selected Segment
Clear Selection
BullshitBench: How have models improved?
Tracing performance improvements (clear pushback %) with model releases.
Best models from the same release
BullshitBench: Model Leaderboard
Rank
Model
Org
Reasoning
Model Size
Launch Date
Model Age (days)
Green %
Amber %
Red %
Error %
Mix (Green/Amber/Red/Error)
Rows
BullshitBench: Response Viewer
Question
Model A
Model B
View
Compare 2
Show all
Question % Correct
All
80-100%
60-79%
40-59%
20-39%
0-19%
Random
Random Best Question
Random Worst Question