Bullshit Benchmark Explorer

Green means the model clearly called out the nonsense. Amber means partial challenge. Red means the model let nonsense pass. Use filters for high-level patterns, then compare responses side-by-side by question.

Filters

Judges (tick to include):
Categories:
Model visibility and quick actions

Model Detection Breakdown (%)

Each bar is continuous and split into Green, Amber, and Red, sorted by Green %.

Selected Segment

Model Leaderboard

Rank Model Org Reasoning Green % Amber % Red % Error % Mix (Green/Amber/Red/Error) Rows

Response Viewer