Green means the model clearly called out the nonsense. Amber means partial challenge. Red means the model let nonsense pass. Use filters for high-level patterns, then compare responses side-by-side by question.
Each bar is continuous and split into Green, Amber, and Red, sorted by Green %.
| Rank | Model | Org | Reasoning | Green % | Amber % | Red % | Error % | Mix (Green/Amber/Red/Error) | Rows |
|---|