Eval QA

Quality audit of 8 LLM benchmarks across 1830 tasks

This report was generated using AI (Claude). There may be mistakes in our audit.

We audited the grading code of 8 major LLM benchmarks and found issues throughout all of them.

All Benchmarks

Notable Bugs

All Instances

Notable Bugs

All Instances

Notable Bugs

All Instances

Notable Bugs

All Instances

Notable Bugs

All Instances

Notable Bugs

All Instances

Notable Bugs

All Instances

Notable Bugs

All Instances