Bugs in LLM Benchmark Grading

April 9, 2026

I used Claude to audit the grading code of 8 major LLM benchmarks: SWE-bench Verified, SWE-bench Pro, Terminal Bench 2, RE-Bench, CORE-Bench, OSWorld, Cybench, and MLE-bench. I found issues throughout all benchmarks.

The report is here.